The Computer Science (/CS/) section of arXiv is out of control and needs pruning or to be spun-off

The paper Taking AI Welfare Seriously was published to arXiv on November 4th, 2024, and went viral, which motivated me to write this post.

It’s evident over the past few years that the Computer Science (/CS/) section of arXiv has far surpassed the other sections:

The problem is /CS/, once contained, has now become a proliferate mess or overgrowth and detracts from the rest of the preprint repository. The recent popularity of GPT/LLMs has seen a surge in /CS/ content. Maybe there is some economic or theoretical value in some of these papers, such as improved learning models, but it’s evident it has strayed away from the original objective of arXiv, that being the rapid distribution of new math or physics content to bypass what is typically the time-consuming process of peer review.

I observed this trend last year in a September 2023 post, but it has gotten worse and I want to revisit it. This is only made worse by how /CS/ is full of either low-quality papers written by obvious citation rings, or articles that read more like blog/policy pieces and are not technical science papers. There is nothing necessarily wrong with social policy, but such content is more suited for something like SSRN, not arXiv. Same for blog-like articles. Blogs are great; this is a blog, and I read blogs, but arXiv is intended for technical stuff.

Common characteristics of such papers:

1. Tons of references relative to the length of the paper. The above welfare paper is 62 pages, 18 of which are citations.

This can be intended to pad the length of the paper, to inflate its apparent authoritativeness, or to bypass arXiv’s automated quality control systems. To filter out non-academic papers, I believe arXiv flags papers which have insufficient citations for manual review.

2. Lots of authors relative to the length of the paper. This can be as many as a single author per page. The above paper has 10 authors, or about 4.5 pages per author for the non-bibliography section.

3. The same authors citing and co-authoring each other’s papers. (Maybe I could publish a /CS/ paper identifying citations rings in the /CS/ section?)

4. Little or no math anywhere.

5. Too much emphasis on social policy or politics (e.g. “Fighting social media disinformation with language models” or “Detecting political bias on Twitter using language patterns”).

6. Lots of data mining. This can mean using software to sift through troves of api data on social networks such as Twitter or Reddit to find a trend or pattern (such as certain words or speech) which may be of statistical significance, which can be published using steps 1-3 to inflate the page count and speed up the process by having many co-authors work on it.

/CS/ has not only overtaken the rest of the site, it also stands to dilute its mission from originally being a repository for technical pre-prints suitable for publication, to now ‘policy pieces’ or just running simulations or mining data to find any sort of pattern. I would recommend spinning off /CS/ in its entirely to a separate website, similar to bioRxiv.org.