Skip to content

sgnoohc/arxiv-submission-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

arXiv Submission Surge Analysis: Revisions vs. New Papers

An investigation into the reported dramatic increase in arXiv hep-th (High Energy Physics - Theory) submissions, originally noted in a Not Even Wrong blog post by Peter Woit.

Background

Peter Woit reported that arXiv hep-th submissions had roughly doubled starting around December 2025, using the arXiv advanced search with these parameters:

Period 2022 2023 2024 2025 2026
Dec 1-31 634 684 780 1192 -
Jan 1 - Feb 1 583 531 626 659 1137
Feb 1-15 299 266 271 333 581

We verified every one of these numbers by reproducing his exact search. However, further analysis reveals the spike is overwhelmingly driven by paper revisions/replacements, not new research output.

The Key Finding

The arXiv advanced search has a date filter with three options:

  1. "Submission date (most recent)" (submitted_date) -- counts a paper based on when its latest version was uploaded
  2. "Submission date (original)" (submitted_date_first) -- counts a paper based on when v1 was first submitted
  3. "Announcement date" -- when v1 was announced

Woit used option 1 ("most recent"). This means a paper originally submitted in 2020 that gets a revised version uploaded in December 2025 is counted as a December 2025 submission. When we re-run the same searches using option 2 ("original"), the dramatic spike largely disappears:

December 2025 (hep-th, incl. cross-lists)

Metric 2022 2023 2024 2025 YoY change
Most recent (Woit's) 634 686 780 1192 +53%
Original only 800 811 815 855 +5%

The "doubling" is almost entirely a revision surge.

This Is Not Specific to hep-th

We extended the analysis to four arXiv categories from January 2018 through February 2026:

  • hep-th (High Energy Physics - Theory)
  • hep-ex (High Energy Physics - Experiment)
  • hep-ph (High Energy Physics - Phenomenology)
  • cs.AI (Computer Science - Artificial Intelligence)

All four categories show the same pattern: a sudden explosion of the "most recent" count diverging from the "original" count starting around mid-2025.

Plots

1. hep-th: Most Recent vs Original Submission Date

The two metrics tracked each other closely from 2018 to mid-2025, then dramatically diverge.

hep-th line chart

2. hep-th: Decomposed (Original + Replacements)

The blue area (new papers) grows slowly. The red area (replacements) suddenly explodes in 2025.

hep-th stacked

3. hep-th: Year-over-Year Comparison

Left panel uses Woit's metric (showing a dramatic spike in 2025). Right panel uses original submission date (showing normal growth).

hep-th year-over-year

4. All Categories: Most Recent vs Original

The red/blue divergence is replicated across all four categories.

All categories line chart

5. All Categories: Stacked Decomposition

The replacement surge is visible in every category.

All categories stacked

6. Normalized Growth (2018 = 100)

Top panel (Woit's metric): physics categories appear to double; cs.AI appears to 5x. Bottom panel (original submissions only): physics categories grow ~20-40% over 8 years; cs.AI genuinely grows ~3.7x.

Normalized growth

7. Replacement Ratio Over Time

The "smoking gun". From 2018-2024, the replacement excess was ~0% +/- 10% across all categories. Starting mid-2025, all four categories simultaneously spike to +30-60%.

Replacement ratio

Data

Monthly submission counts for each category are in the data/ directory:

File Category
data/arxiv_hepth_monthly.csv hep-th
data/arxiv_hep-ex_monthly.csv hep-ex
data/arxiv_hep-ph_monthly.csv hep-ph
data/arxiv_cs_AI_monthly.csv cs.AI

Each CSV has columns:

  • year, month -- the time period
  • most_recent -- count using "Submission date (most recent)"
  • original_only -- count using "Submission date (original)"

All searches include cross-listed papers (classification-include_cross_list=include).

Reproduction

Scripts used to collect the data and generate plots are in scripts/:

# Fetch hep-th data (takes ~10 min due to rate limiting)
python3 scripts/fetch_arxiv_data.py

# Fetch hep-ex, hep-ph, cs.AI data (takes ~30 min)
python3 scripts/fetch_arxiv_multi.py

# Generate hep-th plots
python3 scripts/plot_arxiv.py

# Generate cross-category comparison plots
python3 scripts/plot_all_categories.py

Requirements: python3, requests, matplotlib

Summary

Claim Reality
hep-th submissions doubled in late 2025 Using "most recent submission date": yes, confirmed
This represents a surge in new research No. Using "original submission date", new papers grew ~5% YoY
The spike is specific to hep-th No. All arXiv categories show the same pattern
Something changed around mid-2025 Yes. A platform-wide surge in paper revisions/replacements began, affecting all categories simultaneously

The most likely explanation is a systemic change in revision behavior across arXiv -- possibly related to LLM-assisted bulk revision of existing papers.

Date

Analysis conducted on February 24, 2026.

About

Analysis of arXiv submission surge: revisions vs new papers (hep-th, hep-ex, hep-ph, cs.AI)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages