Artificial Intelligence Index Report 2025

Artificial Intelligence Index Report 2025

400+ pages, covers the timeline to the end of 2024. .

A lot of charts that show up that AI becomes faster, cheaper, more reliable etc.

At the first pass, my eye caught the section "Will Models Run Out of Data?".

  • there is an estimation, that existing human-generated public texts (that are sufficiently high-quality to be used for training) might be depleted at some point between 2026 and 2032 source.
  • that's more optimistic compared to the previous estimation of 2025 because of updated methodology.
    • it was found that web data might be reliable along with human-curated corpora (published scientific papers or books) - that led to 5x increase of the estimate
    • it was found that models can be trained several times on the same data without degradation - that increased the estimate 2x-5x
  • it's interesting how they estimate the number of tokens (1 token ~ 0.8 English word average, one image and 1sec of video content ~ 30 tokens):
    • CommonCrawl index of open web data ~ 130 trillion tokens
    • the indexed web ~ 510 trillion
    • the entire web ~ 3,100 trillion
    • the total stock of images ~ 300 trillion
    • the total stock off video ~1,350 trillion

As for now, it's enough data to train, possible improvements:

  • use synthetic data
  • learn from images/video
  • improve data efficiency/develop new training approaches

Previous Post