Artificial Intelligence Index Report 2025

Blog Artificial Intelligence Index Report 2025

11th Apr 2025 1 minute

400+ pages, covers the timeline to the end of 2024. .

A lot of charts that show up that AI becomes faster, cheaper, more reliable etc.

At the first pass, my eye caught the section "Will Models Run Out of Data?".

there is an estimation, that existing human-generated public texts (that are sufficiently high-quality to be used for training) might be depleted at some point between 2026 and 2032 source.
that's more optimistic compared to the previous estimation of 2025 because of updated methodology.
- it was found that web data might be reliable along with human-curated corpora (published scientific papers or books) - that led to 5x increase of the estimate
- it was found that models can be trained several times on the same data without degradation - that increased the estimate 2x-5x
it's interesting how they estimate the number of tokens (1 token ~ 0.8 English word average, one image and 1sec of video content ~ 30 tokens):
- CommonCrawl index of open web data ~ 130 trillion tokens
- the indexed web ~ 510 trillion
- the entire web ~ 3,100 trillion
- the total stock of images ~ 300 trillion
- the total stock off video ~1,350 trillion

As for now, it's enough data to train, possible improvements: