Artificial Intelligence Index Report 2025
400+ pages, covers the timeline to the end of 2024. .
A lot of charts that show up that AI becomes faster, cheaper, more reliable etc.
At the first pass, my eye caught the section "Will Models Run Out of Data?".
- there is an estimation, that existing human-generated public texts (that are sufficiently high-quality to be used for training) might be depleted at some point between 2026 and 2032 source.
- that's more optimistic compared to the previous estimation of 2025 because of updated methodology.
- it was found that web data might be reliable along with human-curated corpora (published scientific papers or books) - that led to 5x increase of the estimate
- it was found that models can be trained several times on the same data without degradation - that increased the estimate 2x-5x
- it's interesting how they estimate the number of tokens (1 token ~ 0.8 English word average, one image and 1sec of video content ~ 30 tokens):
- CommonCrawl index of open web data ~ 130 trillion tokens
- the indexed web ~ 510 trillion
- the entire web ~ 3,100 trillion
- the total stock of images ~ 300 trillion
- the total stock off video ~1,350 trillion
As for now, it's enough data to train, possible improvements:
- use synthetic data
- learn from images/video
- improve data efficiency/develop new training approaches