- Artificial intelligence relies on vast amounts of data to train itself . But Elon Musk says models have already run out of human-created data, and have turned to AI-generated information to teach itself.
AI requires an enormous amount of resources—from endless water to an estimated $1 trillion in investor dollars—but Elon Musk has warned that the technology has already depleted its primary training resource: human-generated data.
Engineers and data scientists train AI by essentially reducing the entire internet, all books, and every interesting video published into a token that AI can digest and learn from, Musk told Mark Penn, CEO of marketing firm Stagwell, in an interview broadcast on X Wednesday. However, AI has already consumed that information and requires additional data to fine-tune itself.
“The cumulative sum of human knowledge has been exhausted in AI training,” Mr. Musk said. “That happened basically last year.”
AI continues to train using synthetic data, which is also generated artificially. Musk compared the process to an AI model writing and grading an essay itself.
Microsoft, Google, and Meta have already used synthetic data to train their AI models. Google DeepMind trained its system Alpha Geometry to solve complex math problems using an artificially generated pool of 100 million unique examples, thus “sidestepping the data bottleneck” of human-generated information. In September, Open AI unveiled o1, an AI model that can fact-check itself.
Musk acknowledged that the widespread use of synthetic data for model training has drawbacks. Synthetic data usage increases the likelihood of hallucinations, or nonsensical content that AI can share while believing it is entirely true.
These heaps of incomprehensible or simply incorrect information, known as AI slop, have already flooded the internet, alarming tech experts and users. Meta’s president of global affairs, Nick Clegg, said in February that the company is working to identify AI-generated content on its platforms.
“As the difference between human and synthetic content gets blurred, people want to know where the boundary lies,” Clegg wrote on her website.
Musk did not respond to Fortune’s request for comments.
Scientists agree: Human data is finite
The limited availability of human-generated data for AI training has become a widely accepted issue in the tech community.
A study released in June by research group Epoch AI predicted that tech companies would run out of publicly available content to train AI language models between 2028 and 2032, which is a slightly more conservative projection than Musk claims happened last year. Limited training resources may slow the current rate of AI development.
“There is a serious bottleneck here,” Tamay Besiroglu, a study author, told the Associated Press. “If you start running into data constraints, you won’t be able to scale up your models efficiently. And scaling up models has probably been the most important way to increase their capabilities and improve the quality of their output.”
One reason why human-created information is becoming scarce is not only because AI is digesting it all, but also because some data owners are concerned about AI’s use of it. In July, the MIT-led Data Provenance Initiative published a study revealing that the once-vast well of data for AI training was running dry.
Researchers examined 14,000 web domains used in data sets for AI training and discovered that the online sources behind some of the data sets were restricting their usage, some by 45%, to prevent bots from scraping their data.
It is part of a trend in which data owners are becoming increasingly concerned about AI using their information and want to be fairly compensated for it.
The future of AI training
Tech companies may no longer be able to use human-generated data for AI training, but they are not without options.
“I don’t think anyone is panicking at the large AI companies,” Pablo Villalobos, lead author of the Epoch AI study, said in an interview with the scientific journal Nature. “Or at least they don’t email me if they are.”
Some data scientists have used not only synthetic data, but also private information and agreements with publishers to gain access to their content.
According to the New York Times, Open AI even had employees transcribe podcasts and YouTube videos to collect more training data, potentially infringing on copyright laws. Open AI did not immediately respond to Fortune’s request for comment.
Still, synthetic data remains the future of AI training. CEO Sam Altman told the Sohn Conference Foundation that by 2023, the company would run out of content to feed its models, but that as synthetic data production improves, it will help solve the content crisis.
“As long as you can get over the synthetic data event horizon where the model is good enough to create good synthetic data, I think you should be alright,” according to him.