In a recent chat at CES, Elon Musk pointed out that artificial intelligence has essentially utilized all the real-world training data that’s out there, suggesting that creating synthetic data is the key to progress. This perspective echoes what Ilya Sutskever, the former chief scientist at OpenAI, indicated about reaching a point of “peak data” in the AI field.
A New Direction for AI
Musk argues that we have exhausted human-created data as of 2024. As the head of Tesla and the founder of xAI, he emphasized that allowing AI to generate its own training data is the most effective way to advance AI technology. This approach enables AI systems to evaluate their own performance and learn continuously.
Big Tech Joins the Movement
Many major tech companies are already embracing synthetic data. For example, Microsoft’s newly released open-source Phi-4 model combines both synthetic and real data, while Google has adopted a similar approach with its Gemma models. Other notable mentions include Anthropic’s Claude 3.5 Sonnet and Meta’s new Llama series, both of which depend on data generated by AI.
Predictions and Costs
Analysts from Gartner forecast that by 2024, approximately 60 percent of the data for AI and analytics initiatives will be synthetic. One significant factor driving this change is cost efficiency. The AI startup Writer reported spending close to $700,000 on its Palmyra X 004 model, a much lower expense compared to the $4.6 million needed for a comparable model from OpenAI.
Challenges Ahead
However, synthetic data has its drawbacks. Researchers caution about the potential for “model collapse,” where AI could become less creative and more biased. This issue may arise if biases present in the original dataset are magnified when AI generates new data independently.
Source:
Link