The future – what happens when data sources for large language models run dry

Beauty’s not enough

Data may well present the most immediate bottleneck. Epoch AI, a research outfit, estimates the well of high-quality textual data on the public internet will run dry by 2026. This has left researchers scrambling for ideas. Some labs are turning to the private web, buying data from brokers and news websites. Others are turning to the internet’s vast quantities of audio and visual data, which could be used to train ever-bigger models for decades. Video can be particularly useful in teaching AI models about the physics of the world around them. If a model can observe a ball flying through the air, it might more easily work out the mathematical equation that describes the projectile’s motion. Leading models like gpt-4 and Gemini are now “multimodal”, capable of dealing with various types of data.

When data can no longer be found, it can be made. Companies like Scale AI and Surge AI have built large networks of people to generate and annotate data, including PhD researchers solving problems in maths or biology. One executive at a leading AI startup estimates this is costing AI labs hundreds of millions of dollars per year. A cheaper approach involves generating “synthetic data” in which one LLM makes billions of pages of text to train a second model. Though that method can run into trouble: models trained like this can lose past knowledge and generate uncreative responses. A more fruitful way to train AI models on synthetic data is to have them learn through collaboration or competition. Researchers call this “self-play”. In 2017 Google DeepMind, the search giant’s AI lab, developed a model called AlphaGo that, after training against itself, beat the human world champion in the game of Go. Google and other firms now use similar techniques on their latest LLMs.

Extending ideas like self-play to new domains is hot topic of research. But most real-world problems—from running a business to being a good doctor—are more complex than a game, without clear-cut winning moves. This is why, for such complex domains, data to train models is still needed from people who can differentiate between good and bad quality responses. This in turn slows things down.

From The Economist

Large language models are getting bigger and better. Can they keep improving forever? 17 April 2024 (updated Apr 18th 2024)