The problem is that the types of data typically used for training language models could be depleted in the near future, as early as 2026, according to a paper by researchers at Epoch, an AI research and forecasting organization, which has yet to be peer reviewed. The problem stems from the fact that as researchers build more powerful models with greater capabilities, they have to find more and more texts to train them on. Researchers of large language models are increasingly concerned about running out of this kind of data, says Teven Le Scao, a researcher at artificial intelligence firm Hugging Face, who was not involved in Epoch’s work.
Part of the problem stems from the fact that language AI researchers filter the data they use to train models into two categories: high quality and low quality. The line between the two categories can be blurred, says Pablo Villalobos, Epoch researcher and lead author on the paper, but text in the former is considered better written and is often produced by professional writers.
Low-quality category data consists of text such as social media posts or comments on websites like 4chan, and far outnumbers data considered high-quality. Researchers typically only train models using data that falls into the high-quality category because that’s the kind of language they want the models to reproduce. This approach has led to impressive results for large language models like GPT-3.
One way to overcome these data constraints would be to reevaluate what is defined as “low” and “high” quality, according to Swabha Swayamdipta, a machine learning professor at the University of Southern California who specializes in dataset quality. If data shortages prompt AI researchers to incorporate more diverse data sets into the training process, it would be a “net positive” for language models, says Swayamdipta.
Researchers can also find ways to extend the life of the data used to train language models. Currently, large language models are only trained on the same data once due to performance and cost constraints. But it might be possible to train a model multiple times using the same data, says Swayamdipta.
Some researchers believe that big may not be better when it comes to language patterns anyway. Percy Liang, a computer science professor at Stanford University, says there’s evidence that making models more efficient may improve their capability, rather than increase their size.
“We’ve seen how smaller models trained on higher quality data can outperform larger models trained on lower quality data,” he explains.
#run #data #train #language #programs