Norman Mu | The Myth of Data Inefficiency in Large Language Models

A popular assertion among cognitive psychologists is that large language models (LLMs) rely on a “developmentally implausible” form of language learning. There are many points of discordance between AI and human development, but data inefficiency is one of the most commonly discussed. Specifically, it has been claimed that LLMs “receive around four or five orders of magnitude more language data than human children” (Frank, 2023).

In concrete terms: the 7 billion parameter Llama-2 model was pre-trained on 2 trillion tokens, and the 100 million token BabyLM dataset proposed by Warstadt et al. (2023) has been proposed as a more developmentally plausible amount of “pre-training data” seen by adolescent humans.

The data inefficiency argument misses two key details, which together suggest that LLMs may not be so far behind humans in data efficiency. First, it neglects an important consequence of LLM scaling laws: to achieve the same level of language fluency, larger models require significantly less data than smaller models. This is an easy point to miss if one were only following AI headlines, which have fixated on the major commercial labs’ strategies of massively scaling up compute, data, and model size.

The foundational scaling law papers by Kaplan et al. (2020) and Hoffman et al. (2022) are primarily motivated by the question: given a fixed compute budget, how big of a model should be trained to achieve the lowest loss? To further improve the loss, more compute can be spent either on a bigger model or training for longer, so a principled approach is needed for balancing these tradeoffs.

Kaplan et al. (2020) point out in their Figure 2 (pg. 4) that larger models learn more quickly, though with the sparsely labeled X-axis it’s hard to get a sense of the exact relationship between data and parameters. Fortunately for us, the data from a similar set of experiments in Hoffmann et al. (2022) has been extracted by a team of researchers in a replication study (Besiroglu et al., 2024), and we can plot another version showing dataset size as the dependent variable:

Iso-loss contours as a function of model size (parameters) and training tokens. Plotting code [here](https://gist.github.com/normster/8b97a0f74a5f5a60ba8bee8d7790f1b3).

In the context of language learning, models like Llama-2 7B are already able to achieve a level of basic language understanding roughly on par with humans. Remember, we’re just looking at the acquisition of language and not the mastery of all intellectual activities involving language. So how much data would a brain-sized LLM need? A crude comparison on the basis of matching LLM parameter count to synapse count suggests that Llama-2 7B is ~14,000x smaller than the brain which is generally thought to contain ~100T synapses. Plugging this into the revised scaling law by Besiroglu et al. (2024) gives us a dataset of ~60B tokens which is 34x smaller than the 2T tokens needed for Llama-2 7B. A brain-sized (i.e., 100T parameters) LLM is quite feasible with today’s hardware– 10 days of training with 10,000 GB200 GPUs.

Furthermore, a fixed data budget can be stretched much further by training for multiple randomly-ordered passes, or “epochs”. Muennighoff et al. (2023) find that text data can be reused up to four times without incurring noticeable model degradations, and increasing the compute budget allows for even more aggressive data reuse. So even with a totally ordinary model architecture (Transformer) and training algorithm (gradient descent), basic language understanding can likely be reached with ~15B tokens, and fewer still if you’re willing to spend more compute on more passes over the training dataset. This already brings LLMs within 2 orders of magnitude of humans, a 100x to 1000x smaller gap than typically claimed.

More fundamentally, the data inefficiency argument also ignores hundreds of thousands of years of evolutionary development. Developmental psychologists have written extensively on the importance of non-learned “core knowledge” (Spelke, 2007) for enabling within-lifetime learning of language. Machine learning researchers have similar concepts of “inductive bias” or “statistical priors”, which researchers often specify by hand for their computational models. But Nature was afforded no such help in establishing human core knowledge! Instead, this evolved across the millennia that separate our species from earlier hominids.

By some accounts, language and culture co-evolved with our species starting from 200,000 years ago, which constitutes roughly 10,000 generations. Before an individual human has seen a single token of language, the information contained within their genetic code has already “experienced” the language tokens seen by all 20,000 ancestors prior to reproduction of each subsequent generation. It is only by ignoring this ancestral data that one is able to reach the conclusion that LLMs are vastly less data efficient than humans.

Our early ancestors likely used language less frequently than we do today, so assuming 100M tokens (i.e., what the BabyLM organizers cite as plausible for a modern 13 year old) for each ancestor, we arrive at 2T tokens total within one’s direct lineage. Also critical to the process of natural selection is the remaining population of the species, across which there have existed ~100B humans throughout history each learning from even more language data.

To summarize, our very rough estimates hold that an LLM the size of a human brain would only require pre-training for ~15B tokens to reach the language fluency level of Llama-2 7B, whereas the development of human language learning has already consumed at least 2T tokens in across-lifetime learning. And as DeepSeek showed with the R1 model, developing additional skills such as mathematical reasoning requires several orders of magnitude less data than pre-training. Perhaps this shouldn’t be too surprising, given that evolution is a zeroth order algorithm while gradient descent is a more powerful first order algorithm.

The pursuit of radically more data efficient language model training is motivated by a flawed premise which incorrectly assesses current progress in AI. Instead, researchers studying both artificial and human intelligence need to consider the possibility that deep learning might already be as powerful, if not more powerful than, human learning. Two implications of this: 1) LLMs are a better model for human language learning than many researchers currently believe, and 2) closely emulating human-like patterns of learning and thinking is within grasp.

References

Besiroglu, T., et al. (2023). Chinchilla Scaling: A Replication Attempt.

Frank, M. C. (2023). Bridging the data gap between children and large language models.

Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models.

Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models.

Muennighoff, N., et al. (2023). Scaling Data-Constrained Language Models.

Spelke, E. S. (2007). Core knowledge.

Warstadt, A., et al. (2023). Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora.