Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Physical Address
304 North Cardinal St.
Dorchester Center, MA 02124
Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more
Most people interested in generating AI probably already know that models of large languages (LLM) – such as those behind Chatgpt, Anthropic Claude and Google Gemini – are trained on massive data sets: billions of words drawn from websites, books, code and, more and more, other media such as images, audio and video. But why?
Based on this data, the LLM develops a statistical and generalized understanding of language, its models and the world – encoded in the form of billions of parameters, or “parameters”, in a network of artificial neurons (which are mathematical functions which transform the input data into output signals).
By being exposed to all these training data, the LLM learn to detect and generalize the models that are reflected in the parameters of their neurons. For example, the word “apple” often appears near the terms related to food, fruit or trees, and sometimes computers. The model resumes that the apples can be red, green or yellow, or even sometimes other colors if they are rotten or rare, are spelled “apple” in English and are edible. This statistical knowledge influences the way the model reacts when a user enters a prompt – shaping the outing it generates according to the associations it has “learned” from training data.
But a big question – even among the researchers of the AI - remains: the quantity of training data of an LLM is used to build widespread representations of concepts, and how much memorized Verbatim or stored in an identical way or almost identical to the original data?
This is important not only to better understand the functioning of LLM – and when they are wrong – but also as a model suppliers defend themselves in prosecution against data creators and data owners, such as artists and record companies. If the LLM turns out to reproduce important parts of their word for word for word for word, the courts could be more likely to be stored on the side of the complainants arguing that the models illegally copied protected equipment. If this is not the case – if the models generate outputs based on generalized models rather than an exact replication – developers can continue to scratch and train data protected by copyright in existing legal defenses such as fair use.
Now, we finally have an answer to the question of how many llms memorize in relation to generalization: A new study published this week Researchers from Meta, Google Deepmind, Cornell and Nvidia University reveals that GPT style models have a fixed memorization capacity of approximately 3.6 bits per parameter.
To understand what 3.6 bits mean in practice:
This number is independent of the model in reasonable architectural variations: different depths, widths and precision have produced similar results. The estimate kept stable through model sizes and even precision levels, with complete precision models reaching slightly higher values (up to 3.83 bits / parameter).
One of the main points to remember is that models do not memorize more when trained on more data. Instead, the fixed capacity of a model is distributed through the data set, which means that each individual data point receives less attention.
Jack Morris, the main author, explained via social network x This “training on more data will require models to memorize less by sample”.
These results can help to alleviate concerns about major models memorizing content protected by copyright or sensitive.
If memorization is limited and diluted on many examples, the probability of reproducing a specific training example decreases. In essence, more training data leads to safer generalization behavior, not to increased risk.
To quantify precisely how many models of language memorize, the researchers used an unconventional but powerful approach: They formed models of transformers on sets of data composed of uniformly random bitstrings. Each of these bitstrings was sampled independently, ensuring that no model, structure or redundancy existed through examples.
Because each sample is unique and devoid of shared features, any capacity that the model shows in The reconstruction or identification of these channels during the evaluation directly reflects the amount of information that it has kept or memorized—Thement of the training.
The main reason for this configuration was to completely eliminate the possibility of generalization. Unlike natural language – which is full of grammatical structure, semantic overlap and repetitive concepts – uniform random data does not contain such information. Each example is essentially noise, without a statistical relationship with any other. In such a scenario, all test performance on test data must only come from the memorization of training examples, because there is no distribution model to generalize.
The authors maintain that their method may be One of the only means of principle to decouple the memorization of learning In practice, because when LLMs are formed on real language, even when they produce an output that corresponds to training data, it is difficult to know if they memorized the entry or simply deduce the underlying structure of the models they have observed.
This method allows researchers to map a direct relationship between the number of model parameters and the total stored information. By gradually increasing the size of the model and resulting in each variant to saturation, in hundreds of experiences on models ranging from 500K to 1.5 billion parameters, they observed coherent results: 3.6 bits stored by parameterthat they report as a fundamental measure of LLM memory capacity.
The team also applied their methodology to models formed on real world data sets. When trained on the text, the models had a balance between memorization and generalization.
The smaller data sets have encouraged more memorization, but as the size of the data set increased, the models moved to learning generalizable models. This transition was marked by a phenomenon known as “double descent”, where performance decreases temporarily before improving once the generalization comes into play.
The study also examined how the accuracy of the model – the comparison formation in BFLOAT16 against Float32 – Assign the ability to memorization. They observed a modest increase of 3.51 to 3.83 bits per parameter during the transition to a complete precision of 32 bits. However, this gain is much lower than the doubling of the available bits, which implies a reduction in yields of higher precision.
The document proposes a law of scale which links the capacity of a model and the size of the data set to the effectiveness of the membership inference attacks.
These attacks try to determine whether a particular data point was part of the formation set of a model. Research shows that such attacks become unreliable as the size of the data is developed, supporting the argument that large -scale training helps reduce the risk of confidentiality.
Although the article focuses on an average behavior of cases, some researchers have stressed that certain types of data – such as very unique or stylized writing – can always be more likely to memorize.
The authors recognize this limitation and emphasize that their method is designed to characterize general trends rather than on -board cases.
By introducing a definition of principle and quantifiable memorization, the study gives developers and researchers new tools to assess the behavior of language models. This not only helps the transparency of the model but also to conformity, confidentiality and ethical standards in the development of AI. The results suggest that more data – and no less – can be the safer way when training large -scale language models.
To put into perspective the total memorization of the model:
I am neither a lawyer nor a legal expert, but I would greatly expect such research to be cited in the many current proceedings between IA providers and data creators / rights owners.