Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell


Join our daily and weekly newsletters for the latest updates and the exclusive content on AI coverage. Learn more


Most people interested in generating AI probably already know that models of large languages ​​(LLM) – such as those behind Chatgpt, Anthropic Claude and Google Gemini – are trained on massive data sets: billions of words drawn from websites, books, code and, more and more, other media such as images, audio and video. But why?

Based on this data, the LLM develops a statistical and generalized understanding of language, its models and the world – encoded in the form of billions of parameters, or “parameters”, in a network of artificial neurons (which are mathematical functions which transform the input data into output signals).

By being exposed to all these training data, the LLM learn to detect and generalize the models that are reflected in the parameters of their neurons. For example, the word “apple” often appears near the terms related to food, fruit or trees, and sometimes computers. The model resumes that the apples can be red, green or yellow, or even sometimes other colors if they are rotten or rare, are spelled “apple” in English and are edible. This statistical knowledge influences the way the model reacts when a user enters a prompt – shaping the outing it generates according to the associations it has “learned” from training data.

But a big question – even among the researchers of the AI ​​- remains: the quantity of training data of an LLM is used to build widespread representations of concepts, and how much memorized Verbatim or stored in an identical way or almost identical to the original data?

This is important not only to better understand the functioning of LLM – and when they are wrong – but also as a model suppliers defend themselves in prosecution against data creators and data owners, such as artists and record companies. If the LLM turns out to reproduce important parts of their word for word for word for word, the courts could be more likely to be stored on the side of the complainants arguing that the models illegally copied protected equipment. If this is not the case – if the models generate outputs based on generalized models rather than an exact replication – developers can continue to scratch and train data protected by copyright in existing legal defenses such as fair use.

Now, we finally have an answer to the question of how many llms memorize in relation to generalization: A new study published this week Researchers from Meta, Google Deepmind, Cornell and Nvidia University reveals that GPT style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

To understand what 3.6 bits mean in practice:

  • A single bit is the smallest digital data unit, representing a 0 or 1. Eight bits constitute an byte.
  • The storage of 3.6 bits allows approximately 12.13 distinct values, as calculated by 2 ^ 3.6.
  • This is the quantity of information necessary to choose one of the 12 options, being similar to the selection of one month of the year or the result of a jet of a 12 -sided die.
  • He is not enough to store a single English letter (which needs around 4.7 bits), But it is enough to code a character from a reduced set of 10 common English letters (which requires approximately 3.32 bits).
  • In bytes, 3.6 bits is 0.45 bytes, less than half the size of a typical character stored in ASCII (which uses 8 bits or 1 byte).

This number is independent of the model in reasonable architectural variations: different depths, widths and precision have produced similar results. The estimate kept stable through model sizes and even precision levels, with complete precision models reaching slightly higher values ​​(up to 3.83 bits / parameter).

More training data does not lead to more memorization – in fact, a model will be less likely To memorize any unique data point

One of the main points to remember is that models do not memorize more when trained on more data. Instead, the fixed capacity of a model is distributed through the data set, which means that each individual data point receives less attention.

Jack Morris, the main author, explained via social network x This “training on more data will require models to memorize less by sample”.

These results can help to alleviate concerns about major models memorizing content protected by copyright or sensitive.

If memorization is limited and diluted on many examples, the probability of reproducing a specific training example decreases. In essence, more training data leads to safer generalization behavior, not to increased risk.

How researchers identified these results

To quantify precisely how many models of language memorize, the researchers used an unconventional but powerful approach: They formed models of transformers on sets of data composed of uniformly random bitstrings. Each of these bitstrings was sampled independently, ensuring that no model, structure or redundancy existed through examples.

Because each sample is unique and devoid of shared features, any capacity that the model shows in The reconstruction or identification of these channels during the evaluation directly reflects the amount of information that it has kept or memorized—Thement of the training.

The main reason for this configuration was to completely eliminate the possibility of generalization. Unlike natural language – which is full of grammatical structure, semantic overlap and repetitive concepts – uniform random data does not contain such information. Each example is essentially noise, without a statistical relationship with any other. In such a scenario, all test performance on test data must only come from the memorization of training examples, because there is no distribution model to generalize.

The authors maintain that their method may be One of the only means of principle to decouple the memorization of learning In practice, because when LLMs are formed on real language, even when they produce an output that corresponds to training data, it is difficult to know if they memorized the entry or simply deduce the underlying structure of the models they have observed.

This method allows researchers to map a direct relationship between the number of model parameters and the total stored information. By gradually increasing the size of the model and resulting in each variant to saturation, in hundreds of experiences on models ranging from 500K to 1.5 billion parameters, they observed coherent results: 3.6 bits stored by parameterthat they report as a fundamental measure of LLM memory capacity.

The team also applied their methodology to models formed on real world data sets. When trained on the text, the models had a balance between memorization and generalization.

The smaller data sets have encouraged more memorization, but as the size of the data set increased, the models moved to learning generalizable models. This transition was marked by a phenomenon known as “double descent”, where performance decreases temporarily before improving once the generalization comes into play.

The study also examined how the accuracy of the model – the comparison formation in BFLOAT16 against Float32 – Assign the ability to memorization. They observed a modest increase of 3.51 to 3.83 bits per parameter during the transition to a complete precision of 32 bits. However, this gain is much lower than the doubling of the available bits, which implies a reduction in yields of higher precision.

Unique data are more likely to be memorized

The document proposes a law of scale which links the capacity of a model and the size of the data set to the effectiveness of the membership inference attacks.

These attacks try to determine whether a particular data point was part of the formation set of a model. Research shows that such attacks become unreliable as the size of the data is developed, supporting the argument that large -scale training helps reduce the risk of confidentiality.

Although the article focuses on an average behavior of cases, some researchers have stressed that certain types of data – such as very unique or stylized writing – can always be more likely to memorize.

The authors recognize this limitation and emphasize that their method is designed to characterize general trends rather than on -board cases.

Evolve towards a greater human understanding of the understanding of the LLM

By introducing a definition of principle and quantifiable memorization, the study gives developers and researchers new tools to assess the behavior of language models. This not only helps the transparency of the model but also to conformity, confidentiality and ethical standards in the development of AI. The results suggest that more data – and no less – can be the safer way when training large -scale language models.

To put into perspective the total memorization of the model:

  • A parameter model of 500K can memorize approximately 1.8 million bits, or 225 ko of data.
  • A parameter model of 1.5 billion can contain approximately 5.4 billion bits, or 675 megaoctes of raw information.
  • It is not comparable to the storage of typical files like images (for example, an uncompressed image of 3.6 MB is around 30 million bits), but it is significant when it is distributed on discreet textual models.

I am neither a lawyer nor a legal expert, but I would greatly expect such research to be cited in the many current proceedings between IA providers and data creators / rights owners.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *