Head over to our on-demand library to view sessions from VB Transform 2023. Register Here
Large Language Models (LLMs), often recognized as AI systems trained on vast amounts of data to efficiently predict the next part of a word, are now being viewed from a different perspective.
A recent research paper by Google’s AI subsidiary DeepMind suggests that LLMs can be seen as strong data compressors. The authors “advocate for viewing the prediction problem through the lens of compression,” offering a fresh take on the capabilities of these models.
Their experiments demonstrate that, with slight modifications, LLMs can compress information as effectively, and in some cases, even better than widely used compression algorithms. This viewpoint provides novel insights into developing and evaluating LLMs.
LLMs as data compressors
“The compression aspect of learning and intelligence has been known to some researchers for a long time,” Anian Ruoss, Research Engineer at Google DeepMind and co-author of the paper, told VentureBeat. “However, most machine learning researchers today are (or were) unaware of this crucial equivalence, so we decided to try to popularize these essential ideas.”
VB Transform 2023 On-Demand
Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.
In essence, a machine learning model learns to transform its input, such as images or text, into a “latent space” that encapsulates the key features of the data. This latent space typically has fewer dimensions than the input space, enabling the model to compress the data into a smaller size, hence acting as a data compressor.
In their study, the Google DeepMind researchers repurposed open-source LLMs to perform arithmetic coding, a type of lossless compression algorithm. “Repurposing the models is possible because LLMs are trained with the log-loss (i.e., cross-entropy), which tries to maximize the probability of natural text sequences and decrease the probability of all others,” Ruoss said. “This yields a probability distribution over the sequences and the 1-1 equivalence with compression.”
Lossless compression, such as gzip, is a class of algorithms that can perfectly reconstruct the original data from the compressed data, ensuring no loss of information.
LLMs vs. classical compression algorithms
In their study, the researchers evaluated the compression capabilities of LLMs using vanilla transformers and Chinchilla models on text, image, and audio data. As expected, LLMs excelled in text compression. For example, the 70-billion parameter Chinchilla model impressively compressed data to 8.3% of its original size, significantly outperforming gzip and LZMA2, which managed 32.3% and 23% respectively.
However, the more intriguing finding was that despite being primarily trained on text, these models achieved remarkable compression rates on image and audio data, surpassing domain-specific compression algorithms such as PNG and FLAC by a substantial margin.
“Chinchilla models achieve their impressive compression performance by conditioning a (meta-)trained model to a particular task at hand via in-context learning,” the researchers note in their paper. In-context learning is the ability of a model to perform a task based on examples and information provided in the prompt.
Their findings also show that LLM compressors can be predictors of unexpected modalities, including text and audio. The researchers plan to release more findings in this regard soon.
Despite these promising results, LLMs are not practical tools for data compression compared to existing models, due to the size and speed differences.
“Classical compressors like gzip aren’t going away anytime soon since their compression vs. speed and size trade-off is currently far better than anything else,” Ruoss said.
Classic compression algorithms are compact, no larger than a few hundred kilobytes.
In stark contrast, LLMs can reach hundreds of gigabytes in size and are slow to run on consumer devices. For instance, the researchers found that while gzip can compress 1GB of text in less than a minute on a CPU, an LLM with 3.2 million parameters requires an hour to compress the same amount of data.
“While creating a strong compressor using (very) small-scale language models is, in principle, possible, it has not been demonstrated as of this day,” Ruoss said.
Viewing LLMs in a different light
One of the more profound findings of viewing LLMs from a compression perspective is the insight it provides into how scale affects the performance of these models. The prevailing thought in the field is that bigger LLMs are inherently better. However, the researchers discovered that while larger models do achieve superior compression rates on larger datasets, their performance diminishes on smaller datasets.
“For each dataset, the model sizes reach a critical point, after which the adjusted compression rate starts to increase again since the number of parameters is too big compared to the size of the dataset,” the researchers note in their paper.
This suggests that a bigger model is not necessarily better for any kind of task. Scaling laws are dependent on the size of the dataset, and compression can serve as an indicator of how well the model learns the information of its dataset.
“Compression provides a principled approach for reasoning about scale,” Ruoss said. “In current language modeling, scaling the model will almost always lead to better performance. However, this is just because we don’t have enough data to evaluate the performance correctly. Compression provides a quantifiable metric to evaluate whether your model has the right size by looking at the compression ratio.”
These findings could have significant implications for the evaluation of LLMs in the future. For instance, a critical issue in LLM training is test set contamination, which occurs when a trained model is tested on data from the training set, leading to misleading results. This problem has become more pressing as machine learning research shifts from curated academic benchmarks to extensive user-provided or web-scraped data.
“In a certain sense, [the test set contamination problem] is an unsolvable one because it is ill-defined. When are two pieces of text or images scraped from the internet essentially the same?” Ruoss said.
However, Ruoss suggests that test set contamination is not a problem when evaluating the model using compression approaches that consider the model complexity, also known as Minimum Description Length (MDL).
“MDL punishes a pure memorizer that is ‘storing’ all the training data in its parameters due to its huge complexity. We hope researchers will use this framework more frequently to evaluate their models,” Ruoss said.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.