In the rapidly advancing world of artificial intelligence, large language models (LLMs) have emerged as pivotal tools in diverse applications, from personal assistants and AI healthcare to marketing strategies. As these models become more sophisticated, their ability to handle complex tasks depends increasingly on processing long contexts that incorporate extensive domain knowledge or user-specific information. However, this capability comes with a significant challenge: the need for efficient context processing to minimize response delays.

When LLMs process long contexts, they must first prefill, or read and process, the entire input before generating a response. This task can become particularly cumbersome when dealing with large inputs, such as detailed user prompts or extensive conversation histories. The processing delay grows super-linearly with the length of the context, often resulting in several seconds to tens of seconds of latency. For instance, even recent advancements that increase throughput can still leave users waiting over twenty seconds for a response to a 30,000-token context.

Enter CacheGen, a groundbreaking solution developed by researchers from the University of Chicago, Stanford, and Microsoft to address these challenges and improve the speed and efficiency of LLMs. 

“Natural language models can be used not just as chatbots but also as a way to analyze new data or personalized data or internal domain-specific documents,” said assistant professor Junchen Jiang. “However, if it takes a long time to process these documents, the user experience suffers.”

Large language models, such as OpenAI’s GPT-4, rely on vast amounts of data to generate coherent and contextually accurate responses. These models often need to process long inputs containing detailed domain knowledge or user-specific information. However, processing such extensive contexts can introduce significant delays. For instance, before generating a response, the entire context must be processed, which can take several seconds or even minutes, depending on the length and complexity of the input. 

An illustration of how SLOW inference can be if the LLM has to process the same long document repeatedly.

A common approach to mitigate this delay is by reusing a precomputed key-value (KV) cache. This cache stores important data from previous computations, allowing the model to bypass redundant processing. However, fetching this KV cache over a network can introduce its own set of delays, as these caches are large and can reach sizes of tens of gigabytes. This retrieval process can be time-consuming and hinder the model’s responsiveness, especially when the cache is stored on a different machine. 

An illustration of how much FASTER inference can be if the KV cache of the long document is delivered efficiently to LLMs via CacheGen.

CacheGen is designed to tackle these inefficiencies head-on. Developed by a team led by Jiang, CacheGen offers a two-fold solution: compressing the KV cache and optimizing its streaming. Here’s how it works:

  1. KV Cache Encoding: CacheGen uses a custom tensor encoder that compresses the KV cache into a more compact bitstream. This compression is achieved with minimal computational overhead, significantly reducing the bandwidth needed to fetch the cache. By embracing the distributional properties of the KV cache, CacheGen ensures that the compression maintains the necessary data quality for accurate LLM responses.
  2. Adaptive KV Cache Streaming: To further minimize delays, CacheGen employs adaptive streaming strategies. When bandwidth is limited, CacheGen can increase the compression level for parts of the context or choose to recompute certain elements of the KV cache on the fly. This flexibility allows the system to maintain high performance and low latency, regardless of network conditions.

The implications of CacheGen’s technology are vast and transformative. By significantly reducing the time required to process and fetch large contexts, CacheGen can enhance the user experience across various applications. 

“Cities and small businesses need infrastructure to run these models efficiently,” stated Jiang. “With CacheGen, we can achieve a 4-5x speedup, which can be even higher in real-world implementations. This is crucial for sectors like AI healthcare and personal assistance, where quick and accurate responses are vital.”

For instance, in AI-driven personal assistance, users can receive faster and more accurate responses to their queries, improving overall productivity and satisfaction.

In healthcare, where AI is increasingly used to analyze patient data and provide diagnostic support, CacheGen can accelerate the processing of medical records and research papers, enabling healthcare professionals to make quicker, more informed decisions. This speed is crucial in scenarios where time is of the essence, such as emergency care or rapid disease outbreak responses.

One of the primary challenges CacheGen addresses is the inefficient reuse of KV caches. Currently, the KV cache must often be retrieved from another machine, causing additional network delays. CacheGen’s ability to compress and efficiently reload these caches is a breakthrough, as Jiang explains: “GPU memory is very precious. You cannot keep the KV cache in GPU memory all the time, so you have to store it somewhere. Loading it back is expensive. CacheGen compresses this cache into a smaller size and reloads it efficiently.”

Furthermore, a follow-up project of CacheGen also supports combining multiple KV caches, enabling the model to answer complex queries that draw on information from multiple documents. This flexibility is essential for applications requiring comprehensive data analysis, such as in-depth research or large-scale data integration.

CacheGen represents a significant step forward in making large language models more practical and accessible for a wide range of applications. By addressing the hidden problem of network delays in context processing, CacheGen not only enhances the efficiency of AI systems but also opens up new possibilities for their use in everyday tasks and professional settings.

As Jiang notes, “The real value of this work is in letting people know there’s this important problem in large language model services. By solving it, we’re making these models more useful and efficient for everyone.”

For more detailed information, the CacheGen code is publicly available, inviting further exploration and application by the AI community.

Related News

More UChicago CS stories from this research area.
UChicago CS News

CS/LSSG Showcases Sustainability Research and Education

Nov 11, 2024
UChicago CS News

Ph.D. Student Jibang Wu Receives the Stigler Center Ph.D. Dissertation Award for His Work Modeling the Incentive Structures of Reward and Recommendation–Based Systems

Oct 24, 2024
UChicago CS News

Rebecca Willett Receives the SIAM Activity Group on Data Science Career Prize

Oct 23, 2024
UChicago CS News

UChicago CS Researchers Shine at UIST 2024 with Papers, Posters, Workshops and Demonstrations

Oct 10, 2024
UChicago CS News

UChicago Scientists Receive Grant to Expand Global Data Management Platform, Globus

Oct 03, 2024
UChicago CS News

UChicago Researchers Demonstrate the Quantifiable Uniqueness of Former President Donald Trump’s Language Use

Sep 30, 2024
UChicago CS News

Five UChicago CS students named to Siebel Scholars class of 2025

Sep 20, 2024
UChicago CS News

NSF and Simons Foundation launch $20 million National AI Research Institute in Astronomy

Sep 18, 2024
In the News

Data Ecology: A Socio-Technical Approach to Controlling Dataflows

Sep 18, 2024
UChicago CS News

Ph.D. Student Shawn Shan Named MIT Technology Review’s 35 Innovators Under 35 and Innovator of the Year

Sep 16, 2024
UChicago CS News

Ben Zhao Named to TIME Magazine’s TIME100 AI List

Sep 05, 2024
UChicago CS News

Ian Foster and Rick Stevens Named to HPCwire’s 35 Legends List

Aug 28, 2024
arrow-down-largearrow-left-largearrow-right-large-greyarrow-right-large-yellowarrow-right-largearrow-right-smallbutton-arrowclosedocumentfacebookfacet-arrow-down-whitefacet-arrow-downPage 1CheckedCheckedicon-apple-t5backgroundLayer 1icon-google-t5icon-office365-t5icon-outlook-t5backgroundLayer 1icon-outlookcom-t5backgroundLayer 1icon-yahoo-t5backgroundLayer 1internal-yellowinternalintranetlinkedinlinkoutpauseplaypresentationsearch-bluesearchshareslider-arrow-nextslider-arrow-prevtwittervideoyoutube