TL;DR: LLM inference costs have significantly reduced with the introduction of context caching, allowing for 10x cheaper access to reused input tokens. Deepmind and Gemini have made progress in this area, with Deepmind also releasing a new small 2B Gemma model benefiting from model distillation. China-based DeepSeek has announced automatic context caching, reducing API costs by 90%. In parallel, research on inference-time scaling laws suggests that increasing the number of inference steps can significantly improve LLM performance. These advancements are synergistic and could make agentic LLM systems more feasible. The evolution of LLM compression methods, from QuIP to A
Disclaimer: This post has been created automatically using generative AI. Including DALL-E, Gemini, OpenAI and others. Please take its contents with a grain of salt. For feedback on how we can improve, please email us
Introduction
This week, our eyes were again on the rapid progress in LLM inference, in particular, the possibility of significantly reducing the cost for reused input tokens with context caching. We might labor this point a bit much, but the progress in inference compute prices for LLMs is truly unprecedented. In this blog post, we will discuss the latest developments in LLM inference, including the innovations in context caching and the potential impact it could have on the field of AI.
Deepseek’s 10x Cheaper Reused LLM Input Tokens
At the peak of Moore’s law, the cost per transistor reduced around ~4,000x in the first 14 years up to 1982. But transistors were not getting fundamentally more capable at the same time! At this stage, it is hard to imagine progress at this pace not soon having a global impact. The innovations in context caching this week tie into a great new paper investigating how LLM performance can benefit from repeated inference steps, or “inference-time scaling laws”. Together, we think these provide a very powerful new avenue for unlocking economically useful LLM capabilities.
Deepmind’s Flurry of Activity
Deepmind followed META’s week in the spotlight with a flurry of activity. Gemini released a new Pro 1.5 experimental model, which, for the first time, put Deepmind at the top of the LMSYS arena, suggesting they have finally caught up in the LLM capability race on some measures (but still behind on Livebench and Zeroeval benchmarks). They also announced the Flash model that will reduce 5x in price next week (taking it to half of GPT-4o-mini cost), a move we think is partly a reflection of progress in distillation but also likely competitive pressure from Llama 3.1. They also released an impressive (for its size) new small 2B Gemma model benefiting from model distillation (which we expect to join the LLM builder toolkit post Llama 3.1, as we discussed last week).
DeepSeek’s Context Caching on Disk
Less than 24 after the Gemini Flash price announcement, inference compute pricing was taken a whole level lower with China-based DeepSeek announcing Context Caching on Disk via their API. This automatically reduces the cost of handling reused input tokens by 90%, down to $0.014 / million tokens, making it 10x cheaper than GPT-4o-mini. The caching mechanism works by storing input content it expects to be reused in a
This week, there have been significant advancements in the field of large language models, specifically in the area of inference compute prices. It is now possible to access reused input token inference with Deepseek v2 for 4,300x cheaper than GPT-3 (da-vinci 002) cost just 24 months ago. This is a truly unprecedented progress and is expected to have a global impact. Along with this, new research on inference-time scaling laws suggests that we can significantly improve LLM performance by increasing the number of inference steps. This approach, known as repeated sampling, allows weaker models to outperform stronger ones in certain tasks. These advancements are synergistic and are expected to make some agentic LLM systems far more feasible, both in terms of cost and latency. The reduced cost and latency also open up new avenues for using LLMs in scenarios where repeated querying of the same input tokens is essential, such as multi-step data analysis, codebase questioning, and multi-turn conversations. As these advancements continue, we can expect even more significant cost reductions and increased capabilities for LLMs, which will have a profound impact on various industries and applications.
Crafted using generative AI from insights found on Towards AI.
Join us on this incredible generative AI journey and be a part of the revolution. Stay tuned for updates and insights on generative AI by following us on X or LinkedIn.