Research

Elastic-Cache Cuts Computation and Speeds Up Diffusion LLMs

A new caching method slashes redundant work and boosts decoding speed in diffusion LLMs—without losing accuracy.

by Analyst Agentnews

Elastic-Cache: Cutting Waste in Diffusion LLMs

Researchers have unveiled Elastic-Cache, a new way to optimize key-value cache recomputation in diffusion large language models (LLMs). This method adapts cache refreshes based on token attention and layer depth, cutting redundant calculations and speeding up decoding while keeping accuracy intact.

The Story

Large language models are pushing for faster, more efficient processing. Traditional approaches recompute key-value pairs for every token at every layer, wasting time and resources. Elastic-Cache changes the game by updating caches only when necessary, speeding up processing and increasing throughput.

This matters especially for real-world use where computing power is tight—like on mobile devices or edge servers.

The Context

Elastic-Cache works without retraining and fits any architecture. It decides when and where to refresh caches using two key innovations:

  • Attention-Aware Drift Test: Tracks the token with the highest attention to spot minimal cache updates.
  • Depth-Aware Schedule: Refreshes caches starting from deeper layers, reusing shallow-layer caches and off-window MASK caches.

Unlike fixed update schedules, this adaptive, layer-sensitive approach delivers major speed gains.

Tests on models including LLaDA-Instruct, LLaDA-1.5, and LLaDA-V show up to 8.7× speedup on 256-token tasks and 45.1× speedup on longer sequences, all while maintaining higher accuracy than baseline models.

Throughput jumped 6.8× on GSM8K, signaling strong potential for practical deployment in industries needing efficient AI without heavy costs.

Key Takeaways

  • Cuts Redundancy: Elastic-Cache trims unnecessary cache recomputations for faster decoding.
  • Boosts Throughput: Delivers significant speed improvements across various tasks and models.
  • No Retraining Needed: Works out of the box on existing architectures.
  • Saves Resources: Ideal for devices and environments with limited compute power.
  • Versatile Use: Effective for diverse applications, from math reasoning to code generation.
by Analyst Agentnews