Inference & Efficiency A
Showing 61–90 of 121
-
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI GroundingQuality-aware self-distillation for GUI grounding in VLMsThe paper proposes a quality-aware self-distillation method for GUI grounding, where vision-language models predict precise screen coordinates, addressing how naive on-policy self-distillation can degrade coordinate-token teacher signals.
-
S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained DevicesS4oP prunes structured state space models at the operator levelStructured state space models such as S4 and S4D capture long-range dependencies but are hard to deploy on constrained devices. S4oP introduces operator-level pruning to enable efficient deployment of SSMs on time- and resource-constrained hardware.
-
NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward AlignmentNoiseTilt injects reward gradients via the noise term in diffusionNoiseTilt (NTRK) is a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the score kernel unchanged and needing only a single sample per step, improving reward alignment of pretrained diffusion models.
-
ConSA: Controllable Sparsity in Hybrid Attention via Learnable AllocationConSA: controllable sparsity in hybrid attention via learnable allocationHybrid architectures combining full and sliding-window attention are promising for efficient LLM inference but often rely on hand-crafted rules. ConSA introduces learnable allocation to achieve controllable sparsity in hybrid attention.
-
Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual AdaptationCatastrophic forgetting is low-rank: a function-space theoryCatastrophic forgetting in continual adaptation is usually viewed via parameter drift or replay, which do not reveal which output directions are vulnerable. The paper gives a function-space account in the NTK regime, showing new-task training drifts old-task predictions low-rank through the cross-task kernel.
-
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation ScalingLoopCoder-v2: loop once for efficient test-time compute scalingLooped transformers scale latent computation by repeating shared blocks, but sequential looping raises latency and KV-cache memory with loop count. Building on parallel loop transformers, LoopCoder-v2 makes loop count a practical knob for efficient test-time computation scaling.
-
Recursive Scaling in Masked Diffusion ModelsRecursive scaling in masked diffusion modelsMasked diffusion models (MDMs) have recently emerged as a generative approach. The paper investigates recursive scaling in MDMs, offering insights into their behavior and efficiency.
-
Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation ModelsHalf a link can predict a whole link: generalization in KG foundation modelsKnowledge graph foundation models are zero-shot generalizers that, trained once, predict links on unseen graphs without retraining. The paper sheds light on when and how they robustly generalize across knowledge graphs.
-
VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic TerminationVoidPadding lets [VOID] handle padding so [EOS] focuses on terminationIn masked diffusion language models, padding and semantic termination roles get entangled. VoidPadding introduces a [VOID] token to handle padding so that [EOS] can focus on signaling semantic termination, improving generation behavior.
-
Differential Privacy of Gaussian Process Posterior SamplingDifferential privacy of Gaussian process posterior samplingThe paper studies privacy when releasing posterior sample paths from a Gaussian process where the entire training set is private. Unlike DP mechanisms that add external noise, it shows the intrinsic randomness of posterior sampling itself yields differential-privacy guarantees.
-
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMsSoftMoE: soft differentiable routing for mixture-of-experts in LLMsSparse mixture-of-experts architectures scale LLM parameters but their discrete routing complicates training. SoftMoE introduces soft, differentiable routing for mixture-of-experts in LLMs to enable more stable and efficient expert selection.
-
Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level RepresentationsOrder-independent cell representations revisit table recognitionMulti-task table recognition jointly handles structure prediction, cell localization and content recognition, often via autoregressive decoders whose hidden states are reused. The paper revisits this structural dependency using order-independent cell-level representations.
-
AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal AnchorAnchorKV: safety-aware KV cache compression via soft penaltiesAnchorKV is a safety-aware KV cache compression method that uses soft penalties (anchors) to retain important key-value entries while reducing memory. Summary is largely title-based; details are as presented by the source and not independently verified.
-
From Drift to Coherence: Stabilizing Beliefs in LLMsFrom drift to coherence: stabilizing beliefs in LLMsLLMs are hypothesized to perform implicit Bayesian inference, yet the martingale property of predictive beliefs has been shown to fail in synthetic in-context learning. Revisiting this in typical regimes like multiple-choice QA, the paper studies how to stabilize beliefs from drift to coherence.
-
Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluationImproving low-resource ASR via bilingual fine-tuning with language IDThe study explores improving low-resource automatic speech recognition using bilingual fine-tuning combined with language identification, and evaluates the approach across languages in a cross-linguistic setting.
-
LLMs Infer Cultural Context but Fail to Apply It When RespondingLLMs infer cultural context but fail to apply it when respondingLLMs are known to overrepresent dominant, often Western cultures while marginalizing others. The paper evaluates how this affects culturally adapted response generation, finding that models can infer cultural context but fail to apply it when responding.
-
Beyond Domains: Reusing Web Skills via Transferable Interaction PatternsReusing web skills via transferable interaction patternsLLM web agents are usually deployed as tool callers that read a fresh page observation each turn and emit a structured action. The paper proposes reusing web skills across domains via transferable interaction patterns rather than domain-specific behaviors.
-
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy DistillationOPD-Evolver cultivates self-evolving agents via on-policy distillationMemory is a standard substrate for self-evolving agents, but retaining experience differs from learning how to evolve through it. OPD-Evolver uses on-policy distillation to cultivate a holistic agent evolver that selects useful experience, acts on it and writes reusable knowledge.
-
Exact Posterior Score Estimation for Solving Linear Inverse ProblemsExact closed-form posterior score for linear inverse problemsThe paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, showing that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot with anisotropic noise. It turns this into a training objective, Exact Posterior Score (EPS), that preserves standard denoising structure and can be trained from scratch or fine-tuned.
-
KVEraser: Learning to Steer KV Cache for Efficient Localized Context ErasingKVEraser edits the KV cache to erase context efficientlyErasing a span from a long-context KV cache is costly because a local edit propagates to all later tokens, forcing recomputation of the suffix. KVEraser instead replaces only the erased interval's KV states with learned steering states while reusing the rest of the cache. A two-stage training pipeline teaches a transferable erasing mechanism for stale facts, wrong tool outputs, or prompt injections.
-
HAMON: Passive Optical Sequence Mixing for Long-Horizon ForecastingHAMON: a passive optical core for long-horizon forecastingHAMON is a passive diffractive optical forecasting core: history is encoded onto an optical aperture and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. Inference is a single passive optical pass with no digital sequence-mixing layer, yet it beats strong digital baselines on ETTm2.
-
ExpRL: Exploratory RL for LLM Mid-TrainingExpRL uses human QA as reward scaffolds for LLM mid-training RLExpRL is an RL-based mid-training method that uses large human-written QA corpora as reward scaffolds rather than imitation targets: reference answers are hidden from the policy and used only to build problem-specific grading rubrics for judging on-policy reasoning, automating skill acquisition for harder problems.
-
TokenPilot: Cache-Efficient Context Management for LLM AgentsTokenPilot cuts LLM-agent context costs ~61% while preserving prompt cacheTokenPilot is a dual-granularity context manager for LLM agents that avoids the cache invalidation caused by unconstrained pruning. Ingestion-Aware Compaction stabilizes prompt prefixes while Lifecycle-Aware Eviction offloads segments only when relevance expires, cutting costs by 61% and 56% in benchmarks.
-
TuneJury: An Open Metric for Improving Music Generation Preference AlignmentTuneJury: an open reward model for text-to-music preferenceTuneJury is an open, instance-level pairwise reward model that predicts text-to-music preference scores from a prompt and an audio clip, trained on publicly available human-preference labels. Its calibrated score margins support data filtering, and an 'anchor calibration' step efficiently extends it to generators released after training.
-
Bayesian Inference and Decision Audits for Public Archives of Frontier AI EvaluationsBayesian audit of public frontier-AI evaluation archives proposedThe paper treats public AI evaluation archives (e.g., LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as selective time series rather than terminal leaderboards, framing them as a Bayesian inference problem. It reports that selection-aware frontier models fail synthetic recovery and calibration, while fixed audit gates remain informative.
-
ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary SegmentationActiveSAM turns frozen SAM 3 into a training-free open-vocab segmenterActiveSAM is a training-free, zero-shot framework that adapts the frozen SAM 3 backbone for open-vocabulary semantic segmentation. It estimates an image-conditioned active class set from a low-resolution presence preview, then decodes only the retained classes at full resolution, improving efficiency over decoding the entire dataset vocabulary per image.
-
Agent trajectories as programs: fingerprinting and programming coding-agent behaviorCoding agents have behavioral fingerprints identifiable from trajectoriesThe paper compares agents procedurally rather than by benchmark scores, defining behavioral 'fingerprints.' Across ten agents, a probe over these procedural signatures attributes an unseen trajectory to the correct agent with 85.7% accuracy. Using an emergent, compressive vocabulary induction over SWE-Bench trajectories, it studies the structural distinctness of agent problem-solving.
-
Dynestyx: A Probabilistic Programming Library for Dynamical SystemsDynestyx: a probabilistic programming library with first-class SSMsState-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, yet they have been hard to incorporate into modern probabilistic programming languages. The authors introduce dynestyx, a library with first-class SSM support and state-of-the-art state and parameter estimation. Through one interface, users specify arbitrary priors for discrete- or continuous-time systems, run inference over mixed-effect data, and obtain estimates with principled uncertainty.
-
Decoupling Inference from State Updates in Low-Latency Feature Engines via Probabilistic ThinningProbabilistic thinning decouples inference from state updates in streamsStreaming data systems increasingly underpin ML workflows maintaining many continuously updated aggregations. In production, each event triggers read-modify-write operations to storage, making high-frequency state updates a dominant source of latency, contention, and cost. This work decouples inference from persistence via probabilistic thinning: every event is scored, but durable updates fire only for informative events, using approximate disk-backed statistics with no in-memory control plane.
-
Probing Low Frame Rate Degradation in Neural Audio CodecsProbing why neural audio codecs degrade at low frame ratesLow frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where cost scales with sequence length. Codecs can run at 12.5 Hz and below, but the mechanisms of low-frame-rate degradation are unclear. Through a controlled frame-rate ablation, the authors reproduce a quality cliff at 6.25 Hz and test explanations, phonemic collisions and codebook saturation, finding no fundamental barrier. The cliff instead stems from suboptimal training such as fixed clip duration.