Inference & Efficiency A

Showing 61–90 of 121
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
    Quality-aware self-distillation for GUI grounding in VLMs
    Computer Vision
    The paper proposes a quality-aware self-distillation method for GUI grounding, where vision-language models predict precise screen coordinates, addressing how naive on-policy self-distillation can degrade coordinate-token teacher signals.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices
    S4oP prunes structured state space models at the operator level
    Fine-tuning Inference Reinforcement Learning
    Structured state space models such as S4 and S4D capture long-range dependencies but are hard to deploy on constrained devices. S4oP introduces operator-level pruning to enable efficient deployment of SSMs on time- and resource-constrained hardware.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward Alignment
    NoiseTilt injects reward gradients via the noise term in diffusion
    Inference
    NoiseTilt (NTRK) is a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the score kernel unchanged and needing only a single sample per step, improving reward alignment of pretrained diffusion models.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation
    ConSA: controllable sparsity in hybrid attention via learnable allocation
    Deep Learning Inference Reinforcement Learning
    Hybrid architectures combining full and sliding-window attention are promising for efficient LLM inference but often rely on hand-crafted rules. ConSA introduces learnable allocation to achieve controllable sparsity in hybrid attention.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Catastrophic Forgetting is Low-Rank: A Function-Space Theory for Continual Adaptation
    Catastrophic forgetting is low-rank: a function-space theory
    Fine-tuning Reinforcement Learning
    Catastrophic forgetting in continual adaptation is usually viewed via parameter drift or replay, which do not reveal which output directions are vulnerable. The paper gives a function-space account in the NTK regime, showing new-task training drifts old-task predictions low-rank through the cross-task kernel.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation Scaling
    LoopCoder-v2: loop once for efficient test-time compute scaling
    Deep Learning Software Engineering Transformer
    Looped transformers scale latent computation by repeating shared blocks, but sequential looping raises latency and KV-cache memory with loop count. Building on parallel loop transformers, LoopCoder-v2 makes loop count a practical knob for efficient test-time computation scaling.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Recursive Scaling in Masked Diffusion Models
    Recursive scaling in masked diffusion models
    Deep Learning Inference Transformer
    Masked diffusion models (MDMs) have recently emerged as a generative approach. The paper investigates recursive scaling in MDMs, offering insights into their behavior and efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Half a Link can Be Enough to Predict a Whole Link: Understanding Generalization in Knowledge Graph Foundation Models
    Half a link can predict a whole link: generalization in KG foundation models
    Inference
    Knowledge graph foundation models are zero-shot generalizers that, trained once, predict links on unseen graphs without retraining. The paper sheds light on when and how they robustly generalize across knowledge graphs.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic Termination
    VoidPadding lets [VOID] handle padding so [EOS] focuses on termination
    Deep Learning Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning
    In masked diffusion language models, padding and semantic termination roles get entangled. VoidPadding introduces a [VOID] token to handle padding so that [EOS] can focus on signaling semantic termination, improving generation behavior.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Differential Privacy of Gaussian Process Posterior Sampling
    Differential privacy of Gaussian process posterior sampling
    Inference
    The paper studies privacy when releasing posterior sample paths from a Gaussian process where the entire training set is private. Unlike DP mechanisms that add external noise, it shows the intrinsic randomness of posterior sampling itself yields differential-privacy guarantees.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs
    SoftMoE: soft differentiable routing for mixture-of-experts in LLMs
    Inference Mixture of Experts (MoE)
    Sparse mixture-of-experts architectures scale LLM parameters but their discrete routing complicates training. SoftMoE introduces soft, differentiable routing for mixture-of-experts in LLMs to enable more stable and efficient expert selection.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations
    Order-independent cell representations revisit table recognition
    Inference
    Multi-task table recognition jointly handles structure prediction, cell localization and content recognition, often via autoregressive decoders whose hidden states are reused. The paper revisits this structural dependency using order-independent cell-level representations.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor
    AnchorKV: safety-aware KV cache compression via soft penalties
    Inference Reinforcement Learning
    AnchorKV is a safety-aware KV cache compression method that uses soft penalties (anchors) to retain important key-value entries while reducing memory. Summary is largely title-based; details are as presented by the source and not independently verified.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    From Drift to Coherence: Stabilizing Beliefs in LLMs
    From drift to coherence: stabilizing beliefs in LLMs
    Fine-tuning Inference Reinforcement Learning Software Engineering
    LLMs are hypothesized to perform implicit Bayesian inference, yet the martingale property of predictive beliefs has been shown to fail in synthetic in-context learning. Revisiting this in typical regimes like multiple-choice QA, the paper studies how to stabilize beliefs from drift to coherence.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation
    Improving low-resource ASR via bilingual fine-tuning with language ID
    Fine-tuning Inference Speech Processing
    The study explores improving low-resource automatic speech recognition using bilingual fine-tuning combined with language identification, and evaluates the approach across languages in a cross-linguistic setting.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    LLMs Infer Cultural Context but Fail to Apply It When Responding
    LLMs infer cultural context but fail to apply it when responding
    Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    LLMs are known to overrepresent dominant, often Western cultures while marginalizing others. The paper evaluates how this affects culturally adapted response generation, finding that models can infer cultural context but fail to apply it when responding.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns
    Reusing web skills via transferable interaction patterns
    AI Agents Meta Retrieval-Augmented Generation (RAG)
    LLM web agents are usually deployed as tool callers that read a fresh page observation each turn and emit a structured action. The paper proposes reusing web skills across domains via transferable interaction patterns rather than domain-specific behaviors.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation
    OPD-Evolver cultivates self-evolving agents via on-policy distillation
    AI Agents
    Memory is a standard substrate for self-evolving agents, but retaining experience differs from learning how to evolve through it. OPD-Evolver uses on-policy distillation to cultivate a holistic agent evolver that selects useful experience, acts on it and writes reusable knowledge.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Infrastructure & Hardware extract
    Exact Posterior Score Estimation for Solving Linear Inverse Problems
    Exact closed-form posterior score for linear inverse problems
    Inference Reinforcement Learning
    The paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, showing that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot with anisotropic noise. It turns this into a training objective, Exact Posterior Score (EPS), that preserves standard denoising structure and can be trained from scratch or fine-tuned.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing
    KVEraser edits the KV cache to erase context efficiently
    Fine-tuning Reinforcement Learning
    Erasing a span from a long-context KV cache is costly because a local edit propagates to all later tokens, forcing recomputation of the suffix. KVEraser instead replaces only the erased interval's KV states with learned steering states while reusing the rest of the cache. A two-stage training pipeline teaches a transferable erasing mechanism for stale facts, wrong tool outputs, or prompt injections.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting
    HAMON: a passive optical core for long-horizon forecasting
    Inference Neural Network Transformer
    HAMON is a passive diffractive optical forecasting core: history is encoded onto an optical aperture and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. Inference is a single passive optical pass with no digital sequence-mixing layer, yet it beats strong digital baselines on ETTm2.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    ExpRL: Exploratory RL for LLM Mid-Training
    ExpRL uses human QA as reward scaffolds for LLM mid-training RL
    Fine-tuning Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    ExpRL is an RL-based mid-training method that uses large human-written QA corpora as reward scaffolds rather than imitation targets: reference answers are hidden from the policy and used only to build problem-specific grading rubrics for judging on-policy reasoning, automating skill acquisition for harder problems.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    TokenPilot: Cache-Efficient Context Management for LLM Agents
    TokenPilot cuts LLM-agent context costs ~61% while preserving prompt cache
    AI Agents Inference Natural Language Processing (NLP) Reinforcement Learning
    TokenPilot is a dual-granularity context manager for LLM agents that avoids the cache invalidation caused by unconstrained pruning. Ingestion-Aware Compaction stabilizes prompt prefixes while Lifecycle-Aware Eviction offloads segments only when relevance expires, cutting costs by 61% and 56% in benchmarks.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    TuneJury: An Open Metric for Improving Music Generation Preference Alignment
    TuneJury: an open reward model for text-to-music preference
    Deep Learning Inference
    TuneJury is an open, instance-level pairwise reward model that predicts text-to-music preference scores from a prompt and an audio clip, trained on publicly available human-preference labels. Its calibrated score margins support data filtering, and an 'anchor calibration' step efficiently extends it to generators released after training.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
    Bayesian audit of public frontier-AI evaluation archives proposed
    Inference Reinforcement Learning
    The paper treats public AI evaluation archives (e.g., LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as selective time series rather than terminal leaderboards, framing them as a Bayesian inference problem. It reports that selection-aware frontier models fail synthetic recovery and calibration, while fixed audit gates remain informative.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation
    ActiveSAM turns frozen SAM 3 into a training-free open-vocab segmenter
    Inference Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    ActiveSAM is a training-free, zero-shot framework that adapts the frozen SAM 3 backbone for open-vocabulary semantic segmentation. It estimates an image-conditioned active class set from a low-resolution presence preview, then decodes only the retained classes at full resolution, improving efficiency over decoding the entire dataset vocabulary per image.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Agent trajectories as programs: fingerprinting and programming coding-agent behavior
    Coding agents have behavioral fingerprints identifiable from trajectories
    AI Agents Neural Network Software Engineering
    The paper compares agents procedurally rather than by benchmark scores, defining behavioral 'fingerprints.' Across ten agents, a probe over these procedural signatures attributes an unseen trajectory to the correct agent with 85.7% accuracy. Using an emergent, compressive vocabulary induction over SWE-Bench trajectories, it studies the structural distinctness of agent problem-solving.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Dynestyx: A Probabilistic Programming Library for Dynamical Systems
    Dynestyx: a probabilistic programming library with first-class SSMs
    Inference Machine Learning
    State-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, yet they have been hard to incorporate into modern probabilistic programming languages. The authors introduce dynestyx, a library with first-class SSM support and state-of-the-art state and parameter estimation. Through one interface, users specify arbitrary priors for discrete- or continuous-time systems, run inference over mixed-effect data, and obtain estimates with principled uncertainty.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Decoupling Inference from State Updates in Low-Latency Feature Engines via Probabilistic Thinning
    Probabilistic thinning decouples inference from state updates in streams
    Inference Machine Learning Neural Network Retrieval-Augmented Generation (RAG)
    Streaming data systems increasingly underpin ML workflows maintaining many continuously updated aggregations. In production, each event triggers read-modify-write operations to storage, making high-frequency state updates a dominant source of latency, contention, and cost. This work decouples inference from persistence via probabilistic thinning: every event is scored, but durable updates fire only for informative events, using approximate disk-backed statistics with no in-memory control plane.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Probing Low Frame Rate Degradation in Neural Audio Codecs
    Probing why neural audio codecs degrade at low frame rates
    Inference Reinforcement Learning Speech Processing
    Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where cost scales with sequence length. Codecs can run at 12.5 Hz and below, but the mechanisms of low-frame-rate degradation are unclear. Through a controlled frame-rate ablation, the authors reproduce a quality cliff at 6.25 Hz and test explanations, phonemic collisions and codebook saturation, finding no fundamental barrier. The cliff instead stems from suboptimal training such as fixed clip duration.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗