Inference & Efficiency A

Showing 1–30 of 121
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning
    UNIEGO: unified egocentric video encoder via multi-teacher distillation
    Neural Network
    UNIEGO is a unified egocentric video encoder trained via a hierarchical multi-teacher distillation framework. Representation-specific proxy models translate knowledge from teachers spanning multiple viewpoints, modalities, and foundation models into a single egocentric space, while remaining deployable from egocentric video alone.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Multi-Task Bayesian In-Context Learning
    Multi-task Bayesian inference via in-context learning
    Inference Meta Reinforcement Learning Transformer
    The paper studies multi-task Bayesian in-context learning, using in-context learning to perform Bayesian predictive inference across tasks. It targets the intractability of exact inference and the cost or restrictiveness of scalable approximations, aiming for uncertainty quantification and data efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving
    Execution-State Capsules: checkpoint/restore for on-device AI serving
    AI Agents Meta NVIDIA Retrieval-Augmented Generation (RAG) Speech Processing
    The paper introduces Execution-State Capsules, a graph-bound mechanism to checkpoint and restore execution state for low-latency, small-batch, on-device physical-AI serving. It targets scenarios beyond the high-throughput, high-concurrency regime that paged or radix KV caches mainly serve.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    LedgerAgent: Structured State for Policy-Adherent Tool-Calling Agents
    LedgerAgent: structured state for policy-adherent tool-calling agents
    AI Agents Inference Retrieval-Augmented Generation (RAG)
    Policy-adherent tool-calling agents in customer-service domains must track task state across turns while following rules. LedgerAgent introduces structured state to help such agents stay consistent and policy-compliant.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic Programs
    DeepSWIP: quotient-WMC counterfactuals for neural probabilistic logic programs
    Inference Reinforcement Learning
    Neurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference has limits. DeepSWIP introduces quotient-WMC counterfactuals to enable counterfactual reasoning in neural probabilistic logic programs.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
    FlowEdit: associative memory for lifelong pronunciation adaptation in TTS
    Embeddings Inference Speech Processing
    Flow-matching text-to-speech achieves strong zero-shot quality but stays static after deployment. FlowEdit uses associative memory to enable lifelong pronunciation adaptation without full retraining.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Agents & Tool Use extract
    Efficient and Sound Probabilistic Verification for AI Agents
    Efficient and sound probabilistic verification for AI agents
    AI Agents Deep Learning Inference Neural Network
    Securing AI agents that operate in complex digital environments has become critical, motivating runtime verification. This paper presents an efficient and sound probabilistic verification approach for AI agents.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution
    Marginal advantage accumulation for self-evolving memory agents
    The paper proposes marginal advantage accumulation, a cross-batch, operation-level mechanism for memory-driven agent self-evolution. It aims to distinguish stably effective memory operations from accidental hits, addressing contradictory feedback that the same operation can receive across different batches in trace distillation.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    UltraQuant: 4-bit KV Caching for Context-Heavy Agents
    UltraQuant: 4-bit KV caching for context-heavy agents
    AI Agents Inference Quantization
    Context-heavy agents put unusual pressure on the key-value cache as long prefixes are reused across calls. UltraQuant applies 4-bit quantization to compress the KV cache while preserving quality.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction
    HEPTv2: an efficient point transformer for particle tracking
    Inference Machine Learning Neural Network NVIDIA Transformer
    The paper presents HEPTv2, an end-to-end efficient point transformer for charged-particle reconstruction. It targets tracking—reconstructing trajectories from sparse detector measurements under extreme combinatorial ambiguity—aiming to stay accurate and efficient at the High-Luminosity LHC.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Neural network surrogates with uncertainty quantification for inverse problems in partial differential equations
    NN surrogates with uncertainty quantification for PDE inverse problems
    Inference Neural Network Reinforcement Learning
    The paper develops neural network surrogates with uncertainty quantification for inverse problems in partial differential equations. It targets the inference of unknown model parameters from noisy or incomplete observations, where traditional numerical methods are costly, particularly in Bayesian settings.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Infrastructure & Hardware extract
    Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
    Rethinking shrinkage bias in LLM FP4 pretraining with a UFP4 recipe
    Mixture of Experts (MoE) NVIDIA Quantization
    FP4 training promises large memory and compute savings for LLM pretraining but suffers from shrinkage bias. This paper analyzes its geometric origin and systemic impact and proposes a UFP4 recipe to address it.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning
    AutoPass: evidence-guided LLM agents for compiler performance tuning
    AI Agents Fine-tuning Inference
    Large language models show promise for code compilation tasks but struggle with runtime performance tuning. AutoPass uses evidence-guided LLM agents to perform compiler performance tuning.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noise
    Robust Q-learning for mean-field control under Wasserstein uncertainty
    Quantization
    This paper presents a robust Q-learning algorithm for discrete-time mean-field control problems with common noise, accounting for Wasserstein uncertainty in the model.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    SoftSkill: Behavioral Compression for Contextual Adaptation
    SoftSkill: behavioral compression for contextual adaptation
    Computer Vision Deep Learning Inference Software Engineering
    Agent skills are commonly deployed as natural-language Markdown files that encode answer policies. SoftSkill compresses such behaviors to enable more efficient contextual adaptation.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Token-Operations-Oriented Inference Optimization Techniques for Large Models
    Token-operation-oriented inference optimization for large models
    Inference Reinforcement Learning
    The paper proposes a token-operations-oriented framework for large-model inference optimization, presenting a four-layer technical architecture aimed at scalable, low-cost, and stable large-model services. The layers include components such as multi-model fusion.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Infrastructure & Hardware extract
    Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference
    Explicit knowledge conflict resolution for LLM inference
    Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    Large language models perform strongly across language tasks but can hold conflicting parametric and contextual knowledge. This work proposes explicit knowledge conflict resolution to navigate unreliable knowledge during inference.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-trainin
    HilDA: hierarchical distillation with diffusion for self-supervised LiDAR
    Computer Vision Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Using vision foundation models for camera-to-LiDAR knowledge distillation is promising. HilDA advances self-supervised LiDAR pre-training through hierarchical distillation with diffusion.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented Generation
    When does streaming tool use help in streaming RAG?
    Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    The paper characterizes when streaming tool use helps in streaming retrieval-augmented generation, which issues tool queries in parallel with ongoing user input to cut perceived latency. It argues the benefit is query-intrinsic and studies how tool intent stabilizes before an utterance is complete.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Infrastructure & Hardware extract
    HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization
    HydraHead: head-level hybridization of linear and full attention
    Neural Network Retrieval-Augmented Generation (RAG)
    The paper proposes HydraHead, a hybrid attention design that exploits head-level functional heterogeneity to combine linear and full attention. It moves beyond the common layer-wise hybridization strategy, addressing the difficulty of integrating linear attention with full attention for efficient long-context processing.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMs
    GEMS: geometric constraints for multi-semantic activation steering
    Deep Learning Inference
    The paper introduces GEMS, which uses geometric constraints to enable superposing multiple semantic directions in LLM activation steering. It addresses the collapse that occurs when existing single-direction steering methods inject several semantic directions at once without constraints.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Light-weight Pronunciation Assessment via Discrete Speech Token Surprisal
    Lightweight pronunciation assessment via speech token surprisal
    Inference Speech Processing
    The paper proposes a lightweight framework for automated pronunciation assessment based on discrete speech token surprisal, trained only on native speech resources. It operates unsupervised or with light calibration from a small set of scored utterances, avoiding costly labeled learner-error or non-native corpora.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Closing the Calibration Gap in Semantic Caching
    Closing the calibration gap in semantic caching
    Inference
    The paper addresses the calibration gap in semantic caching, which cuts LLM inference costs by serving cached responses to semantically similar queries. It shows that evaluating with PR-AUC—which only measures ranking, not usability at a fixed threshold—leads to systematically poor deployment choices.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
    Rubric-conditioned self-distillation rethinks reward supervision
    Neural Network Reinforcement Learning
    Post-training of reasoning models often combines supervised distillation with reinforcement learning from verifiable rewards, but distillation relies on costly chain-of-thought annotations. This work proposes rubric-conditioned self-distillation to rethink reward supervision while cutting annotation cost.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive Generation
    Diffusion-Proof: formal theorem proving beyond autoregressive generation
    Deep Learning Inference Retrieval-Augmented Generation (RAG)
    Enhancing formal math reasoning in LLMs has become a key focus, but most work relies on autoregressive generation. Diffusion-Proof offers a recipe for formal theorem proving using diffusion models, exploring proof search through a non-autoregressive framework and training approach.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Structured Inference with Large Language Gibbs
    Structured probabilistic inference over LLMs via Gibbs sampling
    Inference Neural Network Reinforcement Learning
    Knowledge encoded in LLMs can serve as a substrate for structured reasoning over variables describing a complex world, but accessing it probabilistically is hard. This work performs structured inference over LLMs using Gibbs sampling, enabling probabilistic reasoning across interrelated variables.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning Models
    DreamReasoner-8B: block-size curriculum learning for diffusion reasoning
    Inference Machine Learning
    Block diffusion language models speed decoding via parallel block-wise denoising, but reliably scaling them for long chain-of-thought reasoning is unresolved. The authors develop DreamReasoner-8B, using block-size curriculum learning to strengthen long-CoT reasoning in diffusion reasoning models.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight
    Hardware/vision-in-the-loop validation of monocular UAV pose estimation
    Transformer
    Autonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    Essential Subspace Merging for Multi-Task Learning
    Essential subspace merging for multi-task model merging
    Inference Neural Network
    Model merging integrates the capabilities of several models fine-tuned from the same pretrained checkpoint into one, enabling multi-task learning. This work proposes Essential Subspace Merging, which extracts and merges each task's essential subspace to reduce interference and preserve multi-task performance.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RL
    Discriminator-guided RL corrects flow matching using your data
    Inference Neural Network Reinforcement Learning
    Score- and flow-matching models often rely on preference-based RL both to align with subjective preferences and, surprisingly, to recover certain properties. This work argues the reward was in the data all along, correcting flow matching with discriminator-guided reinforcement learning.
    Read original (arXiv cs.LG (Machine Learning)) ↗