Inference & Efficiency A
Showing 1–30 of 121
-
UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation LearningUNIEGO: unified egocentric video encoder via multi-teacher distillationUNIEGO is a unified egocentric video encoder trained via a hierarchical multi-teacher distillation framework. Representation-specific proxy models translate knowledge from teachers spanning multiple viewpoints, modalities, and foundation models into a single egocentric space, while remaining deployable from egocentric video alone.
-
Multi-Task Bayesian In-Context LearningMulti-task Bayesian inference via in-context learningThe paper studies multi-task Bayesian in-context learning, using in-context learning to perform Bayesian predictive inference across tasks. It targets the intractability of exact inference and the cost or restrictiveness of scalable approximations, aiming for uncertainty quantification and data efficiency.
-
Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI ServingExecution-State Capsules: checkpoint/restore for on-device AI servingThe paper introduces Execution-State Capsules, a graph-bound mechanism to checkpoint and restore execution state for low-latency, small-batch, on-device physical-AI serving. It targets scenarios beyond the high-throughput, high-concurrency regime that paged or radix KV caches mainly serve.
-
LedgerAgent: Structured State for Policy-Adherent Tool-Calling AgentsLedgerAgent: structured state for policy-adherent tool-calling agentsPolicy-adherent tool-calling agents in customer-service domains must track task state across turns while following rules. LedgerAgent introduces structured state to help such agents stay consistent and policy-compliant.
-
DeepSWIP: Quotient-WMC Counterfactuals for Neural Probabilistic Logic ProgramsDeepSWIP: quotient-WMC counterfactuals for neural probabilistic logic programsNeurosymbolic systems such as DeepProbLog combine neural perception with probabilistic logic, but standard inference has limits. DeepSWIP introduces quotient-WMC counterfactuals to enable counterfactual reasoning in neural probabilistic logic programs.
-
FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTSFlowEdit: associative memory for lifelong pronunciation adaptation in TTSFlow-matching text-to-speech achieves strong zero-shot quality but stays static after deployment. FlowEdit uses associative memory to enable lifelong pronunciation adaptation without full retraining.
-
Efficient and Sound Probabilistic Verification for AI AgentsEfficient and sound probabilistic verification for AI agentsSecuring AI agents that operate in complex digital environments has become critical, motivating runtime verification. This paper presents an efficient and sound probabilistic verification approach for AI agents.
-
Marginal Advantage Accumulation for Memory-Driven Agent Self-EvolutionMarginal advantage accumulation for self-evolving memory agentsThe paper proposes marginal advantage accumulation, a cross-batch, operation-level mechanism for memory-driven agent self-evolution. It aims to distinguish stably effective memory operations from accidental hits, addressing contradictory feedback that the same operation can receive across different batches in trace distillation.
-
UltraQuant: 4-bit KV Caching for Context-Heavy AgentsUltraQuant: 4-bit KV caching for context-heavy agentsContext-heavy agents put unusual pressure on the key-value cache as long prefixes are reused across calls. UltraQuant applies 4-bit quantization to compress the KV cache while preserving quality.
-
HEPTv2: End-to-End Efficient Point Transformer for Charged Particle ReconstructionHEPTv2: an efficient point transformer for particle trackingThe paper presents HEPTv2, an end-to-end efficient point transformer for charged-particle reconstruction. It targets tracking—reconstructing trajectories from sparse detector measurements under extreme combinatorial ambiguity—aiming to stay accurate and efficient at the High-Luminosity LHC.
-
Neural network surrogates with uncertainty quantification for inverse problems in partial differential equationsNN surrogates with uncertainty quantification for PDE inverse problemsThe paper develops neural network surrogates with uncertainty quantification for inverse problems in partial differential equations. It targets the inference of unknown model parameters from noisy or incomplete observations, where traditional numerical methods are costly, particularly in Bayesian settings.
-
Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 RecipeRethinking shrinkage bias in LLM FP4 pretraining with a UFP4 recipeFP4 training promises large memory and compute savings for LLM pretraining but suffers from shrinkage bias. This paper analyzes its geometric origin and systemic impact and proposes a UFP4 recipe to address it.
-
AutoPass: Evidence-Guided LLM Agents for Compiler Performance TuningAutoPass: evidence-guided LLM agents for compiler performance tuningLarge language models show promise for code compilation tasks but struggle with runtime performance tuning. AutoPass uses evidence-guided LLM agents to perform compiler performance tuning.
-
Robust $Q$-learning for mean-field control under Wasserstein uncertainty in common noiseRobust Q-learning for mean-field control under Wasserstein uncertaintyThis paper presents a robust Q-learning algorithm for discrete-time mean-field control problems with common noise, accounting for Wasserstein uncertainty in the model.
-
SoftSkill: Behavioral Compression for Contextual AdaptationSoftSkill: behavioral compression for contextual adaptationAgent skills are commonly deployed as natural-language Markdown files that encode answer policies. SoftSkill compresses such behaviors to enable more efficient contextual adaptation.
-
Token-Operations-Oriented Inference Optimization Techniques for Large ModelsToken-operation-oriented inference optimization for large modelsThe paper proposes a token-operations-oriented framework for large-model inference optimization, presenting a four-layer technical architecture aimed at scalable, low-cost, and stable large-model services. The layers include components such as multi-model fusion.
-
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM InferenceExplicit knowledge conflict resolution for LLM inferenceLarge language models perform strongly across language tasks but can hold conflicting parametric and contextual knowledge. This work proposes explicit knowledge conflict resolution to navigate unreliable knowledge during inference.
-
HilDA: Hierarchical Distillation with Diffusion for Advancing Self-Supervised LiDAR Pre-traininHilDA: hierarchical distillation with diffusion for self-supervised LiDARUsing vision foundation models for camera-to-LiDAR knowledge distillation is promising. HilDA advances self-supervised LiDAR pre-training through hierarchical distillation with diffusion.
-
When Does Streaming Tool Use Help? Characterizing Tool-Intent Stabilization in Streaming Retrieval-Augmented GenerationWhen does streaming tool use help in streaming RAG?The paper characterizes when streaming tool use helps in streaming retrieval-augmented generation, which issues tool queries in parallel with ongoing user input to cut perceived latency. It argues the benefit is query-intrinsic and studies how tool intent stabilizes before an utterance is complete.
-
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention HybridizationHydraHead: head-level hybridization of linear and full attentionThe paper proposes HydraHead, a hybrid attention design that exploits head-level functional heterogeneity to combine linear and full attention. It moves beyond the common layer-wise hybridization strategy, addressing the difficulty of integrating linear attention with full attention for efficient long-context processing.
-
GEMS: Geometric Constraints Enable Multi-Semantic Superposition in LLMsGEMS: geometric constraints for multi-semantic activation steeringThe paper introduces GEMS, which uses geometric constraints to enable superposing multiple semantic directions in LLM activation steering. It addresses the collapse that occurs when existing single-direction steering methods inject several semantic directions at once without constraints.
-
Light-weight Pronunciation Assessment via Discrete Speech Token SurprisalLightweight pronunciation assessment via speech token surprisalThe paper proposes a lightweight framework for automated pronunciation assessment based on discrete speech token surprisal, trained only on native speech resources. It operates unsupervised or with light calibration from a small set of scored utterances, avoiding costly labeled learner-error or non-native corpora.
-
Closing the Calibration Gap in Semantic CachingClosing the calibration gap in semantic cachingThe paper addresses the calibration gap in semantic caching, which cuts LLM inference costs by serving cached responses to semantically similar queries. It shows that evaluating with PR-AUC—which only measures ranking, not usability at a fixed threshold—leads to systematically poor deployment choices.
-
Rethinking Reward Supervision: Rubric-Conditioned Self-DistillationRubric-conditioned self-distillation rethinks reward supervisionPost-training of reasoning models often combines supervised distillation with reinforcement learning from verifiable rewards, but distillation relies on costly chain-of-thought annotations. This work proposes rubric-conditioned self-distillation to rethink reward supervision while cutting annotation cost.
-
Diffusion-Proof: Recipe for Formal Theorem Proving Beyond Auto-Regressive GenerationDiffusion-Proof: formal theorem proving beyond autoregressive generationEnhancing formal math reasoning in LLMs has become a key focus, but most work relies on autoregressive generation. Diffusion-Proof offers a recipe for formal theorem proving using diffusion models, exploring proof search through a non-autoregressive framework and training approach.
-
Structured Inference with Large Language GibbsStructured probabilistic inference over LLMs via Gibbs samplingKnowledge encoded in LLMs can serve as a substrate for structured reasoning over variables describing a complex world, but accessing it probabilistically is hard. This work performs structured inference over LLMs using Gibbs sampling, enabling probabilistic reasoning across interrelated variables.
-
DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning ModelsDreamReasoner-8B: block-size curriculum learning for diffusion reasoningBlock diffusion language models speed decoding via parallel block-wise denoising, but reliably scaling them for long chain-of-thought reasoning is unresolved. The authors develop DreamReasoner-8B, using block-size curriculum learning to strengthen long-CoT reasoning in diffusion reasoning models.
-
Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV FlightHardware/vision-in-the-loop validation of monocular UAV pose estimationAutonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
-
Essential Subspace Merging for Multi-Task LearningEssential subspace merging for multi-task model mergingModel merging integrates the capabilities of several models fine-tuned from the same pretrained checkpoint into one, enabling multi-task learning. This work proposes Essential Subspace Merging, which extracts and merges each task's essential subspace to reduce interference and preserve multi-task performance.
-
The Reward Was in Your Data All Along: Correcting Flow Matching with Discriminator-Guided RLDiscriminator-guided RL corrects flow matching using your dataScore- and flow-matching models often rely on preference-based RL both to align with subjective preferences and, surprisingly, to recover certain properties. This work argues the reward was in the data all along, correcting flow matching with discriminator-guided reinforcement learning.