Inference & Efficiency A

Showing 91–116 of 116
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Beyond the Smile: A Hybrid Convolutional VAE for Crypto Volatility Surfaces
    A convolutional VAE for completing crypto implied-volatility surfaces
    Inference Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    The authors present a convolutional variational autoencoder for crypto implied-volatility surfaces, paired with a predictor combining it with a quadratic smile re-fit via a deterministic per-tenor routing rule. Trained on 6,034 hourly Binance BTC and ETH option surfaces (May-October 2023), it achieves hidden-cell completion RMSE of 0.94-1.56 vol points. At 50% masking the hybrid reaches 0.83 vol points versus 7.00 for the smile re-fit alone, an eightfold reduction at no extra inference cost.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • NVIDIA Developer Blog · EN Developer Tools extract
    Boosting MoE Training Throughput with Advanced Fusion Kernels
    NVIDIA details advanced fusion kernels to boost MoE training throughput
    Deep Learning Generative AI Machine Learning Mixture of Experts (MoE) NVIDIA
    On its developer blog, NVIDIA explains advanced fusion-kernel techniques aimed at boosting training throughput for Mixture-of-Experts (MoE) models. Noting that MoE has rapidly become a foundational component of modern large-scale AI systems, the post outlines kernel-level optimizations for more efficient training.
    Read original (NVIDIA Developer Blog) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    A Causal Model of Theory of Mind in Conflict for Artificial Intelligence
    A structural causal model for when AI should engage theory of mind in conflict
    Inference
    Theory of mind (ToM), ascribing mental states to others for prediction and inference, is widely assumed essential for human-machine integration. Existing AI-ToM models address how to mentalize but leave when largely unaddressed. The paper asks under what situational and agent-level conditions ToM engagement is causally warranted in conflict, presenting a structural causal model as a directed acyclic graph that treats ToM as a mechanism activated by conditions rather than an always-on capacity.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter
    Study probes extrinsic and intrinsic traits of code-interpreter reasoning
    Fine-tuning Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning
    This paper studies reasoning with a Code Interpreter (CI) in LLMs from two angles: extrinsic properties (crucial tokens) and intrinsic properties (code-specific cognitive behaviors). It reports that stronger CI reasoning models show more crucial tokens and behaviors—especially verification, backtracking, and backward chaining—and explores leveraging these at inference and training time. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting
    RAID: retrieval-augmented diffusion for cold-start, cross-lingual forecasting
    Embeddings Inference Meta Retrieval-Augmented Generation (RAG)
    Time-series foundation models transfer well given a history window, but true cold-start items with no prior observations violate that. The authors propose RAID (Retrieval-Augmented Iterative Diffusion), replacing history-based correlation with metadata-driven semantic retrieval and graph-conditioned diffusion. It maps metadata into a shared semantic space via a frozen multilingual embedding model, builds an inductive retrieval graph for unseen items, and refines a forecast from neighbors.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance
    MA-SBI: misspecification-aware inference via side-channel guidance
    Inference Neural Network Reinforcement Learning
    Simulation-based inference (SBI) is often hindered by simulator misspecification, the mismatch between simulated and real observations. The recent robust method RoPE uses optimal transport between learned representations but needs ground-truth calibration pairs unavailable where SBI is needed. Practitioners instead have unstructured side-information such as regime labels, instruction text, and policy bulletins. The authors propose Misspecification-Aware SBI (MA-SBI) to exploit this guidance.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    LESS Is More: Mutual-Stability Sampling for Diffusion Language Models
    LESS: a training-free adaptive sampler for diffusion language models
    Deep Learning Inference Neural Network Retrieval-Augmented Generation (RAG) Transformer
    The paper presents LESS, a training-free, model-agnostic adaptive sampler for diffusion LLMs that frames token commitment as an online stopping problem. Its mutual-stability rule unmasks a position only when its top-1 prediction is confident, persists across recent steps, and is distributionally stable (top-K inter-step JS divergence). It is evaluated on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
    Binary Tracking: open vision-language models for spatial QA and navigation
    AI Agents Computer Vision GPT Inference Retrieval-Augmented Generation (RAG)
    The paper addresses spatial question answering for service robots traversing long egocentric routes, returning metric coordinates that downstream navigation can act on for queries like 'where can I find a dry cleaner on the way back home?' Prior approaches rely on closed-source models such as GPT-4o, which robots cannot reliably depend on due to network instability, latency, and deployment cost. The authors propose Binary Tracking, an open-source vision-language approach that can run onboard.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens
    Anchor-token roadmap for revocable decoding in diffusion LLMs
    Deep Learning Embeddings Inference Retrieval-Augmented Generation (RAG) Speech Processing
    An arXiv paper addresses the speed-quality trade-off and error propagation in revocable decoding for diffusion LLMs (dLLMs). It proposes following a latent 'roadmap' guided by anchor tokens to mitigate failures arising in mixed-quality contexts during parallel generation. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models
    Paper: Expert Tying shares MoE expert params across layers
    DeepSeek Inference Mixture of Experts (MoE) Transformer
    An arXiv paper introduces Expert Tying, an architectural change that shares expert parameters across consecutive transformer layers while keeping independent layer-wise routing and attention, aiming to cut Mixture-of-Experts memory cost. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents
    GIST-CMTF adds goal-state inference to causal minimal tool filtering
    AI Agents Deep Learning Inference
    The paper introduces GIST-CMTF, which augments Causal Minimal Tool Filtering with goal-state inference for tool-augmented LLM agents. It addresses wrong-goal execution, where ambiguous requests such as "handle my appointment" map to multiple goals and an agent may follow a valid causal tool path toward an unintended objective.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Developer Tools extract
    LLM-based Visual Code Completion for Aerospace Geometric Design
    Paper: LLM visual-programming copilot for aerospace design
    GPT Inference Neural Network
    An arXiv paper presents an LLM-based visual programming copilot for aerospace geometric design tasks, using a visual-programming variant of the ReAct methodology. Summarized neutrally from the abstract; claims are the authors' and not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis
    Physics-guided multi-scale framework for bearing fault diagnosis
    Inference Reinforcement Learning
    The paper proposes a progressive, physics-guided multi-scale vibration-processing pipeline for bearing fault diagnosis, using a kinematics-derived descriptor for real-time screening and fault-adaptive segmentation. Reported figures reflect the abstract and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing
    DoubtProbe: a dual-branch inference-time defense against LLM jailbreaks
    Inference Llama Retrieval-Augmented Generation (RAG)
    This arXiv paper proposes DoubtProbe, a dual-branch inference-time framework for black-box jailbreak defense in LLMs. The authors observe that many jailbreaks do not remove the harmful goal but reorganize the information needed to express it, evading safety alignment while remaining recoverable during generation. DoubtProbe combines structural verification and semantic auditing to counter this.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Sakana AI Blog (ja) · JA New Model Releases extract
    Sakana AI、初の商用プロダクト「Sakana Marlin」を提供開始
    Sakana AI launches Marlin, its first commercial autonomous research assistant
    AI Agents Algorithms & Theory Inference Neural Network Reinforcement Learning
    Sakana AI has launched Sakana Marlin, its first commercial product: an autonomous research assistant for business. Given a research theme, it works autonomously for up to about eight hours—forming hypotheses, gathering and verifying information—then outputs structured summary slides and a report spanning dozens of pages. Built on the firm's long-horizon reasoning technology, it aims to act as a 'virtual CSO,' is self-serve, and available same day, with plans from free pay-per-use to Enterprise.
    Read original (Sakana AI Blog (ja)) ↗
  • Lobste.rs (AI tagged) · EN Inference & Efficiency extract
    The future of Siri, or: why private inference isn’t private enough
    The future of Siri: why private inference isn't private enough
    Inference
    An essay on the future of voice assistants like Siri, arguing that on-device or 'private' inference alone does not fully protect user privacy and that stronger guarantees are needed beyond encryption and local processing.
    Read original (Lobste.rs (AI tagged)) ↗
  • NVIDIA Developer Blog · EN Agents & Tool Use extract
    NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark
    NVIDIA tops first agentic AI benchmark for agentic coding performance
    AI Agents Generative AI Inference NVIDIA
    NVIDIA reports leading agentic coding performance on the first benchmark dedicated to agentic AI, per its developer blog. The result highlights its inference stack and GPU infrastructure as a platform for autonomous coding agents.
    Read original (NVIDIA Developer Blog) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization
    AdaSR enables adaptive streaming reasoning for reasoning models
    Machine Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering Speech Processing
    AdaSR moves beyond the read-then-think paradigm by letting reasoning models reason incrementally as input streams in. It uses a hierarchical relative policy optimization scheme to train streaming reasoning.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    HumP-KD: A Hybrid Uncertainty-Aware Multi-Stage Progressive Knowledge Distillation Framework for Efficient Fire Classification
    HumP-KD: uncertainty-aware distillation for efficient fire classification
    Machine Learning Meta Neural Network Transformer
    HumP-KD is a hybrid, uncertainty-aware multi-stage progressive knowledge distillation framework for fire classification. It targets models that are simultaneously accurate and efficient for real-time use.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent Workflows
    Direct latent-space synthesis for parallel branches in LLM-agent workflows
    AI Agents Neural Network
    LLMs serve as execution engines for agentic systems yet still consume context through a sequential text interface, mismatching modern structured workflows with independent parallel branches. The paper explores synthesizing such parallel branches directly in latent space.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing
    Route-specialized dual adapters for memory-assisted knowledge editing
    Embeddings Inference Llama
    This work targets knowledge editing that updates selected facts while preserving nearby behavior in a memory-assisted setting. It proposes route-specialized dual adapters that decide when to write and when to suppress edits.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    Abstracting Cross-Domain Action Sequences into Interpretable Workflows
    Abstracting cross-domain action sequences into interpretable workflows
    Deep Learning Inference Microsoft Reinforcement Learning
    Time-stamped interaction logs objectively record digital app usage, but their granularity and noise obscure meaningful insights into work. The paper proposes abstracting cross-domain action sequences into interpretable workflows.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning Mechanisms
    Structural correspondence between Beethoven's Moonlight Sonata and ML
    Embeddings Machine Learning Neural Network Natural Language Processing (NLP) Reinforcement Learning
    Through computational analysis, this paper argues that the three movements of Beethoven's Moonlight Sonata (Op. 27 No. 2) instantiate three distinct machine learning architectures by structural correspondence rather than mere analogy.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0
    A fused INT8 GEMM kernel speeds diffusion transformers on consumer GPUs
    Neural Network Quantization Transformer
    Post-training INT8 quantization of diffusion transformers is often slower than FP8/NF4 on consumer Ampere GPUs. The paper presents a fused INT8 GEMM kernel for Ideogram 4.0 that realizes native INT8 speedups.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Zero-shot generalization of transformer neural operators to larger domains
    Zero-shot generalization of transformer neural operators to larger domains
    Embeddings Inference Machine Learning Neural Network Transformer
    The paper studies whether transformer-based neural operators for PDE solution operators can generalize zero-shot to larger spatial domains than seen in training.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
    CARE: auditable evidence review to control LLM-generated policies
    Machine Learning
    Giving LLMs direct control over costly, irreversible experiments invites unsafe exploration, while discarding their creativity sacrifices optimization. CARE controls LLM-generated policies through auditable review of evidence in scientific experimentation.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗