Multimodal A

Showing 31–60 of 101
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Native Active Perception as Reasoning for Omni-Modal Understanding
    Active perception as reasoning for efficient omni-modal understanding
    Deep Learning Fine-tuning Machine Learning Neural Network Retrieval-Augmented Generation (RAG)
    Passive long-video models 'watch it all,' processing frames uniformly so cost grows with duration regardless of query difficulty. This work treats perception as reasoning, with native active perception that selectively attends to relevant frames for efficient omni-modal understanding.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation
    Rubric-conditioned self-distillation rethinks reward supervision
    Neural Network Reinforcement Learning
    Post-training of reasoning models often combines supervised distillation with reinforcement learning from verifiable rewards, but distillation relies on costly chain-of-thought annotations. This work proposes rubric-conditioned self-distillation to rethink reward supervision while cutting annotation cost.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
    Reference-driven generation of multi-speaker audio scenes
    Embeddings Retrieval-Augmented Generation (RAG) Reinforcement Learning Speech Processing
    Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision such as per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. This work generates multi-speaker audio scenes by drawing on in-the-wild reference priors for more natural synthesis.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models
    Measuring commonsense and knowledge retention in VLA models
    AI Agents Computer Vision Fine-tuning Robotics Software Engineering
    Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet how much commonsense and factual knowledge they retain is unclear. This work measures that retention, revealing how much fine-tuning erodes prior world knowledge.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Risk Stratification for ICU Delirium using Pervasive Ambient Sensing Information
    Ambient sensing stratifies ICU delirium risk
    Neural Network Reinforcement Learning
    Delirium is a common, serious ICU complication linked to higher morbidity, longer stays, and greater costs, yet early detection is hard. This work uses pervasive ambient sensing information from the ICU to stratify patients' delirium risk, supporting earlier intervention.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2
    A multi-domain benchmark to detect GPT-Image-2 text-rich images
    Computer Vision GPT OpenAI Retrieval-Augmented Generation (RAG)
    Text-rich images often hold privacy-sensitive, transactional, or decision-relevant information. As multimodal generators synthesize realistic text and layouts, this work builds a multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2, assessing detector reliability.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    OneCanvas: 3D Scene Understanding via Panoramic Reprojection
    OneCanvas enables VLM 3D scene understanding via panoramic reprojection
    Computer Vision Embeddings Neural Network Robotics Software Engineering
    Existing 3D scene understanding in VLMs relies on complex, model-specific geometry encoders or large training budgets for spatial reasoning. OneCanvas instead uses panoramic reprojection, letting VLMs reason about 3D scenes efficiently without dedicated geometry encoders or heavy training.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory
    TGO-I: a spectral geometry observatory for Vision Transformers
    Computer Vision Reinforcement Learning Transformer
    Despite the wide adoption and success of Vision Transformers, understanding of their dimensional and representational geometry remains limited. The Transformer Geometry Observatory (TGO-I) studies ViTs through spectral geometry, observing and analyzing the structure of their representation spaces.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight
    Hardware/vision-in-the-loop validation of monocular UAV pose estimation
    Transformer
    Autonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival Analysis
    ChronoSurv: clinical-pathway graph framework for survival analysis
    Neural Network
    Accurate survival prediction is essential for personalized treatment in head and neck cancer but is challenging given heterogeneous, high-dimensional multimodal clinical data. ChronoSurv is a clinical pathway-guided graph framework that integrates multimodal data to improve survival analysis.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
    Decoupling perception and reasoning for shortcut-resilient self-distillation
    Computer Vision Machine Learning Software Engineering
    On-policy self-distillation trains a model on its own rollouts, using a frozen copy to give dense token-level targets conditioned on a reference. This work decouples perception from reasoning—seeing before reasoning—to make multimodal on-policy self-distillation resilient to shortcut learning.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning
    Auditing LLM-as-judge bias via positive-unlabeled learning
    Embeddings
    LLMs are increasingly used as judges for scalable evaluation, yet LLM-as-a-Judge systems show systematic biases decoupled from semantic quality, notably verbosity bias. This work uses positive-unlabeled learning to quantify and audit LLM evaluation, helping detect and correct such biases.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors
    Hybrid LSTM–Vision Transformer predicts HRRR forecast errors
    Reinforcement Learning Transformer
    Forecast errors in high-resolution numerical weather prediction such as HRRR often stem from unresolved planetary boundary layer processes, convection, and terrain-induced circulations. This work uses a hybrid LSTM–Vision Transformer architecture to predict HRRR forecast errors from vertically structured features.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection
    ThinkDeception: interpretable multimodal deception detection via RL
    Machine Learning Neural Network Reinforcement Learning
    Existing multimodal deception detection relies on end-to-end black boxes that offer no transparent reasoning. ThinkDeception is a progressive reinforcement learning framework that explicitly captures subtle cross-modal cues and produces interpretable reasoning trajectories for deception detection.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System
    CAPRA: a multi-agent LLM system for software architecture feedback
    AI Agents GPT Machine Learning Software Engineering
    Automated assessment in software engineering education has advanced, but giving quality feedback on architecture deliverables remains hard. CAPRA is a multi-agent LLM system that scales detailed feedback on software architecture deliverables.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
    RTSGameBench: an RTS benchmark for strategic reasoning by VLMs
    AI Agents Computer Vision Neural Network Retrieval-Augmented Generation (RAG)
    Modern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    REVES: REvision and VErification--Augmented Training for Test-Time Scaling
    REVES: revision- and verification-augmented training for test-time scaling
    Inference Reinforcement Learning Software Engineering
    Test-time scaling via sequential revision has become a powerful paradigm. REVES proposes revision- and verification-augmented training that strengthens a model ability to revise and verify its own outputs, making extra test-time compute more effective.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Developer Tools extract
    Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction
    Learning robust pair confidence for multimodal emotion-cause extraction
    Inference Retrieval-Augmented Generation (RAG)
    Multimodal emotion-cause pair extraction requires reliable pairing of emotions and their causes. This work learns robust pair confidence, yielding emotion-cause extraction that is more resilient to noise and ambiguity.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Efficient Financial Language Understanding via Distillation with Synthetic Data
    Efficient financial language understanding via distillation with synthetic data
    Neural Network Natural Language Processing (NLP) Reinforcement Learning
    Large instruction-following models are powerful but costly to deploy, especially in finance. This work distills capabilities using synthetic data to build lightweight models that understand financial language efficiently.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction
    SAMA: semantic anchor-aligned augmentation for low-resource multimodal IE
    Machine Learning Retrieval-Augmented Generation (RAG)
    Multimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • NVIDIA Developer Blog · EN Agents & Tool Use extract
    Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI
    NVIDIA unveils XR AI to build AI agents for AR glasses and XR devices
    AI Agents Computer Vision Generative AI NVIDIA
    NVIDIA introduced NVIDIA XR AI, a framework for developers to build AI agents for AR glasses and wearable XR devices. It targets the gap between ready hardware and the work of integrating live, real-time AI experiences. Capabilities are per NVIDIA's own announcement; third-party verification pending.
    Read original (NVIDIA Developer Blog) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Industry Adoption extract
    Visual Verification Enables Inference-time Steering and Autonomous Policy Improvement
    VERITAS steers and self-improves robot policies at inference time
    Inference Reinforcement Learning
    The paper proposes VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier that evaluates actions at inference time, improving performance without extra training and enabling self-improvement.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?
    Do distilled sets beat coresets? Rethinking dataset distillation
    Machine Learning Retrieval-Augmented Generation (RAG)
    Dataset distillation synthesizes compact training sets for data-centric machine learning. This paper rethinks distillation for classification, asking whether distilled sets actually outperform coresets (real-data subsets) and under what conditions.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Trust the Right Teacher: Quality-Aware Self-Distillation for GUI Grounding
    Quality-aware self-distillation for GUI grounding in VLMs
    Computer Vision
    The paper proposes a quality-aware self-distillation method for GUI grounding, where vision-language models predict precise screen coordinates, addressing how naive on-policy self-distillation can degrade coordinate-token teacher signals.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Uncertainty Quantification for Flow-Based Vision-Language-Action Models
    Uncertainty quantification for flow-based vision-language-action models
    Computer Vision Fine-tuning Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Vision-language-action models combine vision-language backbones with expressive generative action heads trained via flow matching on large robotic datasets. Despite strong performance, the paper studies uncertainty quantification for these flow-based VLA models.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training
    STAR: spatiotemporal adaptive reward allocation for text-to-image RL
    Reinforcement Learning
    The paper proposes STAR, a spatiotemporal adaptive reward allocation method for text-to-image RL post-training, replacing a single scalar advantage applied uniformly with rewards that account for the temporal and spatial structure of generation.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?
    GameCraft-Bench: can agents build playable games end-to-end?
    AI Agents
    Game generation is an emerging coding-agent application requiring natural-language specs to become playable interactive systems. GameCraft-Bench evaluates whether agents can build games end-to-end inside a real game engine, where scripts, scenes, assets, rendering and runtime must cohere.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models
    Qwen-RobotManip: alignment unlocks scale for robot manipulation models
    Computer Vision
    Language and multimodal foundation models generalize by aligning heterogeneous data under a unified formulation and training at scale. This technical report investigates applying that recipe to robotic manipulation, arguing alignment unlocks scale for manipulation foundation models.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Environment-Grounded Automated Prompt Optimization for LLM Game Agents
    Environment-grounded automated prompt optimization for LLM game agents
    AI Agents Fine-tuning Reinforcement Learning
    LLM agents in interactive environments are sensitive to prompts, yet prompt engineering stays manual and task-specific. The paper decomposes the observation-to-action pipeline and proposes an environment-grounded automated prompt optimization framework for LLM game agents.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports
    The Slop Paradox: AI-rewritten radiology reports erode clinical uncertainty
    AI clinical documentation tools increasingly summarize and reformat radiology reports with LLMs. Using 450 chest X-ray reports from the Indiana University dataset, the paper measures resulting information degradation, showing erosion of clinical uncertainty and cross-modal alignment in AI-rewritten reports.
    Read original (arXiv cs.CL (Computation and Language)) ↗