Multimodal A
Showing 31–60 of 101
-
Native Active Perception as Reasoning for Omni-Modal UnderstandingActive perception as reasoning for efficient omni-modal understandingPassive long-video models 'watch it all,' processing frames uniformly so cost grows with duration regardless of query difficulty. This work treats perception as reasoning, with native active perception that selectively attends to relevant frames for efficient omni-modal understanding.
-
Rethinking Reward Supervision: Rubric-Conditioned Self-DistillationRubric-conditioned self-distillation rethinks reward supervisionPost-training of reasoning models often combines supervised distillation with reinforcement learning from verifiable rewards, but distillation relies on costly chain-of-thought annotations. This work proposes rubric-conditioned self-distillation to rethink reward supervision while cutting annotation cost.
-
Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild PriorsReference-driven generation of multi-speaker audio scenesExisting multi-speaker dialogue systems bind speakers to utterances through structured supervision such as per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. This work generates multi-speaker audio scenes by drawing on in-the-wild reference priors for more natural synthesis.
-
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action ModelsMeasuring commonsense and knowledge retention in VLA modelsEmbodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet how much commonsense and factual knowledge they retain is unclear. This work measures that retention, revealing how much fine-tuning erodes prior world knowledge.
-
Risk Stratification for ICU Delirium using Pervasive Ambient Sensing InformationAmbient sensing stratifies ICU delirium riskDelirium is a common, serious ICU complication linked to higher morbidity, longer stays, and greater costs, yet early detection is hard. This work uses pervasive ambient sensing information from the ICU to stratify patients' delirium risk, supporting earlier intervention.
-
A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2A multi-domain benchmark to detect GPT-Image-2 text-rich imagesText-rich images often hold privacy-sensitive, transactional, or decision-relevant information. As multimodal generators synthesize realistic text and layouts, this work builds a multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2, assessing detector reliability.
-
OneCanvas: 3D Scene Understanding via Panoramic ReprojectionOneCanvas enables VLM 3D scene understanding via panoramic reprojectionExisting 3D scene understanding in VLMs relies on complex, model-specific geometry encoders or large training budgets for spatial reasoning. OneCanvas instead uses panoramic reprojection, letting VLMs reason about 3D scenes efficiently without dedicated geometry encoders or heavy training.
-
Transformer Geometry Observatory TGO-I: Spectral Geometry ObservatoryTGO-I: a spectral geometry observatory for Vision TransformersDespite the wide adoption and success of Vision Transformers, understanding of their dimensional and representational geometry remains limited. The Transformer Geometry Observatory (TGO-I) studies ViTs through spectral geometry, observing and analyzing the structure of their representation spaces.
-
Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV FlightHardware/vision-in-the-loop validation of monocular UAV pose estimationAutonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
-
ChronoSurv: A Clinical Pathway-Guided Graph Framework for Multimodal Survival AnalysisChronoSurv: clinical-pathway graph framework for survival analysisAccurate survival prediction is essential for personalized treatment in head and neck cancer but is challenging given heterogeneous, high-dimensional multimodal clinical data. ChronoSurv is a clinical pathway-guided graph framework that integrates multimodal data to improve survival analysis.
-
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-DistillationDecoupling perception and reasoning for shortcut-resilient self-distillationOn-policy self-distillation trains a model on its own rollouts, using a frozen copy to give dense token-level targets conditioned on a reference. This work decouples perception from reasoning—seeing before reasoning—to make multimodal on-policy self-distillation resilient to shortcut learning.
-
Quantifying and Auditing LLM Evaluation via Positive--Unlabeled LearningAuditing LLM-as-judge bias via positive-unlabeled learningLLMs are increasingly used as judges for scalable evaluation, yet LLM-as-a-Judge systems show systematic biases decoupled from semantic quality, notably verbosity bias. This work uses positive-unlabeled learning to quantify and audit LLM evaluation, helping detect and correct such biases.
-
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast ErrorsHybrid LSTM–Vision Transformer predicts HRRR forecast errorsForecast errors in high-resolution numerical weather prediction such as HRRR often stem from unresolved planetary boundary layer processes, convection, and terrain-induced circulations. This work uses a hybrid LSTM–Vision Transformer architecture to predict HRRR forecast errors from vertically structured features.
-
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception DetectionThinkDeception: interpretable multimodal deception detection via RLExisting multimodal deception detection relies on end-to-end black boxes that offer no transparent reasoning. ThinkDeception is a progressive reinforcement learning framework that explicitly captures subtle cross-modal cues and produces interpretable reasoning trajectories for deception detection.
-
CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM SystemCAPRA: a multi-agent LLM system for software architecture feedbackAutomated assessment in software engineering education has advanced, but giving quality feedback on architecture deliverables remains hard. CAPRA is a multi-agent LLM system that scales detailed feedback on software architecture deliverables.
-
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language ModelsRTSGameBench: an RTS benchmark for strategic reasoning by VLMsModern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.
-
REVES: REvision and VErification--Augmented Training for Test-Time ScalingREVES: revision- and verification-augmented training for test-time scalingTest-time scaling via sequential revision has become a powerful paradigm. REVES proposes revision- and verification-augmented training that strengthens a model ability to revise and verify its own outputs, making extra test-time compute more effective.
-
Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair ExtractionLearning robust pair confidence for multimodal emotion-cause extractionMultimodal emotion-cause pair extraction requires reliable pairing of emotions and their causes. This work learns robust pair confidence, yielding emotion-cause extraction that is more resilient to noise and ambiguity.
-
Efficient Financial Language Understanding via Distillation with Synthetic DataEfficient financial language understanding via distillation with synthetic dataLarge instruction-following models are powerful but costly to deploy, especially in finance. This work distills capabilities using synthetic data to build lightweight models that understand financial language efficiently.
-
SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information ExtractionSAMA: semantic anchor-aligned augmentation for low-resource multimodal IEMultimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.
-
Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AINVIDIA unveils XR AI to build AI agents for AR glasses and XR devicesNVIDIA introduced NVIDIA XR AI, a framework for developers to build AI agents for AR glasses and wearable XR devices. It targets the gap between ready hardware and the work of integrating live, real-time AI experiences. Capabilities are per NVIDIA's own announcement; third-party verification pending.
-
Visual Verification Enables Inference-time Steering and Autonomous Policy ImprovementVERITAS steers and self-improves robot policies at inference timeThe paper proposes VERITAS, a generator-verifier framework pairing a pre-trained generalist robot policy with a gradient-free visual verifier that evaluates actions at inference time, improving performance without extra training and enabling self-improvement.
-
Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?Do distilled sets beat coresets? Rethinking dataset distillationDataset distillation synthesizes compact training sets for data-centric machine learning. This paper rethinks distillation for classification, asking whether distilled sets actually outperform coresets (real-data subsets) and under what conditions.
-
Trust the Right Teacher: Quality-Aware Self-Distillation for GUI GroundingQuality-aware self-distillation for GUI grounding in VLMsThe paper proposes a quality-aware self-distillation method for GUI grounding, where vision-language models predict precise screen coordinates, addressing how naive on-policy self-distillation can degrade coordinate-token teacher signals.
-
Uncertainty Quantification for Flow-Based Vision-Language-Action ModelsUncertainty quantification for flow-based vision-language-action modelsVision-language-action models combine vision-language backbones with expressive generative action heads trained via flow matching on large robotic datasets. Despite strong performance, the paper studies uncertainty quantification for these flow-based VLA models.
-
STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-TrainingSTAR: spatiotemporal adaptive reward allocation for text-to-image RLThe paper proposes STAR, a spatiotemporal adaptive reward allocation method for text-to-image RL post-training, replacing a single scalar advantage applied uniformly with rewards that account for the temporal and spatial structure of generation.
-
GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine?GameCraft-Bench: can agents build playable games end-to-end?Game generation is an emerging coding-agent application requiring natural-language specs to become playable interactive systems. GameCraft-Bench evaluates whether agents can build games end-to-end inside a real game engine, where scripts, scenes, assets, rendering and runtime must cohere.
-
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation ModelsQwen-RobotManip: alignment unlocks scale for robot manipulation modelsLanguage and multimodal foundation models generalize by aligning heterogeneous data under a unified formulation and training at scale. This technical report investigates applying that recipe to robotic manipulation, arguing alignment unlocks scale for manipulation foundation models.
-
Environment-Grounded Automated Prompt Optimization for LLM Game AgentsEnvironment-grounded automated prompt optimization for LLM game agentsLLM agents in interactive environments are sensitive to prompts, yet prompt engineering stays manual and task-specific. The paper decomposes the observation-to-action pipeline and proposes an environment-grounded automated prompt optimization framework for LLM game agents.
-
The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology ReportsThe Slop Paradox: AI-rewritten radiology reports erode clinical uncertaintyAI clinical documentation tools increasingly summarize and reformat radiology reports with LLMs. Using 450 chest X-ray reports from the Indiana University dataset, the paper measures resulting information degradation, showing erosion of clinical uncertainty and cross-modal alignment in AI-rewritten reports.