Multimodal A
Showing 61–90 of 101
-
Vision-language models for chest radiography do not always need the imageVision-language models for chest radiography do not always need the imageMedical vision-language models combine images and text for reporting. For chest radiography, the paper shows these models do not always need the image to make predictions, and discusses the implications for evaluation and clinical use.
-
EnvRL: Learn from Environment Dynamics in Agentic Reinforcement LearningEnvRL learns from environment dynamics in agentic RLEnvRL is a method that learns from environment dynamics in agentic reinforcement learning, leveraging the structure of agent-environment interaction to improve learning efficiency and performance.
-
MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality BlockMambaCount: efficient open-vocabulary counting via state-space dualityText-guided open-vocabulary object counting is hard in dense scenes with large scale variation, and existing Transformer methods are limited by quadratic complexity. MambaCount uses a spatial sparse state space duality block for efficient open-vocabulary object counting.
-
Context-Aware RL for Agentic and Multimodal LLMsContextRL rewards picking the right context to ground answersContextRL is a context-aware RL method that improves long-horizon and multimodal reasoning via an indirect objective: instead of supervising only the final answer, it rewards selecting the context that supports a query-answer pair, encouraging fine-grained grounding. Trained on contrastive coding-trajectory and image data, it gains an average +2.2% over standard GRPO.
-
Geometric Action Model for Robot Policy LearningGAM reuses a geometric foundation model for robot controlThe Geometric Action Model (GAM) is a language-conditioned manipulation policy that repurposes a pretrained geometric foundation model as a shared substrate for perception, temporal prediction, and action decoding. It splits the model at an intermediate layer: shallow layers act as an observation encoder, while a causal future predictor forecasts latent tokens from language, proprioception, and action history.
-
DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research AgentsDeepRubric: evidence-tree rubrics to boost deep-research agent RLDeepRubric is a data-construction framework for RL of deep research agents that reverses the usual query-to-rubric flow: starting from a seed topic it builds an evidence tree to decide what an evidence-backed report should be judged on, then synthesizes aligned query-rubric pairs for more reliable reward supervision.
-
Learning the Geometry of Data: A Mathematical Review of Shape Space AnalysisA mathematical review of shape space analysis for geometric dataThis survey synthesizes the fast-growing literature on shape space analysis, a framework for data whose observations carry rich geometric form across biology, medicine, anthropology and vision. Drawing on differential geometry, statistics and ML, it organizes the work around a shared pipeline of shape representation, parameterization and metric construction.
-
FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation ModelsFusionRS: a large-scale RGB-infrared-text remote sensing datasetNoting that remote-sensing vision-language models remain RGB-centric, the paper introduces FusionRS, described as the first large-scale RGB-infrared-text dataset for dual-modal learning. It is built by translating public RGB images into infrared-style counterparts, pairing each with conventional and infrared-aware captions.
-
ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement LearningROVE: RL that learns humanoid manipulation from imperfect interventionsROVE is an RL framework for post-training humanoid Vision-Language-Action models from imperfect human interventions. It pairs a human-in-the-loop data pipeline with Optimistic Value Estimation to prioritize high-value behaviors in mixed-quality trajectories, and adds cross-embodiment human videos to robustify value estimation.
-
From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects IdentificationNEXIS identifies causal, interpretable heterogeneous treatment effectsThe paper proposes NEXIS (Neural EXposure Interaction Search), a method for causally identifying heterogeneous treatment effects (HTE) in controlled experiments. By leveraging multi-modal pre-treatment measurements and scalable representations, it reframes HTE identification as Markov-blanket discovery over a sufficient, aligned representation, aiming to ease the expressivity-interpretability trade-off.
-
Bayesian Inference and Decision Audits for Public Archives of Frontier AI EvaluationsBayesian audit of public frontier-AI evaluation archives proposedThe paper treats public AI evaluation archives (e.g., LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as selective time series rather than terminal leaderboards, framing them as a Bayesian inference problem. It reports that selection-aware frontier models fail synthetic recovery and calibration, while fixed audit gates remain informative.
-
Task-Error Residual Learning for Real-Robot Five-Ball JugglingResidual learning enables fast, stable real-robot five-ball jugglingFor residual learning that refines existing behavior, sample efficiency hinges on how much information each rollout returns and how efficiently it is used. Standard scalar RL reward carries less than the directional task error defining the task. Using directional task-error supervision and a task-error model driving sample selection, the system achieves stable three-, four-, and five-ball juggling on Barrett WAM arms, converging from the second attempt with monotonically decreasing error.
-
Functional Gradient Descent with Adaptive RepresentationsFunctional gradient descent made practical via adaptive representationsFunctional optimization is usually solved by tuning parameters of a fixed representation such as a neural network, yielding highly nonconvex losses that hinder training and analysis. Functional gradient descent (FGD)-gradient descent directly in function space-offers strong convergence guarantees and clean theory but is hard to implement because functional gradients are infinite-dimensional. The paper proposes a practical FGD using adaptive representations.
-
RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual ForecastingRAID: retrieval-augmented diffusion for cold-start, cross-lingual forecastingTime-series foundation models transfer well given a history window, but true cold-start items with no prior observations violate that. The authors propose RAID (Retrieval-Augmented Iterative Diffusion), replacing history-based correlation with metadata-driven semantic retrieval and graph-conditioned diffusion. It maps metadata into a shared semantic space via a frozen multilingual embedding model, builds an inductive retrieval graph for unseen items, and refines a forecast from neighbors.
-
Binary Tracking for Spatial QA and Navigation with Open Vision-Language ModelsBinary Tracking: open vision-language models for spatial QA and navigationThe paper addresses spatial question answering for service robots traversing long egocentric routes, returning metric coordinates that downstream navigation can act on for queries like 'where can I find a dry cleaner on the way back home?' Prior approaches rely on closed-source models such as GPT-4o, which robots cannot reliably depend on due to network instability, latency, and deployment cost. The authors propose Binary Tracking, an open-source vision-language approach that can run onboard.
-
Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial LocalizationSemantic Flip: synthetic OOD generation for robust refusal in embodied agentsDetecting unanswerable queries is essential for reliable embodied agents, yet vision-language models often answer overconfidently when visual memory cannot support the query, risking misleading users or physically guiding them to arbitrary locations. The paper proposes Semantic Flip, a simple method that generates synthetic out-of-distribution samples to teach embodied VLMs when to respond 'I do not know,' improving robust refusal in embodied question answering and spatial localization.
-
Data-Driven Decoding of Russell's Circumplex Model of AffectDo Transformer embeddings recover Russell's circumplex affect geometry?An arXiv paper tests whether Transformer latent spaces, trained on text and speech, recover the geometric regularities of Russell's circumplex model of affect. It unifies two complementary experiments to probe emotion representation, addressing the opacity of high-dimensional affective embeddings. Neutral, abstract-based summary.
-
A Perception vs. Distortion Perspective on Score-Based Generative Channel EstimationScore-based channel estimation analyzed via perception-distortion tradeoffScore-based models are increasingly used for wireless physical-layer tasks, but it is unclear when they beat discriminative learning. Using channel estimation as a case study, the paper interprets score-based estimation through the perception-distortion tradeoff, identifying when score matching excels and quantifying the excess risk of distortion-minimizing approaches.
-
Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight VerifierPaper: semi-supervised LLM reasoning from minimal labelsAn arXiv paper presents a semi-supervised framework that scales LLM reasoning from minimal supervision, using a lightweight reasoning-correctness classifier to turn verification into a data-creation mechanism. Summarized neutrally from the abstract.
-
Connecting Speech to Words through ImagesPaper: visually grounded speech-to-word learning methodAn arXiv paper proposes a visually grounded method to build a vocabulary of spoken words using only images and their spoken descriptions, without explicit text supervision. Summarized neutrally from the abstract; results are the authors' own.
-
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument ControlLabOSBench: a simulated testbed for computer-use agents controlling instrumentsThe paper proposes LabOSBench, a simulated yet realistic testbed for evaluating computer-use agents on scientific instrument control. It notes that existing benchmarks focus on software tasks in virtual systems, while real instruments require coordinated interface control and feedback-driven parameter tuning that are costly and risky to evaluate directly.
-
Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality AssessmentMST-CLIPIQA: decoupling semantics and distortions in AI-image qualityThe paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
-
We Need Explanation Cards to Connect Explanation Algorithms to the Real World'Explanation Cards' add robustness and validity context to explanationsAlgorithmic explanations often need expert knowledge to read and can be uninformative about complex decision functions. The authors propose Explanation Cards that augment explanations with robustness and validity information plus clear interpretation instructions, making otherwise uninformative explanations practically useful while flagging when they are not.
-
Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate RepresentationsGen-VCoT uses generated RGB visual intermediates for multimodal reasoningGen-VCoT replaces text-only chain-of-thought with generated RGB intermediates, staging visual grounding (SAM), depth (Marigold), and semantic reasoning (Qwen2-VL) under an adaptive router. It improves spatial (+25%) and depth (+50%) questions but can hurt simple factual ones; text CoT still wins on CLEVR, suggesting task-dependent representations.
-
Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video DetectionA noise-amplification perspective for detecting AI-generated videosThe paper proposes detecting AI-generated videos, especially those from text-to-video models, by amplifying noise to reveal subtle artifacts that distinguish them from authentic footage. It notes that prior work largely targeted GAN-generated samples and frames text-to-video detection as still underexplored.
-
Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion ModelsReflective Masking elicits iterative reasoning in mask diffusion modelsThe paper introduces Reflective Masking, a lightweight post-training method that lets mask diffusion models iteratively revisit and revise prior outputs via multi-turn masking, plus a History Reference component. Claims reflect the abstract and are not independently verified.
-
Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action ModelsNVIDIA explains the rise of World-Action Models for roboticsNVIDIA's technical blog surveys World-Action Models (WAMs)—robot policies pretrained to "imagine" via world modeling, then fine-tuned to act. It relates them to Vision-Language-Action (VLA) models built on pretrained VLM backbones for robotics.
-
Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?Uncertainty estimation fails as a safety net for clinical VQAThis arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.
-
Gaze Heads: How VLMs Look at What They Describe'Gaze heads' in VLMs track and steer described image regionsThe paper identifies a small set of attention heads, dubbed gaze heads, that track the image region a vision-language model is currently describing. Intervening on the top ~100 of them can steer the model to describe any chosen region.
-
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning AlignmentCORA aligns reasoning and answers in multimodal RLVRCORA analyzes the gap between a model's reasoning and its final answer when extending verifiable-reward RL to multimodal settings. It proposes consistency-oriented reasoning alignment to bridge that gap.