Multimodal A

Showing 61–90 of 101
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Vision-language models for chest radiography do not always need the image
    Vision-language models for chest radiography do not always need the image
    Computer Vision Inference Software Engineering
    Medical vision-language models combine images and text for reporting. For chest radiography, the paper shows these models do not always need the image to make predictions, and discusses the implications for evaluation and clinical use.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning
    EnvRL learns from environment dynamics in agentic RL
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning
    EnvRL is a method that learns from environment dynamics in agentic reinforcement learning, leveraging the structure of agent-environment interaction to improve learning efficiency and performance.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block
    MambaCount: efficient open-vocabulary counting via state-space duality
    Reinforcement Learning Transformer
    Text-guided open-vocabulary object counting is hard in dense scenes with large scale variation, and existing Transformer methods are limited by quadratic complexity. MambaCount uses a spatial sparse state space duality block for efficient open-vocabulary object counting.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Context-Aware RL for Agentic and Multimodal LLMs
    ContextRL rewards picking the right context to ground answers
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    ContextRL is a context-aware RL method that improves long-horizon and multimodal reasoning via an indirect objective: instead of supervising only the final answer, it rewards selecting the context that supports a query-answer pair, encouraging fine-grained grounding. Trained on contrastive coding-trajectory and image data, it gains an average +2.2% over standard GRPO.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Geometric Action Model for Robot Policy Learning
    GAM reuses a geometric foundation model for robot control
    Computer Vision Reinforcement Learning
    The Geometric Action Model (GAM) is a language-conditioned manipulation policy that repurposes a pretrained geometric foundation model as a shared substrate for perception, temporal prediction, and action decoding. It splits the model at an intermediate layer: shallow layers act as an observation encoder, while a causal future predictor forecasts latent tokens from language, proprioception, and action history.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
    DeepRubric: evidence-tree rubrics to boost deep-research agent RL
    AI Agents Reinforcement Learning
    DeepRubric is a data-construction framework for RL of deep research agents that reverses the usual query-to-rubric flow: starting from a seed topic it builds an evidence tree to decide what an evidence-backed report should be judged on, then synthesizes aligned query-rubric pairs for more reliable reward supervision.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Developer Tools extract
    Learning the Geometry of Data: A Mathematical Review of Shape Space Analysis
    A mathematical review of shape space analysis for geometric data
    Computer Vision Deep Learning Machine Learning Neural Network Reinforcement Learning
    This survey synthesizes the fast-growing literature on shape space analysis, a framework for data whose observations carry rich geometric form across biology, medicine, anthropology and vision. Drawing on differential geometry, statistics and ML, it organizes the work around a shared pipeline of shape representation, parameterization and metric construction.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models
    FusionRS: a large-scale RGB-infrared-text remote sensing dataset
    Computer Vision
    Noting that remote-sensing vision-language models remain RGB-centric, the paper introduces FusionRS, described as the first large-scale RGB-infrared-text dataset for dual-modal learning. It is built by translating public RGB images into infrared-style counterparts, pairing each with conventional and infrared-aware captions.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
    ROVE: RL that learns humanoid manipulation from imperfect interventions
    Computer Vision Machine Learning Reinforcement Learning
    ROVE is an RL framework for post-training humanoid Vision-Language-Action models from imperfect human interventions. It pairs a human-in-the-loop data pipeline with Optimistic Value Estimation to prioritize high-value behaviors in mixed-quality trajectories, and adds cross-embodiment human videos to robustify value estimation.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects Identification
    NEXIS identifies causal, interpretable heterogeneous treatment effects
    The paper proposes NEXIS (Neural EXposure Interaction Search), a method for causally identifying heterogeneous treatment effects (HTE) in controlled experiments. By leveraging multi-modal pre-treatment measurements and scalable representations, it reframes HTE identification as Markov-blanket discovery over a sufficient, aligned representation, aiming to ease the expressivity-interpretability trade-off.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
    Bayesian audit of public frontier-AI evaluation archives proposed
    Inference Reinforcement Learning
    The paper treats public AI evaluation archives (e.g., LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as selective time series rather than terminal leaderboards, framing them as a Bayesian inference problem. It reports that selection-aware frontier models fail synthetic recovery and calibration, while fixed audit gates remain informative.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Task-Error Residual Learning for Real-Robot Five-Ball Juggling
    Residual learning enables fast, stable real-robot five-ball juggling
    Neural Network Reinforcement Learning
    For residual learning that refines existing behavior, sample efficiency hinges on how much information each rollout returns and how efficiently it is used. Standard scalar RL reward carries less than the directional task error defining the task. Using directional task-error supervision and a task-error model driving sample selection, the system achieves stable three-, four-, and five-ball juggling on Barrett WAM arms, converging from the second attempt with monotonically decreasing error.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Functional Gradient Descent with Adaptive Representations
    Functional gradient descent made practical via adaptive representations
    Computer Vision Deep Learning Neural Network
    Functional optimization is usually solved by tuning parameters of a fixed representation such as a neural network, yielding highly nonconvex losses that hinder training and analysis. Functional gradient descent (FGD)-gradient descent directly in function space-offers strong convergence guarantees and clean theory but is hard to implement because functional gradients are infinite-dimensional. The paper proposes a practical FGD using adaptive representations.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    RAID: Semantic Graph Diffusion for True Cold-Start and Cross-Lingual Forecasting
    RAID: retrieval-augmented diffusion for cold-start, cross-lingual forecasting
    Embeddings Inference Meta Retrieval-Augmented Generation (RAG)
    Time-series foundation models transfer well given a history window, but true cold-start items with no prior observations violate that. The authors propose RAID (Retrieval-Augmented Iterative Diffusion), replacing history-based correlation with metadata-driven semantic retrieval and graph-conditioned diffusion. It maps metadata into a shared semantic space via a frozen multilingual embedding model, builds an inductive retrieval graph for unseen items, and refines a forecast from neighbors.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
    Binary Tracking: open vision-language models for spatial QA and navigation
    AI Agents Computer Vision GPT Inference Retrieval-Augmented Generation (RAG)
    The paper addresses spatial question answering for service robots traversing long egocentric routes, returning metric coordinates that downstream navigation can act on for queries like 'where can I find a dry cleaner on the way back home?' Prior approaches rely on closed-source models such as GPT-4o, which robots cannot reliably depend on due to network instability, latency, and deployment cost. The authors propose Binary Tracking, an open-source vision-language approach that can run onboard.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization
    Semantic Flip: synthetic OOD generation for robust refusal in embodied agents
    AI Agents Computer Vision Neural Network Reinforcement Learning Software Engineering
    Detecting unanswerable queries is essential for reliable embodied agents, yet vision-language models often answer overconfidently when visual memory cannot support the query, risking misleading users or physically guiding them to arbitrary locations. The paper proposes Semantic Flip, a simple method that generates synthetic out-of-distribution samples to teach embodied VLMs when to respond 'I do not know,' improving robust refusal in embodied question answering and spatial localization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Data-Driven Decoding of Russell's Circumplex Model of Affect
    Do Transformer embeddings recover Russell's circumplex affect geometry?
    Deep Learning Embeddings Speech Processing Transformer
    An arXiv paper tests whether Transformer latent spaces, trained on text and speech, recover the geometric regularities of Russell's circumplex model of affect. It unifies two complementary experiments to probe emotion representation, addressing the opacity of high-dimensional affective embeddings. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    A Perception vs. Distortion Perspective on Score-Based Generative Channel Estimation
    Score-based channel estimation analyzed via perception-distortion tradeoff
    Computer Vision Neural Network
    Score-based models are increasingly used for wireless physical-layer tasks, but it is unclear when they beat discriminative learning. Using channel estimation as a case study, the paper interprets score-based estimation through the perception-distortion tradeoff, identifying when score matching excels and quantifying the excess risk of distortion-minimizing approaches.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier
    Paper: semi-supervised LLM reasoning from minimal labels
    Neural Network Software Engineering
    An arXiv paper presents a semi-supervised framework that scales LLM reasoning from minimal supervision, using a lightweight reasoning-correctness classifier to turn verification into a data-creation mechanism. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Connecting Speech to Words through Images
    Paper: visually grounded speech-to-word learning method
    Neural Network Speech Processing
    An arXiv paper proposes a visually grounded method to build a vocabulary of spoken words using only images and their spoken descriptions, without explicit text supervision. Summarized neutrally from the abstract; results are the authors' own.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control
    LabOSBench: a simulated testbed for computer-use agents controlling instruments
    AI Agents Computer Vision
    The paper proposes LabOSBench, a simulated yet realistic testbed for evaluating computer-use agents on scientific instrument control. It notes that existing benchmarks focus on software tasks in virtual systems, while real instruments require coordinated interface control and feedback-driven parameter tuning that are costly and risky to evaluate directly.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment
    MST-CLIPIQA: decoupling semantics and distortions in AI-image quality
    Computer Vision Machine Learning Retrieval-Augmented Generation (RAG)
    The paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    We Need Explanation Cards to Connect Explanation Algorithms to the Real World
    'Explanation Cards' add robustness and validity context to explanations
    Algorithms & Theory Neural Network Reinforcement Learning
    Algorithmic explanations often need expert knowledge to read and can be uninformative about complex decision functions. The authors propose Explanation Cards that augment explanations with robustness and validity information plus clear interpretation instructions, making otherwise uninformative explanations practically useful while flagging when they are not.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations
    Gen-VCoT uses generated RGB visual intermediates for multimodal reasoning
    Machine Learning
    Gen-VCoT replaces text-only chain-of-thought with generated RGB intermediates, staging visual grounding (SAM), depth (Marigold), and semantic reasoning (Qwen2-VL) under an adaptive router. It improves spatial (+25%) and depth (+50%) questions but can hurt simple factual ones; text CoT still wins on CLEVR, suggesting task-dependent representations.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Developer Tools extract
    Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection
    A noise-amplification perspective for detecting AI-generated videos
    Reinforcement Learning
    The paper proposes detecting AI-generated videos, especially those from text-to-video models, by amplifying noise to reveal subtle artifacts that distinguish them from authentic footage. It notes that prior work largely targeted GAN-generated samples and frames text-to-video detection as still underexplored.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models
    Reflective Masking elicits iterative reasoning in mask diffusion models
    Retrieval-Augmented Generation (RAG) Software Engineering
    The paper introduces Reflective Masking, a lightweight post-training method that lets mask diffusion models iteratively revisit and revise prior outputs via multi-turn masking, plus a History Reference component. Claims reflect the abstract and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • NVIDIA Developer Blog · EN Multimodal extract
    Pretrained to Imagine, Fine-Tuned to Act: The Rise of World-Action Models
    NVIDIA explains the rise of World-Action Models for robotics
    Computer Vision Generative AI NVIDIA Reinforcement Learning Robotics
    NVIDIA's technical blog surveys World-Action Models (WAMs)—robot policies pretrained to "imagine" via world modeling, then fine-tuned to act. It relates them to Vision-Language-Action (VLA) models built on pretrained VLM backbones for robotics.
    Read original (NVIDIA Developer Blog) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?
    Uncertainty estimation fails as a safety net for clinical VQA
    Computer Vision Retrieval-Augmented Generation (RAG) Software Engineering
    This arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Gaze Heads: How VLMs Look at What They Describe
    'Gaze heads' in VLMs track and steer described image regions
    Computer Vision Deep Learning Software Engineering
    The paper identifies a small set of attention heads, dubbed gaze heads, that track the image region a vision-language model is currently describing. Intervening on the top ~100 of them can steer the model to describe any chosen region.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
    CORA aligns reasoning and answers in multimodal RLVR
    Computer Vision Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    CORA analyzes the gap between a model's reasoning and its final answer when extending verifiable-reward RL to multimodal settings. It proposes consistency-oriented reasoning alignment to bridge that gap.
    Read original (arXiv cs.CL (Computation and Language)) ↗