Multimodal A

Showing 1–30 of 102
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning
    UNIEGO: unified egocentric video encoder via multi-teacher distillation
    Neural Network
    UNIEGO is a unified egocentric video encoder trained via a hierarchical multi-teacher distillation framework. Representation-specific proxy models translate knowledge from teachers spanning multiple viewpoints, modalities, and foundation models into a single egocentric space, while remaining deployable from egocentric video alone.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Industry Adoption extract
    Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation
    Structuring and tokenizing user interest context for generative recommendation
    Neural Network
    Generative recommendation predicts a user's next interaction from past behavior, with item tokenization bridging item semantics and the recommendation model. This work proposes a way to structure and tokenize distributed user-interest context to improve generative recommenders.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
    Cross-attention attribution for style-captioned text-to-speech
    Reinforcement Learning Speech Processing
    Style-captioned text-to-speech systems use natural language to control voice characteristics. This work uses cross-attention attribution to analyze how individual instruction words shape the generated speech.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs
    StylisticBias: few visual cues drive most social bias in MLLMs
    Machine Learning Reinforcement Learning
    StylisticBias investigates the visual cues that shape how multimodal large language models judge people. The study finds that a small set of human visual cues drives most of the social biases exhibited by MLLMs, which are increasingly deployed in consequential settings.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm
    SARLO-80: a worldwide 80cm slant SAR-optical dataset
    Deep Learning Reinforcement Learning
    Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable SAR resources are scarce. SARLO-80 provides a worldwide slant-range SAR and optical dataset at 80cm resolution to fill this gap.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS
    FlowEdit: associative memory for lifelong pronunciation adaptation in TTS
    Embeddings Inference Speech Processing
    Flow-matching text-to-speech achieves strong zero-shot quality but stays static after deployment. FlowEdit uses associative memory to enable lifelong pronunciation adaptation without full retraining.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology
    RefRad2D: training spatially grounded radiology VLMs at scale
    Computer Vision Fine-tuning Neural Network Software Engineering
    The paper studies how to train spatially grounded vision-language models for radiology without manual spatial annotations. It introduces RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with VQA and spatial grounding subsets.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    HEPTv2: End-to-End Efficient Point Transformer for Charged Particle Reconstruction
    HEPTv2: an efficient point transformer for particle tracking
    Inference Machine Learning Neural Network NVIDIA Transformer
    The paper presents HEPTv2, an end-to-end efficient point transformer for charged-particle reconstruction. It targets tracking—reconstructing trajectories from sparse detector measurements under extreme combinatorial ambiguity—aiming to stay accurate and efficient at the High-Luminosity LHC.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach
    Tackling modality imbalance in federated graph learning via synthesis
    The paper addresses modality imbalance in multimodal federated graph learning with a data-synthesis-based approach. It targets two granularities of imbalance—client-level, where some clients lack entire modalities, and node-level, where individual nodes have missing modalities.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act
    Train, retrieve, or both? Statutory citation on Ontario tenancy law
    Deep Learning Fine-tuning Neural Network Retrieval-Augmented Generation (RAG)
    The paper runs a four-arm head-to-head comparison of fine-tuning, retrieval, and their combination for producing correct statutory citations on the Ontario Residential Tenancies Act and its core regulation. It targets the practical need of tenants, landlords, and help-desk staff to be pointed at the governing provision.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision
    Wall-to-wall forest structure mapping from inventory, lidar, imagery
    Computer Vision Neural Network
    The paper integrates national forest inventory data, airborne lidar, and satellite imagery with computer vision to produce wall-to-wall maps of forest structure. It targets the persistent need for annually updated, large-landscape maps to support forest and wildfire risk management.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback
    PsyScore: psychometric essay scoring with scaffolded feedback
    Retrieval-Augmented Generation (RAG)
    The paper presents PsyScore, a psychometrically-aware framework for automated essay scoring that adapts to writing traits and provides ZPD-scaffolded feedback. It aims to unify scoring and feedback, which existing methods treat separately, balancing reliable assessment with interpretable, actionable instruction.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
    ELVA: ranking-driven universal multimodal retrieval
    Deep Learning Machine Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Leveraging multimodal large language models through contrastive learning has become mainstream for retrieval. ELVA explores a ranking-driven approach to universal multimodal retrieval.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End Driving
    Lagrange: open-vocabulary energy-based framework for end-to-end driving
    Computer Vision Machine Learning Neural Network Reinforcement Learning
    Scaling end-to-end autonomous driving to complex open-world settings demands strong perception. Lagrange offers an open-vocabulary, energy-based sparse framework for generalized end-to-end driving.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Confidence-Aware Automated Assessment of Student-Drawn Scientific Models
    Confidence-aware automated assessment of student-drawn science models
    Deep Learning Retrieval-Augmented Generation (RAG) Transformer
    Student-generated drawings are widely used in science education to assess conceptual understanding. This work introduces confidence-aware automated assessment of student-drawn scientific models.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    Finetuning Vision-Language-Action Models Requires Fewer Layers Than You Think
    Finetuning vision-language-action models needs fewer layers than expected
    Computer Vision Fine-tuning Inference Machine Learning Reinforcement Learning
    Vision-Language-Action models pre-trained on massive video-robot datasets have transformed robot control. This work shows that finetuning them requires fewer layers than previously assumed.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs
    SPOT-E: test-time entropy shaping with visual spotlights for frozen VLMs
    Computer Vision Inference Reinforcement Learning Software Engineering
    Vision-language models often underperform on evidence-intensive tasks by missing decisive visual cues. SPOT-E applies test-time entropy shaping with visual spotlights to improve frozen VLMs.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Industry Adoption extract
    Augmenting Game AI with Deep Reinforcement Learning
    Augmenting game AI with deep reinforcement learning
    AI Agents Machine Learning Reinforcement Learning
    Immersion in video games depends not only on graphics, audio, and mechanics but also on the quality of game AI. This work augments game AI using deep reinforcement learning.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow Matching
    FlowMaps: long-term multimodal object dynamics with flow matching
    AI Agents Reinforcement Learning
    Joint spatial and temporal understanding of 3D scenes is essential for deployed robots. FlowMaps models long-term multimodal object dynamics using flow matching.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme Conversion
    ReNikud: audio-supervised Hebrew grapheme-to-phoneme conversion
    Neural Network Speech Processing
    The paper presents ReNikud, an audio-supervised approach to grapheme-to-phoneme conversion for Modern Hebrew. It addresses the ambiguity of Hebrew's abjad script, which leaves vowels largely unwritten, going beyond standard pipelines that first predict vowel diacritics (nikud).
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization
    MedRLM: recursive multimodal AI for long-context clinical reasoning
    AI Agents Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    The paper introduces MedRLM, a recursive multimodal health-intelligence system for long-context clinical reasoning, sensor-guided screening, evidence-grounded decision support, and community-to-tertiary referral optimization. It targets reasoning over heterogeneous, longitudinal patient data, beyond the single-step prompting or retrieval of current medical LLMs.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    NAMESAKES: Probing Identity Memorization in Text-to-Image Models
    NAMESAKES: probing identity memorization in text-to-image models
    Neural Network
    The paper introduces NAMESAKES, a study probing identity memorization in text-to-image models, which can generate realistic likenesses of individuals from their names. It addresses the difficulty of telling whether a generated face is memorized or fabricated without ground-truth photos, training data, or white-box model access.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors
    PASQA: a pitch-accent-focused speech quality assessment model
    Speech Processing
    The paper proposes PASQA, a speech quality assessment model that explicitly targets pitch-accent correctness, trained on synthetic speech containing accent errors. It addresses the insensitivity of existing utterance-level MOS prediction models to localized pitch-accent mistakes.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine Authorship
    Self-preference is weak in verifiable instruction-following revision
    Neural Network
    The paper tests whether large language models resist valid corrections to their own writing during verifiable instruction-following revision. Across four models under genuine authorship, it finds that the documented self-preference bias is weak or absent in this revision setting.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis
    An information-theoretic look at supervising latent chain-of-thought
    The paper gives an information-theoretic analysis of what makes supervision effective in latent chain-of-thought reasoning, which internalizes reasoning in continuous hidden states. It examines why outcome supervision provides weak learning signals, making robust latent reasoning difficult.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations
    Human-model gaps in speech quality assessment under perturbations
    Speech Processing
    The paper investigates discrepancies between human judgments and MOS prediction models in speech quality assessment, using controlled acoustic and prosodic perturbations. It probes whether these models, widely used as proxy metrics in text-to-speech research, capture quality differences beyond acoustic fidelity.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Hacker News (Front Page) · EN New Model Releases extract
    DeepSeek Introduces Vision
    DeepSeek introduces vision capabilities
    DeepSeek
    An item reporting that DeepSeek has introduced vision capabilities, adding image understanding to its previously text-focused models. The multimodal upgrade broadens the range of tasks the models can handle.
    Read original (Hacker News (Front Page)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    NRITYAM: Language Models Meet Art and Heritage of Dance
    NRITYAM: a benchmark for cultural comprehension of dance traditions
    Neural Network Reinforcement Learning Software Engineering
    The paper presents NRITYAM, a benchmark for evaluating how well language models comprehend culture in the context of global dance traditions. It addresses the gap that the global effectiveness of language models depends on a nuanced understanding of local socio-cultural contexts.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Hacker News (Front Page) · EN Multimodal extract
    Midjourney Medical
    Midjourney Medical
    An item on Midjourney Medical, a medical-focused offering from the image-generation company Midjourney. Accompanied by a demo video, it is presented as a new effort to apply generative imaging technology in the medical domain.
    Read original (Hacker News (Front Page)) ↗
  • Simon Willison's Weblog · EN Infrastructure & Hardware extract
    GLM-5.2 is probably the most powerful text-only open weights LLM
    GLM-5.2 may be the most powerful text-only open weights LLM
    DeepSeek Mixture of Experts (MoE)
    Chinese AI lab Z.ai released GLM-5.2 to coding-plan subscribers on June 13 and then published full open weights under an MIT license on June 16. Similar in size to GLM-5 and GLM-5.1, it may be the most powerful text-only open weights LLM, per Simon Willison.
    Read original (Simon Willison's Weblog) ↗