Multimodal A
Showing 1–30 of 102
-
UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation LearningUNIEGO: unified egocentric video encoder via multi-teacher distillationUNIEGO is a unified egocentric video encoder trained via a hierarchical multi-teacher distillation framework. Representation-specific proxy models translate knowledge from teachers spanning multiple viewpoints, modalities, and foundation models into a single egocentric space, while remaining deployable from egocentric video alone.
-
Structuring and Tokenizing Distributed User Interest Context for Generative RecommendationStructuring and tokenizing user interest context for generative recommendationGenerative recommendation predicts a user's next interaction from past behavior, with item tokenization bridging item semantics and the recommendation model. This work proposes a way to structure and tokenize distributed user-interest context to improve generative recommenders.
-
How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-SpeechCross-attention attribution for style-captioned text-to-speechStyle-captioned text-to-speech systems use natural language to control voice characteristics. This work uses cross-attention attribution to analyze how individual instruction words shape the generated speech.
-
StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMsStylisticBias: few visual cues drive most social bias in MLLMsStylisticBias investigates the visual cues that shape how multimodal large language models judge people. The study finds that a small set of human visual cues drives most of the social biases exhibited by MLLMs, which are increasingly deployed in consequential settings.
-
SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cmSARLO-80: a worldwide 80cm slant SAR-optical datasetMultimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable SAR resources are scarce. SARLO-80 provides a worldwide slant-range SAR and optical dataset at 80cm resolution to fill this gap.
-
FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTSFlowEdit: associative memory for lifelong pronunciation adaptation in TTSFlow-matching text-to-speech achieves strong zero-shot quality but stays static after deployment. FlowEdit uses associative memory to enable lifelong pronunciation adaptation without full retraining.
-
Scalable Training of Spatially Grounded 2D Vision-Language Models for RadiologyRefRad2D: training spatially grounded radiology VLMs at scaleThe paper studies how to train spatially grounded vision-language models for radiology without manual spatial annotations. It introduces RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with VQA and spatial grounding subsets.
-
HEPTv2: End-to-End Efficient Point Transformer for Charged Particle ReconstructionHEPTv2: an efficient point transformer for particle trackingThe paper presents HEPTv2, an end-to-end efficient point transformer for charged-particle reconstruction. It targets tracking—reconstructing trajectories from sparse detector measurements under extreme combinatorial ambiguity—aiming to stay accurate and efficient at the High-Luminosity LHC.
-
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based ApproachTackling modality imbalance in federated graph learning via synthesisThe paper addresses modality imbalance in multimodal federated graph learning with a data-synthesis-based approach. It targets two granularities of imbalance—client-level, where some clients lack entire modalities, and node-level, where individual nodes have missing modalities.
-
Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies ActTrain, retrieve, or both? Statutory citation on Ontario tenancy lawThe paper runs a four-arm head-to-head comparison of fine-tuning, retrieval, and their combination for producing correct statutory citations on the Ontario Residential Tenancies Act and its core regulation. It targets the practical need of tenants, landlords, and help-desk staff to be pointed at the governing provision.
-
Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer visionWall-to-wall forest structure mapping from inventory, lidar, imageryThe paper integrates national forest inventory data, airborne lidar, and satellite imagery with computer vision to produce wall-to-wall maps of forest structure. It targets the persistent need for annually updated, large-landscape maps to support forest and wildfire risk management.
-
PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded FeedbackPsyScore: psychometric essay scoring with scaffolded feedbackThe paper presents PsyScore, a psychometrically-aware framework for automated essay scoring that adapts to writing traits and provides ZPD-scaffolded feedback. It aims to unify scoring and feedback, which existing methods treat separately, balancing reliable assessment with interpretable, actionable instruction.
-
ELVA: Exploring Ranking-Driven Universal Multimodal RetrievalELVA: ranking-driven universal multimodal retrievalLeveraging multimodal large language models through contrastive learning has become mainstream for retrieval. ELVA explores a ranking-driven approach to universal multimodal retrieval.
-
Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End DrivingLagrange: open-vocabulary energy-based framework for end-to-end drivingScaling end-to-end autonomous driving to complex open-world settings demands strong perception. Lagrange offers an open-vocabulary, energy-based sparse framework for generalized end-to-end driving.
-
Confidence-Aware Automated Assessment of Student-Drawn Scientific ModelsConfidence-aware automated assessment of student-drawn science modelsStudent-generated drawings are widely used in science education to assess conceptual understanding. This work introduces confidence-aware automated assessment of student-drawn scientific models.
-
Finetuning Vision-Language-Action Models Requires Fewer Layers Than You ThinkFinetuning vision-language-action models needs fewer layers than expectedVision-Language-Action models pre-trained on massive video-robot datasets have transformed robot control. This work shows that finetuning them requires fewer layers than previously assumed.
-
SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMsSPOT-E: test-time entropy shaping with visual spotlights for frozen VLMsVision-language models often underperform on evidence-intensive tasks by missing decisive visual cues. SPOT-E applies test-time entropy shaping with visual spotlights to improve frozen VLMs.
-
Augmenting Game AI with Deep Reinforcement LearningAugmenting game AI with deep reinforcement learningImmersion in video games depends not only on graphics, audio, and mechanics but also on the quality of game AI. This work augments game AI using deep reinforcement learning.
-
FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow MatchingFlowMaps: long-term multimodal object dynamics with flow matchingJoint spatial and temporal understanding of 3D scenes is essential for deployed robots. FlowMaps models long-term multimodal object dynamics using flow matching.
-
ReNikud: Audio-Supervised Hebrew Grapheme-to-Phoneme ConversionReNikud: audio-supervised Hebrew grapheme-to-phoneme conversionThe paper presents ReNikud, an audio-supervised approach to grapheme-to-phoneme conversion for Modern Hebrew. It addresses the ambiguity of Hebrew's abjad script, which leaves vowels largely unwritten, going beyond standard pipelines that first predict vowel diacritics (nikud).
-
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral OptimizationMedRLM: recursive multimodal AI for long-context clinical reasoningThe paper introduces MedRLM, a recursive multimodal health-intelligence system for long-context clinical reasoning, sensor-guided screening, evidence-grounded decision support, and community-to-tertiary referral optimization. It targets reasoning over heterogeneous, longitudinal patient data, beyond the single-step prompting or retrieval of current medical LLMs.
-
NAMESAKES: Probing Identity Memorization in Text-to-Image ModelsNAMESAKES: probing identity memorization in text-to-image modelsThe paper introduces NAMESAKES, a study probing identity memorization in text-to-image models, which can generate realistic likenesses of individuals from their names. It addresses the difficulty of telling whether a generated face is memorized or fabricated without ground-truth photos, training data, or white-box model access.
-
PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent ErrorsPASQA: a pitch-accent-focused speech quality assessment modelThe paper proposes PASQA, a speech quality assessment model that explicitly targets pitch-accent correctness, trained on synthetic speech containing accent errors. It addresses the insensitivity of existing utterance-level MOS prediction models to localized pitch-accent mistakes.
-
Self-Preference Is Weak or Absent in Verifiable Instruction-Following Revision: A Four-Model Test Under Genuine AuthorshipSelf-preference is weak in verifiable instruction-following revisionThe paper tests whether large language models resist valid corrections to their own writing during verifiable instruction-following revision. Across four models under genuine authorship, it finds that the documented self-preference bias is weak or absent in this revision setting.
-
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic AnalysisAn information-theoretic look at supervising latent chain-of-thoughtThe paper gives an information-theoretic analysis of what makes supervision effective in latent chain-of-thought reasoning, which internalizes reasoning in continuous hidden states. It examines why outcome supervision provides weak learning signals, making robust latent reasoning difficult.
-
Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic PerturbationsHuman-model gaps in speech quality assessment under perturbationsThe paper investigates discrepancies between human judgments and MOS prediction models in speech quality assessment, using controlled acoustic and prosodic perturbations. It probes whether these models, widely used as proxy metrics in text-to-speech research, capture quality differences beyond acoustic fidelity.
-
DeepSeek Introduces VisionDeepSeek introduces vision capabilitiesAn item reporting that DeepSeek has introduced vision capabilities, adding image understanding to its previously text-focused models. The multimodal upgrade broadens the range of tasks the models can handle.
-
NRITYAM: Language Models Meet Art and Heritage of DanceNRITYAM: a benchmark for cultural comprehension of dance traditionsThe paper presents NRITYAM, a benchmark for evaluating how well language models comprehend culture in the context of global dance traditions. It addresses the gap that the global effectiveness of language models depends on a nuanced understanding of local socio-cultural contexts.
-
Midjourney MedicalMidjourney MedicalAn item on Midjourney Medical, a medical-focused offering from the image-generation company Midjourney. Accompanied by a demo video, it is presented as a new effort to apply generative imaging technology in the medical domain.
-
GLM-5.2 is probably the most powerful text-only open weights LLMGLM-5.2 may be the most powerful text-only open weights LLMChinese AI lab Z.ai released GLM-5.2 to coding-plan subscribers on June 13 and then published full open weights under an MIT license on June 16. Similar in size to GLM-5 and GLM-5.1, it may be the most powerful text-only open weights LLM, per Simon Willison.