New Model Releases A
Showing 31–60 of 268
-
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation ExchangesCATCH-ME: a counterspeech dataset against hate and misinformationThe paper introduces CATCH-ME, a dataset of contextually annotated multi-turn counterspeech against overlapping hate speech and misinformation. It addresses NLP's tendency to treat the two threats in isolation and the tendency of zero-shot LLMs to produce repetitive, vague counterspeech.
-
Critical Percolation as a Synthetic Data Model for InterpretabilityCritical percolation as a synthetic data model for interpretabilityThe paper introduces critical percolation as a synthetic data model for interpretability research. It builds a family of synthetic datasets with the hierarchical, multi-scale structure of natural data, addressing the gap that typical interpretability toy datasets lack such structure.
-
Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer visionWall-to-wall forest structure mapping from inventory, lidar, imageryThe paper integrates national forest inventory data, airborne lidar, and satellite imagery with computer vision to produce wall-to-wall maps of forest structure. It targets the persistent need for annually updated, large-landscape maps to support forest and wildfire risk management.
-
ELVA: Exploring Ranking-Driven Universal Multimodal RetrievalELVA: ranking-driven universal multimodal retrievalLeveraging multimodal large language models through contrastive learning has become mainstream for retrieval. ELVA explores a ranking-driven approach to universal multimodal retrieval.
-
Lagrange: An Open-Vocabulary, Energy-Based Sparse Framework for Generalized End-to-End DrivingLagrange: open-vocabulary energy-based framework for end-to-end drivingScaling end-to-end autonomous driving to complex open-world settings demands strong perception. Lagrange offers an open-vocabulary, energy-based sparse framework for generalized end-to-end driving.
-
Editorial Alignment: A Participatory Approach to Engaging Editorial Expertise in LLM-mediated Knowledge DisseminationEditorial alignment: engaging editorial expertise in LLM knowledge disseminationLLM-driven information services are reshaping how public knowledge is produced. This work proposes a participatory approach to engage editorial expertise in LLM-mediated knowledge dissemination.
-
The Register Gap: A Meaning Intelligence Framework for Nigerian Public DiscourseThe Register Gap: a meaning intelligence framework for Nigerian discourseThis work introduces the Meaning Intelligence Framework, a nine-dimension annotation and evaluation scheme, to study the register gap in Nigerian public discourse.
-
Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM InferenceExplicit knowledge conflict resolution for LLM inferenceLarge language models perform strongly across language tasks but can hold conflicting parametric and contextual knowledge. This work proposes explicit knowledge conflict resolution to navigate unreliable knowledge during inference.
-
SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMsSPOT-E: test-time entropy shaping with visual spotlights for frozen VLMsVision-language models often underperform on evidence-intensive tasks by missing decisive visual cues. SPOT-E applies test-time entropy shaping with visual spotlights to improve frozen VLMs.
-
FlowMaps: Modeling Long-Term Multimodal Object Dynamics with Flow MatchingFlowMaps: long-term multimodal object dynamics with flow matchingJoint spatial and temporal understanding of 3D scenes is essential for deployed robots. FlowMaps models long-term multimodal object dynamics using flow matching.
-
Beyond Accuracy: Measuring Logical Compliance of Predictive ModelsBeyond accuracy: measuring logical compliance of predictive modelsMachine learning models are mostly evaluated through predictive metrics such as accuracy. This work goes beyond accuracy to measure the logical compliance of predictive models.
-
Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at RandomOff-policy evaluation when rewards are missing not at randomThe paper studies off-policy evaluation in finite-horizon MDPs when rewards are missing not at random, as in offline reinforcement learning with sparse, irregular, or censored reward records. It develops missingness-aware policies for settings such as health care and marketing.
-
MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral OptimizationMedRLM: recursive multimodal AI for long-context clinical reasoningThe paper introduces MedRLM, a recursive multimodal health-intelligence system for long-context clinical reasoning, sensor-guided screening, evidence-grounded decision support, and community-to-tertiary referral optimization. It targets reasoning over heterogeneous, longitudinal patient data, beyond the single-step prompting or retrieval of current medical LLMs.
-
NAMESAKES: Probing Identity Memorization in Text-to-Image ModelsNAMESAKES: probing identity memorization in text-to-image modelsThe paper introduces NAMESAKES, a study probing identity memorization in text-to-image models, which can generate realistic likenesses of individuals from their names. It addresses the difficulty of telling whether a generated face is memorized or fabricated without ground-truth photos, training data, or white-box model access.
-
HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention HybridizationHydraHead: head-level hybridization of linear and full attentionThe paper proposes HydraHead, a hybrid attention design that exploits head-level functional heterogeneity to combine linear and full attention. It moves beyond the common layer-wise hybridization strategy, addressing the difficulty of integrating linear attention with full attention for efficient long-context processing.
-
Improving health intelligence in ChatGPTOpenAI improves ChatGPT health responses with GPT-5.5 InstantOpenAI says GPT-5.5 Instant strengthens ChatGPT's health and wellness responses through better reasoning, richer context, and clearer communication. The work is backed by physician-informed evaluations aimed at delivering more reliable, trustworthy health guidance.
-
What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic AnalysisAn information-theoretic look at supervising latent chain-of-thoughtThe paper gives an information-theoretic analysis of what makes supervision effective in latent chain-of-thought reasoning, which internalizes reasoning in continuous hidden states. It examines why outcome supervision provides weak learning signals, making robust latent reasoning difficult.
-
When Lower Privileges Suffice: Investigating Over-Privileged Tool Selection in LLM AgentsInvestigating over-privileged tool selection in LLM agentsThe paper investigates over-privileged tool selection in LLM agents, which autonomously choose among tools with different privilege levels. It addresses a gap in prior tool-selection research, which focuses on safety-agnostic metadata preferences, by studying when lower-privilege tools would suffice.
-
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information DetectionREDACT: a controlled multilingual benchmark for PII detectionThe paper presents REDACT, a systematically controlled multilingual benchmark for personal information (PII) detection. It addresses limitations of existing corpora—few entity types, ad hoc generation, and little insight into which surface conditions cause detector failures.
-
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic FactsAtomMem: an LLM-agent memory system built on atomic factsThe paper proposes AtomMem, a simple and effective memory system for LLM agents built around atomic facts. It addresses the limits of fixed context windows for accumulating and reusing information across sessions, and the coarse, unstable memory of existing memory-augmented systems.
-
DeepSeek Introduces VisionDeepSeek introduces vision capabilitiesAn item reporting that DeepSeek has introduced vision capabilities, adding image understanding to its previously text-focused models. The multimodal upgrade broadens the range of tasks the models can handle.
-
Announcing Stack Overflow for AgentsAnnouncing Stack Overflow for AgentsAn announcement of Stack Overflow for Agents, aimed at AI agents. Like the Q&A site human developers use, it seeks to let agents reference and share knowledge and code examples for solving problems.
-
Think Again or Think Longer? Selective Verification for Budget-Aware ReasoningSelective verification for budget-aware test-time reasoningThe paper studies budget-aware test-time reasoning as a deployment allocation problem, asking whether to 'think again' or 'think longer.' It proposes selective verification, since extra reasoning is not uniformly useful—it can repair failures, waste compute on correct answers, or introduce harmful changes.
-
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language ModelsManifold Bandits: Bayesian curriculum learning for LLM reasoningThe paper proposes Manifold Bandits, a Bayesian curriculum-learning method that samples training problems over the latent geometry of large language models. It targets reinforcement learning for LLM reasoning, where training efficiency depends heavily on how prompts are selected during optimization.
-
Benchmarking Agentic Review SystemsBenchmarking agentic peer-review systemsThe paper benchmarks agentic review systems, which are emerging to relieve the pressure AI-assisted research places on peer review. It evaluates two open-source systems, one proprietary system, and a zero-shot baseline, addressing the open question of how such systems should be assessed.
-
「シャドーAI」7割超の企業が対策追い付かず “会社が選んだAIだけ利用”はもう限界? ガートナーGartner: 73% of Japanese firms cannot keep up with shadow AIGartner reports that 73% of Japanese companies have failed to address shadow AI, where employees use unsanctioned AI tools at work. Restricting staff to only company-approved AI is nearing its limits, making governance and enablement a shared challenge.
-
Closing the Calibration Gap in Semantic CachingClosing the calibration gap in semantic cachingThe paper addresses the calibration gap in semantic caching, which cuts LLM inference costs by serving cached responses to semantically similar queries. It shows that evaluating with PR-AUC—which only measures ranking, not usability at a fixed threshold—leads to systematically poor deployment choices.
-
GLM-5.2 is probably the most powerful text-only open weights LLMGLM-5.2 may be the most powerful text-only open weights LLMChinese AI lab Z.ai released GLM-5.2 to coding-plan subscribers on June 13 and then published full open weights under an MIT license on June 16. Similar in size to GLM-5 and GLM-5.1, it may be the most powerful text-only open weights LLM, per Simon Willison.
-
「AIを使う学生」vs.「使わない学生」、エッセイが創造的なのはどっち? 米大学が2025年に実証実験AI-using vs non-using students: whose essays are more creative?Georgetown University researchers published a study on the homogenizing effect of LLMs on creative diversity, empirically comparing human and ChatGPT writing. The article reports how using AI affects the creativity and diversity of students' essays.
-
Native Active Perception as Reasoning for Omni-Modal UnderstandingActive perception as reasoning for efficient omni-modal understandingPassive long-video models 'watch it all,' processing frames uniformly so cost grows with duration regardless of query difficulty. This work treats perception as reasoning, with native active perception that selectively attends to relevant frames for efficient omni-modal understanding.