Safety & Evaluation A
Showing 61–90 of 317
-
REDACT: A Systematically Controlled Multilingual Benchmark for Personal Information DetectionREDACT: a controlled multilingual benchmark for PII detectionThe paper presents REDACT, a systematically controlled multilingual benchmark for personal information (PII) detection. It addresses limitations of existing corpora—few entity types, ad hoc generation, and little insight into which surface conditions cause detector failures.
-
The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AIScaling up democratic deliberation and empowering people with AIThe paper discusses options for scaling up democratic deliberation and empowering people with AI as large language models become prominent in public discourse. It weighs opportunities against persistent concerns such as linguistic constraints, biases, and the sycophantic tendencies of LLMs, beyond what red teaming addresses.
-
Large Language Models Do Not Always Need Readable LanguageLLMs don't always need human-readable languageThe paper investigates whether semantic information can be encoded in compact, non-standard text that sacrifices human readability while remaining usable by models. It argues large language models do not always need human-readable language, especially when the intended reader is another model.
-
Prompt, Plan, Extract: Zero-Shot Agentic LLMs Workflows for Lung Pathology Extraction from Clinical NarrativesZero-shot agentic LLM workflows for lung pathology extractionThe paper presents Prompt, Plan, Extract, a zero-shot agentic LLM workflow for extracting lung pathology information from clinical narrative reports. It targets the labor-intensive, error-prone manual extraction needed for cancer staging and tumor registries, avoiding fully supervised NLP pipelines.
-
AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic FactsAtomMem: an LLM-agent memory system built on atomic factsThe paper proposes AtomMem, a simple and effective memory system for LLM agents built around atomic facts. It addresses the limits of fixed context windows for accumulating and reusing information across sessions, and the coarse, unstable memory of existing memory-augmented systems.
-
Leverage Is Not Reach: A Control-Window Law for Single-Neuron Steering in Language ModelsA control-window law for single-neuron steering in LLMsThe paper develops a budget-normalized control-window framework for single-neuron steering in language models. It seeks to predict when intervening on one neuron coherently controls a behavior—such as refusal or language routing gated by sparse feed-forward neurons—rather than collapsing the output.
-
JAMER: Project-Level Code Framework Dataset and Benchmark on Professional Game EnginesJAMER: a project-level code benchmark on game enginesThe paper introduces JAMER, a project-level code framework dataset and benchmark for professional game engines. It addresses the lack of large-scale datasets and deterministic evaluation for project-level code engineering, which has remained underexplored despite progress in AI-driven game asset and gameplay generation.
-
CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence AnalysisCREDENCE: semantic metrics for claim decomposition in fact-checkingThe paper presents CREDENCE, an approach to decomposing compound sentences into atomic, verifiable claims for automated fact-checking. It introduces semantic metrics that avoid token-overlap measures, which underestimate quality for paraphrastic claims, and adds convergence and termination analysis.
-
CombEval: A Framework for Evaluating Combinatorial Counting in Large Language ModelsCombEval: evaluating combinatorial counting in LLMsThe paper presents CombEval, a dynamic benchmark for evaluating combinatorial counting in large language models. Each problem is expressed as a typed Cofola specification over entities, combinatorial objects, dependencies, and constraints, enabling controlled generation of natural-language counting problems.
-
AgentFinVQA: A Deployable Multi-Agent Pipeline for Auditable Financial Chart QAAgentFinVQA: an auditable multi-agent pipeline for financial chart QAThe paper presents AgentFinVQA, a deployable multi-agent pipeline for auditable financial chart question answering. It targets regulated settings where practitioners must know which answers to trust and cannot send client data to external model providers, unlike existing accuracy-focused, opaque chart-QA agents.
-
Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language ModelsManifold Bandits: Bayesian curriculum learning for LLM reasoningThe paper proposes Manifold Bandits, a Bayesian curriculum-learning method that samples training problems over the latent geometry of large language models. It targets reinforcement learning for LLM reasoning, where training efficiency depends heavily on how prompts are selected during optimization.
-
Benchmarking Agentic Review SystemsBenchmarking agentic peer-review systemsThe paper benchmarks agentic review systems, which are emerging to relieve the pressure AI-assisted research places on peer review. It evaluates two open-source systems, one proprietary system, and a zero-shot baseline, addressing the open question of how such systems should be assessed.
-
Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference SettingsSequential DPO and forgetting across preference settingsThe paper studies sequential Direct Preference Optimization (DPO) across different preference settings, examining how applying multiple alignment objectives one after another affects earlier ones. It looks beyond uniform forgetting to understand how later training stages interfere with previously learned preferences.
-
NRITYAM: Language Models Meet Art and Heritage of DanceNRITYAM: a benchmark for cultural comprehension of dance traditionsThe paper presents NRITYAM, a benchmark for evaluating how well language models comprehend culture in the context of global dance traditions. It addresses the gap that the global effectiveness of language models depends on a nuanced understanding of local socio-cultural contexts.
-
Is it agentic enough? Benchmarking open models on your own toolingHugging Face benchmarks open models' agentic skill on your own toolsHugging Face explores how to judge whether open models are 'agentic enough' by benchmarking them on your own tooling rather than generic suites. The approach evaluates models under realistic, user-specific tool setups to better gauge practical agent capability.
-
Anthropic opens Seoul office and announces new partnerships across the Korean AI ecosystemAnthropic opens a Seoul office, announces new Korean AI partnershipsAnthropic opened a Seoul office and announced new partnerships across Korea's AI ecosystem, including enterprises, startups, and researchers building on Claude. It frames Korea as treating innovation and safety as two sides of the same coin. Specifics are per the announcement and unverified independently.
-
Learning User Simulators with Turing RewardsUser simulators learned with Turing rewards for agent trainingSimulating human users in interactive settings could advance training of agent assistants, evaluation of personalization systems, and social-science research. This work learns user simulators using Turing rewards, aiming to reproduce more realistic user behavior.
-
UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement LearningUBP2: uncertainty-balanced planning for efficient preference-based RLPreference-based RL learns reward models from pairwise behavior comparisons, bypassing explicit reward design, but existing methods often rely on passive data collection. UBP2 introduces uncertainty-balanced preference planning to actively select comparisons and learn efficiently from fewer preferences.
-
Rethinking Reward Supervision: Rubric-Conditioned Self-DistillationRubric-conditioned self-distillation rethinks reward supervisionPost-training of reasoning models often combines supervised distillation with reinforcement learning from verifiable rewards, but distillation relies on costly chain-of-thought annotations. This work proposes rubric-conditioned self-distillation to rethink reward supervision while cutting annotation cost.
-
Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild PriorsReference-driven generation of multi-speaker audio scenesExisting multi-speaker dialogue systems bind speakers to utterances through structured supervision such as per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. This work generates multi-speaker audio scenes by drawing on in-the-wild reference priors for more natural synthesis.
-
Data Intelligence Agents: Interpreting, Modeling, and Querying Enterprise Data via Autonomous Coding AgentsData Intelligence Agents query enterprise data autonomouslyProduction data integration is bottlenecked by repeated, lossy handoffs among data owners, engineers, and analysts who must jointly discover, structure, and query enterprise data. The authors present Data Intelligence Agents (DIA), autonomous coding agents that interpret, model, and query that data.
-
Explaining Attention with Program SynthesisExplaining attention via program synthesis for interpretabilityA longstanding goal of interpretable deep learning is to replace opaque neural computations with human-meaningful symbolic descriptions. This paper approximates the behavior of attention components with synthesized programs, offering a route to explain attention and improve interpretability.
-
Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour SegmentationConfidence is not reliability: rethinking MC dropout in tumour segmentationGlioma segmentation in multiparametric MRI is critical for treatment planning, and a model that fails silently on treatment-critical sub-regions is a patient-safety risk that overlap metrics miss. This work shows MC dropout confidence does not equal reliability, rethinking uncertainty estimation.
-
Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action ModelsMeasuring commonsense and knowledge retention in VLA modelsEmbodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet how much commonsense and factual knowledge they retain is unclear. This work measures that retention, revealing how much fine-tuning erodes prior world knowledge.
-
NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic LearningNeSyCat Torch unifies neurosymbolic semantics via category theoryNeurosymbolic semantics is fragmented: classical, fuzzy, probabilistic, and neural systems each define truth by their own rules. Extending ULLER, NeSyCat subsumes them under a single inductive definition of truth, delivered as a differentiable tensor implementation for neurosymbolic learning.
-
Beyond Algorithms: Conceptual Innovation in Medical Imaging AIBeyond algorithms: the case for conceptual innovation in medical imaging AIAI has driven rapid progress in medical imaging, yielding ever more sophisticated algorithms and steady benchmark gains. Yet this algorithm-centric trajectory reveals limits. This work argues for conceptual innovation beyond algorithms to achieve clinically meaningful advances in medical imaging AI.
-
Trade-offs in Medical LLM Adaptation: An Empirical Study in French QATrade-offs in medical LLM adaptation, studied on French QAAs LLMs are adapted to specialized domains and languages, the effectiveness of adaptation strategies remains unclear. This empirical study on French medical question answering analyzes the trade-offs of various domain-adaptation methods, clarifying gains and losses in performance and generality.
-
Structured Inference with Large Language GibbsStructured probabilistic inference over LLMs via Gibbs samplingKnowledge encoded in LLMs can serve as a substrate for structured reasoning over variables describing a complex world, but accessing it probabilistically is hard. This work performs structured inference over LLMs using Gibbs sampling, enabling probabilistic reasoning across interrelated variables.
-
A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2A multi-domain benchmark to detect GPT-Image-2 text-rich imagesText-rich images often hold privacy-sensitive, transactional, or decision-relevant information. As multimodal generators synthesize realistic text and layouts, this work builds a multi-domain benchmark for detecting AI-generated text-rich images from GPT-Image-2, assessing detector reliability.
-
DreamReasoner-8B: Block-Size Curriculum Learning for Diffusion Reasoning ModelsDreamReasoner-8B: block-size curriculum learning for diffusion reasoningBlock diffusion language models speed decoding via parallel block-wise denoising, but reliably scaling them for long chain-of-thought reasoning is unresolved. The authors develop DreamReasoner-8B, using block-size curriculum learning to strengthen long-CoT reasoning in diffusion reasoning models.