Safety & Evaluation A
Showing 271–300 of 308
-
How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented SetupsExtrinsic discourse evaluation of machine translation qualityThis arXiv paper argues that standard machine-translation (MT) metrics assess quality intrinsically and miss the downstream consequences of translation errors. Under a static regime, the authors propose an entity-counting task probing referential consistency and show high intrinsic MT quality does not reliably predict downstream discourse success. Under an interactive regime, they use the goal-oriented multi-agent Welfare Diplomacy game as a probe.
-
SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM AgentsSING: synthetic intention graph for scalable active tool discoveryThis arXiv paper addresses tool selection for LLM agents whose harnesses connect to hundreds or thousands of APIs, where exhaustive tool-schema injection is costly and imposes a closed-world assumption. Noting that one-shot retrieval often fails to align isolated tool descriptions with the agent's true intent—especially in long-horizon tasks—the authors propose SING, a Synthetic Intention Graph for scalable, active tool discovery.
-
Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?Uncertainty estimation fails as a safety net for clinical VQAThis arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.
-
The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard UsageBD-LSC: a new benchmark dataset for lexical semantic change detectionThis arXiv paper introduces two complementary benchmark datasets for computational lexical semantic change (LSC) detection. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, loss, and stability across three time periods, targeting cases—especially slang versus standard usage—where words simultaneously gain and lose senses, which existing benchmarks struggle to capture.
-
Can LLM Coding Agents Reason About Time Series?Can LLM coding agents reason about time series? A benchmark studyThis arXiv study tests whether LLM agents can analyze ubiquitous time series data used in finance, healthcare, and environmental monitoring. Comparing three approaches—raw numerical data, the LLM as a coding agent, and a combination—the authors find that agents with code access can outperform models processing raw data by up to 10%, though even the best agent still answers roughly 22-34% incorrectly.
-
DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic AuditingDoubtProbe: a dual-branch inference-time defense against LLM jailbreaksThis arXiv paper proposes DoubtProbe, a dual-branch inference-time framework for black-box jailbreak defense in LLMs. The authors observe that many jailbreaks do not remove the harmful goal but reorganize the information needed to express it, evading safety alignment while remaining recoverable during generation. DoubtProbe combines structural verification and semantic auditing to counter this.
-
Anthropic’s Safety SuperpowerStratechery: Anthropic's safety stance licenses its business aimsStratechery argues that Anthropic's conviction in its own safety commitment grants it license to aggressively favor its business interests, and at times to challenge the U.S. government. The essay critically examines how the safety banner shapes the firm's competitive posture.
-
Statement on the US government directive to suspend access to Fable 5 and Mythos 5Willison on the US directive to suspend Fable 5 and Mythos 5Simon Willison comments on the US government's national-security export-control directive suspending all foreign-national access to Fable 5 and Mythos 5, calling the move extraordinary and questioning its rationale and impact.
-
Results from the first Anthropic Public RecordAnthropic shares first Public Record survey of 52,000 Americans on AIAnthropic released first-wave results of its Public Record survey of nearly 52,000 Americans. Curing diseases topped hopes for AI (48%), job loss led fears (64%), and over 70% backed government regulation of AI across party lines.
-
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM ReasoningClinHallu: a stage-wise hallucination diagnosis benchmark for medical MLLMsClinHallu is a benchmark for diagnosing where hallucinations originate in medical multimodal LLM reasoning, decomposing traces into visual recognition, knowledge recall, and reasoning integration. It provides 7,031 validated instances and uses stage-replacement interventions to localize error sources.
-
CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning AlignmentCORA aligns reasoning and answers in multimodal RLVRCORA analyzes the gap between a model's reasoning and its final answer when extending verifiable-reward RL to multimodal settings. It proposes consistency-oriented reasoning alignment to bridge that gap.
-
A Complexity Measure for Active Learning in Multi-group Mean EstimationA complexity measure for active multi-group mean estimationThe paper studies active learning for multi-group mean estimation framed as a d-armed bandit minimizing max-risk. It introduces a complexity measure characterizing the difficulty of adaptive budget allocation.
-
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the LimitWhy generating 'trivia' is provably necessary for valuable mathematicsAs AI coupled to proof assistants generates formal mathematics at scale, a gap opens between what a checker verifies and what mathematicians value. Through the lens of language generation in the limit, the paper argues that producing trivial, peripheral statements is provably necessary to generate valuable mathematics.
-
Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex SetsOptimal hidden-target learning for online inventory optimizationThe work casts online inventory optimization as online convex optimization with memory, where carryover makes the feasible set history-dependent. It develops an optimal hidden-target learning method on general convex sets.
-
When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge EditingRoute-specialized dual adapters for memory-assisted knowledge editingThis work targets knowledge editing that updates selected facts while preserving nearby behavior in a memory-assisted setting. It proposes route-specialized dual adapters that decide when to write and when to suppress edits.
-
Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable SignalsGraph-structured combinatorial semi-bandits with nonlinear rewardsThe paper addresses combinatorial semi-bandit identification of optimal structures under nonlinear reward associations. It leverages separable signals to reduce sampling and computational cost.
-
Which Directions Matter? Sparse Design for Affine Robust OptimizationSparse design identifies which directions matter in robust optimizationThe work studies which uncertainty directions a model must cover in affine robust optimization defined by a finite dictionary and budget. It proposes a sparse design selecting the directions that matter.
-
Graph Diffusion Residuals for Control-Function Instrumental VariablesGraph diffusion residuals for control-function instrumental variablesControl-function IV estimators need first-stage residuals, but high-capacity models can interpolate treatment and leave too little residual. The paper proposes graph diffusion residuals to address this.
-
Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits TokensHow DiffusionGemma actually commits tokens, neither parallel nor sequentialDiffusion language models are marketed as parallel decoders, yet their real token-commit order is rarely measured. Instrumenting DiffusionGemma, the paper shows it is neither purely parallel nor sequential.
-
A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile HealthComparing deep learning for multi-horizon behavioural forecasting in mHealthWearables and smartphones generate rich behavioural time series for proactive health interventions, yet systematic comparisons of forecasting architectures are lacking. The paper benchmarks deep learning architectures for multi-horizon behavioural forecasting in mobile health.
-
LoSoNA: A Benchmark for Local Social Norm Adaptation in Group ConversationsLoSoNA benchmarks local social norm adaptation in group chatsOnline group chats have rarely-stated local conversational norms. LoSoNA is a benchmark measuring whether LLM-based agents can recognize and adapt to these local social norms.
-
Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0A fused INT8 GEMM kernel speeds diffusion transformers on consumer GPUsPost-training INT8 quantization of diffusion transformers is often slower than FP8/NF4 on consumer Ampere GPUs. The paper presents a fused INT8 GEMM kernel for Ideogram 4.0 that realizes native INT8 speedups.
-
Zero-shot generalization of transformer neural operators to larger domainsZero-shot generalization of transformer neural operators to larger domainsThe paper studies whether transformer-based neural operators for PDE solution operators can generalize zero-shot to larger spatial domains than seen in training.
-
Regulating the Machine Contributor: Governance and Policy Alignment in Open SourceGovernance and policy alignment for AI contributors in open sourceAI-assisted development has moved from autocomplete to agents that plan changes, edit files, and submit pull requests with limited supervision, while open source evolves through human processes. The paper examines governance and policy alignment for regulating such machine contributors.
-
When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent RuntimeA longitudinal taxonomy of silent failures in a production LLM agent runtimeLLM agents increasingly run as long-lived autonomous runtimes that schedule jobs, call tools, maintain memory, and push results to humans. This longitudinal study of one persistent system presents a taxonomy of its silent failures.
-
Sensitivity Shaping for Latent ModelingSensitivity shaping for detecting OOD transitions in dynamics modelsGenerative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution transitions. The paper proposes sensitivity shaping for latent modeling to improve such OOD detection.
-
A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway SystemsA temporal planning framework for disruption-aware railway routingRoute optimization is vital for safety and punctuality in railway operations, especially in heterogeneous multi-gauge networks. The paper proposes a temporal planning framework for disruption-aware dynamic route optimization.
-
CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific ExperimentationCARE: auditable evidence review to control LLM-generated policiesGiving LLMs direct control over costly, irreversible experiments invites unsafe exploration, while discarding their creativity sacrifices optimization. CARE controls LLM-generated policies through auditable review of evidence in scientific experimentation.
-
Persuasion Index: A Theory-Guided Framework for Persuasion AnalysisPersuasion Index: a theory-guided framework for persuasion analysisIdentifying persuasive rhetorical cues matters for detecting manipulation, AI safety, and health communication. The paper proposes Persuasion Index, a theory-guided framework for persuasion analysis.
-
VISTA: View-Consistent Self-Verified Training for GUI GroundingVISTA: view-consistent self-verified training for GUI groundingApplying GRPO to GUI grounding samples rollouts from a single screenshot, so groups often turn all-failure or all-success and yield weak signal. VISTA introduces view-consistent, self-verified training to stabilize GUI grounding.