Safety & Evaluation (Page 10 of 11)｜AI/Tech News Trends

arXiv cs.CL (Computation and Language) · 2026-06-15 EN Safety & Evaluation extract

How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups

Extrinsic discourse evaluation of machine translation quality

This arXiv paper argues that standard machine-translation (MT) metrics assess quality intrinsically and miss the downstream consequences of translation errors. Under a static regime, the authors propose an entity-counting task probing referential consistency and show high intrinsic MT quality does not reliably predict downstream discourse success. Under an interactive regime, they use the goal-oriented multi-agent Welfare Diplomacy game as a probe.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-15 EN Agents & Tool Use extract

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

SING: synthetic intention graph for scalable active tool discovery

AI Agents Neural Network Reinforcement Learning

This arXiv paper addresses tool selection for LLM agents whose harnesses connect to hundreds or thousands of APIs, where exhaustive tool-schema injection is costly and imposes a closed-world assumption. Noting that one-shot retrieval often fails to align isolated tool descriptions with the agent's true intent—especially in long-horizon tasks—the authors propose SING, a Synthetic Intention Graph for scalable, active tool discovery.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-15 EN Safety & Evaluation extract

Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?

Uncertainty estimation fails as a safety net for clinical VQA

Computer Vision Retrieval-Augmented Generation (RAG) Software Engineering

This arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-15 EN New Model Releases extract

The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage

BD-LSC: a new benchmark dataset for lexical semantic change detection

Embeddings GPT Machine Learning Neural Network Transformer

This arXiv paper introduces two complementary benchmark datasets for computational lexical semantic change (LSC) detection. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, loss, and stability across three time periods, targeting cases—especially slang versus standard usage—where words simultaneously gain and lose senses, which existing benchmarks struggle to capture.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-15 EN Funding & M&A extract

Can LLM Coding Agents Reason About Time Series?

Can LLM coding agents reason about time series? A benchmark study

AI Agents Software Engineering

This arXiv study tests whether LLM agents can analyze ubiquitous time series data used in finance, healthcare, and environmental monitoring. Comparing three approaches—raw numerical data, the LLM as a coding agent, and a combination—the authors find that agents with code access can outperform models processing raw data by up to 10%, though even the best agent still answers roughly 22-34% incorrectly.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-15 EN Safety & Evaluation extract

DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing

DoubtProbe: a dual-branch inference-time defense against LLM jailbreaks

Inference Llama Retrieval-Augmented Generation (RAG)

This arXiv paper proposes DoubtProbe, a dual-branch inference-time framework for black-box jailbreak defense in LLMs. The authors observe that many jailbreaks do not remove the harmful goal but reorganize the information needed to express it, evading safety alignment while remaining recoverable during generation. DoubtProbe combines structural verification and semantic auditing to counter this.

Read original (arXiv cs.CL (Computation and Language)) ↗

Stratechery (free posts) · 2026-06-15 EN Safety & Evaluation extract

Anthropic’s Safety Superpower

Stratechery: Anthropic's safety stance licenses its business aims

Anthropic

Stratechery argues that Anthropic's conviction in its own safety commitment grants it license to aggressively favor its business interests, and at times to challenge the U.S. government. The essay critically examines how the safety banner shapes the firm's competitive posture.

Read original (Stratechery (free posts)) ↗

Simon Willison's Weblog · 2026-06-13 EN Developer Tools extract

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

Willison on the US directive to suspend Fable 5 and Mythos 5

Anthropic Claude

Simon Willison comments on the US government's national-security export-control directive suspending all foreign-national access to Fable 5 and Mythos 5, calling the move extraordinary and questioning its rationale and impact.

Read original (Simon Willison's Weblog) ↗

Anthropic News · 2026-06-12 EN Safety & Evaluation extract

Results from the first Anthropic Public Record

Anthropic shares first Public Record survey of 52,000 Americans on AI

Anthropic Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning

Anthropic released first-wave results of its Public Record survey of nearly 52,000 Americans. Curing diseases topped hopes for AI (48%), job loss led fears (64%), and over 70% backed government regulation of AI across party lines.

Read original (Anthropic News) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-12 EN Safety & Evaluation extract

ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning

ClinHallu: a stage-wise hallucination diagnosis benchmark for medical MLLMs

Fine-tuning Machine Learning Software Engineering

ClinHallu is a benchmark for diagnosing where hallucinations originate in medical multimodal LLM reasoning, decomposing traces into visual recognition, knowledge recall, and reasoning integration. It provides 7,031 validated instances and uses stage-replacement interventions to localize error sources.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-12 EN Safety & Evaluation extract

CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment

CORA aligns reasoning and answers in multimodal RLVR

Computer Vision Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering

CORA analyzes the gap between a model's reasoning and its final answer when extending verifiable-reward RL to multimodal settings. It proposes consistency-oriented reasoning alignment to bridge that gap.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN New Model Releases extract

A Complexity Measure for Active Learning in Multi-group Mean Estimation

A complexity measure for active multi-group mean estimation

The paper studies active learning for multi-group mean estimation framed as a d-armed bandit minimizing max-risk. It introduces a complexity measure characterizing the difficulty of adaptive budget allocation.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-12 EN Safety & Evaluation extract

Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit

Why generating 'trivia' is provably necessary for valuable mathematics

Retrieval-Augmented Generation (RAG)

As AI coupled to proof assistants generates formal mathematics at scale, a gap opens between what a checker verifies and what mathematicians value. Through the lens of language generation in the limit, the paper argues that producing trivial, peripheral statements is provably necessary to generate valuable mathematics.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN New Model Releases extract

Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets

Optimal hidden-target learning for online inventory optimization

The work casts online inventory optimization as online convex optimization with memory, where carryover makes the feasible set history-dependent. It develops an optimal hidden-target learning method on general convex sets.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Safety & Evaluation extract

When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing

Route-specialized dual adapters for memory-assisted knowledge editing

Embeddings Inference Llama

This work targets knowledge editing that updates selected facts while preserving nearby behavior in a memory-assisted setting. It proposes route-specialized dual adapters that decide when to write and when to suppress edits.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Training & Fine-tuning extract

Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable Signals

Graph-structured combinatorial semi-bandits with nonlinear rewards

Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning

The paper addresses combinatorial semi-bandit identification of optimal structures under nonlinear reward associations. It leverages separable signals to reduce sampling and computational cost.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Safety & Evaluation extract

Which Directions Matter? Sparse Design for Affine Robust Optimization

Sparse design identifies which directions matter in robust optimization

Machine Learning Retrieval-Augmented Generation (RAG)

The work studies which uncertainty directions a model must cover in affine robust optimization defined by a finite dictionary and budget. It proposes a sparse design selecting the directions that matter.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Safety & Evaluation extract

Graph Diffusion Residuals for Control-Function Instrumental Variables

Graph diffusion residuals for control-function instrumental variables

Retrieval-Augmented Generation (RAG)

Control-function IV estimators need first-stage residuals, but high-capacity models can interpolate treatment and leave too little residual. The paper proposes graph diffusion residuals to address this.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Safety & Evaluation extract

Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens

How DiffusionGemma actually commits tokens, neither parallel nor sequential

Deep Learning Mixture of Experts (MoE)

Diffusion language models are marketed as parallel decoders, yet their real token-commit order is rarely measured. Instrumenting DiffusionGemma, the paper shows it is neither purely parallel nor sequential.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-12 EN Training & Fine-tuning extract

A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health

Comparing deep learning for multi-horizon behavioural forecasting in mHealth

Deep Learning Fine-tuning Machine Learning Neural Network Transformer

Wearables and smartphones generate rich behavioural time series for proactive health interventions, yet systematic comparisons of forecasting architectures are lacking. The paper benchmarks deep learning architectures for multi-horizon behavioural forecasting in mobile health.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-12 EN New Model Releases extract

LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations

LoSoNA benchmarks local social norm adaptation in group chats

AI Agents Claude Gemini Software Engineering

Online group chats have rarely-stated local conversational norms. LoSoNA is a benchmark measuring whether LLM-based agents can recognize and adapt to these local social norms.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Inference & Efficiency extract

Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0

A fused INT8 GEMM kernel speeds diffusion transformers on consumer GPUs

Neural Network Quantization Transformer

Post-training INT8 quantization of diffusion transformers is often slower than FP8/NF4 on consumer Ampere GPUs. The paper presents a fused INT8 GEMM kernel for Ideogram 4.0 that realizes native INT8 speedups.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-12 EN Safety & Evaluation extract

Zero-shot generalization of transformer neural operators to larger domains

Embeddings Inference Machine Learning Neural Network Transformer

The paper studies whether transformer-based neural operators for PDE solution operators can generalize zero-shot to larger spatial domains than seen in training.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-12 EN Policy & Regulation extract

Regulating the Machine Contributor: Governance and Policy Alignment in Open Source

Governance and policy alignment for AI contributors in open source

AI Agents Retrieval-Augmented Generation (RAG) Software Engineering

AI-assisted development has moved from autocomplete to agents that plan changes, edit files, and submit pull requests with limited supervision, while open source evolves through human processes. The paper examines governance and policy alignment for regulating such machine contributors.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-12 EN Safety & Evaluation extract

When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime

A longitudinal taxonomy of silent failures in a production LLM agent runtime