Safety & Evaluation (Page 5 of 11)｜AI/Tech News Trends

arXiv cs.AI (Artificial Intelligence) · 2026-06-17 EN Multimodal extract

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: an RTS benchmark for strategic reasoning by VLMs

AI Agents Computer Vision Neural Network Retrieval-Augmented Generation (RAG)

Modern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN New Model Releases extract

SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents

SenFlow: inter-sentence flow modeling for AI-text detection

DeepSeek Retrieval-Augmented Generation (RAG)

Sentence-level AI-generated text detection is hard in hybrid human-AI documents. SenFlow models inter-sentence flow to capture discontinuities, improving detection of AI-generated sentences within mixed documents.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking

Graph-ESBMC-PLC: SMT-based verification of PLCopen ladder diagrams

Inference Machine Learning Neural Network

PLCopen XML defines encodings for IEC 61131-3 Ladder Diagrams. Graph-ESBMC-PLC applies SMT-based model checking to formally verify graphical PLCopen XML Ladder Diagram programs, supporting correctness checking of industrial control software.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-17 EN Safety & Evaluation extract

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench: a risk-dimension-aware benchmark for AI4Science safety

Neural Network Reinforcement Learning Software Engineering

As LLMs become embedded in scientific research, evaluating their safety matters. SciRisk-Bench is a risk-dimension-aware benchmark that assesses the safety of LLMs in AI-for-science settings across multiple risk categories.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Inference & Efficiency extract

REVES: REvision and VErification--Augmented Training for Test-Time Scaling

REVES: revision- and verification-augmented training for test-time scaling

Inference Reinforcement Learning Software Engineering

Test-time scaling via sequential revision has become a powerful paradigm. REVES proposes revision- and verification-augmented training that strengthens a model ability to revise and verify its own outputs, making extra test-time compute more effective.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN New Model Releases extract

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE: stochastic prompt optimization via agent-guided exploration

Context engineering has become a primary lever for improving AI systems. SAGE is a stochastic prompt optimization method that uses agent-guided exploration to automatically discover effective prompts and improve task performance.

Read original (arXiv cs.CL (Computation and Language)) ↗

Stratechery (free posts) · 2026-06-17 EN Safety & Evaluation extract

The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor

Stratechery on Fable's state, jailbreaks, and SpaceX buying Cursor

Anthropic

A Stratechery column by Ben Thompson on three topics: the state of Anthropic's Fable model, the AI jailbreak problem, and SpaceX's acquisition of Cursor. Thompson argues the administration is likely wrong about Fable but that responsibility ultimately lies with Anthropic. Views are the author's; deal specifics are unverified.

Read original (Stratechery (free posts)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

Aligning implied statements for generalizable implicit hate detection

Speech Processing

Classifying implicit hate speech is hard because intent is rarely explicit. This work aligns implied statements and applies context-bounded semi-hard negative mining to improve the generalizability of implicit hate speech detection.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Agents & Tool Use extract

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

Beyond reward engineering: a data recipe for long-context RL

AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning

Long-context reasoning is essential for large language models. Rather than relying on reward engineering, this work presents a data recipe for long-context reinforcement learning that drives effective training.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN New Model Releases extract

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem: benchmarking memory governance in shared-memory agents

AI Agents Neural Network

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared-memory governance untested. GateMem benchmarks memory governance, such as access control and management, in multi-principal shared-memory agents.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Beyond scalar scores: LLM-based metrics for radiology report significance

Inference Machine Learning

Reliable evaluation of generated radiology reports requires strict clinical validity. Going beyond scalar scores, this work explores LLM-based metrics for clinical significance evaluation, assessing report quality in clinically meaningful terms.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

RedactionBench

RedactionBench: a benchmark for redacting sensitive information

Neural Network Reinforcement Learning

Large language models are increasingly applied to sensitive domains. RedactionBench evaluates how well models redact sensitive information in such settings, supporting verification toward safer deployment.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA: semantic anchor-aligned augmentation for low-resource multimodal IE

Machine Learning Retrieval-Augmented Generation (RAG)

Multimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

LegalWorld: a life-cycle interactive environment for legal agents

AI Agents Neural Network Reinforcement Learning

Civil litigation is inherently a life-cycle process where documents connect across stages. LegalWorld provides an interactive environment covering the full litigation life cycle, enabling legal agents to be evaluated and trained within that flow.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Infrastructure & Hardware extract

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus: a morphology-aware neural tokenizer and embedder for Turkish

Embeddings Inference Retrieval-Augmented Generation (RAG)

Turkish is agglutinative, with meaning carried by morphemes that subword tokenizers fail to capture. Morpheus is a morphology-aware neural tokenizer and word embedder designed to improve Turkish language processing.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

LLMs struggle to measure item discrimination in reading assessment

Software Engineering

Item discrimination is a fundamental psychometric property that distinguishes students of different proficiency. This study shows that large language models struggle to measure item discrimination in reading comprehension assessment, exposing limits of automated evaluation.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench: measuring Taiwanese legal understanding

Large language models show strong general abilities, but their grasp of region-specific law is under-tested. TW-LegalBench measures Taiwanese legal understanding, evaluating models for jurisdiction-specific legal applications.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-17 EN Safety & Evaluation extract

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim: a simulated-world forecasting benchmark

Reinforcement Learning Software Engineering

Forecasting benchmarks for general-purpose AI usually inherit real-world events, making evaluation hard to control. ForecastBench-Sim introduces a simulated-world forecasting benchmark, enabling controlled assessment of AI forecasting ability.

Read original (arXiv cs.CL (Computation and Language)) ↗

OpenAI Blog · 2026-06-17 EN New Model Releases extract

Introducing LifeSciBench

OpenAI launches LifeSciBench for life-science research tasks

Deep Learning Reinforcement Learning

OpenAI introduced LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. It aims to rigorously assess AI's practical usefulness in life-science research.

Read original (OpenAI Blog) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN New Model Releases extract

ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

ReproRepo scales reproducibility audits using GitHub repo issues

AI Agents GPT Machine Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning

Reproducing results from papers and code is central to science but existing benchmarks are hard to scale. ReproRepo leverages GitHub repository issues to evaluate, at scale, how well LLM agents can assist with reproducibility tasks, addressing the manual effort that limits prior reproducibility benchmarks.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN New Model Releases extract

Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses

Darshana Graph: a parallel commentary corpus for Indian philosophy

Machine Learning Neural Network

Darshana Graph is a corpus of over 125,000 text records spanning classical Hindu, Buddhist and Jain philosophical traditions, drawn from public-domain and openly licensed translations. It supports comparative Indian philosophy through stylometric and exploratory graph analyses.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN New Model Releases extract

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Zone of Proximal PPO puts the teacher in prompts, not gradients

Reinforcement Learning

Knowledge distillation is brittle for small students, as imitating a large teacher's logits concentrates on its sharpest modes and hurts generalization. The proposed Zone of Proximal Policy Optimization places the teacher in prompts rather than gradients to improve small-student generalization.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-16 EN Inference & Efficiency extract

Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?

Do distilled sets beat coresets? Rethinking dataset distillation

Machine Learning Retrieval-Augmented Generation (RAG)

Dataset distillation synthesizes compact training sets for data-centric machine learning. This paper rethinks distillation for classification, asking whether distilled sets actually outperform coresets (real-data subsets) and under what conditions.

Read original (arXiv cs.LG (Machine Learning)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-16 EN Safety & Evaluation extract

Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers

Fixed-Point Reasoners: stabilizing deep looped Transformers (FPRM)

Transformer

The paper addresses the depth-induced signal propagation problem in looped Transformer architectures using pre-norm layers and residual scaling, and proposes FPRM, a looped Transformer model built on these architectural modifications.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN Safety & Evaluation extract

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

Encoding the Al-Mawrid Arabic-English dictionary with LMF and TEI Lex-0

The paper presents a methodology to systematically digitize and encode the legacy print Al-Mawrid Arabic-English dictionary using the ISO Language Markup Framework and TEI Lex-0, addressing a gap in Arabic lexical infrastructure by producing a standardized computational lexicon.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN Safety & Evaluation extract

RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills

RubricsTree: scalable open-ended evaluation of personal health agents

AI Agents Gemini GPT Meta Neural Network

LLM personal health agents using sensor metrics promise to ease healthcare disparities, but an open-ended evaluation bottleneck limits clinical deployment. RubricsTree offers scalable, evolving open-ended evaluation across health memory and medical skills.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.CL (Computation and Language) · 2026-06-16 EN Safety & Evaluation extract

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Red-team study evaluates Anthropic Fable 5 and Opus 4.8 robustness

Anthropic Neural Network

A red-team study evaluates the adversarial robustness of two Anthropic frontier models, Fable 5 and Opus 4.8, against several families of automated jailbreak attacks across a multi-category harm taxonomy. Methods and figures are as stated by the paper; third-party verification not confirmed.

Read original (arXiv cs.CL (Computation and Language)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-16 EN Safety & Evaluation extract

The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data

SEFD: an open, layout-faithful reconstruction of SEC filings for LLMs

Reinforcement Learning

The paper introduces the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, providing audited financial disclosures as token-efficient pretraining and evaluation data for financial language modeling.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.AI (Artificial Intelligence) · 2026-06-16 EN New Model Releases extract

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW: a deep research benchmark for personalized workflow prediction

AI Agents Retrieval-Augmented Generation (RAG) Software Engineering

The paper introduces DRFLOW, a benchmark for evaluating personalized workflow prediction in deep research systems, focusing on identifying concrete action-step workflows for enterprise tasks rather than generating reports or summaries.

Read original (arXiv cs.AI (Artificial Intelligence)) ↗

arXiv cs.LG (Machine Learning) · 2026-06-16 EN Training & Fine-tuning extract

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

ATT&CK-labeled multi-source security log dataset with SLM evaluation

Fine-tuning Llama Machine Learning Neural Network Reinforcement Learning from Human Feedback (RLHF)

The work builds a dataset of multi-source cybersecurity logs labeled with MITRE ATT&CK and evaluates small language models (SLMs) on it. Summary is title-based and neutral; details and figures are as presented by the source and not independently verified.

Read original (arXiv cs.LG (Machine Learning)) ↗