Safety & Evaluation A
Showing 121–150 of 307
-
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language ModelsRTSGameBench: an RTS benchmark for strategic reasoning by VLMsModern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.
-
SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid DocumentsSenFlow: inter-sentence flow modeling for AI-text detectionSentence-level AI-generated text detection is hard in hybrid human-AI documents. SenFlow models inter-sentence flow to capture discontinuities, improving detection of AI-generated sentences within mixed documents.
-
Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model CheckingGraph-ESBMC-PLC: SMT-based verification of PLCopen ladder diagramsPLCopen XML defines encodings for IEC 61131-3 Ladder Diagrams. Graph-ESBMC-PLC applies SMT-based model checking to formally verify graphical PLCopen XML Ladder Diagram programs, supporting correctness checking of industrial control software.
-
SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science SafetySciRisk-Bench: a risk-dimension-aware benchmark for AI4Science safetyAs LLMs become embedded in scientific research, evaluating their safety matters. SciRisk-Bench is a risk-dimension-aware benchmark that assesses the safety of LLMs in AI-for-science settings across multiple risk categories.
-
REVES: REvision and VErification--Augmented Training for Test-Time ScalingREVES: revision- and verification-augmented training for test-time scalingTest-time scaling via sequential revision has become a powerful paradigm. REVES proposes revision- and verification-augmented training that strengthens a model ability to revise and verify its own outputs, making extra test-time compute more effective.
-
SAGE: Stochastic Prompt Optimization via Agent-Guided ExplorationSAGE: stochastic prompt optimization via agent-guided explorationContext engineering has become a primary lever for improving AI systems. SAGE is a stochastic prompt optimization method that uses agent-guided exploration to automatically discover effective prompts and improve task performance.
-
The State of Fable, The Jailbreak Problem, SpaceX Acquires CursorStratechery on Fable's state, jailbreaks, and SpaceX buying CursorA Stratechery column by Ben Thompson on three topics: the state of Anthropic's Fable model, the AI jailbreak problem, and SpaceX's acquisition of Cursor. Thompson argues the administration is likely wrong about Fable but that responsibility ultimately lies with Anthropic. Views are the author's; deal specifics are unverified.
-
Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative MiningAligning implied statements for generalizable implicit hate detectionClassifying implicit hate speech is hard because intent is rarely explicit. This work aligns implied statements and applies context-bounded semi-hard negative mining to improve the generalizability of implicit hate speech detection.
-
Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement LearningBeyond reward engineering: a data recipe for long-context RLLong-context reasoning is essential for large language models. Rather than relying on reward engineering, this work presents a data recipe for long-context reinforcement learning that drives effective training.
-
GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory AgentsGateMem: benchmarking memory governance in shared-memory agentsMemory benchmarks for LLM agents largely assume single-user settings, leaving shared-memory governance untested. GateMem benchmarks memory governance, such as access control and management, in multi-principal shared-memory agents.
-
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology ReportsBeyond scalar scores: LLM-based metrics for radiology report significanceReliable evaluation of generated radiology reports requires strict clinical validity. Going beyond scalar scores, this work explores LLM-based metrics for clinical significance evaluation, assessing report quality in clinically meaningful terms.
-
RedactionBenchRedactionBench: a benchmark for redacting sensitive informationLarge language models are increasingly applied to sensitive domains. RedactionBench evaluates how well models redact sensitive information in such settings, supporting verification toward safer deployment.
-
SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information ExtractionSAMA: semantic anchor-aligned augmentation for low-resource multimodal IEMultimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.
-
LegalWorld: A Life-Cycle Interactive Environment for Legal AgentsLegalWorld: a life-cycle interactive environment for legal agentsCivil litigation is inherently a life-cycle process where documents connect across stages. LegalWorld provides an interactive environment covering the full litigation life cycle, enabling legal agents to be evaluated and trained within that flow.
-
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for TurkishMorpheus: a morphology-aware neural tokenizer and embedder for TurkishTurkish is agglutinative, with meaning carried by morphemes that subword tokenizers fail to capture. Morpheus is a morphology-aware neural tokenizer and word embedder designed to improve Turkish language processing.
-
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension AssessmentLLMs struggle to measure item discrimination in reading assessmentItem discrimination is a fundamental psychometric property that distinguishes students of different proficiency. This study shows that large language models struggle to measure item discrimination in reading comprehension assessment, exposing limits of automated evaluation.
-
TW-LegalBench: Measuring Taiwanese Legal UnderstandingTW-LegalBench: measuring Taiwanese legal understandingLarge language models show strong general abilities, but their grasp of region-specific law is under-tested. TW-LegalBench measures Taiwanese legal understanding, evaluating models for jurisdiction-specific legal applications.
-
ForecastBench-Sim: A Simulated-World Forecasting BenchmarkForecastBench-Sim: a simulated-world forecasting benchmarkForecasting benchmarks for general-purpose AI usually inherit real-world events, making evaluation hard to control. ForecastBench-Sim introduces a simulated-world forecasting benchmark, enabling controlled assessment of AI forecasting ability.
-
Introducing LifeSciBenchOpenAI launches LifeSciBench for life-science research tasksOpenAI introduced LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. It aims to rigorously assess AI's practical usefulness in life-science research.
-
ReproRepo: Scaling Reproducibility Audits with GitHub Repository IssuesReproRepo scales reproducibility audits using GitHub repo issuesReproducing results from papers and code is central to science but existing benchmarks are hard to scale. ReproRepo leverages GitHub repository issues to evaluate, at scale, how well LLM agents can assist with reproducibility tasks, addressing the manual effort that limits prior reproducibility benchmarks.
-
Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph AnalysesDarshana Graph: a parallel commentary corpus for Indian philosophyDarshana Graph is a corpus of over 125,000 text records spanning classical Hindu, Buddhist and Jain philosophical traditions, drawn from public-domain and openly licensed translations. It supports comparative Indian philosophy through stylometric and exploratory graph analyses.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not GradientsZone of Proximal PPO puts the teacher in prompts, not gradientsKnowledge distillation is brittle for small students, as imitating a large teacher's logits concentrates on its sharpest modes and hurts generalization. The proposed Zone of Proximal Policy Optimization places the teacher in prompts rather than gradients to improve small-student generalization.
-
Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?Do distilled sets beat coresets? Rethinking dataset distillationDataset distillation synthesizes compact training sets for data-centric machine learning. This paper rethinks distillation for classification, asking whether distilled sets actually outperform coresets (real-data subsets) and under what conditions.
-
Fixed-Point Reasoners: Stable and Adaptive Deep Looped TransformersFixed-Point Reasoners: stabilizing deep looped Transformers (FPRM)The paper addresses the depth-induced signal propagation problem in looped Transformer architectures using pre-norm layers and residual scaling, and proposes FPRM, a looped Transformer model built on these architectural modifications.
-
Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0Encoding the Al-Mawrid Arabic-English dictionary with LMF and TEI Lex-0The paper presents a methodology to systematically digitize and encode the legacy print Al-Mawrid Arabic-English dictionary using the ISO Language Markup Framework and TEI Lex-0, addressing a gap in Arabic lexical infrastructure by producing a standardized computational lexicon.
-
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical SkillsRubricsTree: scalable open-ended evaluation of personal health agentsLLM personal health agents using sensor metrics promise to ease healthcare disparities, but an open-ended evaluation bottleneck limits clinical deployment. RubricsTree offers scalable, evolving open-ended evaluation across health memory and medical skills.
-
A Red-Team Study of Anthropic Fable 5 & Opus 4.8 ModelsRed-team study evaluates Anthropic Fable 5 and Opus 4.8 robustnessA red-team study evaluates the adversarial robustness of two Anthropic frontier models, Fable 5 and Opus 4.8, against several families of automated jailbreak attacks across a multi-category harm taxonomy. Methods and figures are as stated by the paper; third-party verification not confirmed.
-
The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining DataSEFD: an open, layout-faithful reconstruction of SEC filings for LLMsThe paper introduces the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, providing audited financial disclosures as token-efficient pretraining and evaluation data for financial language modeling.
-
DRFLOW: A Deep Research Benchmark for Personalized Workflow PredictionDRFLOW: a deep research benchmark for personalized workflow predictionThe paper introduces DRFLOW, a benchmark for evaluating personalized workflow prediction in deep research systems, focusing on identifying concrete action-step workflows for enterprise tasks rather than generating reports or summaries.
-
Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM EvaluationATT&CK-labeled multi-source security log dataset with SLM evaluationThe work builds a dataset of multi-source cybersecurity logs labeled with MITRE ATT&CK and evaluates small language models (SLMs) on it. Summary is title-based and neutral; details and figures are as presented by the source and not independently verified.