Safety & Evaluation A

Showing 121–150 of 307
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
    RTSGameBench: an RTS benchmark for strategic reasoning by VLMs
    AI Agents Computer Vision Neural Network Retrieval-Augmented Generation (RAG)
    Modern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid Documents
    SenFlow: inter-sentence flow modeling for AI-text detection
    DeepSeek Retrieval-Augmented Generation (RAG)
    Sentence-level AI-generated text detection is hard in hybrid human-AI documents. SenFlow models inter-sentence flow to capture discontinuities, improving detection of AI-generated sentences within mixed documents.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Graph-ESBMC-PLC: Formal Verification of Graphical PLCopen XML Ladder Diagram Programs Using SMT-Based Model Checking
    Graph-ESBMC-PLC: SMT-based verification of PLCopen ladder diagrams
    Inference Machine Learning Neural Network
    PLCopen XML defines encodings for IEC 61131-3 Ladder Diagrams. Graph-ESBMC-PLC applies SMT-based model checking to formally verify graphical PLCopen XML Ladder Diagram programs, supporting correctness checking of industrial control software.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety
    SciRisk-Bench: a risk-dimension-aware benchmark for AI4Science safety
    Neural Network Reinforcement Learning Software Engineering
    As LLMs become embedded in scientific research, evaluating their safety matters. SciRisk-Bench is a risk-dimension-aware benchmark that assesses the safety of LLMs in AI-for-science settings across multiple risk categories.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    REVES: REvision and VErification--Augmented Training for Test-Time Scaling
    REVES: revision- and verification-augmented training for test-time scaling
    Inference Reinforcement Learning Software Engineering
    Test-time scaling via sequential revision has become a powerful paradigm. REVES proposes revision- and verification-augmented training that strengthens a model ability to revise and verify its own outputs, making extra test-time compute more effective.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration
    SAGE: stochastic prompt optimization via agent-guided exploration
    Context engineering has become a primary lever for improving AI systems. SAGE is a stochastic prompt optimization method that uses agent-guided exploration to automatically discover effective prompts and improve task performance.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Stratechery (free posts) · EN Safety & Evaluation extract
    The State of Fable, The Jailbreak Problem, SpaceX Acquires Cursor
    Stratechery on Fable's state, jailbreaks, and SpaceX buying Cursor
    Anthropic
    A Stratechery column by Ben Thompson on three topics: the state of Anthropic's Fable model, the AI jailbreak problem, and SpaceX's acquisition of Cursor. Thompson argues the administration is likely wrong about Fable but that responsibility ultimately lies with Anthropic. Views are the author's; deal specifics are unverified.
    Read original (Stratechery (free posts)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining
    Aligning implied statements for generalizable implicit hate detection
    Speech Processing
    Classifying implicit hate speech is hard because intent is rarely explicit. This work aligns implied statements and applies context-bounded semi-hard negative mining to improve the generalizability of implicit hate speech detection.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Agents & Tool Use extract
    Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning
    Beyond reward engineering: a data recipe for long-context RL
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Long-context reasoning is essential for large language models. Rather than relying on reward engineering, this work presents a data recipe for long-context reinforcement learning that drives effective training.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents
    GateMem: benchmarking memory governance in shared-memory agents
    AI Agents Neural Network
    Memory benchmarks for LLM agents largely assume single-user settings, leaving shared-memory governance untested. GateMem benchmarks memory governance, such as access control and management, in multi-principal shared-memory agents.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports
    Beyond scalar scores: LLM-based metrics for radiology report significance
    Inference Machine Learning
    Reliable evaluation of generated radiology reports requires strict clinical validity. Going beyond scalar scores, this work explores LLM-based metrics for clinical significance evaluation, assessing report quality in clinically meaningful terms.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    RedactionBench
    RedactionBench: a benchmark for redacting sensitive information
    Neural Network Reinforcement Learning
    Large language models are increasingly applied to sensitive domains. RedactionBench evaluates how well models redact sensitive information in such settings, supporting verification toward safer deployment.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction
    SAMA: semantic anchor-aligned augmentation for low-resource multimodal IE
    Machine Learning Retrieval-Augmented Generation (RAG)
    Multimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    LegalWorld: A Life-Cycle Interactive Environment for Legal Agents
    LegalWorld: a life-cycle interactive environment for legal agents
    AI Agents Neural Network Reinforcement Learning
    Civil litigation is inherently a life-cycle process where documents connect across stages. LegalWorld provides an interactive environment covering the full litigation life cycle, enabling legal agents to be evaluated and trained within that flow.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Infrastructure & Hardware extract
    Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
    Morpheus: a morphology-aware neural tokenizer and embedder for Turkish
    Embeddings Inference Retrieval-Augmented Generation (RAG)
    Turkish is agglutinative, with meaning carried by morphemes that subword tokenizers fail to capture. Morpheus is a morphology-aware neural tokenizer and word embedder designed to improve Turkish language processing.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment
    LLMs struggle to measure item discrimination in reading assessment
    Software Engineering
    Item discrimination is a fundamental psychometric property that distinguishes students of different proficiency. This study shows that large language models struggle to measure item discrimination in reading comprehension assessment, exposing limits of automated evaluation.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    TW-LegalBench: Measuring Taiwanese Legal Understanding
    TW-LegalBench: measuring Taiwanese legal understanding
    Large language models show strong general abilities, but their grasp of region-specific law is under-tested. TW-LegalBench measures Taiwanese legal understanding, evaluating models for jurisdiction-specific legal applications.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    ForecastBench-Sim: A Simulated-World Forecasting Benchmark
    ForecastBench-Sim: a simulated-world forecasting benchmark
    Reinforcement Learning Software Engineering
    Forecasting benchmarks for general-purpose AI usually inherit real-world events, making evaluation hard to control. ForecastBench-Sim introduces a simulated-world forecasting benchmark, enabling controlled assessment of AI forecasting ability.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • OpenAI Blog · EN New Model Releases extract
    Introducing LifeSciBench
    OpenAI launches LifeSciBench for life-science research tasks
    Deep Learning Reinforcement Learning
    OpenAI introduced LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. It aims to rigorously assess AI's practical usefulness in life-science research.
    Read original (OpenAI Blog) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
    ReproRepo scales reproducibility audits using GitHub repo issues
    AI Agents GPT Machine Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Reproducing results from papers and code is central to science but existing benchmarks are hard to scale. ReproRepo leverages GitHub repository issues to evaluate, at scale, how well LLM agents can assist with reproducibility tasks, addressing the manual effort that limits prior reproducibility benchmarks.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph Analyses
    Darshana Graph: a parallel commentary corpus for Indian philosophy
    Machine Learning Neural Network
    Darshana Graph is a corpus of over 125,000 text records spanning classical Hindu, Buddhist and Jain philosophical traditions, drawn from public-domain and openly licensed translations. It supports comparative Indian philosophy through stylometric and exploratory graph analyses.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
    Zone of Proximal PPO puts the teacher in prompts, not gradients
    Reinforcement Learning
    Knowledge distillation is brittle for small students, as imitating a large teacher's logits concentrates on its sharpest modes and hurts generalization. The proposed Zone of Proximal Policy Optimization places the teacher in prompts rather than gradients to improve small-student generalization.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Rethinking Dataset Distillation for Classification: Do Distilled Sets Outperform Coresets?
    Do distilled sets beat coresets? Rethinking dataset distillation
    Machine Learning Retrieval-Augmented Generation (RAG)
    Dataset distillation synthesizes compact training sets for data-centric machine learning. This paper rethinks distillation for classification, asking whether distilled sets actually outperform coresets (real-data subsets) and under what conditions.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Fixed-Point Reasoners: Stable and Adaptive Deep Looped Transformers
    Fixed-Point Reasoners: stabilizing deep looped Transformers (FPRM)
    Transformer
    The paper addresses the depth-induced signal propagation problem in looped Transformer architectures using pre-norm layers and residual scaling, and proposes FPRM, a looped Transformer model built on these architectural modifications.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0
    Encoding the Al-Mawrid Arabic-English dictionary with LMF and TEI Lex-0
    The paper presents a methodology to systematically digitize and encode the legacy print Al-Mawrid Arabic-English dictionary using the ISO Language Markup Framework and TEI Lex-0, addressing a gap in Arabic lexical infrastructure by producing a standardized computational lexicon.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills
    RubricsTree: scalable open-ended evaluation of personal health agents
    AI Agents Gemini GPT Meta Neural Network
    LLM personal health agents using sensor metrics promise to ease healthcare disparities, but an open-ended evaluation bottleneck limits clinical deployment. RubricsTree offers scalable, evolving open-ended evaluation across health memory and medical skills.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models
    Red-team study evaluates Anthropic Fable 5 and Opus 4.8 robustness
    Anthropic Neural Network
    A red-team study evaluates the adversarial robustness of two Anthropic frontier models, Fable 5 and Opus 4.8, against several families of automated jailbreak attacks across a multi-category harm taxonomy. Methods and figures are as stated by the paper; third-party verification not confirmed.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining Data
    SEFD: an open, layout-faithful reconstruction of SEC filings for LLMs
    Reinforcement Learning
    The paper introduces the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, providing audited financial disclosures as token-efficient pretraining and evaluation data for financial language modeling.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction
    DRFLOW: a deep research benchmark for personalized workflow prediction
    AI Agents Retrieval-Augmented Generation (RAG) Software Engineering
    The paper introduces DRFLOW, a benchmark for evaluating personalized workflow prediction in deep research systems, focusing on identifying concrete action-step workflows for enterprise tasks rather than generating reports or summaries.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation
    ATT&CK-labeled multi-source security log dataset with SLM evaluation
    Fine-tuning Llama Machine Learning Neural Network Reinforcement Learning from Human Feedback (RLHF)
    The work builds a dataset of multi-source cybersecurity logs labeled with MITRE ATT&CK and evaluates small language models (SLMs) on it. Summary is title-based and neutral; details and figures are as presented by the source and not independently verified.
    Read original (arXiv cs.LG (Machine Learning)) ↗