Safety & Evaluation A

Showing 271–300 of 308
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented Setups
    Extrinsic discourse evaluation of machine translation quality
    This arXiv paper argues that standard machine-translation (MT) metrics assess quality intrinsically and miss the downstream consequences of translation errors. Under a static regime, the authors propose an entity-counting task probing referential consistency and show high intrinsic MT quality does not reliably predict downstream discourse success. Under an interactive regime, they use the goal-oriented multi-agent Welfare Diplomacy game as a probe.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Agents & Tool Use extract
    SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents
    SING: synthetic intention graph for scalable active tool discovery
    AI Agents Neural Network Reinforcement Learning
    This arXiv paper addresses tool selection for LLM agents whose harnesses connect to hundreds or thousands of APIs, where exhaustive tool-schema injection is costly and imposes a closed-world assumption. Noting that one-shot retrieval often fails to align isolated tool descriptions with the agent's true intent—especially in long-horizon tasks—the authors propose SING, a Synthetic Intention Graph for scalable, active tool discovery.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?
    Uncertainty estimation fails as a safety net for clinical VQA
    Computer Vision Retrieval-Augmented Generation (RAG) Software Engineering
    This arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage
    BD-LSC: a new benchmark dataset for lexical semantic change detection
    Embeddings GPT Machine Learning Neural Network Transformer
    This arXiv paper introduces two complementary benchmark datasets for computational lexical semantic change (LSC) detection. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, loss, and stability across three time periods, targeting cases—especially slang versus standard usage—where words simultaneously gain and lose senses, which existing benchmarks struggle to capture.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Funding & M&A extract
    Can LLM Coding Agents Reason About Time Series?
    Can LLM coding agents reason about time series? A benchmark study
    AI Agents Software Engineering
    This arXiv study tests whether LLM agents can analyze ubiquitous time series data used in finance, healthcare, and environmental monitoring. Comparing three approaches—raw numerical data, the LLM as a coding agent, and a combination—the authors find that agents with code access can outperform models processing raw data by up to 10%, though even the best agent still answers roughly 22-34% incorrectly.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    DoubtProbe: Black-Box Jailbreak Defense via Structural Verification and Semantic Auditing
    DoubtProbe: a dual-branch inference-time defense against LLM jailbreaks
    Inference Llama Retrieval-Augmented Generation (RAG)
    This arXiv paper proposes DoubtProbe, a dual-branch inference-time framework for black-box jailbreak defense in LLMs. The authors observe that many jailbreaks do not remove the harmful goal but reorganize the information needed to express it, evading safety alignment while remaining recoverable during generation. DoubtProbe combines structural verification and semantic auditing to counter this.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Stratechery (free posts) · EN Safety & Evaluation extract
    Anthropic’s Safety Superpower
    Stratechery: Anthropic's safety stance licenses its business aims
    Anthropic
    Stratechery argues that Anthropic's conviction in its own safety commitment grants it license to aggressively favor its business interests, and at times to challenge the U.S. government. The essay critically examines how the safety banner shapes the firm's competitive posture.
    Read original (Stratechery (free posts)) ↗
  • Simon Willison's Weblog · EN Developer Tools extract
    Statement on the US government directive to suspend access to Fable 5 and Mythos 5
    Willison on the US directive to suspend Fable 5 and Mythos 5
    Anthropic Claude
    Simon Willison comments on the US government's national-security export-control directive suspending all foreign-national access to Fable 5 and Mythos 5, calling the move extraordinary and questioning its rationale and impact.
    Read original (Simon Willison's Weblog) ↗
  • Anthropic News · EN Safety & Evaluation extract
    Results from the first Anthropic Public Record
    Anthropic shares first Public Record survey of 52,000 Americans on AI
    Anthropic Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Anthropic released first-wave results of its Public Record survey of nearly 52,000 Americans. Curing diseases topped hopes for AI (48%), job loss led fears (64%), and over 70% backed government regulation of AI across party lines.
    Read original (Anthropic News) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
    ClinHallu: a stage-wise hallucination diagnosis benchmark for medical MLLMs
    Fine-tuning Machine Learning Software Engineering
    ClinHallu is a benchmark for diagnosing where hallucinations originate in medical multimodal LLM reasoning, decomposing traces into visual recognition, knowledge recall, and reasoning integration. It provides 7,031 validated instances and uses stage-replacement interventions to localize error sources.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    CORA: Analyzing and bridging thinking-answer gap in Multimodal RLVR via Consistency-Oriented Reasoning Alignment
    CORA aligns reasoning and answers in multimodal RLVR
    Computer Vision Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    CORA analyzes the gap between a model's reasoning and its final answer when extending verifiable-reward RL to multimodal settings. It proposes consistency-oriented reasoning alignment to bridge that gap.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    A Complexity Measure for Active Learning in Multi-group Mean Estimation
    A complexity measure for active multi-group mean estimation
    The paper studies active learning for multi-group mean estimation framed as a d-armed bandit minimizing max-risk. It introduces a complexity measure characterizing the difficulty of adaptive budget allocation.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the Limit
    Why generating 'trivia' is provably necessary for valuable mathematics
    Retrieval-Augmented Generation (RAG)
    As AI coupled to proof assistants generates formal mathematics at scale, a gap opens between what a checker verifies and what mathematicians value. Through the lens of language generation in the limit, the paper argues that producing trivial, peripheral statements is provably necessary to generate valuable mathematics.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Optimal Hidden-Target Learning for Online Inventory Optimization on General Convex Sets
    Optimal hidden-target learning for online inventory optimization
    The work casts online inventory optimization as online convex optimization with memory, where carryover makes the feasible set history-dependent. It develops an optimal hidden-target learning method on general convex sets.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    When to Write and When to Suppress: Route-Specialized Dual Adapters for Memory-Assisted Knowledge Editing
    Route-specialized dual adapters for memory-assisted knowledge editing
    Embeddings Inference Llama
    This work targets knowledge editing that updates selected facts while preserving nearby behavior in a memory-assisted setting. It proposes route-specialized dual adapters that decide when to write and when to suppress edits.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Graph Structured Combinatorial Semi-Bandit with Nonlinear Reward Associations through Separable Signals
    Graph-structured combinatorial semi-bandits with nonlinear rewards
    Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    The paper addresses combinatorial semi-bandit identification of optimal structures under nonlinear reward associations. It leverages separable signals to reduce sampling and computational cost.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Which Directions Matter? Sparse Design for Affine Robust Optimization
    Sparse design identifies which directions matter in robust optimization
    Machine Learning Retrieval-Augmented Generation (RAG)
    The work studies which uncertainty directions a model must cover in affine robust optimization defined by a finite dictionary and budget. It proposes a sparse design selecting the directions that matter.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Graph Diffusion Residuals for Control-Function Instrumental Variables
    Graph diffusion residuals for control-function instrumental variables
    Retrieval-Augmented Generation (RAG)
    Control-function IV estimators need first-stage residuals, but high-capacity models can interpolate treatment and leave too little residual. The paper proposes graph diffusion residuals to address this.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Neither Parallel Nor Sequential: How DiffusionGemma Actually Commits Tokens
    How DiffusionGemma actually commits tokens, neither parallel nor sequential
    Deep Learning Mixture of Experts (MoE)
    Diffusion language models are marketed as parallel decoders, yet their real token-commit order is rarely measured. Instrumenting DiffusionGemma, the paper shows it is neither purely parallel nor sequential.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    A Comparative Study of Deep Learning Architectures for Multi-Horizon Behavioural Forecasting for Mobile Health
    Comparing deep learning for multi-horizon behavioural forecasting in mHealth
    Deep Learning Fine-tuning Machine Learning Neural Network Transformer
    Wearables and smartphones generate rich behavioural time series for proactive health interventions, yet systematic comparisons of forecasting architectures are lacking. The paper benchmarks deep learning architectures for multi-horizon behavioural forecasting in mobile health.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    LoSoNA: A Benchmark for Local Social Norm Adaptation in Group Conversations
    LoSoNA benchmarks local social norm adaptation in group chats
    AI Agents Claude Gemini Software Engineering
    Online group chats have rarely-stated local conversational norms. LoSoNA is a benchmark measuring whether LLM-based agents can recognize and adapt to these local social norms.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0
    A fused INT8 GEMM kernel speeds diffusion transformers on consumer GPUs
    Neural Network Quantization Transformer
    Post-training INT8 quantization of diffusion transformers is often slower than FP8/NF4 on consumer Ampere GPUs. The paper presents a fused INT8 GEMM kernel for Ideogram 4.0 that realizes native INT8 speedups.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Zero-shot generalization of transformer neural operators to larger domains
    Zero-shot generalization of transformer neural operators to larger domains
    Embeddings Inference Machine Learning Neural Network Transformer
    The paper studies whether transformer-based neural operators for PDE solution operators can generalize zero-shot to larger spatial domains than seen in training.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Policy & Regulation extract
    Regulating the Machine Contributor: Governance and Policy Alignment in Open Source
    Governance and policy alignment for AI contributors in open source
    AI Agents Retrieval-Augmented Generation (RAG) Software Engineering
    AI-assisted development has moved from autocomplete to agents that plan changes, edit files, and submit pull requests with limited supervision, while open source evolves through human processes. The paper examines governance and policy alignment for regulating such machine contributors.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    When Errors Become Narratives: A Longitudinal Taxonomy of Silent Failures in a Production LLM Agent Runtime
    A longitudinal taxonomy of silent failures in a production LLM agent runtime
    Meta
    LLM agents increasingly run as long-lived autonomous runtimes that schedule jobs, call tools, maintain memory, and push results to humans. This longitudinal study of one persistent system presents a taxonomy of its silent failures.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    Sensitivity Shaping for Latent Modeling
    Sensitivity shaping for detecting OOD transitions in dynamics models
    Neural Network
    Generative dynamics models enable planning in challenging robotic systems, but safe deployment requires reliably detecting policy-induced out-of-distribution transitions. The paper proposes sensitivity shaping for latent modeling to improve such OOD detection.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    A Temporal Planning Framework for Disruption Aware Dynamic Route Optimization in Heterogeneous Railway Systems
    A temporal planning framework for disruption-aware railway routing
    Deep Learning Meta Neural Network Reinforcement Learning
    Route optimization is vital for safety and punctuality in railway operations, especially in heterogeneous multi-gauge networks. The paper proposes a temporal planning framework for disruption-aware dynamic route optimization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    CARE: Controlling LLM-Generated Policies through Auditable Review of Evidence in Scientific Experimentation
    CARE: auditable evidence review to control LLM-generated policies
    Machine Learning
    Giving LLMs direct control over costly, irreversible experiments invites unsafe exploration, while discarding their creativity sacrifices optimization. CARE controls LLM-generated policies through auditable review of evidence in scientific experimentation.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Persuasion Index: A Theory-Guided Framework for Persuasion Analysis
    Persuasion Index: a theory-guided framework for persuasion analysis
    Identifying persuasive rhetorical cues matters for detecting manipulation, AI safety, and health communication. The paper proposes Persuasion Index, a theory-guided framework for persuasion analysis.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    VISTA: View-Consistent Self-Verified Training for GUI Grounding
    VISTA: view-consistent self-verified training for GUI grounding
    Reinforcement Learning Software Engineering
    Applying GRPO to GUI grounding samples rollouts from a single screenshot, so groups often turn all-failure or all-success and yield weak signal. VISTA introduces view-consistent, self-verified training to stabilize GUI grounding.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗