Safety & Evaluation A

Showing 241–270 of 307
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering
    Reasoning hop-count predicts clinical AI failure in EHR QA
    Claude GPT OpenAI Software Engineering Transformer
    An arXiv paper shows that in electronic health record (EHR) question answering, questions needing more inferential hops yield disproportionately more LLM errors. Using a pre-specified hop-count taxonomy, it links this failure structure to theoretical limits on transformer compositionality. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Upper Bounds on the Generalization Error of Deep Learning Models via Local Robustness and Stability
    Tighter deep-learning generalization bounds via local robustness
    Deep Learning Neural Network Reinforcement Learning
    Robustness-based generalization bounds are often vacuous in practice. The authors trace much of the looseness to the robustness term itself, especially for 0-1 loss, which is usually treated as a global measure. They propose a bound that scales the robustness term by the number of stable and unstable samples across input sub-regions, yielding tighter estimates.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Integrated Marketing Attribution: A Bayesian Framework for Privacy-Safe Granular Measurement Anchored in MMM
    IMA fuses MMM and Bayesian attribution for privacy-safe measurement
    Neural Network Retrieval-Augmented Generation (RAG)
    Retail marketing needs granular, campaign-level insight without user-level tracking, yet MMM is too coarse and MTA is unreliable under privacy limits. Integrated Marketing Attribution (IMA) combines MMM with channel-specific Bayesian attribution models, using MMM-informed priors to deliver granular, privacy-safe attribution consistent with MMM.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection
    Benchmark suite for federated noisy-label medical image segmentation
    Meta Reinforcement Learning
    Federated learning enables collaborative medical image segmentation without centralizing sensitive data, but real-world deployment faces label imperfections like contour disagreement and confused labels. The authors argue existing federated noisy-label learning relies on synthetic noise and simplified settings, and introduce a benchmark suite combining diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to guide method selection.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity
    HawkesNest: a synthetic benchmark for spatiotemporal point process models
    Reinforcement Learning Software Engineering
    Evaluating spatiotemporal point process (STPP) models relies on opaque real datasets where failures are hard to attribute. HawkesNest is a generator-aligned synthetic benchmark built on a multivariate Hawkes backbone, defining four complexity axes with deterministic indices so models can be stress-tested under known structural difficulty.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens
    Anchor-token roadmap for revocable decoding in diffusion LLMs
    Deep Learning Embeddings Inference Retrieval-Augmented Generation (RAG) Speech Processing
    An arXiv paper addresses the speed-quality trade-off and error propagation in revocable decoding for diffusion LLMs (dLLMs). It proposes following a latent 'roadmap' guided by anchor tokens to mitigate failures arising in mixed-quality contexts during parallel generation. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts
    RDS Fusion: neuro-symbolic gating with compressed CoT for irony detection
    Fine-tuning Transformer
    An arXiv paper proposes Robust Dual-Signal (RDS) Fusion, a hybrid neuro-symbolic framework that compresses Chain-of-Thought reasoning without supervised fine-tuning to improve zero-shot irony detection. It reports evaluation on a held-out TweetEval test set (N=734). Neutral, abstract-based summary; figures are the authors' claims.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Data-Driven Decoding of Russell's Circumplex Model of Affect
    Do Transformer embeddings recover Russell's circumplex affect geometry?
    Deep Learning Embeddings Speech Processing Transformer
    An arXiv paper tests whether Transformer latent spaces, trained on text and speech, recover the geometric regularities of Russell's circumplex model of affect. It unifies two complementary experiments to probe emotion representation, addressing the opacity of high-dimensional affective embeddings. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Industry Adoption extract
    Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based Course
    Reflections on teaching the engineering of AI-enabled systems in a course
    Algorithms & Theory Machine Learning Neural Network Reinforcement Learning Software Engineering
    This paper reflects on a project-based master's course at the University of Bremen on engineering AI-enabled systems. It argues that machine learning courses emphasize model development while students lack experience in architectural design, deployment, and monitoring, and reports on the course's design and implementation.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Robust Spoofed Speech Detection via Temporal Pyramid Modeling
    Temporal Pyramid modeling for robust, generalizable spoofed-speech detection
    Neural Network Speech Processing
    The paper proposes a Temporal Pyramid Adapter for spoofed speech detection, using parallel temporal convolutions with varying receptive fields to capture multi-scale cues from local artifacts to global prosodic irregularities. It combines self-supervised XLS-R representations with front-end adapters to improve cross-dataset generalization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Funding & M&A extract
    ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies
    ATOM-Bench evaluates atomic skills and compositional generalization in robots
    Fine-tuning Reinforcement Learning
    The paper presents ATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in robotic manipulation policies. It factorizes tabletop manipulation into motor and instruction atoms, noting that a policy may succeed on demonstrated tasks yet fail to execute fine-grained skills or recombine them in new structures.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation
    Paper: framework measures LLM search-agent endorsement risk
    AI Agents Claude Gemini GPT Speech Processing
    An arXiv paper introduces SearchGEO, a controlled framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline and a five-mode attack taxonomy across multiple backends. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Simon Willison's Weblog · EN Safety & Evaluation extract
    "They screwed us": Personality clashes sent Anthropic's models offline
    Willison flags an Axios report on Anthropic's DC backstory
    Anthropic Claude Deep Learning Reinforcement Learning
    Developer Simon Willison's blog highlights an Axios piece of behind-the-scenes accounts about Anthropic's models and the US government, citing a Commerce Department meeting and debates over jailbreak resistance, while noting the reporting rests on anonymous sources.
    Read original (Simon Willison's Weblog) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models
    Triggering latent safety awareness to harden large reasoning models
    DeepSeek Fine-tuning Llama Retrieval-Augmented Generation (RAG) Reinforcement Learning from Human Feedback (RLHF)
    The paper observes that large reasoning models can recognize safety risks when re-presented with the original query alongside their own reasoning trace—a property it calls latent safety awareness. To exploit this without heavy manual annotation, it uses supervised fine-tuning to induce safe tags that trigger safety analysis.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Developer Tools extract
    LLM-based Visual Code Completion for Aerospace Geometric Design
    Paper: LLM visual-programming copilot for aerospace design
    GPT Inference Neural Network
    An arXiv paper presents an LLM-based visual programming copilot for aerospace geometric design tasks, using a visual-programming variant of the ReAct methodology. Summarized neutrally from the abstract; claims are the authors' and not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control
    LabOSBench: a simulated testbed for computer-use agents controlling instruments
    AI Agents Computer Vision
    The paper proposes LabOSBench, a simulated yet realistic testbed for evaluating computer-use agents on scientific instrument control. It notes that existing benchmarks focus on software tasks in virtual systems, while real instruments require coordinated interface control and feedback-driven parameter tuning that are costly and risky to evaluate directly.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment
    MST-CLIPIQA: decoupling semantics and distortions in AI-image quality
    Computer Vision Machine Learning Retrieval-Augmented Generation (RAG)
    The paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Decision-Weighted Flow Matching for Contextual Stochastic Optimization
    DW-FM reweights flow matching toward decision-sensitive regions
    Computer Vision Neural Network Reinforcement Learning from Human Feedback (RLHF)
    Standard generative scenario models optimize uniform distributional fit rather than downstream decision quality. Decision-Weighted Flow Matching (DW-FM) reweights the velocity-regression objective using decision-sensitive endpoint information, linking downstream regret to pathwise velocity mismatch and providing regret-aligned objectives with guarantees.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations
    Gen-VCoT uses generated RGB visual intermediates for multimodal reasoning
    Machine Learning
    Gen-VCoT replaces text-only chain-of-thought with generated RGB intermediates, staging visual grounding (SAM), depth (Marigold), and semantic reasoning (Qwen2-VL) under an adaptive router. It improves spatial (+25%) and depth (+50%) questions but can hurt simple factual ones; text CoT still wins on CLEVR, suggesting task-dependent representations.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents
    S2L replaces runtime SKILL.md text with skill-specific LoRA adapters
    AI Agents Deep Learning Software Engineering
    The paper proposes Skill-to-LoRA (S2L), a behavior-centric representation that replaces runtime skill text—commonly distributed as SKILL.md files—with skill-specific LoRA adapters. Rather than compressing the document, S2L models the behavioral change the skill text induces, aiming at more token-efficient LLM agents.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs
    P3B3: a benchmark for Portuguese variety bias in LLMs
    The paper introduces P3B3, an expert-curated benchmark and framework for measuring European versus Brazilian Portuguese variety bias in LLMs. It reports most models lean strongly toward pt-BR and argues for more balanced multilingual representation.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Automated jailbreak attack targeting multiple defense strategies
    UNIATTACK: a defense-oriented framework for automated black-box LLM jailbreaks
    Retrieval-Augmented Generation (RAG) Speech Processing
    The paper presents UNIATTACK, an adversarial testing framework that systematically builds effective black-box attack prompts on LLMs from a defense-oriented perspective. Unlike static templates or model-specific tuning, it extracts minimal but high-impact features from diverse existing attacks and optimizes them.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
    MyPCBench: benchmarking personal computer-use agents
    AI Agents Claude Neural Network Reinforcement Learning
    MyPCBench evaluates computer-use agents as personal assistants on a Linux desktop with 17 simulated web apps and 184 persona-seeded tasks, benchmarking six closed and open-weight models. Reported scores reflect the paper and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Misinformation Propagation in Benign Multi-Agent Systems
    Study on misinformation propagation in benign multi-agent systems
    AI Agents Reinforcement Learning Software Engineering
    The paper injects intent-based misinformation into single- and multi-agent LLM systems and finds it degrades performance and persists through debate, though multi-agent debate can reduce degradation when most agents are uncontaminated. Robustness depends on group composition.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis
    Physics-guided multi-scale framework for bearing fault diagnosis
    Inference Reinforcement Learning
    The paper proposes a progressive, physics-guided multi-scale vibration-processing pipeline for bearing fault diagnosis, using a kinematics-derived descriptor for real-time screening and fault-adaptive segmentation. Reported figures reflect the abstract and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Funding & M&A extract
    Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents
    Paper on evaluator preference collapse in self-evolving agents
    AI Agents DeepSeek GPT
    An arXiv paper reportedly examining preference collapse in multimodal evaluators and its cross-modal contagion within self-evolving agent systems. The source excerpt was unavailable (content filter), so this summary is based on the title only; see the original for methods and findings.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Funding & M&A extract
    SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG
    SCAR: semantic continuity-aware retrieval for RAG context expansion
    Embeddings Retrieval-Augmented Generation (RAG)
    Note: the abstract was unavailable, so this is summarized neutrally from the title alone. The paper proposes SCAR, a 'semantic continuity-aware retrieval' method aimed at efficient context expansion in retrieval-augmented generation (RAG). Specific mechanisms and evaluation results cannot be confirmed from the title.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection
    FraudSMSWalker benchmark targets URL-masked SMS-to-webpage fraud
    AI Agents Meta Neural Network Reinforcement Learning
    The paper introduces FraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. It contains 699 bilingual chains (332 fraudulent, 367 benign) across ten scenarios, withholding raw URLs, hosts, and reputation metadata so models cannot rely on reputation shortcuts, and evaluates nine web agents.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AI
    Survey reviews Islamic LLMs and trustworthy, hallucination-resistant AI
    Natural Language Processing (NLP) Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    This survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI, spanning Arabic NLP, Qur'anic question answering, knowledge benchmarks, retrieval-augmented generation, and legal reasoning. It argues that Arabic fluency alone is insufficient, and that reliable systems need curated sources, verification modules, and citation-aware generation.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    VeriGraph: Towards Verifiable Data-Analytic Agents
    VeriGraph: a traceable neuro-symbolic framework for verifiable data agents
    AI Agents Neural Network Software Engineering
    This arXiv paper introduces VeriGraph, a traceable neuro-symbolic reasoning framework for verifiable data-analytic agents. The authors note that LLM agents' reliance on linear text trajectories makes reasoning hard to audit, entangling deterministic computations over raw data with semantic deductions over natural-language claims. VeriGraph instead has agents build an explicit heterogeneous evidence directed acyclic graph (DAG) during execution.
    Read original (arXiv cs.CL (Computation and Language)) ↗