Safety & Evaluation A
Showing 241–270 of 307
-
Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question AnsweringReasoning hop-count predicts clinical AI failure in EHR QAAn arXiv paper shows that in electronic health record (EHR) question answering, questions needing more inferential hops yield disproportionately more LLM errors. Using a pre-specified hop-count taxonomy, it links this failure structure to theoretical limits on transformer compositionality. Neutral, abstract-based summary.
-
Upper Bounds on the Generalization Error of Deep Learning Models via Local Robustness and StabilityTighter deep-learning generalization bounds via local robustnessRobustness-based generalization bounds are often vacuous in practice. The authors trace much of the looseness to the robustness term itself, especially for 0-1 loss, which is usually treated as a global measure. They propose a bound that scales the robustness term by the number of stable and unstable samples across input sub-regions, yielding tighter estimates.
-
Integrated Marketing Attribution: A Bayesian Framework for Privacy-Safe Granular Measurement Anchored in MMMIMA fuses MMM and Bayesian attribution for privacy-safe measurementRetail marketing needs granular, campaign-level insight without user-level tracking, yet MMM is too coarse and MTA is unreliable under privacy limits. Integrated Marketing Attribution (IMA) combines MMM with channel-specific Bayesian attribution models, using MMM-informed priors to deliver granular, privacy-safe attribution consistent with MMM.
-
Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method SelectionBenchmark suite for federated noisy-label medical image segmentationFederated learning enables collaborative medical image segmentation without centralizing sensitive data, but real-world deployment faces label imperfections like contour disagreement and confused labels. The authors argue existing federated noisy-label learning relies on synthetic noise and simplified settings, and introduce a benchmark suite combining diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to guide method selection.
-
HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern ComplexityHawkesNest: a synthetic benchmark for spatiotemporal point process modelsEvaluating spatiotemporal point process (STPP) models relies on opaque real datasets where failures are hard to attribute. HawkesNest is a generator-aligned synthetic benchmark built on a multivariate Hawkes backbone, defining four complexity axes with deterministic indices so models can be stress-tested under known structural difficulty.
-
Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor TokensAnchor-token roadmap for revocable decoding in diffusion LLMsAn arXiv paper addresses the speed-quality trade-off and error propagation in revocable decoding for diffusion LLMs (dLLMs). It proposes following a latent 'roadmap' guided by anchor tokens to mitigate failures arising in mixed-quality contexts during parallel generation. Neutral, abstract-based summary.
-
Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media TextsRDS Fusion: neuro-symbolic gating with compressed CoT for irony detectionAn arXiv paper proposes Robust Dual-Signal (RDS) Fusion, a hybrid neuro-symbolic framework that compresses Chain-of-Thought reasoning without supervised fine-tuning to improve zero-shot irony detection. It reports evaluation on a held-out TweetEval test set (N=734). Neutral, abstract-based summary; figures are the authors' claims.
-
Data-Driven Decoding of Russell's Circumplex Model of AffectDo Transformer embeddings recover Russell's circumplex affect geometry?An arXiv paper tests whether Transformer latent spaces, trained on text and speech, recover the geometric regularities of Russell's circumplex model of affect. It unifies two complementary experiments to probe emotion representation, addressing the opacity of high-dimensional affective embeddings. Neutral, abstract-based summary.
-
Beyond Models: Reflections on Engineering AI-enabled Systems in a Project-Based CourseReflections on teaching the engineering of AI-enabled systems in a courseThis paper reflects on a project-based master's course at the University of Bremen on engineering AI-enabled systems. It argues that machine learning courses emphasize model development while students lack experience in architectural design, deployment, and monitoring, and reports on the course's design and implementation.
-
Robust Spoofed Speech Detection via Temporal Pyramid ModelingTemporal Pyramid modeling for robust, generalizable spoofed-speech detectionThe paper proposes a Temporal Pyramid Adapter for spoofed speech detection, using parallel temporal convolutions with varying receptive fields to capture multi-scale cues from local artifacts to global prosodic irregularities. It combines self-supervised XLS-R representations with front-end adapters to improve cross-dataset generalization.
-
ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation PoliciesATOM-Bench evaluates atomic skills and compositional generalization in robotsThe paper presents ATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in robotic manipulation policies. It factorizes tabletop manipulation into motor and instruction atoms, noting that a policy may succeed on demonstrated tasks yet fail to execute fine-grained skills or recombine them in new structures.
-
How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content ManipulationPaper: framework measures LLM search-agent endorsement riskAn arXiv paper introduces SearchGEO, a controlled framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline and a five-mode attack taxonomy across multiple backends. Summarized neutrally from the abstract.
-
"They screwed us": Personality clashes sent Anthropic's models offlineWillison flags an Axios report on Anthropic's DC backstoryDeveloper Simon Willison's blog highlights an Axios piece of behind-the-scenes accounts about Anthropic's models and the US government, citing a Commerce Department meeting and debates over jailbreak resistance, while noting the reporting rests on anonymous sources.
-
Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning ModelsTriggering latent safety awareness to harden large reasoning modelsThe paper observes that large reasoning models can recognize safety risks when re-presented with the original query alongside their own reasoning trace—a property it calls latent safety awareness. To exploit this without heavy manual annotation, it uses supervised fine-tuning to induce safe tags that trigger safety analysis.
-
LLM-based Visual Code Completion for Aerospace Geometric DesignPaper: LLM visual-programming copilot for aerospace designAn arXiv paper presents an LLM-based visual programming copilot for aerospace geometric design tasks, using a visual-programming variant of the ReAct methodology. Summarized neutrally from the abstract; claims are the authors' and not independently verified.
-
LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument ControlLabOSBench: a simulated testbed for computer-use agents controlling instrumentsThe paper proposes LabOSBench, a simulated yet realistic testbed for evaluating computer-use agents on scientific instrument control. It notes that existing benchmarks focus on software tasks in virtual systems, while real instruments require coordinated interface control and feedback-driven parameter tuning that are costly and risky to evaluate directly.
-
Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality AssessmentMST-CLIPIQA: decoupling semantics and distortions in AI-image qualityThe paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
-
Decision-Weighted Flow Matching for Contextual Stochastic OptimizationDW-FM reweights flow matching toward decision-sensitive regionsStandard generative scenario models optimize uniform distributional fit rather than downstream decision quality. Decision-Weighted Flow Matching (DW-FM) reweights the velocity-regression objective using decision-sensitive endpoint information, linking downstream regret to pathwise velocity mismatch and providing regret-aligned objectives with guarantees.
-
Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate RepresentationsGen-VCoT uses generated RGB visual intermediates for multimodal reasoningGen-VCoT replaces text-only chain-of-thought with generated RGB intermediates, staging visual grounding (SAM), depth (Marigold), and semantic reasoning (Qwen2-VL) under an adaptive router. It improves spatial (+25%) and depth (+50%) questions but can hurt simple factual ones; text CoT still wins on CLEVR, suggesting task-dependent representations.
-
Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM AgentsS2L replaces runtime SKILL.md text with skill-specific LoRA adaptersThe paper proposes Skill-to-LoRA (S2L), a behavior-centric representation that replaces runtime skill text—commonly distributed as SKILL.md files—with skill-specific LoRA adapters. Rather than compressing the document, S2L models the behavioral change the skill text induces, aiming at more token-efficient LLM agents.
-
P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMsP3B3: a benchmark for Portuguese variety bias in LLMsThe paper introduces P3B3, an expert-curated benchmark and framework for measuring European versus Brazilian Portuguese variety bias in LLMs. It reports most models lean strongly toward pt-BR and argues for more balanced multilingual representation.
-
Automated jailbreak attack targeting multiple defense strategiesUNIATTACK: a defense-oriented framework for automated black-box LLM jailbreaksThe paper presents UNIATTACK, an adversarial testing framework that systematically builds effective black-box attack prompts on LLMs from a defense-oriented perspective. Unlike static templates or model-specific tuning, it extracts minimal but high-impact features from diverse existing attacks and optimizes them.
-
MyPCBench: A Benchmark for Personally Intelligent Computer-Use AgentsMyPCBench: benchmarking personal computer-use agentsMyPCBench evaluates computer-use agents as personal assistants on a Linux desktop with 17 simulated web apps and 184 persona-seeded tasks, benchmarking six closed and open-weight models. Reported scores reflect the paper and are not independently verified.
-
Misinformation Propagation in Benign Multi-Agent SystemsStudy on misinformation propagation in benign multi-agent systemsThe paper injects intent-based misinformation into single- and multi-agent LLM systems and finds it degrades performance and persists through debate, though multi-agent debate can reduce degradation when most agents are uncontaminated. Robustness depends on group composition.
-
Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault DiagnosisPhysics-guided multi-scale framework for bearing fault diagnosisThe paper proposes a progressive, physics-guided multi-scale vibration-processing pipeline for bearing fault diagnosis, using a kinematics-derived descriptor for real-time screening and fault-adaptive segmentation. Reported figures reflect the abstract and are not independently verified.
-
Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving AgentsPaper on evaluator preference collapse in self-evolving agentsAn arXiv paper reportedly examining preference collapse in multimodal evaluators and its cross-modal contagion within self-evolving agent systems. The source excerpt was unavailable (content filter), so this summary is based on the title only; see the original for methods and findings.
-
SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAGSCAR: semantic continuity-aware retrieval for RAG context expansionNote: the abstract was unavailable, so this is summarized neutrally from the title alone. The paper proposes SCAR, a 'semantic continuity-aware retrieval' method aimed at efficient context expansion in retrieval-augmented generation (RAG). Specific mechanisms and evaluation results cannot be confirmed from the title.
-
FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud DetectionFraudSMSWalker benchmark targets URL-masked SMS-to-webpage fraudThe paper introduces FraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. It contains 699 bilingual chains (332 fraudulent, 367 benign) across ten scenarios, withholding raw URLs, hosts, and reputation metadata so models cannot rely on reputation shortcuts, and evaluates nine web agents.
-
Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AISurvey reviews Islamic LLMs and trustworthy, hallucination-resistant AIThis survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI, spanning Arabic NLP, Qur'anic question answering, knowledge benchmarks, retrieval-augmented generation, and legal reasoning. It argues that Arabic fluency alone is insufficient, and that reliable systems need curated sources, verification modules, and citation-aware generation.
-
VeriGraph: Towards Verifiable Data-Analytic AgentsVeriGraph: a traceable neuro-symbolic framework for verifiable data agentsThis arXiv paper introduces VeriGraph, a traceable neuro-symbolic reasoning framework for verifiable data-analytic agents. The authors note that LLM agents' reliance on linear text trajectories makes reasoning hard to audit, entangling deterministic computations over raw data with semantic deductions over natural-language claims. VeriGraph instead has agents build an explicit heterogeneous evidence directed acyclic graph (DAG) during execution.