Safety & Evaluation A

Showing 91–120 of 307
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    X+Slides: Benchmarking Audience-Conditioned Slide Generation
    X+Slides benchmarks audience-conditioned slide generation
    Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Automatically generating slide decks from documents is an important LLM application, but existing benchmarks mainly assess completeness and technical depth. X+Slides introduces a benchmark for audience-conditioned slide generation, evaluating how well decks adapt to their intended audience.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Acceleration of an algebraic multigrid pressure solver using graph neural networks
    Graph neural networks accelerate an algebraic multigrid pressure solver
    Neural Network
    Solving the pressure-Poisson equation is the main bottleneck in incompressible unstructured flow solvers, as traditional linear solvers are sensitive to mesh irregularities. This work uses graph neural networks to accelerate an algebraic multigrid pressure solver, improving solve efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
    TxBench-PP evaluates AI agents on preclinical pharmacology
    AI Agents Claude GPT Reinforcement Learning from Human Feedback (RLHF) Software Engineering
    AI agents promise to accelerate drug discovery by compressing interpretation and decision loops, but deployment needs trusted evaluation on realistic tasks. TxBench-PP is a benchmark analyzing AI agent performance on small-molecule preclinical pharmacology, assessing their practical reliability.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question Answering
    RECOM analyzes validity vs discrimination in automatic metrics
    Neural Network Software Engineering
    Automatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity) and discriminate quality. Using open-ended Reddit QA, RECOM analyzes this validity–discrimination trade-off.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain Shift
    Polarization-aware evaluation of deepfake detectors under domain shift
    Generative AI Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Advances in diffusion models and face-swapping enable highly realistic deepfakes and real-world harm. This work shows AUC can mislead when evaluating detectors under domain shift, and proposes a polarization-aware evaluation that better reflects deepfake detector performance across domains.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis
    Hybrid LLM-ML system uses LLMs as interfaces for appendicitis
    Machine Learning
    LLMs can broaden clinical decision support by interpreting free-text documentation, but using them directly as diagnostic engines is limited by sensitivity to prompts and information order. This work treats LLMs as interfaces, not oracles, pairing them with ML for pediatric appendicitis diagnosis.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight
    Hardware/vision-in-the-loop validation of monocular UAV pose estimation
    Transformer
    Autonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI Studies
    A clinician-centered annotation and evaluation pipeline for ultrasound AI
    Health & Bio Neural Network
    Clinician-centered evaluation is critical for validating medical AI, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. This work proposes a clinician-centered pipeline for annotation and evaluation in ultrasound AI studies to ground validation clinically.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
    Pretraining-stage alignment via regular safety reflection
    Fine-tuning Inference Reinforcement Learning
    To achieve deeper safety alignment for LLMs, recent work pushes safety interventions earlier into pretraining, mainly by filtering unsafe data or rewriting it into safe forms. Going beyond safe data, this work embeds regular safety reflection during pretraining to instill more fundamental alignment.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages
    IndicContextEval: audio-LLM context use across 8 Indic languages
    Meta Neural Network Software Engineering Speech Processing
    Audio LLMs can condition speech recognition on textual prompts such as domain descriptions or entity lists, but whether they truly use this context is unclear. IndicContextEval is a benchmark evaluating context utilisation in audio large language models across eight Indic languages.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces
    AdsMind: physics-grounded multi-agent search for adsorption configs
    AI Agents Machine Intelligence Machine Learning
    Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, but exhaustive ab initio exploration is prohibitive. AdsMind is a physics-grounded multi-agent system that self-corrects to efficiently discover adsorption configurations.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Complementary Attention Head Pruning for Efficient Transformers
    Complementary attention-head pruning for efficient Transformers
    Natural Language Processing (NLP) Reinforcement Learning Transformer
    Transformers' success stems from architectural scaling, which inflates parameter counts and hinders deployment in resource-constrained settings. This work proposes complementary attention head pruning, removing heads so that retained ones stay complementary, preserving accuracy while improving efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Developer Tools extract
    OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic Testing
    OpenAnt: LLM-powered vulnerability discovery via code decomposition
    Automated vulnerability discovery in large codebases is hard: static analysis yields high false positives while dynamic methods like fuzzing lack coverage. OpenAnt is an LLM-powered approach combining code decomposition, adversarial verification, and dynamic testing to surface real vulnerabilities.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical Systems
    OrthoReg: orthogonal regularization for symbolic-neural dynamical systems
    Neural Network Reinforcement Learning
    Dynamical systems are fundamental to modeling the natural world, but modeling them trades off interpretable hand-specified mechanistic models against flexible yet opaque neural ones. OrthoReg introduces orthogonal regularization to disentangle symbolic and neural components in hybrid dynamical systems.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction
    A formal theory of human-AI coevolution and social intelligence
    Conversational AI has advanced in language generation, personalization, and long-context interaction, but most methods model social behavior through isolated components. This work offers a formal theory of human-AI coevolution dynamics, explaining how social intelligence emerges through long-term interaction.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline Evaluation
    Urdu Katib: a historical dataset for offline Urdu handwriting recognition
    Neural Network Retrieval-Augmented Generation (RAG)
    Automatic handwritten text recognition is challenging, especially for cursive scripts. This work introduces the Urdu Katib Handwritten Dataset, a historical-document dataset for offline Urdu handwritten text recognition, providing resources to advance recognition research on cursive scripts.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    INDEQS: Informed Neural controlled Differential EQuationS
    INDEQS: informed neural controlled differential equations for forecasting
    Neural Network Reinforcement Learning
    Neural Controlled Differential Equations provide a powerful continuous-time framework for time-series forecasting, but standard graph-based extensions struggle to learn spatial structure. INDEQS introduces informed neural controlled differential equations to better capture structure and improve forecasting.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation
    Decoupling perception and reasoning for shortcut-resilient self-distillation
    Computer Vision Machine Learning Software Engineering
    On-policy self-distillation trains a model on its own rollouts, using a frozen copy to give dense token-level targets conditioned on a reference. This work decouples perception from reasoning—seeing before reasoning—to make multimodal on-policy self-distillation resilient to shortcut learning.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RL
    ProductConsistency preserves product identity in instruction-based editing
    Fine-tuning Machine Learning Reinforcement Learning
    Instruction-based image editing enables complex edits from natural language, but in product-centric scenarios preserving product features and branding is hard. ProductConsistency uses supervised fine-tuning and reinforcement learning to improve product identity preservation during instruction-based editing.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Context-Aware Optimization of Follow-Up Intervals for Type 2 Diabetes Care Using Markov Decision Processes
    Optimizing type-2 diabetes follow-up intervals with MDPs
    Reinforcement Learning
    Chronic disease management relies on regular patient-provider interactions to track progression and control. For Type 2 Diabetes, guidelines prescribe fixed follow-up intervals. This work uses Markov decision processes to optimize follow-up intervals in a context-aware way, tailoring scheduling to each patient.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning
    Auditing LLM-as-judge bias via positive-unlabeled learning
    Embeddings
    LLMs are increasingly used as judges for scalable evaluation, yet LLM-as-a-Judge systems show systematic biases decoupled from semantic quality, notably verbosity bias. This work uses positive-unlabeled learning to quantify and audit LLM evaluation, helping detect and correct such biases.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Adaptive Speech-to-Spike Encoding for Spiking Neural Networks
    Adaptive speech-to-spike encoding for spiking neural networks
    Deep Learning Google Neural Network Speech Processing
    The mismatch between continuous acoustic signals and discrete event-driven processing is a fundamental bottleneck for neuromorphic speech processing. Rather than fixed spike encoders, this work proposes adaptive speech-to-spike encoding for spiking neural networks, improving downstream performance.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Sumi: Open Uniform Diffusion Language Model from Scratch
    Sumi: an open uniform diffusion language model from scratch
    Deep Learning Reinforcement Learning
    Diffusion models are a promising alternative to autoregressive ones, and uniform diffusion language models (UDLMs) let any token be updated at any step. This work releases Sumi, an open uniform diffusion language model built from scratch, supporting research and reproducibility in diffusion LMs.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Enhancing Multilingual Reasoning via Steerable Model Merging
    Enhancing multilingual reasoning via steerable model merging
    Neural Network
    Model merging effectively composes the capabilities of a multilingual model and a reasoning model, achieving promising generalization on multilingual reasoning by aligning their feature spaces. This work introduces steerable model merging to control the composition and further boost multilingual reasoning.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction
    TRAP benchmarks agents on task completion and privacy resistance
    AI Agents Neural Network
    Agents are increasingly deployed in document-intensive workflows where sensitive private information is routine input—e.g., booking a flight needs passport numbers. TRAP is a benchmark evaluating agents on both task completion and resistance to active privacy-extraction attempts.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment
    G-IdiomAlign: a gloss-pivoted cross-lingual idiom benchmark
    Embeddings
    Idioms resist literal cross-lingual mapping because they are non-compositional. G-IdiomAlign anchors each idiom to an English Wiktionary gloss and adds a high-confidence reference alignment set. Two protocols (multiple-choice idiom equivalence and gloss-contrastive generation) isolate the effect of explicit glosses.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering
    Direct timestep embedding and contrastive alignment for time-series QA
    Embeddings Machine Learning Retrieval-Augmented Generation (RAG) Software Engineering
    Time-series question answering casts analysis as natural-language QA. Instead of tokenizing the series, this work embeds timesteps directly and uses contrastive alignment to match language representations, avoiding the information loss of tokenization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia Assessment
    Mitigating scoring errors in speech-based dementia assessment
    Embeddings Retrieval-Augmented Generation (RAG) Reinforcement Learning Speech Processing
    Early detection of cognitive impairment relies on neuropsychological tests whose scoring is subjective. This work mitigates scoring errors and compensates for nonverbal subtests in speech-based dementia assessment, aiming for more objective and reliable screening.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM System
    CAPRA: a multi-agent LLM system for software architecture feedback
    AI Agents GPT Machine Learning Software Engineering
    Automated assessment in software engineering education has advanced, but giving quality feedback on architecture deliverables remains hard. CAPRA is a multi-agent LLM system that scales detailed feedback on software architecture deliverables.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI
    A controlled benchmark of quantum-latent GAN augmentation for brain MRI
    Medical image classification is constrained by limited labeled data. This paper builds a controlled benchmark evaluating quantum-latent GAN data augmentation for brain MRI classification, measuring its effect under standardized conditions.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗