Safety & Evaluation A
Showing 91–120 of 307
-
X+Slides: Benchmarking Audience-Conditioned Slide GenerationX+Slides benchmarks audience-conditioned slide generationAutomatically generating slide decks from documents is an important LLM application, but existing benchmarks mainly assess completeness and technical depth. X+Slides introduces a benchmark for audience-conditioned slide generation, evaluating how well decks adapt to their intended audience.
-
Acceleration of an algebraic multigrid pressure solver using graph neural networksGraph neural networks accelerate an algebraic multigrid pressure solverSolving the pressure-Poisson equation is the main bottleneck in incompressible unstructured flow solvers, as traditional linear solvers are sensitive to mesh irregularities. This work uses graph neural networks to accelerate an algebraic multigrid pressure solver, improving solve efficiency.
-
TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical PharmacologyTxBench-PP evaluates AI agents on preclinical pharmacologyAI agents promise to accelerate drug discovery by compressing interpretation and decision loops, but deployment needs trusted evaluation on realistic tasks. TxBench-PP is a benchmark analyzing AI agent performance on small-molecule preclinical pharmacology, assessing their practical reliability.
-
RECOM: A Validity Discrimination Tradeoff in Automatic Metrics for Open Ended Reddit Question AnsweringRECOM analyzes validity vs discrimination in automatic metricsAutomatic metrics are the default for evaluating LLM-generated text, yet a metric is quietly asked to do two jobs: tell genuine content alignment from surface coincidence (validity) and discriminate quality. Using open-ended Reddit QA, RECOM analyzes this validity–discrimination trade-off.
-
When AUC Misleads: Polarization-Aware Evaluation of Deepfake Detectors under Domain ShiftPolarization-aware evaluation of deepfake detectors under domain shiftAdvances in diffusion models and face-swapping enable highly realistic deepfakes and real-world harm. This work shows AUC can mislead when evaluating detectors under domain shift, and proposes a polarization-aware evaluation that better reflects deepfake detector performance across domains.
-
Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric AppendicitisHybrid LLM-ML system uses LLMs as interfaces for appendicitisLLMs can broaden clinical decision support by interpreting free-text documentation, but using them directly as diagnostic engines is limited by sensitivity to prompts and information order. This work treats LLMs as interfaces, not oracles, pairing them with ML for pediatric appendicitis diagnosis.
-
Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV FlightHardware/vision-in-the-loop validation of monocular UAV pose estimationAutonomous UAV operations on ships need reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents hardware- and vision-in-the-loop validation of deep monocular pose estimation for autonomous maritime UAV flight.
-
A Clinician-Centered Pipeline for Annotation and Evaluation in Ultrasound AI StudiesA clinician-centered annotation and evaluation pipeline for ultrasound AIClinician-centered evaluation is critical for validating medical AI, especially in ultrasound imaging where quantitative metrics do not always capture clinical usability. This work proposes a clinician-centered pipeline for annotation and evaluation in ultrasound AI studies to ground validation clinically.
-
Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety ReflectionPretraining-stage alignment via regular safety reflectionTo achieve deeper safety alignment for LLMs, recent work pushes safety interventions earlier into pretraining, mainly by filtering unsafe data or rewriting it into safe forms. Going beyond safe data, this work embeds regular safety reflection during pretraining to instill more fundamental alignment.
-
IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic LanguagesIndicContextEval: audio-LLM context use across 8 Indic languagesAudio LLMs can condition speech recognition on textual prompts such as domain descriptions or entity lists, but whether they truly use this context is unclear. IndicContextEval is a benchmark evaluating context utilisation in audio large language models across eight Indic languages.
-
AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst SurfacesAdsMind: physics-grounded multi-agent search for adsorption configsIdentifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, but exhaustive ab initio exploration is prohibitive. AdsMind is a physics-grounded multi-agent system that self-corrects to efficiently discover adsorption configurations.
-
Complementary Attention Head Pruning for Efficient TransformersComplementary attention-head pruning for efficient TransformersTransformers' success stems from architectural scaling, which inflates parameter counts and hinders deployment in resource-constrained settings. This work proposes complementary attention head pruning, removing heads so that retained ones stay complementary, preserving accuracy while improving efficiency.
-
OpenAnt: LLM-Powered Vulnerability Discovery Through Code Decomposition, Adversarial Verification, and Dynamic TestingOpenAnt: LLM-powered vulnerability discovery via code decompositionAutomated vulnerability discovery in large codebases is hard: static analysis yields high false positives while dynamic methods like fuzzing lack coverage. OpenAnt is an LLM-powered approach combining code decomposition, adversarial verification, and dynamic testing to surface real vulnerabilities.
-
OrthoReg: Orthogonal Regularization for Hybrid Symbolic-Neural Dynamical SystemsOrthoReg: orthogonal regularization for symbolic-neural dynamical systemsDynamical systems are fundamental to modeling the natural world, but modeling them trades off interpretable hand-specified mechanistic models against flexible yet opaque neural ones. OrthoReg introduces orthogonal regularization to disentangle symbolic and neural components in hybrid dynamical systems.
-
Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term InteractionA formal theory of human-AI coevolution and social intelligenceConversational AI has advanced in language generation, personalization, and long-context interaction, but most methods model social behavior through isolated components. This work offers a formal theory of human-AI coevolution dynamics, explaining how social intelligence emerges through long-term interaction.
-
Urdu Katib Handwritten Dataset: A Historical Document Dataset for Offline Urdu Handwritten Text Recognition with CRNN-Based Baseline EvaluationUrdu Katib: a historical dataset for offline Urdu handwriting recognitionAutomatic handwritten text recognition is challenging, especially for cursive scripts. This work introduces the Urdu Katib Handwritten Dataset, a historical-document dataset for offline Urdu handwritten text recognition, providing resources to advance recognition research on cursive scripts.
-
INDEQS: Informed Neural controlled Differential EQuationSINDEQS: informed neural controlled differential equations for forecastingNeural Controlled Differential Equations provide a powerful continuous-time framework for time-series forecasting, but standard graph-based extensions struggle to learn spatial structure. INDEQS introduces informed neural controlled differential equations to better capture structure and improve forecasting.
-
Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-DistillationDecoupling perception and reasoning for shortcut-resilient self-distillationOn-policy self-distillation trains a model on its own rollouts, using a frozen copy to give dense token-level targets conditioned on a reference. This work decouples perception from reasoning—seeing before reasoning—to make multimodal on-policy self-distillation resilient to shortcut learning.
-
ProductConsistency: Improving Product Identity Preservation in Instruction-Based Image Editing via SFT and RLProductConsistency preserves product identity in instruction-based editingInstruction-based image editing enables complex edits from natural language, but in product-centric scenarios preserving product features and branding is hard. ProductConsistency uses supervised fine-tuning and reinforcement learning to improve product identity preservation during instruction-based editing.
-
Context-Aware Optimization of Follow-Up Intervals for Type 2 Diabetes Care Using Markov Decision ProcessesOptimizing type-2 diabetes follow-up intervals with MDPsChronic disease management relies on regular patient-provider interactions to track progression and control. For Type 2 Diabetes, guidelines prescribe fixed follow-up intervals. This work uses Markov decision processes to optimize follow-up intervals in a context-aware way, tailoring scheduling to each patient.
-
Quantifying and Auditing LLM Evaluation via Positive--Unlabeled LearningAuditing LLM-as-judge bias via positive-unlabeled learningLLMs are increasingly used as judges for scalable evaluation, yet LLM-as-a-Judge systems show systematic biases decoupled from semantic quality, notably verbosity bias. This work uses positive-unlabeled learning to quantify and audit LLM evaluation, helping detect and correct such biases.
-
Adaptive Speech-to-Spike Encoding for Spiking Neural NetworksAdaptive speech-to-spike encoding for spiking neural networksThe mismatch between continuous acoustic signals and discrete event-driven processing is a fundamental bottleneck for neuromorphic speech processing. Rather than fixed spike encoders, this work proposes adaptive speech-to-spike encoding for spiking neural networks, improving downstream performance.
-
Sumi: Open Uniform Diffusion Language Model from ScratchSumi: an open uniform diffusion language model from scratchDiffusion models are a promising alternative to autoregressive ones, and uniform diffusion language models (UDLMs) let any token be updated at any step. This work releases Sumi, an open uniform diffusion language model built from scratch, supporting research and reproducibility in diffusion LMs.
-
Enhancing Multilingual Reasoning via Steerable Model MergingEnhancing multilingual reasoning via steerable model mergingModel merging effectively composes the capabilities of a multilingual model and a reasoning model, achieving promising generalization on multilingual reasoning by aligning their feature spaces. This work introduces steerable model merging to control the composition and further boost multilingual reasoning.
-
TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extractionTRAP benchmarks agents on task completion and privacy resistanceAgents are increasingly deployed in document-intensive workflows where sensitive private information is routine input—e.g., booking a flight needs passport numbers. TRAP is a benchmark evaluating agents on both task completion and resistance to active privacy-extraction attempts.
-
G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom AlignmentG-IdiomAlign: a gloss-pivoted cross-lingual idiom benchmarkIdioms resist literal cross-lingual mapping because they are non-compositional. G-IdiomAlign anchors each idiom to an English Wiktionary gloss and adds a high-confidence reference alignment set. Two protocols (multiple-choice idiom equivalence and gloss-contrastive generation) isolate the effect of explicit glosses.
-
Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question AnsweringDirect timestep embedding and contrastive alignment for time-series QATime-series question answering casts analysis as natural-language QA. Instead of tokenizing the series, this work embeds timesteps directly and uses contrastive alignment to match language representations, avoiding the information loss of tokenization.
-
Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia AssessmentMitigating scoring errors in speech-based dementia assessmentEarly detection of cognitive impairment relies on neuropsychological tests whose scoring is subjective. This work mitigates scoring errors and compensates for nonverbal subtests in speech-based dementia assessment, aiming for more objective and reliable screening.
-
CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM SystemCAPRA: a multi-agent LLM system for software architecture feedbackAutomated assessment in software engineering education has advanced, but giving quality feedback on architecture deliverables remains hard. CAPRA is a multi-agent LLM system that scales detailed feedback on software architecture deliverables.
-
A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRIA controlled benchmark of quantum-latent GAN augmentation for brain MRIMedical image classification is constrained by limited labeled data. This paper builds a controlled benchmark evaluating quantum-latent GAN data augmentation for brain MRI classification, measuring its effect under standardized conditions.