Safety & Evaluation A

Showing 211–240 of 290
  • OpenAI Blog · EN Safety & Evaluation extract
    Predicting model behavior before release by simulating deployment
    OpenAI unveils Deployment Simulation to predict model behavior pre-release
    OpenAI
    OpenAI introduced Deployment Simulation, a method to predict an AI model's behavior before deployment by using real conversation data to simulate responses, aiming to improve safety and evaluation accuracy. The claims are OpenAI's own and not independently verified.
    Read original (OpenAI Blog) ↗
  • arXiv cs.CL (Computation and Language) · EN Multimodal extract
    Context-Aware RL for Agentic and Multimodal LLMs
    ContextRL rewards picking the right context to ground answers
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    ContextRL is a context-aware RL method that improves long-horizon and multimodal reasoning via an indirect objective: instead of supervising only the final answer, it rewards selecting the context that supports a query-answer pair, encouraging fine-grained grounding. Trained on contrastive coding-trajectory and image data, it gains an average +2.2% over standard GRPO.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    Geometric Action Model for Robot Policy Learning
    GAM reuses a geometric foundation model for robot control
    Computer Vision Reinforcement Learning
    The Geometric Action Model (GAM) is a language-conditioned manipulation policy that repurposes a pretrained geometric foundation model as a shared substrate for perception, temporal prediction, and action decoding. It splits the model at an intermediate layer: shallow layers act as an observation encoder, while a causal future predictor forecasts latent tokens from language, proprioception, and action history.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio
    A benchmark for LLM agents on Nature Portfolio meta-analyses
    AI Agents Meta Retrieval-Augmented Generation (RAG)
    This work introduces a benchmark that evaluates LLM agents on meta-analysis articles from Nature Portfolio. The article excerpt was unavailable, so this summary is limited to a neutral description based on the title.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Your Privacy My Cloak: Backdoor Attacks on Differentially Private Federated Learning
    DP can hide backdoors in federated learning, enabling RING attack
    Deep Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning
    Challenging the belief that differential privacy (DP) makes federated learning robust to backdoors, the authors show empirically that complying with DP masks the statistical signatures defenses rely on, rendering them ineffective. They exploit this with RING, an attack that uses DP to conceal malicious contributions while maximizing impact, acting as a perturbation layer agnostic to the underlying backdoor technique.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research Agents
    DeepRubric: evidence-tree rubrics to boost deep-research agent RL
    AI Agents Reinforcement Learning
    DeepRubric is a data-construction framework for RL of deep research agents that reverses the usual query-to-rubric flow: starting from a seed topic it builds an evidence tree to decide what an evidence-backed report should be judged on, then synthesizes aligned query-rubric pairs for more reliable reward supervision.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    HAMON: Passive Optical Sequence Mixing for Long-Horizon Forecasting
    HAMON: a passive optical core for long-horizon forecasting
    Inference Neural Network Transformer
    HAMON is a passive diffractive optical forecasting core: history is encoded onto an optical aperture and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. Inference is a single passive optical pass with no digital sequence-mixing layer, yet it beats strong digital baselines on ETTm2.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models
    FusionRS: a large-scale RGB-infrared-text remote sensing dataset
    Computer Vision
    Noting that remote-sensing vision-language models remain RGB-centric, the paper introduces FusionRS, described as the first large-scale RGB-infrared-text dataset for dual-modal learning. It is built by translating public RGB images into infrared-style counterparts, pairing each with conventional and infrared-aware captions.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    TuneJury: An Open Metric for Improving Music Generation Preference Alignment
    TuneJury: an open reward model for text-to-music preference
    Deep Learning Inference
    TuneJury is an open, instance-level pairwise reward model that predicts text-to-music preference scores from a prompt and an audio clip, trained on publicly available human-preference labels. Its calibrated score margins support data filtering, and an 'anchor calibration' step efficiently extends it to generators released after training.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Bayesian Inference and Decision Audits for Public Archives of Frontier AI Evaluations
    Bayesian audit of public frontier-AI evaluation archives proposed
    Inference Reinforcement Learning
    The paper treats public AI evaluation archives (e.g., LiveBench, Open LLM Leaderboard v2, LMArena, GAIA, tau-bench) as selective time series rather than terminal leaderboards, framing them as a Bayesian inference problem. It reports that selection-aware frontier models fail synthetic recovery and calibration, while fixed audit gates remain informative.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models
    Measurement study of post-hoc falsification operators for code models
    Fine-tuning Neural Network Retrieval-Augmented Generation (RAG)
    Per its title, this paper presents a measurement study of post-hoc 'falsification operators' applied to frozen (non-retrained) small code models, framed around selection without signal and recovery through expression. The raw excerpt was blocked by a content filter, so this summary is based on the title alone and stays deliberately neutral.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation
    ActiveSAM turns frozen SAM 3 into a training-free open-vocab segmenter
    Inference Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    ActiveSAM is a training-free, zero-shot framework that adapts the frozen SAM 3 backbone for open-vocabulary semantic segmentation. It estimates an image-conditioned active class set from a low-resolution presence preview, then decodes only the retained classes at full resolution, improving efficiency over decoding the entire dataset vocabulary per image.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    When in Doubt, Plan It Out: Committed Small Language Model Deliberation for Reactive Reinforcement Learning
    PACT pairs reactive RL with a deliberative small-LM planner
    Neural Network Reinforcement Learning
    PACT (Plan, Align, Commit, Think) is a hybrid architecture combining a fast reactive RL policy with a slow, deliberative small language model (SLM) planner. The SLM is invoked asynchronously to generate and verify action plans; once validated as safe and feasible, a plan executes directly without retraining the RL policy. On three FrozenLake settings, a 2B-parameter SLM backbone outperformed all baselines.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT
    Multi-center benchmark diagnoses abdominal disease from non-contrast CT
    Deep Learning Retrieval-Augmented Generation (RAG)
    The paper introduces a multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation that synthesizes contrast-enhanced findings from single-phase non-contrast CT, aiming to cut contrast risks and radiologist workload. Using paired NCCT-CECT studies from two centers, it benchmarks five deep-learning architectures under a unified protocol.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance
    Three invariants capture persistent-Laplacian predictive power compactly
    The paper proposes a compact, fixed-length spectral representation that distills the persistent Laplacian into three invariants - Betti numbers, the spectral gap, and analytic torsion - addressing the high dimensionality and varying-length problems of the full eigenspectrum. On benchmarks like MNIST and QM-3D, it matches or exceeds full-spectrum performance while cutting computational overhead.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Agent trajectories as programs: fingerprinting and programming coding-agent behavior
    Coding agents have behavioral fingerprints identifiable from trajectories
    AI Agents Neural Network Software Engineering
    The paper compares agents procedurally rather than by benchmark scores, defining behavioral 'fingerprints.' Across ten agents, a probe over these procedural signatures attributes an unseen trajectory to the correct agent with 85.7% accuracy. Using an emergent, compressive vocabulary induction over SWE-Bench trajectories, it studies the structural distinctness of agent problem-solving.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Inference & Efficiency extract
    Decoupling Inference from State Updates in Low-Latency Feature Engines via Probabilistic Thinning
    Probabilistic thinning decouples inference from state updates in streams
    Inference Machine Learning Neural Network Retrieval-Augmented Generation (RAG)
    Streaming data systems increasingly underpin ML workflows maintaining many continuously updated aggregations. In production, each event triggers read-modify-write operations to storage, making high-frequency state updates a dominant source of latency, contention, and cost. This work decouples inference from persistence via probabilistic thinning: every event is scored, but durable updates fire only for informative events, using approximate disk-backed statistics with no in-memory control plane.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Task-Error Residual Learning for Real-Robot Five-Ball Juggling
    Residual learning enables fast, stable real-robot five-ball juggling
    Neural Network Reinforcement Learning
    For residual learning that refines existing behavior, sample efficiency hinges on how much information each rollout returns and how efficiently it is used. Standard scalar RL reward carries less than the directional task error defining the task. Using directional task-error supervision and a task-error model driving sample selection, the system achieves stable three-, four-, and five-ball juggling on Barrett WAM arms, converging from the second attempt with monotonically decreasing error.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Latent space mapping of interpretable structural coordinates from stochastic single-molecule signals
    Contrastive latent mapping of nanopore signals into molecular coordinates
    Nanopores are versatile single-molecule sensors, but stochastic translocation dynamics warp encoded information, limiting their utility. The paper shifts from time-domain analysis to a learned latent-space mapping via a contrastive encoder trained only on simulated signals from a physics-informed model. It maps nanopore signals of engineered DNA barcodes into an interpretable molecular coordinate system that responds to structural parameters but stays invariant to acquisition conditions.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    A nonparametric two-sample test using a parametric integral probability metric
    A nonparametric two-sample test via a single-node parametric IPM
    Machine Learning Neural Network Reinforcement Learning
    Detecting distributional differences between two independent samples is fundamental in statistics and machine learning. Nonparametric two-sample testing decides whether two samples come from the same distribution without assuming a parametric form. The paper proposes a new test statistic based on an integral probability metric (IPM) defined via a specially designed parametric discriminator class using a single neural-network node, and analyzes the resulting test's properties.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Scalable Circuit Learning for Interpreting Large Language Models
    CircuitLasso: scalable LLM circuit learning via sparse linear regression
    Retrieval-Augmented Generation (RAG)
    A major mechanistic-interpretability direction learns sparse circuits over LLM components to reveal how they jointly produce behavior, but raw neurons are polysemantic and hard to interpret. Sparse autoencoder (SAE) features help, yet their high dimensionality makes intervention-based circuit learning computationally prohibitive. The paper proposes CircuitLasso, a scalable approach based on sparse linear regression whose structural accuracy matches state-of-the-art intervention methods.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    A Unified Causal-Origin Taxonomy of Distributional Shifts in Reinforcement Learning
    A unified causal-origin taxonomy of distributional shifts in RL
    Reinforcement Learning
    Reinforcement learning systems degrade when operating conditions diverge from training, reflecting distributional shifts in the data-generating process. These shifts arise between training and evaluation (ID vs. OOD generalization) or in non-stationary settings where dynamics evolve, yet their formal relationship is unclear and prior work emphasizes mitigation over causes. The paper proposes a unified taxonomy of the causal origins of shift within the agent-environment interaction.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    MA-SBI: Misspecification-Aware Simulation-Based Inference via Side-Channel Guidance
    MA-SBI: misspecification-aware inference via side-channel guidance
    Inference Neural Network Reinforcement Learning
    Simulation-based inference (SBI) is often hindered by simulator misspecification, the mismatch between simulated and real observations. The recent robust method RoPE uses optimal transport between learned representations but needs ground-truth calibration pairs unavailable where SBI is needed. Practitioners instead have unstructured side-information such as regime labels, instruction text, and policy bulletins. The authors propose Misspecification-Aware SBI (MA-SBI) to exploit this guidance.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Greed Is Learned: Visible Incentives as Reward-Hacking Triggers
    Greed Is Learned: RL agents get addicted to visible reward channels
    AI Agents Deep Learning Neural Network Reinforcement Learning
    Deployed agents increasingly act with a reward proxy in view, such as a balance or KPI dashboard. The authors show reinforcement learning can make a policy 'addicted' to this visible self-benefit channel: it chases the displayed payoff across domains, sacrifices the true task, and follows the channel even when rewritten, while policies that never saw it stay honest. They call this 'reward-channel addiction' and study it in MoneyWorld, a synthetic sandbox where it can flip safety alignment.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    IMPACTeen: Intentions, Manipulation, Persuasion, Annotations, and Consequences in Teen Communication Dataset
    IMPACTeen: a teen-context dataset of social-influence scenarios and labels
    Neural Network
    The paper introduces IMPACTeen, a dataset of textual social-influence scenarios in adolescent interpersonal, media, and digital settings. It contains 1,021 texts and 5,100 annotation records labeled from five perspectives (teens, parents, psychologists, communication experts, teachers), built via constrained LLM generation plus two-step human editing, with Polish and English versions. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    LESS Is More: Mutual-Stability Sampling for Diffusion Language Models
    LESS: a training-free adaptive sampler for diffusion language models
    Deep Learning Inference Neural Network Retrieval-Augmented Generation (RAG) Transformer
    The paper presents LESS, a training-free, model-agnostic adaptive sampler for diffusion LLMs that frames token commitment as an online stopping problem. Its mutual-stability rule unmasks a position only when its top-1 prediction is confident, persists across recent steps, and is distributionally stable (top-K inter-step JS divergence). It is evaluated on Dream-7B, LLaDA-8B, and LLaDA-1.5-8B. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences
    LOGOS: a general-purpose generative foundation model for natural sciences
    Neural Network
    This report presents LOGOS (Language Of Generative Objects in Science), a generative language model unifying heterogeneous natural-science tasks in one autoregressive framework over a shared scientific grammar. It encodes scientific objects and their spatial contacts/constraints as discrete tokens, casting tasks as next-token prediction without explicit coordinates or geometric networks, and reportedly matches or beats domain-specific baselines. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Factorized Neural Operators Decompose Dynamic and Persistent Responses
    FaNO: factorized neural operators splitting dynamic and persistent responses
    Deep Learning
    Physical systems often combine fast-evolving dynamics with persistent structures, which existing neural operators struggle to capture because a single dominant inductive bias couples distinct responses into one representation. The authors introduce a unified Green's-function framework and propose Factorized Neural Operators (FaNO), decomposing spectral representations into equivariant dynamic responses and invariant persistent responses to better model multiscale physical behavior.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization
    Semantic Flip: synthetic OOD generation for robust refusal in embodied agents
    AI Agents Computer Vision Neural Network Reinforcement Learning Software Engineering
    Detecting unanswerable queries is essential for reliable embodied agents, yet vision-language models often answer overconfidently when visual memory cannot support the query, risking misleading users or physically guiding them to arbitrary locations. The paper proposes Semantic Flip, a simple method that generates synthetic out-of-distribution samples to teach embodied VLMs when to respond 'I do not know,' improving robust refusal in embodied question answering and spatial localization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Contrastive-Difference CKA Reveals Concept-Specific Structural Alignment Across Language Model Architectures
    CKA_Delta reveals concept-specific alignment across LLM architectures
    Neural Network
    An arXiv paper introduces contrastive-difference CKA (CKA_Delta), a training-free diagnostic, to characterize whether different LLM architectures encode high-level concepts compatibly. It reports a geometric-functional universality dissociation: moderate geometric convergence alongside near-perfect functional transfer. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗