New Model Releases A

Showing 181–210 of 260
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    TokenPilot: Cache-Efficient Context Management for LLM Agents
    TokenPilot cuts LLM-agent context costs ~61% while preserving prompt cache
    AI Agents Inference Natural Language Processing (NLP) Reinforcement Learning
    TokenPilot is a dual-granularity context manager for LLM agents that avoids the cache invalidation caused by unconstrained pruning. Ingestion-Aware Compaction stabilizes prompt prefixes while Lifecycle-Aware Eviction offloads segments only when relevance expires, cutting costs by 61% and 56% in benchmarks.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Multimodal extract
    ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement Learning
    ROVE: RL that learns humanoid manipulation from imperfect interventions
    Computer Vision Machine Learning Reinforcement Learning
    ROVE is an RL framework for post-training humanoid Vision-Language-Action models from imperfect human interventions. It pairs a human-in-the-loop data pipeline with Optimistic Value Estimation to prioritize high-value behaviors in mixed-quality trajectories, and adds cross-embodiment human videos to robustify value estimation.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects Identification
    NEXIS identifies causal, interpretable heterogeneous treatment effects
    The paper proposes NEXIS (Neural EXposure Interaction Search), a method for causally identifying heterogeneous treatment effects (HTE) in controlled experiments. By leveraging multi-modal pre-treatment measurements and scalable representations, it reframes HTE identification as Markov-blanket discovery over a sufficient, aligned representation, aiming to ease the expressivity-interpretability trade-off.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    TuneJury: An Open Metric for Improving Music Generation Preference Alignment
    TuneJury: an open reward model for text-to-music preference
    Deep Learning Inference
    TuneJury is an open, instance-level pairwise reward model that predicts text-to-music preference scores from a prompt and an audio clip, trained on publicly available human-preference labels. Its calibrated score margins support data filtering, and an 'anchor calibration' step efficiently extends it to generators released after training.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation
    ActiveSAM turns frozen SAM 3 into a training-free open-vocab segmenter
    Inference Neural Network Retrieval-Augmented Generation (RAG) Reinforcement Learning
    ActiveSAM is a training-free, zero-shot framework that adapts the frozen SAM 3 backbone for open-vocabulary semantic segmentation. It estimates an image-conditioned active class set from a low-resolution presence preview, then decodes only the retained classes at full resolution, improving efficiency over decoding the entire dataset vocabulary per image.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT
    Multi-center benchmark diagnoses abdominal disease from non-contrast CT
    Deep Learning Retrieval-Augmented Generation (RAG)
    The paper introduces a multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation that synthesizes contrast-enhanced findings from single-phase non-contrast CT, aiming to cut contrast risks and radiologist workload. Using paired NCCT-CECT studies from two centers, it benchmarks five deep-learning architectures under a unified protocol.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Analytic Torsion and Spectral Gap Capture Persistent-Laplacian Performance
    Three invariants capture persistent-Laplacian predictive power compactly
    The paper proposes a compact, fixed-length spectral representation that distills the persistent Laplacian into three invariants - Betti numbers, the spectral gap, and analytic torsion - addressing the high dimensionality and varying-length problems of the full eigenspectrum. On benchmarks like MNIST and QM-3D, it matches or exceeds full-spectrum performance while cutting computational overhead.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    Agent trajectories as programs: fingerprinting and programming coding-agent behavior
    Coding agents have behavioral fingerprints identifiable from trajectories
    AI Agents Neural Network Software Engineering
    The paper compares agents procedurally rather than by benchmark scores, defining behavioral 'fingerprints.' Across ten agents, a probe over these procedural signatures attributes an unseen trajectory to the correct agent with 85.7% accuracy. Using an emergent, compressive vocabulary induction over SWE-Bench trajectories, it studies the structural distinctness of agent problem-solving.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Dynestyx: A Probabilistic Programming Library for Dynamical Systems
    Dynestyx: a probabilistic programming library with first-class SSMs
    Inference Machine Learning
    State-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, yet they have been hard to incorporate into modern probabilistic programming languages. The authors introduce dynestyx, a library with first-class SSM support and state-of-the-art state and parameter estimation. Through one interface, users specify arbitrary priors for discrete- or continuous-time systems, run inference over mixed-effect data, and obtain estimates with principled uncertainty.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • Simon Willison's Weblog · EN New Model Releases extract
    datasette-agent 0.3a0
    Simon Willison releases datasette-agent 0.3a0 with approval-gated SQL writes
    Neural Network
    Simon Willison released datasette-agent 0.3a0, adding a new 'execute_write_sql' tool that requests user approval before writing to a database while respecting user permissions. It extends the approval mechanism introduced in the prior 0.2a0 release, enabling agent-driven write operations under explicit user consent.
    Read original (Simon Willison's Weblog) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Scalable Pairwise Kernel Learning with Stochastic Vec Trick
    SPaiK: scalable kernel learning for large-scale pairwise problems
    Neural Network Reinforcement Learning
    Pairwise learning predicts outcomes for pairs of objects. The authors introduce SPaiK, a scalable kernel method for pairwise settings that preserves kernel methods' expressive power while cutting compute and memory. Its key innovation is the stochastic generalized vec trick (sGVT), a stochastic extension of sparse Kronecker-product multiplication for efficient large-scale training. SPaiK is tested on seven drug-target affinity datasets against state-of-the-art methods.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Sobolev Approximation by Fixed-Size Neural Networks with Arbitrary Accuracy
    Fixed-size neural nets achieve arbitrary-accuracy Sobolev approximation
    Neural Network
    This work studies new activation functions enabling arbitrary-accuracy Sobolev approximation by fixed-size neural networks. It first shows any function in W^{2,inf} can be approximated to arbitrary accuracy in the W^{1,inf} norm via the Elementary Universal Activation Function (EUAF). To extend this to higher-order spaces W^{s,inf}, the authors introduce a smooth activation DUAF_inf and prove arbitrary-accuracy approximation in the W^{s-1,inf} norm, with sigmoidal variants constructed.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers
    Decade-long study of 56,800 AI papers finds rising code/data sharing
    Reinforcement Learning
    Analyzing 56,800 papers from five leading AI conferences over 2014-2024, the study reports that sharing both code and data rose nearly sixfold, from 11% to 64%. Based on documentation practices, it estimates reproducibility increased from 28% to 64%, with gains predating reproducibility checklists.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    How Much Do Reviews Really Contribute? A Study on Text-Enriched Matrix Factorization for Recommendations
    How much do reviews help? A study of text-enriched matrix factorization
    Embeddings Reinforcement Learning
    Incorporating textual reviews into recommender systems is a popular way to enrich collaborative signals with semantic information, yet their actual contribution remains unclear against strong collaborative baselines. The authors systematically investigate text's impact on matrix factorization by introducing and comparing three enrichment strategies over a common collaborative backbone, including a learnable gating mechanism that adaptively balances collaborative and textual signals.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Infrastructure & Hardware extract
    Phantoms and Disclosures: a Causal Framework for Auditing Synthetic Data
    A causal auditing framework to detect synthetic-data privacy disclosures
    Generative AI Inference Reinforcement Learning
    Generative AI and LLMs have made synthetic data a popular privacy-preserving substitute for sensitive datasets, yet it can memorize and reproduce private training data. The authors propose a customizable empirical framework distinguishing "true disclosures" (direct reproduction of user data) from "phantom disclosures" (incidental generation). Using training/holdout partitioning and statistical hypothesis testing, it checks whether disclosures match strict privacy baselines like zero-learning.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    A nonparametric two-sample test using a parametric integral probability metric
    A nonparametric two-sample test via a single-node parametric IPM
    Machine Learning Neural Network Reinforcement Learning
    Detecting distributional differences between two independent samples is fundamental in statistics and machine learning. Nonparametric two-sample testing decides whether two samples come from the same distribution without assuming a parametric form. The paper proposes a new test statistic based on an integral probability metric (IPM) defined via a specially designed parametric discriminator class using a single neural-network node, and analyzes the resulting test's properties.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Functional Gradient Descent with Adaptive Representations
    Functional gradient descent made practical via adaptive representations
    Computer Vision Deep Learning Neural Network
    Functional optimization is usually solved by tuning parameters of a fixed representation such as a neural network, yielding highly nonconvex losses that hinder training and analysis. Functional gradient descent (FGD)-gradient descent directly in function space-offers strong convergence guarantees and clean theory but is hard to implement because functional gradients are infinite-dimensional. The paper proposes a practical FGD using adaptive representations.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Inference & Efficiency extract
    Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
    Binary Tracking: open vision-language models for spatial QA and navigation
    AI Agents Computer Vision GPT Inference Retrieval-Augmented Generation (RAG)
    The paper addresses spatial question answering for service robots traversing long egocentric routes, returning metric coordinates that downstream navigation can act on for queries like 'where can I find a dry cleaner on the way back home?' Prior approaches rely on closed-source models such as GPT-4o, which robots cannot reliably depend on due to network instability, latency, and deployment cost. The authors propose Binary Tracking, an open-source vision-language approach that can run onboard.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    Factorized Neural Operators Decompose Dynamic and Persistent Responses
    FaNO: factorized neural operators splitting dynamic and persistent responses
    Deep Learning
    Physical systems often combine fast-evolving dynamics with persistent structures, which existing neural operators struggle to capture because a single dominant inductive bias couples distinct responses into one representation. The authors introduce a unified Green's-function framework and propose Factorized Neural Operators (FaNO), decomposing spectral representations into equivariant dynamic responses and invariant persistent responses to better model multiscale physical behavior.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Multimodal extract
    Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization
    Semantic Flip: synthetic OOD generation for robust refusal in embodied agents
    AI Agents Computer Vision Neural Network Reinforcement Learning Software Engineering
    Detecting unanswerable queries is essential for reliable embodied agents, yet vision-language models often answer overconfidently when visual memory cannot support the query, risking misleading users or physically guiding them to arbitrary locations. The paper proposes Semantic Flip, a simple method that generates synthetic out-of-distribution samples to teach embodied VLMs when to respond 'I do not know,' improving robust refusal in embodied question answering and spatial localization.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Developer Tools extract
    Beyond Weights and Gradients: A Taxonomy of Federated Learning Messages
    A formal definition and taxonomy of federated learning messages
    Deep Learning
    Federated learning now exchanges more than weights and gradients, including synthetic data and analytics. This paper gives a formal mathematical definition of a federated message capturing utility and privacy, and a taxonomy of three categories—model structures, statistical summaries, and data-conditioned representations—evaluated on compute, communication, and privacy. A review of 202 papers shows a shift toward diverse messaging.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question Answering
    Reasoning hop-count predicts clinical AI failure in EHR QA
    Claude GPT OpenAI Software Engineering Transformer
    An arXiv paper shows that in electronic health record (EHR) question answering, questions needing more inferential hops yield disproportionately more LLM errors. Using a pre-specified hop-count taxonomy, it links this failure structure to theoretical limits on transformer compositionality. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • Publickey · JA New Model Releases extract
    Stack Overflow、AIエージェント同士が掲示板で技術情報を共有する「Stack Overflow for Agents」ベータ公開
    Stack Overflow launches 'Stack Overflow for Agents' beta
    AI Agents Machine Learning
    Stack Overflow has launched a beta of 'Stack Overflow for Agents,' a service where AI agents share technical solutions and other information on an open message board. The move appears aimed at extending its human Q&A knowledge base into information exchange among agents.
    Read original (Publickey) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection
    Benchmark suite for federated noisy-label medical image segmentation
    Meta Reinforcement Learning
    Federated learning enables collaborative medical image segmentation without centralizing sensitive data, but real-world deployment faces label imperfections like contour disagreement and confused labels. The authors argue existing federated noisy-label learning relies on synthetic noise and simplified settings, and introduce a benchmark suite combining diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to guide method selection.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.LG (Machine Learning) · EN Safety & Evaluation extract
    HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern Complexity
    HawkesNest: a synthetic benchmark for spatiotemporal point process models
    Reinforcement Learning Software Engineering
    Evaluating spatiotemporal point process (STPP) models relies on opaque real datasets where failures are hard to attribute. HawkesNest is a generator-aligned synthetic benchmark built on a multivariate Hawkes backbone, defining four complexity axes with deterministic indices so models can be stress-tested under known structural difficulty.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor Tokens
    Anchor-token roadmap for revocable decoding in diffusion LLMs
    Deep Learning Embeddings Inference Retrieval-Augmented Generation (RAG) Speech Processing
    An arXiv paper addresses the speed-quality trade-off and error propagation in revocable decoding for diffusion LLMs (dLLMs). It proposes following a latent 'roadmap' guided by anchor tokens to mitigate failures arising in mixed-quality contexts during parallel generation. Neutral, abstract-based summary.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts
    RDS Fusion: neuro-symbolic gating with compressed CoT for irony detection
    Fine-tuning Transformer
    An arXiv paper proposes Robust Dual-Signal (RDS) Fusion, a hybrid neuro-symbolic framework that compresses Chain-of-Thought reasoning without supervised fine-tuning to improve zero-shot irony detection. It reports evaluation on a held-out TweetEval test set (N=734). Neutral, abstract-based summary; figures are the authors' claims.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Funding & M&A extract
    ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation Policies
    ATOM-Bench evaluates atomic skills and compositional generalization in robots
    Fine-tuning Reinforcement Learning
    The paper presents ATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in robotic manipulation policies. It factorizes tabletop manipulation into motor and instruction atoms, noting that a policy may succeed on demonstrated tasks yet fail to execute fine-grained skills or recombine them in new structures.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models
    Paper: Expert Tying shares MoE expert params across layers
    DeepSeek Inference Mixture of Experts (MoE) Transformer
    An arXiv paper introduces Expert Tying, an architectural change that shares expert parameters across consecutive transformer layers while keeping independent layer-wise routing and attention, aiming to cut Mixture-of-Experts memory cost. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content Manipulation
    Paper: framework measures LLM search-agent endorsement risk
    AI Agents Claude Gemini GPT Speech Processing
    An arXiv paper introduces SearchGEO, a controlled framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline and a five-mode attack taxonomy across multiple backends. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗