New Model Releases A
Showing 181–210 of 260
-
TokenPilot: Cache-Efficient Context Management for LLM AgentsTokenPilot cuts LLM-agent context costs ~61% while preserving prompt cacheTokenPilot is a dual-granularity context manager for LLM agents that avoids the cache invalidation caused by unconstrained pruning. Ingestion-Aware Compaction stabilizes prompt prefixes while Lifecycle-Aware Eviction offloads segments only when relevance expires, cutting costs by 61% and 56% in benchmarks.
-
ROVE: Unlocking Human Interventions for Humanoid Manipulation via Reinforcement LearningROVE: RL that learns humanoid manipulation from imperfect interventionsROVE is an RL framework for post-training humanoid Vision-Language-Action models from imperfect human interventions. It pairs a human-in-the-loop data pipeline with Optimistic Value Estimation to prioritize high-value behaviors in mixed-quality trajectories, and adds cross-embodiment human videos to robustify value estimation.
-
From Tokens to Policy: Causal and Interpretable Heterogeneous Treatment Effects IdentificationNEXIS identifies causal, interpretable heterogeneous treatment effectsThe paper proposes NEXIS (Neural EXposure Interaction Search), a method for causally identifying heterogeneous treatment effects (HTE) in controlled experiments. By leveraging multi-modal pre-treatment measurements and scalable representations, it reframes HTE identification as Markov-blanket discovery over a sufficient, aligned representation, aiming to ease the expressivity-interpretability trade-off.
-
TuneJury: An Open Metric for Improving Music Generation Preference AlignmentTuneJury: an open reward model for text-to-music preferenceTuneJury is an open, instance-level pairwise reward model that predicts text-to-music preference scores from a prompt and an audio clip, trained on publicly available human-preference labels. Its calibrated score margins support data filtering, and an 'anchor calibration' step efficiently extends it to generators released after training.
-
ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary SegmentationActiveSAM turns frozen SAM 3 into a training-free open-vocab segmenterActiveSAM is a training-free, zero-shot framework that adapts the frozen SAM 3 backbone for open-vocabulary semantic segmentation. It estimates an image-conditioned active class set from a low-resolution presence preview, then decodes only the retained classes at full resolution, improving efficiency over decoding the entire dataset vocabulary per image.
-
A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CTMulti-center benchmark diagnoses abdominal disease from non-contrast CTThe paper introduces a multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation that synthesizes contrast-enhanced findings from single-phase non-contrast CT, aiming to cut contrast risks and radiologist workload. Using paired NCCT-CECT studies from two centers, it benchmarks five deep-learning architectures under a unified protocol.
-
Analytic Torsion and Spectral Gap Capture Persistent-Laplacian PerformanceThree invariants capture persistent-Laplacian predictive power compactlyThe paper proposes a compact, fixed-length spectral representation that distills the persistent Laplacian into three invariants - Betti numbers, the spectral gap, and analytic torsion - addressing the high dimensionality and varying-length problems of the full eigenspectrum. On benchmarks like MNIST and QM-3D, it matches or exceeds full-spectrum performance while cutting computational overhead.
-
Agent trajectories as programs: fingerprinting and programming coding-agent behaviorCoding agents have behavioral fingerprints identifiable from trajectoriesThe paper compares agents procedurally rather than by benchmark scores, defining behavioral 'fingerprints.' Across ten agents, a probe over these procedural signatures attributes an unseen trajectory to the correct agent with 85.7% accuracy. Using an emergent, compressive vocabulary induction over SWE-Bench trajectories, it studies the structural distinctness of agent problem-solving.
-
Dynestyx: A Probabilistic Programming Library for Dynamical SystemsDynestyx: a probabilistic programming library with first-class SSMsState-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, yet they have been hard to incorporate into modern probabilistic programming languages. The authors introduce dynestyx, a library with first-class SSM support and state-of-the-art state and parameter estimation. Through one interface, users specify arbitrary priors for discrete- or continuous-time systems, run inference over mixed-effect data, and obtain estimates with principled uncertainty.
-
datasette-agent 0.3a0Simon Willison releases datasette-agent 0.3a0 with approval-gated SQL writesSimon Willison released datasette-agent 0.3a0, adding a new 'execute_write_sql' tool that requests user approval before writing to a database while respecting user permissions. It extends the approval mechanism introduced in the prior 0.2a0 release, enabling agent-driven write operations under explicit user consent.
-
Scalable Pairwise Kernel Learning with Stochastic Vec TrickSPaiK: scalable kernel learning for large-scale pairwise problemsPairwise learning predicts outcomes for pairs of objects. The authors introduce SPaiK, a scalable kernel method for pairwise settings that preserves kernel methods' expressive power while cutting compute and memory. Its key innovation is the stochastic generalized vec trick (sGVT), a stochastic extension of sparse Kronecker-product multiplication for efficient large-scale training. SPaiK is tested on seven drug-target affinity datasets against state-of-the-art methods.
-
Sobolev Approximation by Fixed-Size Neural Networks with Arbitrary AccuracyFixed-size neural nets achieve arbitrary-accuracy Sobolev approximationThis work studies new activation functions enabling arbitrary-accuracy Sobolev approximation by fixed-size neural networks. It first shows any function in W^{2,inf} can be approximated to arbitrary accuracy in the W^{1,inf} norm via the Elementary Universal Activation Function (EUAF). To extend this to higher-order spaces W^{s,inf}, the authors introduce a smooth activation DUAF_inf and prove arbitrary-accuracy approximation in the W^{s-1,inf} norm, with sigmoidal variants constructed.
-
The embrace of open science: An analysis of a decade of AI research and 56 800 conference papersDecade-long study of 56,800 AI papers finds rising code/data sharingAnalyzing 56,800 papers from five leading AI conferences over 2014-2024, the study reports that sharing both code and data rose nearly sixfold, from 11% to 64%. Based on documentation practices, it estimates reproducibility increased from 28% to 64%, with gains predating reproducibility checklists.
-
How Much Do Reviews Really Contribute? A Study on Text-Enriched Matrix Factorization for RecommendationsHow much do reviews help? A study of text-enriched matrix factorizationIncorporating textual reviews into recommender systems is a popular way to enrich collaborative signals with semantic information, yet their actual contribution remains unclear against strong collaborative baselines. The authors systematically investigate text's impact on matrix factorization by introducing and comparing three enrichment strategies over a common collaborative backbone, including a learnable gating mechanism that adaptively balances collaborative and textual signals.
-
Phantoms and Disclosures: a Causal Framework for Auditing Synthetic DataA causal auditing framework to detect synthetic-data privacy disclosuresGenerative AI and LLMs have made synthetic data a popular privacy-preserving substitute for sensitive datasets, yet it can memorize and reproduce private training data. The authors propose a customizable empirical framework distinguishing "true disclosures" (direct reproduction of user data) from "phantom disclosures" (incidental generation). Using training/holdout partitioning and statistical hypothesis testing, it checks whether disclosures match strict privacy baselines like zero-learning.
-
A nonparametric two-sample test using a parametric integral probability metricA nonparametric two-sample test via a single-node parametric IPMDetecting distributional differences between two independent samples is fundamental in statistics and machine learning. Nonparametric two-sample testing decides whether two samples come from the same distribution without assuming a parametric form. The paper proposes a new test statistic based on an integral probability metric (IPM) defined via a specially designed parametric discriminator class using a single neural-network node, and analyzes the resulting test's properties.
-
Functional Gradient Descent with Adaptive RepresentationsFunctional gradient descent made practical via adaptive representationsFunctional optimization is usually solved by tuning parameters of a fixed representation such as a neural network, yielding highly nonconvex losses that hinder training and analysis. Functional gradient descent (FGD)-gradient descent directly in function space-offers strong convergence guarantees and clean theory but is hard to implement because functional gradients are infinite-dimensional. The paper proposes a practical FGD using adaptive representations.
-
Binary Tracking for Spatial QA and Navigation with Open Vision-Language ModelsBinary Tracking: open vision-language models for spatial QA and navigationThe paper addresses spatial question answering for service robots traversing long egocentric routes, returning metric coordinates that downstream navigation can act on for queries like 'where can I find a dry cleaner on the way back home?' Prior approaches rely on closed-source models such as GPT-4o, which robots cannot reliably depend on due to network instability, latency, and deployment cost. The authors propose Binary Tracking, an open-source vision-language approach that can run onboard.
-
Factorized Neural Operators Decompose Dynamic and Persistent ResponsesFaNO: factorized neural operators splitting dynamic and persistent responsesPhysical systems often combine fast-evolving dynamics with persistent structures, which existing neural operators struggle to capture because a single dominant inductive bias couples distinct responses into one representation. The authors introduce a unified Green's-function framework and propose Factorized Neural Operators (FaNO), decomposing spectral representations into equivariant dynamic responses and invariant persistent responses to better model multiscale physical behavior.
-
Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial LocalizationSemantic Flip: synthetic OOD generation for robust refusal in embodied agentsDetecting unanswerable queries is essential for reliable embodied agents, yet vision-language models often answer overconfidently when visual memory cannot support the query, risking misleading users or physically guiding them to arbitrary locations. The paper proposes Semantic Flip, a simple method that generates synthetic out-of-distribution samples to teach embodied VLMs when to respond 'I do not know,' improving robust refusal in embodied question answering and spatial localization.
-
Beyond Weights and Gradients: A Taxonomy of Federated Learning MessagesA formal definition and taxonomy of federated learning messagesFederated learning now exchanges more than weights and gradients, including synthetic data and analytics. This paper gives a formal mathematical definition of a federated message capturing utility and privacy, and a taxonomy of three categories—model structures, statistical summaries, and data-conditioned representations—evaluated on compute, communication, and privacy. A review of 202 papers shows a shift toward diverse messaging.
-
Compositional Reasoning Depth Predicts Clinical AI Failure: Empirical Evidence Consistent with Transformer Compositionality Limits in Electronic Health Record Question AnsweringReasoning hop-count predicts clinical AI failure in EHR QAAn arXiv paper shows that in electronic health record (EHR) question answering, questions needing more inferential hops yield disproportionately more LLM errors. Using a pre-specified hop-count taxonomy, it links this failure structure to theoretical limits on transformer compositionality. Neutral, abstract-based summary.
-
Stack Overflow、AIエージェント同士が掲示板で技術情報を共有する「Stack Overflow for Agents」ベータ公開Stack Overflow launches 'Stack Overflow for Agents' betaStack Overflow has launched a beta of 'Stack Overflow for Agents,' a service where AI agents share technical solutions and other information on an open message board. The move appears aimed at extending its human Q&A knowledge base into information exchange among agents.
-
Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method SelectionBenchmark suite for federated noisy-label medical image segmentationFederated learning enables collaborative medical image segmentation without centralizing sensitive data, but real-world deployment faces label imperfections like contour disagreement and confused labels. The authors argue existing federated noisy-label learning relies on synthetic noise and simplified settings, and introduce a benchmark suite combining diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to guide method selection.
-
HawkesNest: A Multi-Axis Synthetic Benchmark for Spatiotemporal Pattern ComplexityHawkesNest: a synthetic benchmark for spatiotemporal point process modelsEvaluating spatiotemporal point process (STPP) models relies on opaque real datasets where failures are hard to attribute. HawkesNest is a generator-aligned synthetic benchmark built on a multivariate Hawkes backbone, defining four complexity axes with deterministic indices so models can be stress-tested under known structural difficulty.
-
Follow the Latent Roadmap: Navigating Revocable Decoding for Diffusion LLMs with Anchor TokensAnchor-token roadmap for revocable decoding in diffusion LLMsAn arXiv paper addresses the speed-quality trade-off and error propagation in revocable decoding for diffusion LLMs (dLLMs). It proposes following a latent 'roadmap' guided by anchor tokens to mitigate failures arising in mixed-quality contexts during parallel generation. Neutral, abstract-based summary.
-
Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media TextsRDS Fusion: neuro-symbolic gating with compressed CoT for irony detectionAn arXiv paper proposes Robust Dual-Signal (RDS) Fusion, a hybrid neuro-symbolic framework that compresses Chain-of-Thought reasoning without supervised fine-tuning to improve zero-shot irony detection. It reports evaluation on a held-out TweetEval test set (N=734). Neutral, abstract-based summary; figures are the authors' claims.
-
ATOM-Bench: A Real-World Benchmark for Atomic Skills and Compositional Generalization in Manipulation PoliciesATOM-Bench evaluates atomic skills and compositional generalization in robotsThe paper presents ATOM-Bench, a real-world benchmark for evaluating both atomic skills and compositional generalization in robotic manipulation policies. It factorizes tabletop manipulation into motor and instruction atoms, noting that a policy may succeed on demonstrated tasks yet fail to execute fine-grained skills or recombine them in new structures.
-
Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language ModelsPaper: Expert Tying shares MoE expert params across layersAn arXiv paper introduces Expert Tying, an architectural change that shares expert parameters across consecutive transformer layers while keeping independent layer-wise routing and attention, aiming to cut Mixture-of-Experts memory cost. Summarized neutrally from the abstract.
-
How Much Can We Trust LLM Search Agents? Measuring Endorsement Vulnerability to Web Content ManipulationPaper: framework measures LLM search-agent endorsement riskAn arXiv paper introduces SearchGEO, a controlled framework for measuring endorsement corruption in LLM-based web-search agents, combining a web-evidence manipulation pipeline and a five-mode attack taxonomy across multiple backends. Summarized neutrally from the abstract.