New Model Releases A
Showing 121–150 of 261
-
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical SkillsRubricsTree: scalable open-ended evaluation of personal health agentsLLM personal health agents using sensor metrics promise to ease healthcare disparities, but an open-ended evaluation bottleneck limits clinical deployment. RubricsTree offers scalable, evolving open-ended evaluation across health memory and medical skills.
-
Learning from the Self-future: On-policy Self-distillation for dLLMsOn-policy self-distillation explored for diffusion LLMsOn-policy self-distillation (OPSD) helps post-training of LLMs but is unexplored for diffusion LLMs (dLLMs). Existing OPSD methods are autoregressive-centric, injecting privileged information via left-to-right prefix conditioning; this work studies self-distillation suited to dLLMs.
-
The Stanford EDGAR Filings Dataset: Reconstructing U.S. Corporate and Financial Disclosures into Layout-Faithful and Token-Efficient Pretraining DataSEFD: an open, layout-faithful reconstruction of SEC filings for LLMsThe paper introduces the Stanford EDGAR Filings Dataset (SEFD), an open reconstruction of SEC filings into layout-faithful MultiMarkdown, providing audited financial disclosures as token-efficient pretraining and evaluation data for financial language modeling.
-
DRFLOW: A Deep Research Benchmark for Personalized Workflow PredictionDRFLOW: a deep research benchmark for personalized workflow predictionThe paper introduces DRFLOW, a benchmark for evaluating personalized workflow prediction in deep research systems, focusing on identifying concrete action-step workflows for enterprise tasks rather than generating reports or summaries.
-
Kolmogorov Regression for Robust Diffusion PoliciesKolmogorov regression yields robust diffusion policiesFinite-dimensional diffusion policies suffer temporal drift from discretization that degrades long-horizon performance. The paper introduces a backward Kolmogorov equation that lifts diffusion policies into a Cameron-Martin space to make them more robust.
-
A Diffusion Approximation for Temporal-Difference Learning with Linear Features under Markovian NoiseA diffusion approximation for TD learning under Markovian noiseThe classical continuous-time description of temporal-difference learning with linear features is an ODE capturing asymptotic mean dynamics but neglecting stochasticity. This work provides a diffusion approximation for TD learning under Markovian noise to capture those fluctuations.
-
ReAge3D: Re-Aging 3D Faces with View ConsistencyReAge3D: identity-preserving, view-consistent 3D face re-agingThe paper presents ReAge3D, a framework for identity-preserving 3D face re-aging that introduces a 2D diffusion-based re-aging model (DiffReaging) trained on synthetic image pairs and a center-out approach to maintain detail and view consistency.
-
Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI ModelsAn agentic benchmark for implicit animal welfare in frontier AIAI agents are shifting from advisors to actors that book travel and run procurement. Existing animal-welfare benchmarks grade only text answers, so this work introduces an agentic benchmark testing whether implicit animal-welfare reasoning transfers to agent actions in frontier models.
-
Descriptor: Certus Caliber Classification Gunshot Dataset (C3GD)C3GD: a public field-collected gunshot muzzle-blast sound datasetThe paper introduces the Certus Caliber Classification Gunshot Dataset (C3GD), a public dataset of firearm muzzle-blast sounds with over 8,000 field-collected data points from 28 firearms across 16 calibers, with detailed metadata.
-
Knowledge Reutilization in Meta-Reinforcement LearningA meta-knowledge reutilization framework for meta-RL across agentsThe paper proposes a meta-knowledge reutilization framework for meta-reinforcement learning that learns task-level knowledge on a dynamics-simplified agent and transfers it to heterogeneous agents, using a Bayesian non-parametric prior to organize latent task modes.
-
Towards Understanding and Measuring COGNITIVE ATROPHY in LLM BehaviourFormalizing 'cognitive atrophy' as a process-level measure of LLM behaviourThe paper formalizes 'cognitive atrophy,' a process-level behavioural measure of AI-mediated mental-health support, capturing whether interactions help users keep reflecting, coping, and deciding, a dimension distinct from safety and static response quality.
-
Unintended Effects of Geographic Conditioning in Large Language ModelsUnintended regional biases from geographic conditioning in LLMsConversational AI localizes responses using user metadata, yet the regional biases this hidden context introduces remain poorly understood. The paper analyzes the unintended effects of geographic conditioning on large language model outputs.
-
Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-EscapingStructural role injection in Handlebars-templated LLM promptsLLM apps build prompts from templates, with Handlebars the default in Microsoft Semantic Kernel. While double-brace expressions HTML-escape values, triple-brace interpolation inserts them raw. The paper studies structural role injection and the limits of HTML auto-escaping.
-
datasette-tailscale 0.1a0Simon Willison releases datasette-tailscale, an experimental Tailscale pluginSimon Willison released datasette-tailscale 0.1a0, a very experimental alpha plugin that runs a local Datasette server with a Tailscale sidecar so it is reachable inside your Tailnet via a chosen hostname. You launch it with an auth key and hostname. It relies on Python bindings for the experimental tailscale-rs library, and he filed an issue asking for a cleaner way to set up the proxy.
-
Querying an astronomical database using large language models: the ALeRCE text-to-SQL systemA text-to-SQL system for querying the ALeRCE astronomical databaseThe paper develops an LLM-based text-to-SQL system using in-context learning, applied to the ALeRCE astronomical broker database, generating executable SQL from natural language and evaluated on a dataset of 110 NL/SQL pairs via step-by-step generation.
-
HistoRAG: Embedding Historical Methodology in Retrieval-Augmented Generation Through Critical Technical PracticeHistoRAG embeds historical methodology into RAG via critical practiceRAG grounds model outputs in external evidence, but its dominant evaluations and defaults are oriented toward factual question answering. HistoRAG embeds historical methodology into retrieval-augmented generation through critical technical practice for interpretive historical studies.
-
Volterra Generative ModelsVolterra generative models add memory to diffusion perturbationsScore-based diffusion models use memoryless Brownian perturbations that yield tractable reverse-time dynamics. Volterra generative models introduce continuous-time perturbations with memory, generalizing diffusion-based generation.
-
NoiseTilt: Noise-Tilted Reverse Kernels for Diffusion Reward AlignmentNoiseTilt injects reward gradients via the noise term in diffusionNoiseTilt (NTRK) is a reward-guided diffusion sampler that injects reward gradients through the noise term, leaving the score kernel unchanged and needing only a single sample per step, improving reward alignment of pretrained diffusion models.
-
Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs RespondSecurity and privacy prompts in the wild: what users ask LLMsThe paper analyzes, in the wild, what users ask large language models about security and privacy and how the models respond, characterizing the questions, response patterns and associated concerns.
-
When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver SupportSynthetic lived experience in AI peer-like caregiver supportCaregivers seek informational and emotional support in online communities where peers draw on personal narratives. As LLMs are designed as peer-like supporters, the paper examines the tension introduced when AI claims synthetic lived experience in caregiver support.
-
Compositional Skill Routing for LLM Agents: Decompose, Retrieve, and ComposeCompositional skill routing for LLM agents: decompose, retrieve, composeLLM agents rely on reusable tool specifications (skills), but real tasks require composing multiple skills. The paper formalizes compositional skill routing: decomposing a complex query into atomic sub-tasks, retrieving relevant skills, and composing them.
-
LoopCoder-v2: Only Loop Once for Efficient Test-Time Computation ScalingLoopCoder-v2: loop once for efficient test-time compute scalingLooped transformers scale latent computation by repeating shared blocks, but sequential looping raises latency and KV-cache memory with loop count. Building on parallel loop transformers, LoopCoder-v2 makes loop count a practical knob for efficient test-time computation scaling.
-
Recursive Scaling in Masked Diffusion ModelsRecursive scaling in masked diffusion modelsMasked diffusion models (MDMs) have recently emerged as a generative approach. The paper investigates recursive scaling in MDMs, offering insights into their behavior and efficiency.
-
Reading between the Lines: Leveraging Large Language Models for Global Dementia and Depression Assessment from Clinical InterviewsUsing LLMs to assess dementia and depression from clinical interviewsDementia and depression are the most prevalent geriatric neuropsychiatric disorders, with overlapping symptoms complicating diagnosis. The study investigates open-weights LLMs for predicting dementia and depression severity from speech collected during clinical interviews.
-
Fast Nonparametric Conditional Independence Testing via Two-Stage RegressionFast nonparametric conditional independence testing via two-stage regressionConditional independence testing is fundamental to statistics and causal inference. The paper proposes a fast nonparametric conditional independence test based on two-stage regression, aiming to improve computational efficiency and power.
-
LLM Consumer Behavior Theory: Foundations of a Novel Research FieldLLM Consumer Behavior Theory: a new field for agentic marketsThe paper introduces LLM Consumer Behavior Theory, a proposed field analyzing consumer behavior in agentic markets where LLMs make consumption decisions on behalf of users, drawing on classical and behavioral economics alongside NLP.
-
C2FL: Clustered Continual Federated Learning under Spatial and Temporal DriftC2FL: clustered continual federated learning under driftCollective adaptive systems let nodes learn from locally sensed data, but privacy-sensitive data and node mobility hinder scaling. C2FL proposes clustered continual federated learning that handles spatial and temporal drift.
-
VoidPadding: Let [VOID] Handle Padding in Masked Diffusion Language Models so that [EOS] Can Focus on Semantic TerminationVoidPadding lets [VOID] handle padding so [EOS] focuses on terminationIn masked diffusion language models, padding and semantic termination roles get entangled. VoidPadding introduces a [VOID] token to handle padding so that [EOS] can focus on signaling semantic termination, improving generation behavior.
-
Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast SynthesisImproved latent modeling for 3D MRI reconstruction and synthesisThe paper proposes an improved latent modeling approach for 3D MRI reconstruction and cross-contrast synthesis, addressing the heavy computational cost of large 3D volumes by recovering semantics first to better infer absent MRI contrasts.
-
Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health DialogueFine-tuning LLMs for passive depression severity from AI dialogueThe paper fine-tunes LLMs for passive estimation of depression severity from AI mental-health dialogue, exploring how conversational signals can indicate severity. Figures and efficacy are as reported by the source and not independently verified.