New Model Releases A
Showing 91–120 of 260
-
FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEsFoMoE breaks the full-replica barrier with a federation of MoEsPretraining LLMs typically demands large-scale infrastructure with tightly coupled accelerators. As model and data scale grow, FoMoE proposes a federation of Mixture-of-Experts that avoids replicating the full model across devices, breaking the full-replica barrier and easing infrastructure constraints.
-
Sumi: Open Uniform Diffusion Language Model from ScratchSumi: an open uniform diffusion language model from scratchDiffusion models are a promising alternative to autoregressive ones, and uniform diffusion language models (UDLMs) let any token be updated at any step. This work releases Sumi, an open uniform diffusion language model built from scratch, supporting research and reproducibility in diffusion LMs.
-
Spotlight: Synergizing Seed Exploration and Spot GPUs for DiT RL Post-TrainingSpotlight cuts DiT RL post-training cost with spot GPUsReinforcement learning post-training of Diffusion Transformers is prohibitively expensive, needing thousands of high-end GPUs. Spotlight synergizes seed exploration with cheap, preemptible spot GPUs to substantially reduce the cost of DiT RL post-training.
-
Enhancing Multilingual Reasoning via Steerable Model MergingEnhancing multilingual reasoning via steerable model mergingModel merging effectively composes the capabilities of a multilingual model and a reasoning model, achieving promising generalization on multilingual reasoning by aligning their feature spaces. This work introduces steerable model merging to control the composition and further boost multilingual reasoning.
-
TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extractionTRAP benchmarks agents on task completion and privacy resistanceAgents are increasingly deployed in document-intensive workflows where sensitive private information is routine input—e.g., booking a flight needs passport numbers. TRAP is a benchmark evaluating agents on both task completion and resistance to active privacy-extraction attempts.
-
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception DetectionThinkDeception: interpretable multimodal deception detection via RLExisting multimodal deception detection relies on end-to-end black boxes that offer no transparent reasoning. ThinkDeception is a progressive reinforcement learning framework that explicitly captures subtle cross-modal cues and produces interpretable reasoning trajectories for deception detection.
-
Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question AnsweringDirect timestep embedding and contrastive alignment for time-series QATime-series question answering casts analysis as natural-language QA. Instead of tokenizing the series, this work embeds timesteps directly and uses contrastive alignment to match language representations, avoiding the information loss of tokenization.
-
CAPRA: Scaling Feedback on Software Architecture Deliverables with a Multi-Agent LLM SystemCAPRA: a multi-agent LLM system for software architecture feedbackAutomated assessment in software engineering education has advanced, but giving quality feedback on architecture deliverables remains hard. CAPRA is a multi-agent LLM system that scales detailed feedback on software architecture deliverables.
-
SenFlow: Inter-Sentence Flow Modeling for AI-Generated Text Detection in Hybrid DocumentsSenFlow: inter-sentence flow modeling for AI-text detectionSentence-level AI-generated text detection is hard in hybrid human-AI documents. SenFlow models inter-sentence flow to capture discontinuities, improving detection of AI-generated sentences within mixed documents.
-
SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science SafetySciRisk-Bench: a risk-dimension-aware benchmark for AI4Science safetyAs LLMs become embedded in scientific research, evaluating their safety matters. SciRisk-Bench is a risk-dimension-aware benchmark that assesses the safety of LLMs in AI-for-science settings across multiple risk categories.
-
SAGE: Stochastic Prompt Optimization via Agent-Guided ExplorationSAGE: stochastic prompt optimization via agent-guided explorationContext engineering has become a primary lever for improving AI systems. SAGE is a stochastic prompt optimization method that uses agent-guided exploration to automatically discover effective prompts and improve task performance.
-
Improving Medical Communication using Rubric-Guided Counterfactual RecommendationsRubric-guided counterfactual recommendations for medical communicationText-based telemedicine increasingly relies on lightweight patient feedback. This work improves medical communication using rubric-guided counterfactual recommendations, enhancing the quality of patient-clinician interactions.
-
A near-autonomous AI chemist improves a challenging reaction in medicinal chemistryOpenAI's near-autonomous AI chemist improves a key medicinal reactionOpenAI and Molecule.one describe a near-autonomous AI chemist, built on GPT-5.4, that improved a challenging reaction in medicinal chemistry. Framed as advancing drug-discovery research; specific performance figures are article-based and not independently verified.
-
GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory AgentsGateMem: benchmarking memory governance in shared-memory agentsMemory benchmarks for LLM agents largely assume single-user settings, leaving shared-memory governance untested. GateMem benchmarks memory governance, such as access control and management, in multi-principal shared-memory agents.
-
Cursor、Gitホスティング「Origin」発表 SpaceXによる買収発表直後にCursor unveils 'Origin' Git hosting, seen as a GitHub rivalCursor, the AI coding tool, announced 'Origin', a Git hosting service that the article frames as aimed at rivaling GitHub. The reveal reportedly came right after news of SpaceX acquiring Cursor. Acquisition terms and Origin's features are article-based, and third-party verification is unconfirmed.
-
HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector SpaceHandwritingAgent: language-driven handwriting synthesis in vector spaceEmulating natural handwriting styles remains an open problem. HandwritingAgent synthesizes handwriting in a scalable vector space from language-driven instructions, enabling generation of diverse, resolution-independent handwriting styles.
-
RedactionBenchRedactionBench: a benchmark for redacting sensitive informationLarge language models are increasingly applied to sensitive domains. RedactionBench evaluates how well models redact sensitive information in such settings, supporting verification toward safer deployment.
-
Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence AggregationImproving long-document retrieval with chunk evidence aggregationDense retrieval matches one query vector against one document vector, but long documents get lost in a single vector. This work splits documents into chunks and aggregates per-chunk evidence to improve long-document retrieval.
-
SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information ExtractionSAMA: semantic anchor-aligned augmentation for low-resource multimodal IEMultimodal information extraction spans many tasks but suffers from scarce data in low-resource settings. SAMA proposes semantic anchor-aligned augmentation to unify and improve multimodal information extraction under low-resource conditions.
-
Output Vector Editing for Memorization Mitigation in Large Language ModelsOutput vector editing for memorization mitigation in LLMsLarge language models memorize and reproduce sequences from their training data. This work edits output vectors to mitigate such memorization, reducing the risk of leaking copyrighted or private content.
-
Attention as Frustrated SynchronizationAttention as frustrated synchronizationA network of oscillators that synchronizes perfectly computes nothing. This work frames attention as frustrated synchronization, offering a physics-inspired view that interprets the workings of attention through partial, non-trivial synchronization.
-
ForecastBench-Sim: A Simulated-World Forecasting BenchmarkForecastBench-Sim: a simulated-world forecasting benchmarkForecasting benchmarks for general-purpose AI usually inherit real-world events, making evaluation hard to control. ForecastBench-Sim introduces a simulated-world forecasting benchmark, enabling controlled assessment of AI forecasting ability.
-
Introducing LifeSciBenchOpenAI launches LifeSciBench for life-science research tasksOpenAI introduced LifeSciBench, an expert-authored and expert-reviewed benchmark for evaluating how AI systems handle real-world life science research tasks and decisions. It aims to rigorously assess AI's practical usefulness in life-science research.
-
datasette 1.0a34Datasette 1.0a34 released with in-interface row editingDatasette 1.0a34 has been released. The headline feature is tooling to insert, edit, and delete rows directly within the Datasette interface, available on table pages so users can modify data without leaving the app.
-
GitLab、AIエージェント向けの次世代Git互換ソースコード管理サービス「Project Switch」発表。最大で50倍高速かつ半分のトークンで利用可能にGitLab unveils 'Project Switch,' a Git-compatible SCM service for AI agentsGitLab announced Project Switch, a next-generation Git-compatible source code management service aimed at AI agents, at its GitLab Transcend event in London. Reports cite up to 50x speed and roughly half the token usage; figures reflect the announcement and remain unverified.
-
ReproRepo: Scaling Reproducibility Audits with GitHub Repository IssuesReproRepo scales reproducibility audits using GitHub repo issuesReproducing results from papers and code is central to science but existing benchmarks are hard to scale. ReproRepo leverages GitHub repository issues to evaluate, at scale, how well LLM agents can assist with reproducibility tasks, addressing the manual effort that limits prior reproducibility benchmarks.
-
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal NavigationEvolveNav: a self-evolving framework for zero-shot object-goal navigationThe paper proposes a self-evolving zero-shot object-goal navigation framework that builds an agentic rule memory by extracting actionable knowledge from past trajectories and uses a retrieval strategy to enable continuous test-time improvement.
-
Darshana Graph: A Parallel Commentary Corpus for Comparative Indian Philosophy, with Stylometric and Exploratory Graph AnalysesDarshana Graph: a parallel commentary corpus for Indian philosophyDarshana Graph is a corpus of over 125,000 text records spanning classical Hindu, Buddhist and Jain philosophical traditions, drawn from public-domain and openly licensed translations. It supports comparative Indian philosophy through stylometric and exploratory graph analyses.
-
Zone of Proximal Policy Optimization: Teacher in Prompts, Not GradientsZone of Proximal PPO puts the teacher in prompts, not gradientsKnowledge distillation is brittle for small students, as imitating a large teacher's logits concentrates on its sharpest modes and hurts generalization. The proposed Zone of Proximal Policy Optimization places the teacher in prompts rather than gradients to improve small-student generalization.
-
Looped World ModelsLooped World Models refine latents iteratively for efficient simWorld models need deep computation for faithful long-horizon simulation, but deep models are costly and accumulate errors. LoopWM introduces the first looped architectures for world modelling, iteratively refining latent states to resolve this tension.