New Model Releases A
Showing 211–240 of 260
-
Understanding the Behaviors of Environment-aware Information RetrievalPaper: RL adapts LLM query formulation per retrieverAn arXiv paper presents a systematic analysis of how LLMs can learn, via reinforcement learning, to adapt their query formulation strategies to different retrievers in retrieval-augmented generation. Summarized neutrally from the abstract.
-
GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM AgentsGIST-CMTF adds goal-state inference to causal minimal tool filteringThe paper introduces GIST-CMTF, which augments Causal Minimal Tool Filtering with goal-state inference for tool-augmented LLM agents. It addresses wrong-goal execution, where ambiguous requests such as "handle my appointment" map to multiple goals and an agent may follow a valid causal tool path toward an unintended objective.
-
The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language ModelsMIXGUARD: mixup-based privacy for LLM split learningThe paper presents MIXGUARD, a mixup-based privacy-preserving split-learning framework for LLMs combining token- and representation-level obfuscation with adaptive gradient perturbation to balance utility, privacy, and efficiency. Claims reflect the abstract.
-
Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality AssessmentMST-CLIPIQA: decoupling semantics and distortions in AI-image qualityThe paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
-
OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language ModelsOpenClaw-Skill: collective skill tree search for LLM agentsThe paper proposes Collective Skill Tree Search (CSTS), a tree-search framework that automatically builds reusable skills for LLM agents via iterative collective generation and assessment across multiple models. Claims reflect the abstract.
-
GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy OptimizationGD²PO eases multi-reward conflicts in LLM RL via dynamic reward decouplingAs LLM post-training RL uses multi-dimensional rewards, conflicting signals across reward groups can cancel out and hinder training. GD²PO decouples rewards into groups and, inspired by DAPO, dynamically filters near-zero-advantage rollouts, reducing conflicts and improving RL training efficiency.
-
P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMsP3B3: a benchmark for Portuguese variety bias in LLMsThe paper introduces P3B3, an expert-curated benchmark and framework for measuring European versus Brazilian Portuguese variety bias in LLMs. It reports most models lean strongly toward pt-BR and argues for more balanced multilingual representation.
-
MyPCBench: A Benchmark for Personally Intelligent Computer-Use AgentsMyPCBench: benchmarking personal computer-use agentsMyPCBench evaluates computer-use agents as personal assistants on a Linux desktop with 17 simulated web apps and 184 persona-seeded tasks, benchmarking six closed and open-weight models. Reported scores reflect the paper and are not independently verified.
-
Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video DetectionA noise-amplification perspective for detecting AI-generated videosThe paper proposes detecting AI-generated videos, especially those from text-to-video models, by amplifying noise to reveal subtle artifacts that distinguish them from authentic footage. It notes that prior work largely targeted GAN-generated samples and frames text-to-video detection as still underexplored.
-
Misinformation Propagation in Benign Multi-Agent SystemsStudy on misinformation propagation in benign multi-agent systemsThe paper injects intent-based misinformation into single- and multi-agent LLM systems and finds it degrades performance and persists through debate, though multi-agent debate can reduce degradation when most agents are uncontaminated. Robustness depends on group composition.
-
Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion ModelsReflective Masking elicits iterative reasoning in mask diffusion modelsThe paper introduces Reflective Masking, a lightweight post-training method that lets mask diffusion models iteratively revisit and revise prior outputs via multi-turn masking, plus a History Reference component. Claims reflect the abstract and are not independently verified.
-
SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAGSCAR: semantic continuity-aware retrieval for RAG context expansionNote: the abstract was unavailable, so this is summarized neutrally from the title alone. The paper proposes SCAR, a 'semantic continuity-aware retrieval' method aimed at efficient context expansion in retrieval-augmented generation (RAG). Specific mechanisms and evaluation results cannot be confirmed from the title.
-
FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud DetectionFraudSMSWalker benchmark targets URL-masked SMS-to-webpage fraudThe paper introduces FraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. It contains 699 bilingual chains (332 fraudulent, 367 benign) across ten scenarios, withholding raw URLs, hosts, and reputation metadata so models cannot rely on reputation shortcuts, and evaluates nine web agents.
-
VeriGraph: Towards Verifiable Data-Analytic AgentsVeriGraph: a traceable neuro-symbolic framework for verifiable data agentsThis arXiv paper introduces VeriGraph, a traceable neuro-symbolic reasoning framework for verifiable data-analytic agents. The authors note that LLM agents' reliance on linear text trajectories makes reasoning hard to audit, entangling deterministic computations over raw data with semantic deductions over natural-language claims. VeriGraph instead has agents build an explicit heterogeneous evidence directed acyclic graph (DAG) during execution.
-
The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard UsageBD-LSC: a new benchmark dataset for lexical semantic change detectionThis arXiv paper introduces two complementary benchmark datasets for computational lexical semantic change (LSC) detection. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, loss, and stability across three time periods, targeting cases—especially slang versus standard usage—where words simultaneously gain and lose senses, which existing benchmarks struggle to capture.
-
人工知能学会「AIは人間を代替しない」 社会実装へ4提言 安保・著作権にも言及JSAI marks 40th year with four proposals on AI's social adoptionOn its 40th anniversary, the Japanese Society for AI issued proposals for adopting AI across Japanese society. Asserting that AI will not replace humans, it offered four recommendations and touched on issues spanning security and copyright.
-
Javaアプリ更新を1カ月→3日に爆速化 “ソースコード生成AI止まり”じゃない「IBM Bob」の仕組みIBM unveils 'IBM Bob', an AI that speeds Java app modernizationIBM's new AI tool 'IBM Bob' reportedly cut Java application modernization from 30 days to 3 at early adopters. Its distinguishing feature is going beyond mere source-code generation.
-
Sakana AI、初の商用プロダクト「Marlin」リリース その実力は?【出力レポート全文掲載】Sakana AI launches its first commercial product, Sakana MarlinSakana AI has launched Sakana Marlin, an AI research agent, commercializing the beta it had offered since April. Ahead of the release it held a press hands-on, showing reporters reports the AI generated from pre-collected themes.
-
ChatGPT vs. Google検索──どっちで調べるのが学習効果が高い? 8日間の実験で検証した研究Study: does ChatGPT or Google search aid learning more? An 8-day testResearchers at Georgia Tech, the University of Michigan and others published a study comparing whether AI chatbots or search engines yield better learning. Over an eight-day experiment, the paper examines how generative AI shapes information seeking and learning.
-
Introducing the OpenAI Partner NetworkOpenAI launches Partner Network, investing $150M to speed enterprise AIOpenAI introduced its Partner Network, committing $150M to help global partners accelerate enterprise AI adoption, deployment, and transformation. The program aims to broaden OpenAI's reach into enterprise markets through a structured partner ecosystem.
-
2027年までにAIエージェントでコーディングを行うチームの65%が、IDEが必要不可欠だとは考えなくなる。ガートナーの予想Gartner: by 2027, 65% of AI-coding teams find IDEs non-essentialResearch firm Gartner says the enterprise AI coding-agent market has entered a new phase of growth and competitive realignment. It predicts that by 2027, 65% of teams coding with AI agents will no longer regard an IDE as essential.
-
Sakana AI、初の商用プロダクト「Sakana Marlin」を提供開始Sakana AI launches Marlin, its first commercial autonomous research assistantSakana AI has launched Sakana Marlin, its first commercial product: an autonomous research assistant for business. Given a research theme, it works autonomously for up to about eight hours—forming hypotheses, gathering and verifying information—then outputs structured summary slides and a report spanning dozens of pages. Built on the firm's long-horizon reasoning technology, it aims to act as a 'virtual CSO,' is self-serve, and available same day, with plans from free pay-per-use to Enterprise.
-
Amazon、Anthropicの最新AIについて懸念を伝えていた 米政権による停止命令に先立ち 関係筋Amazon's Jassy flagged Anthropic AI security risks to US officialsAmazon CEO Andy Jassy was among the tech executives who raised security concerns about Anthropic's frontier models to Trump administration officials, sources told Reuters, ahead of the directive barring foreign nationals from using Fable 5 and Mythos 5.
-
luau-wasm 0.1a0luau-wasm 0.1a0 released, bringing Luau to WebAssemblyAn early 0.1a0 release of luau-wasm packages Luau, Roblox's typed Lua dialect, for WebAssembly, built using the newly enabled approach for publishing WASM wheels to PyPI for Pyodide.
-
「Claude Fable 5」「Mythos 5」全面停止 米政府の指令により Anthropicは早期復旧を宣言Anthropic halts Fable 5, Mythos 5 under US order, vows quick restoreOn June 12 Anthropic said it would suspend its flagship Claude Fable 5 and Mythos 5 for all users after a US export-control directive barred foreign nationals from access on security grounds. Calling it a misunderstanding, the firm aims to restore service soon; other models are unaffected.
-
OpenAI WebRTC Audio Session, now with document contextSimon Willison adds document context to his OpenAI WebRTC audio toolSimon Willison updated his browser tool for OpenAI's WebRTC realtime audio API. It now supports the newer realtime voice model touting GPT-5-class reasoning, and lets users paste document text as context for spoken conversations about it.
-
トヨタが抜かれる日――キオクシア首位奪取、2005年「時価総額トップ10」を振り返るKioxia tops Toyota in market cap: MONOist weekly news roundupMONOist editors pick the week's top stories from June 8-12, led by Kioxia overtaking Toyota as Japan's most valuable listed company, and revisit 2005's top-10 market-cap ranking to trace shifts in Japan's corporate landscape.
-
TCS and Anthropic partner to bring Claude to regulated industriesAnthropic partners with TCS to bring Claude to regulated industriesAnthropic announced a partnership with Tata Consultancy Services. TCS will deploy Claude to 50,000 employees across 56 countries, build Claude-powered products for finance, healthcare and the public sector, and join the Claude Partner Network.
-
ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM ReasoningClinHallu: a stage-wise hallucination diagnosis benchmark for medical MLLMsClinHallu is a benchmark for diagnosing where hallucinations originate in medical multimodal LLM reasoning, decomposing traces into visual recognition, knowledge recall, and reasoning integration. It provides 7,031 validated instances and uses stage-replacement interventions to localize error sources.
-
AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy OptimizationAdaSR enables adaptive streaming reasoning for reasoning modelsAdaSR moves beyond the read-then-think paradigm by letting reasoning models reason incrementally as input streams in. It uses a hierarchical relative policy optimization scheme to train streaming reasoning.