New Model Releases A

Showing 211–240 of 260
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Understanding the Behaviors of Environment-aware Information Retrieval
    Paper: RL adapts LLM query formulation per retriever
    Deep Learning Embeddings Retrieval-Augmented Generation (RAG) Reinforcement Learning
    An arXiv paper presents a systematic analysis of how LLMs can learn, via reinforcement learning, to adapt their query formulation strategies to different retrievers in retrieval-augmented generation. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN New Model Releases extract
    GIST-CMTF: Goal-State Inference for Causal Minimal Tool Filtering in LLM Agents
    GIST-CMTF adds goal-state inference to causal minimal tool filtering
    AI Agents Deep Learning Inference
    The paper introduces GIST-CMTF, which augments Causal Minimal Tool Filtering with goal-state inference for tool-augmented LLM agents. It addresses wrong-goal execution, where ambiguous requests such as "handle my appointment" map to multiple goals and an agent may follow a valid causal tool path toward an unintended objective.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models
    MIXGUARD: mixup-based privacy for LLM split learning
    Fine-tuning
    The paper presents MIXGUARD, a mixup-based privacy-preserving split-learning framework for LLMs combining token- and representation-level obfuscation with adaptive gradient perturbation to balance utility, privacy, and efficiency. Claims reflect the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment
    MST-CLIPIQA: decoupling semantics and distortions in AI-image quality
    Computer Vision Machine Learning Retrieval-Augmented Generation (RAG)
    The paper introduces MST-CLIPIQA, a multi-scale two-stream framework for assessing AI-generated image quality. It argues that monolithic vision-language representations entangle semantic understanding with low-level perceptual sensitivity, and instead decouples them using dual CLIP encoders for hierarchical alignment.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models
    OpenClaw-Skill: collective skill tree search for LLM agents
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning
    The paper proposes Collective Skill Tree Search (CSTS), a tree-search framework that automatically builds reusable skills for LLM agents via iterative collective generation and assessment across multiple models. Claims reflect the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
    GD²PO eases multi-reward conflicts in LLM RL via dynamic reward decoupling
    Algorithms & Theory Reinforcement Learning Reinforcement Learning from Human Feedback (RLHF)
    As LLM post-training RL uses multi-dimensional rewards, conflicting signals across reward groups can cancel out and hinder training. GD²PO decouples rewards into groups and, inspired by DAPO, dynamically filters near-zero-advantage rollouts, reducing conflicts and improving RL training efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs
    P3B3: a benchmark for Portuguese variety bias in LLMs
    The paper introduces P3B3, an expert-curated benchmark and framework for measuring European versus Brazilian Portuguese variety bias in LLMs. It reports most models lean strongly toward pt-BR and argues for more balanced multilingual representation.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents
    MyPCBench: benchmarking personal computer-use agents
    AI Agents Claude Neural Network Reinforcement Learning
    MyPCBench evaluates computer-use agents as personal assistants on a Linux desktop with 17 simulated web apps and 184 persona-seeded tasks, benchmarking six closed and open-weight models. Reported scores reflect the paper and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Developer Tools extract
    Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection
    A noise-amplification perspective for detecting AI-generated videos
    Reinforcement Learning
    The paper proposes detecting AI-generated videos, especially those from text-to-video models, by amplifying noise to reveal subtle artifacts that distinguish them from authentic footage. It notes that prior work largely targeted GAN-generated samples and frames text-to-video detection as still underexplored.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Misinformation Propagation in Benign Multi-Agent Systems
    Study on misinformation propagation in benign multi-agent systems
    AI Agents Reinforcement Learning Software Engineering
    The paper injects intent-based misinformation into single- and multi-agent LLM systems and finds it degrades performance and persists through debate, though multi-agent debate can reduce degradation when most agents are uncontaminated. Robustness depends on group composition.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models
    Reflective Masking elicits iterative reasoning in mask diffusion models
    Retrieval-Augmented Generation (RAG) Software Engineering
    The paper introduces Reflective Masking, a lightweight post-training method that lets mask diffusion models iteratively revisit and revise prior outputs via multi-turn masking, plus a History Reference component. Claims reflect the abstract and are not independently verified.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Funding & M&A extract
    SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAG
    SCAR: semantic continuity-aware retrieval for RAG context expansion
    Embeddings Retrieval-Augmented Generation (RAG)
    Note: the abstract was unavailable, so this is summarized neutrally from the title alone. The paper proposes SCAR, a 'semantic continuity-aware retrieval' method aimed at efficient context expansion in retrieval-augmented generation (RAG). Specific mechanisms and evaluation results cannot be confirmed from the title.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection
    FraudSMSWalker benchmark targets URL-masked SMS-to-webpage fraud
    AI Agents Meta Neural Network Reinforcement Learning
    The paper introduces FraudSMSWalker, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. It contains 699 bilingual chains (332 fraudulent, 367 benign) across ten scenarios, withholding raw URLs, hosts, and reputation metadata so models cannot rely on reputation shortcuts, and evaluates nine web agents.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    VeriGraph: Towards Verifiable Data-Analytic Agents
    VeriGraph: a traceable neuro-symbolic framework for verifiable data agents
    AI Agents Neural Network Software Engineering
    This arXiv paper introduces VeriGraph, a traceable neuro-symbolic reasoning framework for verifiable data-analytic agents. The authors note that LLM agents' reliance on linear text trajectories makes reasoning hard to audit, entangling deterministic computations over raw data with semantic deductions over natural-language claims. VeriGraph instead has agents build an explicit heterogeneous evidence directed acyclic graph (DAG) during execution.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    The BD-LSC Dataset: Facilitating the Benchmarking of Models for Lexical Semantic Change Detection in Slang and Standard Usage
    BD-LSC: a new benchmark dataset for lexical semantic change detection
    Embeddings GPT Machine Learning Neural Network Transformer
    This arXiv paper introduces two complementary benchmark datasets for computational lexical semantic change (LSC) detection. The Bi-Directional Lexical Semantic Change (BD-LSC) dataset captures sense gain, loss, and stability across three time periods, targeting cases—especially slang versus standard usage—where words simultaneously gain and lose senses, which existing benchmarks struggle to capture.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • ITmedia AI+ · JA New Model Releases extract
    人工知能学会「AIは人間を代替しない」 社会実装へ4提言 安保・著作権にも言及
    JSAI marks 40th year with four proposals on AI's social adoption
    On its 40th anniversary, the Japanese Society for AI issued proposals for adopting AI across Japanese society. Asserting that AI will not replace humans, it offered four recommendations and touched on issues spanning security and copyright.
    Read original (ITmedia AI+) ↗
  • ITmedia AI+ · JA New Model Releases extract
    Javaアプリ更新を1カ月→3日に爆速化 “ソースコード生成AI止まり”じゃない「IBM Bob」の仕組み
    IBM unveils 'IBM Bob', an AI that speeds Java app modernization
    IBM's new AI tool 'IBM Bob' reportedly cut Java application modernization from 30 days to 3 at early adopters. Its distinguishing feature is going beyond mere source-code generation.
    Read original (ITmedia AI+) ↗
  • ITmedia AI+ · JA New Model Releases extract
    Sakana AI、初の商用プロダクト「Marlin」リリース その実力は?【出力レポート全文掲載】
    Sakana AI launches its first commercial product, Sakana Marlin
    AI Agents Reinforcement Learning
    Sakana AI has launched Sakana Marlin, an AI research agent, commercializing the beta it had offered since April. Ahead of the release it held a press hands-on, showing reporters reports the AI generated from pre-collected themes.
    Read original (ITmedia AI+) ↗
  • ITmedia AI+ · JA New Model Releases extract
    ChatGPT vs. Google検索──どっちで調べるのが学習効果が高い? 8日間の実験で検証した研究
    Study: does ChatGPT or Google search aid learning more? An 8-day test
    Generative AI Google GPT
    Researchers at Georgia Tech, the University of Michigan and others published a study comparing whether AI chatbots or search engines yield better learning. Over an eight-day experiment, the paper examines how generative AI shapes information seeking and learning.
    Read original (ITmedia AI+) ↗
  • OpenAI Blog · EN Industry Adoption extract
    Introducing the OpenAI Partner Network
    OpenAI launches Partner Network, investing $150M to speed enterprise AI
    OpenAI
    OpenAI introduced its Partner Network, committing $150M to help global partners accelerate enterprise AI adoption, deployment, and transformation. The program aims to broaden OpenAI's reach into enterprise markets through a structured partner ecosystem.
    Read original (OpenAI Blog) ↗
  • Publickey · JA New Model Releases extract
    2027年までにAIエージェントでコーディングを行うチームの65%が、IDEが必要不可欠だとは考えなくなる。ガートナーの予想
    Gartner: by 2027, 65% of AI-coding teams find IDEs non-essential
    AI Agents Machine Learning
    Research firm Gartner says the enterprise AI coding-agent market has entered a new phase of growth and competitive realignment. It predicts that by 2027, 65% of teams coding with AI agents will no longer regard an IDE as essential.
    Read original (Publickey) ↗
  • Sakana AI Blog (ja) · JA New Model Releases extract
    Sakana AI、初の商用プロダクト「Sakana Marlin」を提供開始
    Sakana AI launches Marlin, its first commercial autonomous research assistant
    AI Agents Algorithms & Theory Inference Neural Network Reinforcement Learning
    Sakana AI has launched Sakana Marlin, its first commercial product: an autonomous research assistant for business. Given a research theme, it works autonomously for up to about eight hours—forming hypotheses, gathering and verifying information—then outputs structured summary slides and a report spanning dozens of pages. Built on the firm's long-horizon reasoning technology, it aims to act as a 'virtual CSO,' is self-serve, and available same day, with plans from free pay-per-use to Enterprise.
    Read original (Sakana AI Blog (ja)) ↗
  • ITmedia AI+ · JA New Model Releases extract
    Amazon、Anthropicの最新AIについて懸念を伝えていた 米政権による停止命令に先立ち 関係筋
    Amazon's Jassy flagged Anthropic AI security risks to US officials
    Anthropic
    Amazon CEO Andy Jassy was among the tech executives who raised security concerns about Anthropic's frontier models to Trump administration officials, sources told Reuters, ahead of the directive barring foreign nationals from using Fable 5 and Mythos 5.
    Read original (ITmedia AI+) ↗
  • Simon Willison's Weblog · EN New Model Releases extract
    luau-wasm 0.1a0
    luau-wasm 0.1a0 released, bringing Luau to WebAssembly
    An early 0.1a0 release of luau-wasm packages Luau, Roblox's typed Lua dialect, for WebAssembly, built using the newly enabled approach for publishing WASM wheels to PyPI for Pyodide.
    Read original (Simon Willison's Weblog) ↗
  • ITmedia AI+ · JA Policy & Regulation extract
    「Claude Fable 5」「Mythos 5」全面停止 米政府の指令により Anthropicは早期復旧を宣言
    Anthropic halts Fable 5, Mythos 5 under US order, vows quick restore
    Anthropic Claude
    On June 12 Anthropic said it would suspend its flagship Claude Fable 5 and Mythos 5 for all users after a US export-control directive barred foreign nationals from access on security grounds. Calling it a misunderstanding, the firm aims to restore service soon; other models are unaffected.
    Read original (ITmedia AI+) ↗
  • Simon Willison's Weblog · EN New Model Releases extract
    OpenAI WebRTC Audio Session, now with document context
    Simon Willison adds document context to his OpenAI WebRTC audio tool
    GPT OpenAI
    Simon Willison updated his browser tool for OpenAI's WebRTC realtime audio API. It now supports the newer realtime voice model touting GPT-5-class reasoning, and lets users paste document text as context for spoken conversations about it.
    Read original (Simon Willison's Weblog) ↗
  • ITmedia AI+ · JA New Model Releases extract
    トヨタが抜かれる日――キオクシア首位奪取、2005年「時価総額トップ10」を振り返る
    Kioxia tops Toyota in market cap: MONOist weekly news roundup
    MONOist editors pick the week's top stories from June 8-12, led by Kioxia overtaking Toyota as Japan's most valuable listed company, and revisit 2005's top-10 market-cap ranking to trace shifts in Japan's corporate landscape.
    Read original (ITmedia AI+) ↗
  • Anthropic News · EN Industry Adoption extract
    TCS and Anthropic partner to bring Claude to regulated industries
    Anthropic partners with TCS to bring Claude to regulated industries
    Anthropic Claude Neural Network Reinforcement Learning
    Anthropic announced a partnership with Tata Consultancy Services. TCS will deploy Claude to 50,000 employees across 56 countries, build Claude-powered products for finance, healthcare and the public sector, and join the Claude Partner Network.
    Read original (Anthropic News) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    ClinHallu: A Benchmark for Diagnosing Stage-Wise Hallucinations in Medical MLLM Reasoning
    ClinHallu: a stage-wise hallucination diagnosis benchmark for medical MLLMs
    Fine-tuning Machine Learning Software Engineering
    ClinHallu is a benchmark for diagnosing where hallucinations originate in medical multimodal LLM reasoning, decomposing traces into visual recognition, knowledge recall, and reasoning integration. It provides 7,031 validated instances and uses stage-replacement interventions to localize error sources.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy Optimization
    AdaSR enables adaptive streaming reasoning for reasoning models
    Machine Learning Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering Speech Processing
    AdaSR moves beyond the read-then-think paradigm by letting reasoning models reason incrementally as input streams in. It uses a hierarchical relative policy optimization scheme to train streaming reasoning.
    Read original (arXiv cs.CL (Computation and Language)) ↗