New Model Releases A
Showing 151–180 of 250
-
Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image SegmentationCoT-enhanced reasoning for semi-supervised medical image segmentationSemi-supervised medical image segmentation mitigates annotation scarcity via consistency regularization but relies mostly on pixel-level visual matching. The paper adds chain-of-thought-enhanced reasoning to go beyond visual cues for segmentation.
-
KANLib -- An Modular, Extensible and Fast Kolmogorov-Arnold Network ImplementationKANLib: a modular, extensible and fast KAN implementationKolmogorov-Arnold Networks replace linear weights with learnable univariate functions but their high computational cost hampers practical research. KANLib provides a modular, extensible and fast implementation of KANs to ease experimentation.
-
Non-negative Elastic Net Decoding for Information RetrievalNon-negative elastic net decoding for information retrievalDense retrieval has become the dominant paradigm in information retrieval. The paper applies non-negative elastic net decoding to information retrieval, aiming to improve retrieval representations and accuracy.
-
ChLogic: Evaluating Robustness of Logical Reasoning in Chinese ExpressionsChLogic evaluates logical reasoning robustness in ChineseLLMs do well on standardized logical reasoning benchmarks, but whether this holds beyond English is unclear. ChLogic is an English-Chinese aligned benchmark testing whether models preserve logical reasoning when the same latent structure is expressed in Chinese.
-
Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning ModelsDynamic rollout editing reduces overthinking in RL reasoning modelsLong chain-of-thought reasoning helps, but models often keep generating unnecessary reasoning after reaching a correct answer. Framing this as overthinking in GRPO-style RL post-training, the paper proposes dynamic rollout editing to reduce it.
-
AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal AnchorAnchorKV: safety-aware KV cache compression via soft penaltiesAnchorKV is a safety-aware KV cache compression method that uses soft penalties (anchors) to retain important key-value entries while reducing memory. Summary is largely title-based; details are as presented by the source and not independently verified.
-
WallZero: Mastering the Game of WallGo with Strategic AnalysisWallZero masters the board game WallGo with strategic analysisWallGo is a recently introduced strategic board game. WallZero masters WallGo through an approach incorporating strategic analysis, demonstrating game-playing performance and strategic insights.
-
Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation ModelsQwen-RobotManip: alignment unlocks scale for robot manipulation modelsLanguage and multimodal foundation models generalize by aligning heterogeneous data under a unified formulation and training at scale. This technical report investigates applying that recipe to robotic manipulation, arguing alignment unlocks scale for manipulation foundation models.
-
Environment-Grounded Automated Prompt Optimization for LLM Game AgentsEnvironment-grounded automated prompt optimization for LLM game agentsLLM agents in interactive environments are sensitive to prompts, yet prompt engineering stays manual and task-specific. The paper decomposes the observation-to-action pipeline and proposes an environment-grounded automated prompt optimization framework for LLM game agents.
-
From Drift to Coherence: Stabilizing Beliefs in LLMsFrom drift to coherence: stabilizing beliefs in LLMsLLMs are hypothesized to perform implicit Bayesian inference, yet the martingale property of predictive beliefs has been shown to fail in synthetic in-context learning. Revisiting this in typical regimes like multiple-choice QA, the paper studies how to stabilize beliefs from drift to coherence.
-
When Multiple Scripts Matter: Evaluating ASR in Clinical SettingsEvaluating ASR in clinical settings when multiple scripts matterAutomatic speech recognition in non-English clinical settings faces multiscript variability, where a term appears in multiple valid orthographies. String-matching metrics treat variants as errors and underestimate performance; the paper studies ASR evaluation when multiple scripts matter.
-
Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP BackdoorsAuditing deployment-interface exposure of CLIP backdoorsCLIP models are reused across downstream interfaces including feature extraction, retrieval, reranking and selection. Existing CLIP backdoors are validated on small attack-native tasks; the paper audits backdoor exposure across deployment interfaces beyond native success.
-
Toward Accessible Psychotherapy Training Using AI-Driven Interactive Patient AvatarsAI-driven patient avatars for more accessible psychotherapy trainingTraining psychotherapists in evidence-based interventions like Acceptance and Commitment Therapy needs repeated practice with feedback, limited by ethical, logistical and resource constraints. The paper introduces AI-driven interactive patient avatars to make such training more accessible.
-
Vision-language models for chest radiography do not always need the imageVision-language models for chest radiography do not always need the imageMedical vision-language models combine images and text for reporting. For chest radiography, the paper shows these models do not always need the image to make predictions, and discusses the implications for evaluation and clinical use.
-
EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden IntentEComAgentBench: shopping agents on long-horizon tasks with hidden intentAs LLM-based shopping agents reach production, existing benchmarks miss how requirements arrive: implicitly, in a profile, or only when the right question is asked. EComAgentBench evaluates shopping agents on long-horizon tasks with distributed hidden intent.
-
OpenAIの高度AIでソフトバンクの脆弱性を1万件発見 孫正義氏「大変な危機」 日本の重要インフラ企業へ診断サービス提供SoftBank unveils OpenAI-powered Patching-as-a-Service security offeringSoftBank Group announced "Patching as a Service" on June 16, a cybersecurity offering built on OpenAI technologies such as "GPT-5.5 Cyber." It simulates attacks on corporate systems to find vulnerabilities, then proposes remediation plans and implementation end-to-end. SoftBank says it will prioritize select firms supporting Japan's critical infrastructure, while chairman Masayoshi Son stressed the gravity of the cyber threat.
-
LLMs Infer Cultural Context but Fail to Apply It When RespondingLLMs infer cultural context but fail to apply it when respondingLLMs are known to overrepresent dominant, often Western cultures while marginalizing others. The paper evaluates how this affects culturally adapted response generation, finding that models can infer cultural context but fail to apply it when responding.
-
SuCo: Sufficiency-guided Continuous Adaptive ReasoningSuCo: sufficiency-guided continuous adaptive reasoningSuCo is a method for sufficiency-guided continuous adaptive reasoning that adapts the reasoning process to a necessary-and-sufficient extent, aiming to balance efficiency and accuracy. Summary is largely title-based; details are as presented by the source.
-
Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code TranslationBridging correctness and runtime efficiency in LLM code translationLLMs have advanced the functional correctness of automated code translation, but runtime efficiency of translated programs has received little attention. As Moore's law wanes, the paper works to bridge the gap between functional correctness and runtime efficiency in LLM-based code translation.
-
From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent ReasoningFrom trainee to trainer: LLM-designed RL training environmentsRL pipelines for LLM training often rely on manually redesigned environments between stages, forcing heuristic guesses about good configurations. The paper has the LLM itself design training environments for reinforcement learning with multi-agent reasoning, moving from trainee to trainer.
-
MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality BlockMambaCount: efficient open-vocabulary counting via state-space dualityText-guided open-vocabulary object counting is hard in dense scenes with large scale variation, and existing Transformer methods are limited by quadratic complexity. MambaCount uses a spatial sparse state space duality block for efficient open-vocabulary object counting.
-
OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy DistillationOPD-Evolver cultivates self-evolving agents via on-policy distillationMemory is a standard substrate for self-evolving agents, but retaining experience differs from learning how to evolve through it. OPD-Evolver uses on-policy distillation to cultivate a holistic agent evolver that selects useful experience, acts on it and writes reusable knowledge.
-
Predicting model behavior before release by simulating deploymentOpenAI unveils Deployment Simulation to predict model behavior pre-releaseOpenAI introduced Deployment Simulation, a method to predict an AI model's behavior before deployment by using real conversation data to simulate responses, aiming to improve safety and evaluation accuracy. The claims are OpenAI's own and not independently verified.
-
June Framework Memory and storage pricing updatesFramework updates memory and storage pricing amid volatile marketA Framework blog post reports updated memory and storage pricing for its desktop products amid a volatile memory market. It states the 128GB Framework Desktop has risen about $1,660 to $4,839, up from $2,000 at launch. The piece concerns hardware market dynamics rather than AI directly and reached the feed via lobste.rs.
-
Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode OutcomesHABC: hierarchical advantage weighting for RL fine-tuning of VLAsOnline RL fine-tuning of pretrained VLA policies yields only one binary outcome per episode, yet actor updates need per-transition signals. The authors argue a single scalar conflates viability and efficiency and that mixing autonomous and intervention segments misassigns credit. Their method, Hierarchical Advantage-Weighted Behavior Cloning (HABC), trains separate critic heads for the two objectives on different data subsets.
-
Benchmarking LLM Agents on Meta-Analysis Articles from Nature PortfolioA benchmark for LLM agents on Nature Portfolio meta-analysesThis work introduces a benchmark that evaluates LLM agents on meta-analysis articles from Nature Portfolio. The article excerpt was unavailable, so this summary is limited to a neutral description based on the title.
-
KVEraser: Learning to Steer KV Cache for Efficient Localized Context ErasingKVEraser edits the KV cache to erase context efficientlyErasing a span from a long-context KV cache is costly because a local edit propagates to all later tokens, forcing recomputation of the suffix. KVEraser instead replaces only the erased interval's KV states with learned steering states while reusing the rest of the cache. A two-stage training pipeline teaches a transferable erasing mechanism for stale facts, wrong tool outputs, or prompt injections.
-
DEEPRUBRIC: Evidence-Tree Rubric Supervision for Efficient Reinforcement Learning of Deep Research AgentsDeepRubric: evidence-tree rubrics to boost deep-research agent RLDeepRubric is a data-construction framework for RL of deep research agents that reverses the usual query-to-rubric flow: starting from a seed topic it builds an evidence tree to decide what an evidence-backed report should be judged on, then synthesizes aligned query-rubric pairs for more reliable reward supervision.
-
HAMON: Passive Optical Sequence Mixing for Long-Horizon ForecastingHAMON: a passive optical core for long-horizon forecastingHAMON is a passive diffractive optical forecasting core: history is encoded onto an optical aperture and cascaded trainable phase masks with free-space diffraction shape the forecast directly in the output field. Inference is a single passive optical pass with no digital sequence-mixing layer, yet it beats strong digital baselines on ETTm2.
-
FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation ModelsFusionRS: a large-scale RGB-infrared-text remote sensing datasetNoting that remote-sensing vision-language models remain RGB-centric, the paper introduces FusionRS, described as the first large-scale RGB-infrared-text dataset for dual-modal learning. It is built by translating public RGB images into infrared-style counterparts, pairing each with conventional and infrared-aware captions.