Training & Fine-tuning A

Showing 61–90 of 99
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    From Drift to Coherence: Stabilizing Beliefs in LLMs
    From drift to coherence: stabilizing beliefs in LLMs
    Fine-tuning Inference Reinforcement Learning Software Engineering
    LLMs are hypothesized to perform implicit Bayesian inference, yet the martingale property of predictive beliefs has been shown to fail in synthetic in-context learning. Revisiting this in typical regimes like multiple-choice QA, the paper studies how to stabilize beliefs from drift to coherence.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation
    Improving low-resource ASR via bilingual fine-tuning with language ID
    Fine-tuning Inference Speech Processing
    The study explores improving low-resource automatic speech recognition using bilingual fine-tuning combined with language identification, and evaluates the approach across languages in a cross-linguistic setting.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Developer Tools extract
    Beyond Native Success: Auditing Deployment-Interface Exposure of CLIP Backdoors
    Auditing deployment-interface exposure of CLIP backdoors
    Neural Network Reinforcement Learning
    CLIP models are reused across downstream interfaces including feature extraction, retrieval, reranking and selection. Existing CLIP backdoors are validated on small attack-native tasks; the paper audits backdoor exposure across deployment interfaces beyond native success.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    SuCo: Sufficiency-guided Continuous Adaptive Reasoning
    SuCo: sufficiency-guided continuous adaptive reasoning
    Fine-tuning Reinforcement Learning Software Engineering
    SuCo is a method for sufficiency-guided continuous adaptive reasoning that adapts the reasoning process to a necessary-and-sufficient extent, aiming to balance efficiency and accuracy. Summary is largely title-based; details are as presented by the source.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation
    Bridging correctness and runtime efficiency in LLM code translation
    Neural Network Retrieval-Augmented Generation (RAG)
    LLMs have advanced the functional correctness of automated code translation, but runtime efficiency of translated programs has received little attention. As Moore's law wanes, the paper works to bridge the gap between functional correctness and runtime efficiency in LLM-based code translation.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • NVIDIA Developer Blog · EN Training & Fine-tuning extract
    Fine-Tuning Biological Foundation Models with LoRA Using NVIDIA BioNeMo Recipes
    NVIDIA details LoRA fine-tuning of biological foundation models via BioNeMo
    Fine-tuning NVIDIA
    An NVIDIA developer blog post explains how to efficiently fine-tune biological foundation models—pretrained on large protein or genomic sequence corpora, such as the ESM2 protein language model—using LoRA, illustrated with the company's BioNeMo Recipes. A technical piece on applying foundation models in computational biology.
    Read original (NVIDIA Developer Blog) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    The Value Axis: Language Models Encode Whether They're on the Right Track
    LLMs encode a 'value axis' tracking if their strategy works
    Fine-tuning Reinforcement Learning Reinforcement Learning from Human Feedback (RLHF)
    Researchers built a 'value axis' for Qwen3-8B that captures whether its current strategy is likely to reach its goal. The axis separates high- and low-confidence rollouts, backtracking, and correct vs. corrupted code; steering it up suppresses self-correction while steering down induces exploration. DPO can raise the internal value of rewarded behaviors.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Infrastructure & Hardware extract
    Exact Posterior Score Estimation for Solving Linear Inverse Problems
    Exact closed-form posterior score for linear inverse problems
    Inference Reinforcement Learning
    The paper derives the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, showing that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot with anisotropic noise. It turns this into a training objective, Exact Posterior Score (EPS), that preserves standard denoising structure and can be trained from scratch or fine-tuned.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.LG (Machine Learning) · EN Infrastructure & Hardware extract
    Hierarchical Advantage Weighting for Online RL Fine-Tuning of VLAs from Sparse Episode Outcomes
    HABC: hierarchical advantage weighting for RL fine-tuning of VLAs
    Fine-tuning Reinforcement Learning
    Online RL fine-tuning of pretrained VLA policies yields only one binary outcome per episode, yet actor updates need per-transition signals. The authors argue a single scalar conflates viability and efficiency and that mixing autonomous and intervention segments misassigns credit. Their method, Hierarchical Advantage-Weighted Behavior Cloning (HABC), trains separate critic heads for the two objectives on different data subsets.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    KVEraser: Learning to Steer KV Cache for Efficient Localized Context Erasing
    KVEraser edits the KV cache to erase context efficiently
    Fine-tuning Reinforcement Learning
    Erasing a span from a long-context KV cache is costly because a local edit propagates to all later tokens, forcing recomputation of the suffix. KVEraser instead replaces only the erased interval's KV states with learned steering states while reusing the rest of the cache. A two-stage training pipeline teaches a transferable erasing mechanism for stale facts, wrong tool outputs, or prompt injections.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    ExpRL: Exploratory RL for LLM Mid-Training
    ExpRL uses human QA as reward scaffolds for LLM mid-training RL
    Fine-tuning Retrieval-Augmented Generation (RAG) Reinforcement Learning Software Engineering
    ExpRL is an RL-based mid-training method that uses large human-written QA corpora as reward scaffolds rather than imitation targets: reference answers are hidden from the policy and used only to build problem-specific grading rubrics for judging on-policy reasoning, automating skill acquisition for harder problems.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Safety & Evaluation extract
    Selection Without Signal, Recovery Through Expression: A Measurement Study of Post-Hoc Falsification Operators for Frozen Small Code Models
    Measurement study of post-hoc falsification operators for code models
    Fine-tuning Neural Network Retrieval-Augmented Generation (RAG)
    Per its title, this paper presents a measurement study of post-hoc 'falsification operators' applied to frozen (non-retrained) small code models, framed around selection without signal and recovery through expression. The raw excerpt was blocked by a content filter, so this summary is based on the title alone and stays deliberately neutral.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Task-Error Residual Learning for Real-Robot Five-Ball Juggling
    Residual learning enables fast, stable real-robot five-ball juggling
    Neural Network Reinforcement Learning
    For residual learning that refines existing behavior, sample efficiency hinges on how much information each rollout returns and how efficiently it is used. Standard scalar RL reward carries less than the directional task error defining the task. Using directional task-error supervision and a task-error model driving sample selection, the system achieves stable three-, four-, and five-ball juggling on Barrett WAM arms, converging from the second attempt with monotonically decreasing error.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Inference & Efficiency extract
    Exploring Extrinsic and Intrinsic Properties for Effective Reasoning with Code Interpreter
    Study probes extrinsic and intrinsic traits of code-interpreter reasoning
    Fine-tuning Inference Retrieval-Augmented Generation (RAG) Reinforcement Learning
    This paper studies reasoning with a Code Interpreter (CI) in LLMs from two angles: extrinsic properties (crucial tokens) and intrinsic properties (code-specific cognitive behaviors). It reports that stronger CI reasoning models show more crucial tokens and behaviors—especially verification, backtracking, and backward chaining—and explores leveraging these at inference and training time. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    Speaking the Language of Science: Toward a General-Purpose Generative Foundation Model for the Natural Sciences
    LOGOS: a general-purpose generative foundation model for natural sciences
    Neural Network
    This report presents LOGOS (Language Of Generative Objects in Science), a generative language model unifying heterogeneous natural-science tasks in one autoregressive framework over a shared scientific grammar. It encodes scientific objects and their spatial contacts/constraints as discrete tokens, casting tasks as next-token prediction without explicit coordinates or geometric networks, and reportedly matches or beats domain-specific baselines. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Fantastic Pretraining Optimizers and Where to Find Them II: Hyperball Optimization
    Hyperball: an optimizer wrapper fixing Frobenius norms to speed up pretraining
    Matrix-based optimizers like Muon accelerate LLM pretraining, but their edge over AdamW shrinks at larger model and data scales under standard constant decoupled weight decay. The paper proposes Hyperball, a simple wrapper that fixes the Frobenius norms of weight matrices and their optimizer updates to constants. On Qwen3-style models up to 1.2B parameters, Muon-Hyperball reports a 20-30% token-equivalent speedup over weight-decay baselines.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • Publickey · JA New Model Releases extract
    Stack Overflow、AIエージェント同士が掲示板で技術情報を共有する「Stack Overflow for Agents」ベータ公開
    Stack Overflow launches 'Stack Overflow for Agents' beta
    AI Agents Machine Learning
    Stack Overflow has launched a beta of 'Stack Overflow for Agents,' a service where AI agents share technical solutions and other information on an open message board. The move appears aimed at extending its human Q&A knowledge base into information exchange among agents.
    Read original (Publickey) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Deep Q-Learning on Hölder Spaces
    Bellman-target regularity analysis motivates a tensor-product DeepONet
    Reinforcement Learning
    This work studies the operator-theoretic core of Q-learning in continuous-time stochastic control with continuous states and actions. Under uniform ellipticity and Hölder-regular coefficients, a Bellman update smooths the state while leaving Lipschitz dependence on the action, motivating a tensor-product DeepONet and yielding approximation and resource bounds with a stiffness-complexity trade-off.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    Robust Dual-Signal Fusion: Hybrid Neuro-Symbolic Gating with Compressed Chain-of-Thought Refinement for Irony Detection in Social Media Texts
    RDS Fusion: neuro-symbolic gating with compressed CoT for irony detection
    Fine-tuning Transformer
    An arXiv paper proposes Robust Dual-Signal (RDS) Fusion, a hybrid neuro-symbolic framework that compresses Chain-of-Thought reasoning without supervised fine-tuning to improve zero-shot irony detection. It reports evaluation on a held-out TweetEval test set (N=734). Neutral, abstract-based summary; figures are the authors' claims.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    Tying the Loop -- Tied Expert Layers in Mixture-of-Experts Language Models
    Paper: Expert Tying shares MoE expert params across layers
    DeepSeek Inference Mixture of Experts (MoE) Transformer
    An arXiv paper introduces Expert Tying, an architectural change that shares expert parameters across consecutive transformer layers while keeping independent layer-wise routing and attention, aiming to cut Mixture-of-Experts memory cost. Summarized neutrally from the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Safety & Evaluation extract
    Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models
    Triggering latent safety awareness to harden large reasoning models
    DeepSeek Fine-tuning Llama Retrieval-Augmented Generation (RAG) Reinforcement Learning from Human Feedback (RLHF)
    The paper observes that large reasoning models can recognize safety risks when re-presented with the original query alongside their own reasoning trace—a property it calls latent safety awareness. To exploit this without heavy manual annotation, it uses supervised fine-tuning to induce safe tags that trigger safety analysis.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    The Art of Mixology: Mixup-based Obfuscation for Privacy-Preserving Split Learning in Large Language Models
    MIXGUARD: mixup-based privacy for LLM split learning
    Fine-tuning
    The paper presents MIXGUARD, a mixup-based privacy-preserving split-learning framework for LLMs combining token- and representation-level obfuscation with adaptive gradient perturbation to balance utility, privacy, and efficiency. Claims reflect the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN Training & Fine-tuning extract
    Decision-Weighted Flow Matching for Contextual Stochastic Optimization
    DW-FM reweights flow matching toward decision-sensitive regions
    Computer Vision Neural Network Reinforcement Learning from Human Feedback (RLHF)
    Standard generative scenario models optimize uniform distributional fit rather than downstream decision quality. Decision-Weighted Flow Matching (DW-FM) reweights the velocity-regression objective using decision-sensitive endpoint information, linking downstream regret to pathwise velocity mismatch and providing regret-aligned objectives with guarantees.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.CL (Computation and Language) · EN New Model Releases extract
    OpenClaw-Skill: Collective Skill Tree Search for Agentic Large Language Models
    OpenClaw-Skill: collective skill tree search for LLM agents
    AI Agents Retrieval-Augmented Generation (RAG) Reinforcement Learning
    The paper proposes Collective Skill Tree Search (CSTS), a tree-search framework that automatically builds reusable skills for LLM agents via iterative collective generation and assessment across multiple models. Claims reflect the abstract.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.LG (Machine Learning) · EN New Model Releases extract
    GD$^2$PO: Mitigating Multi-Reward Conflicts via Group-Dynamic reward-Decoupled Policy Optimization
    GD²PO eases multi-reward conflicts in LLM RL via dynamic reward decoupling
    Algorithms & Theory Reinforcement Learning Reinforcement Learning from Human Feedback (RLHF)
    As LLM post-training RL uses multi-dimensional rewards, conflicting signals across reward groups can cancel out and hinder training. GD²PO decouples rewards into groups and, inspired by DAPO, dynamically filters near-zero-advantage rollouts, reducing conflicts and improving RL training efficiency.
    Read original (arXiv cs.LG (Machine Learning)) ↗
  • arXiv cs.AI (Artificial Intelligence) · EN Training & Fine-tuning extract
    Skill-to-LoRA: From Using Skills to Learning Behaviors for Token-Efficient LLM Agents
    S2L replaces runtime SKILL.md text with skill-specific LoRA adapters
    AI Agents Deep Learning Software Engineering
    The paper proposes Skill-to-LoRA (S2L), a behavior-centric representation that replaces runtime skill text—commonly distributed as SKILL.md files—with skill-specific LoRA adapters. Rather than compressing the document, S2L models the behavioral change the skill text induces, aiming at more token-efficient LLM agents.
    Read original (arXiv cs.AI (Artificial Intelligence)) ↗
  • arXiv cs.CL (Computation and Language) · EN Training & Fine-tuning extract
    SkillWiki: A Living Knowledge Infrastructure for Agent Skills
    SkillWiki: a living knowledge infrastructure for agent skills
    While knowledge is managed via Wikipedia and software via GitHub, agent skills still lack infrastructure for large-scale production, governance, and evolution. SkillWiki is a living knowledge infrastructure turning heterogeneous knowledge into reusable skill assets linked to their originating evidence. It presents the full skill lifecycle, from knowledge ingestion to provenance-aware exploration, governance, and execution-driven evolution, with a live demo and source code available.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • arXiv cs.CL (Computation and Language) · EN Infrastructure & Hardware extract
    daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization
    daVinci-kernel: an RL framework co-evolving skills for GPU kernel tuning
    AI Agents Fine-tuning Reinforcement Learning
    GPU kernel optimization assumes correctness and targets execution efficiency. The authors present daVinci-kernel, an RL framework coupling skill discovery and exploitation via a dynamically evolving skill library. Three agents share one LLM backbone: a Selection Agent retrieving techniques via BM25 and LLM reranking, a Policy Agent generating CUDA/Triton kernels, and a Summary Agent distilling rollouts into reusable skills. Skills are added only after execution verification confirms speedups.
    Read original (arXiv cs.CL (Computation and Language)) ↗
  • ITmedia AI+ · JA New Model Releases extract
    ChatGPT vs. Google検索──どっちで調べるのが学習効果が高い? 8日間の実験で検証した研究
    Study: does ChatGPT or Google search aid learning more? An 8-day test
    Generative AI Google GPT
    Researchers at Georgia Tech, the University of Michigan and others published a study comparing whether AI chatbots or search engines yield better learning. Over an eight-day experiment, the paper examines how generative AI shapes information seeking and learning.
    Read original (ITmedia AI+) ↗
  • Publickey · JA New Model Releases extract
    2027年までにAIエージェントでコーディングを行うチームの65%が、IDEが必要不可欠だとは考えなくなる。ガートナーの予想
    Gartner: by 2027, 65% of AI-coding teams find IDEs non-essential
    AI Agents Machine Learning
    Research firm Gartner says the enterprise AI coding-agent market has entered a new phase of growth and competitive realignment. It predicts that by 2027, 65% of teams coding with AI agents will no longer regard an IDE as essential.
    Read original (Publickey) ↗