Infrastructure & Hardware B
Showing 91–110 of 110
-
Uncertainty Is Not a Safety Net for Clinical VQA, but Can It Anticipate Model Failure?Uncertainty estimation fails as a safety net for clinical VQAThis arXiv paper tests whether uncertainty estimation (UE) gives clinical vision-language models a reliable trust-or-escalate signal. Benchmarking 8 methods across 12 VLMs on clinical visual question-answering, the authors find UE quality is not intrinsic to the method but tracks model accuracy—degrading exactly where performance is weakest and reliability most needed. Under perturbations that hide the correct option, accuracy collapses while uncertainty barely changes.
-
Can LLM Coding Agents Reason About Time Series?Can LLM coding agents reason about time series? A benchmark studyThis arXiv study tests whether LLM agents can analyze ubiquitous time series data used in finance, healthcare, and environmental monitoring. Comparing three approaches—raw numerical data, the LLM as a coding agent, and a combination—the authors find that agents with code access can outperform models processing raw data by up to 10%, though even the best agent still answers roughly 22-34% incorrectly.
-
daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel OptimizationdaVinci-kernel: an RL framework co-evolving skills for GPU kernel tuningGPU kernel optimization assumes correctness and targets execution efficiency. The authors present daVinci-kernel, an RL framework coupling skill discovery and exploitation via a dynamically evolving skill library. Three agents share one LLM backbone: a Selection Agent retrieving techniques via BM25 and LLM reranking, a Policy Agent generating CUDA/Triton kernels, and a Summary Agent distilling rollouts into reusable skills. Skills are added only after execution verification confirms speedups.
-
Mapping SQLite result columns back to their source `table.column`Mapping SQLite result columns back to their source table.columnA technical note exploring how to map columns in arbitrary SQLite query results back to their originating table.column, so that Datasette could render queries with richer, source-aware metadata.
-
OpenAI WebRTC Audio Session, now with document contextSimon Willison adds document context to his OpenAI WebRTC audio toolSimon Willison updated his browser tool for OpenAI's WebRTC realtime audio API. It now supports the newer realtime voice model touting GPT-5-class reasoning, and lets users paste document text as context for spoken conversations about it.
-
NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI BenchmarkNVIDIA tops first agentic AI benchmark for agentic coding performanceNVIDIA reports leading agentic coding performance on the first benchmark dedicated to agentic AI, per its developer blog. The result highlights its inference stack and GPU infrastructure as a platform for autonomous coding agents.
-
AdaSR: Adaptive Streaming Reasoning with Hierarchical Relative Policy OptimizationAdaSR enables adaptive streaming reasoning for reasoning modelsAdaSR moves beyond the read-then-think paradigm by letting reasoning models reason incrementally as input streams in. It uses a hierarchical relative policy optimization scheme to train streaming reasoning.
-
Flood and Harvest: The Provable Necessity of Trivia for Generating Valuable Mathematics via the Lens of Language Generation in the LimitWhy generating 'trivia' is provably necessary for valuable mathematicsAs AI coupled to proof assistants generates formal mathematics at scale, a gap opens between what a checker verifies and what mathematicians value. Through the lens of language generation in the limit, the paper argues that producing trivial, peripheral statements is provably necessary to generate valuable mathematics.
-
Compressed Computation is (probably) not Computation in SuperpositionCompressed Computation is probably not computation in superpositionThe paper examines whether the Compressed Computation toy model is an instance of computation in superposition. It argues, based on analysis, that it probably is not.
-
Towards Direct Latent-Space Synthesis for Parallel Branches in LLM-Agent WorkflowsDirect latent-space synthesis for parallel branches in LLM-agent workflowsLLMs serve as execution engines for agentic systems yet still consume context through a sequential text interface, mismatching modern structured workflows with independent parallel branches. The paper explores synthesizing such parallel branches directly in latent space.
-
Moonlight in Latent Space: Chirality and Structural Correspondence Between Beethoven's Op. 27 No. 2 and Machine Learning MechanismsStructural correspondence between Beethoven's Moonlight Sonata and MLThrough computational analysis, this paper argues that the three movements of Beethoven's Moonlight Sonata (Op. 27 No. 2) instantiate three distinct machine learning architectures by structural correspondence rather than mere analogy.
-
A Statistical and Machine Learning Framework for Operational Threshold Detection and Deployable Dispatch Controller Development in Hydrogen Multi-Energy SystemsML framework for threshold detection in hydrogen multi-energy systemsThe study presents a statistical and machine learning framework characterizing a hydrogen-based multi-energy system. It targets operational threshold detection and deployable dispatch controller development.
-
Realizing Native INT8 Compute for Diffusion Transformers on Consumer GPUs: A Fused INT8 GEMM Kernel for Ideogram 4.0A fused INT8 GEMM kernel speeds diffusion transformers on consumer GPUsPost-training INT8 quantization of diffusion transformers is often slower than FP8/NF4 on consumer Ampere GPUs. The paper presents a fused INT8 GEMM kernel for Ideogram 4.0 that realizes native INT8 speedups.
-
Cluster LOCO: Feature Importance For Interpreting ClustersCluster LOCO gives feature importance to interpret clustersClustering is widely used but its outputs are hard to interpret and audit. Cluster LOCO provides feature-importance scores to explain what distinguishes each cluster.
-
VISTA: View-Consistent Self-Verified Training for GUI GroundingVISTA: view-consistent self-verified training for GUI groundingApplying GRPO to GUI grounding samples rollouts from a single screenshot, so groups often turn all-failure or all-success and yield weak signal. VISTA introduces view-consistent, self-verified training to stabilize GUI grounding.
-
Regional Climate Model Emulation with Diffusion Approaches: What is the Added Value of Generative Machine Learning?Added value of diffusion-based generative ML for climate model emulationEmulators cheaply reproduce regional climate models' downscaling, linking global-model predictors to high-resolution fields. The paper assesses the added value of diffusion-based generative machine learning for regional climate model emulation.
-
Every Eval Ever: A Unifying Schema and Community Repository for AI Evaluation ResultsEvery Eval Ever: a unifying schema and repository for AI evaluationsAI evaluations are widely used to track progress, but inconsistencies across evaluators hinder analysis and comparison. The paper proposes a unifying schema and a community repository, Every Eval Ever, for AI evaluation results.
-
Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated InfrastructureNVIDIA details deploying MiniMax M3 for long-context agentic workflowsNVIDIA's developer blog explains how to deploy MiniMax M3 on NVIDIA accelerated infrastructure for long-context reasoning and agentic workflows, addressing fragmented enterprise AI pipelines spanning text, vision, and other modalities.
-
Dense Coordinate-List Fine-Tuning Induces a Controllable Interference Surface in Vision-Language ModelsDense coordinate-list fine-tuning induces a controllable interference surfaceFine-tuning vision-language models to emit dense coordinate lists improves grounding but alters how they serialize, repeat, and terminate structured output. The paper shows this induces a controllable interference surface in VLMs.
-
When the Tool Decides: LLM Agents Defer Blindly to Graph Neural Network Tools, and Stronger Backbones Defer MoreLLM agents defer blindly to GNN tools — stronger backbones defer moreA growing line of work equips LLM agents with graph neural networks as callable tools. The paper finds that agents defer blindly to these GNN tools, and that stronger backbones tend to defer even more.