Developer Tools B
Showing 1–30 of 315
-
画面操作を“録画”→AIが作業代行 Codexに新機能「Record & Replay」OpenAI adds 'Record & Replay' to Codex to automate recorded UI stepsOpenAI has added a new 'Record & Replay' feature to its Codex coding agent. Users record on-screen operations, and the AI then reproduces those steps to carry out the task automatically, according to ITmedia.
-
Datasette Apps: Host custom HTML applications inside DatasetteDatasette Apps lets you host custom HTML apps inside DatasetteSimon Willison introduced Datasette Apps, letting developers host custom HTML/JS applications inside a Datasette instance. The apps can read Datasette's databases, enabling lightweight, data-backed web apps served directly from the data exploration tool itself.
-
Optimal Deterministic Multicalibration and OmnipredictionA deterministic algorithm achieving optimal multicalibrationA minimax-optimal multicalibration algorithm that outputs a deterministic predictor, resolving the open question of whether randomization is needed for optimal sample complexity. The result is extended to deterministic predictors satisfying outcome indistinguishability and omniprediction.
-
Predictability as a Fine-Grained Measure for PrivacyPrivacy via predictability, a fine-grained privacy measureThe paper introduces 'privacy via predictability,' a fine-grained privacy framework that explicitly incorporates an attacker's core prior knowledge. It aims to ease the costly privacy-accuracy tradeoff imposed by the worst-case guarantees of differential privacy.
-
LedgerAgent: Structured State for Policy-Adherent Tool-Calling AgentsLedgerAgent: structured state for policy-adherent tool-calling agentsPolicy-adherent tool-calling agents in customer-service domains must track task state across turns while following rules. LedgerAgent introduces structured state to help such agents stay consistent and policy-compliant.
-
SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cmSARLO-80: a worldwide 80cm slant SAR-optical datasetMultimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable SAR resources are scarce. SARLO-80 provides a worldwide slant-range SAR and optical dataset at 80cm resolution to fill this gap.
-
Sovereign Execution Brokers: Enforcing Certificate-Bound Authority in Agentic Control PlanesSovereign Execution Brokers for agentic control planesAutonomous agents are increasingly wired into cloud, deployment, and data-control workflows, straining production security. This work proposes sovereign execution brokers that enforce certificate-bound authority within agentic control planes.
-
Multi-LCB: Extending LiveCodeBench to Multiple Programming LanguagesMulti-LCB: extending LiveCodeBench to multiple programming languagesLiveCodeBench has become a widely adopted benchmark for evaluating large language models on code. Multi-LCB extends it to multiple programming languages to assess multilingual code generation.
-
What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?What safety-aligned LLMs learn from mixed compliance demonstrationsIn-context demonstrations can jailbreak language models, but it has been unclear what safety-aligned models learn when demonstrations mix compliant and non-compliant behavior. This work analyzes that learning behavior.
-
Entropy Estimation in Multi-Qutrit Systems via Variational and Classical Neural NetworksEstimating entropy in multi-qutrit systems with VQAs and CNNsThe paper presents a systematic study of von Neumann entropy estimation in multi-qutrit quantum systems, comparing variational quantum algorithms with classical convolutional neural networks on an ideal noise-free simulator for systems of up to three qutrits.
-
Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM SystemsContagion Networks: evaluator bias propagation in multi-agent LLMsWhen large language models act as evaluators in multi-agent systems, their systematic evaluation biases can spread through the system. This work analyzes how such evaluator bias propagates across agents.
-
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent SystemsHierarchical recovery for cross-device agent systemsThe paper proposes a hierarchical recovery mechanism for cross-device agent systems, moving beyond coarse-grained global replanning. It targets real-world computer-use tasks that span multiple applications and devices and must coordinate heterogeneous environments under dynamic runtime failures.
-
Optimal Order of Multi-Agent and General Many-Body SystemsOptimal order of multi-agent and general many-body systemsThis paper develops a general framework for analyzing multi-agent systems with feedback loops between agents, as well as general many-body systems, and characterizes their optimal order.
-
Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from UsersAligning LLMs with implicit user feedback from mouse and gazeThe paper proposes aligning large language models using implicit user signals—such as mouse and eye movements—instead of explicit human feedback. It addresses the limitation that users rarely provide explicit ratings, which makes high-quality preference data scarce for reward modeling.
-
New usage analytics and updated spend controls for enterprisesOpenAI adds usage analytics and spend controls to ChatGPT EnterpriseOpenAI introduced new usage analytics and updated spend controls for ChatGPT Enterprise, helping organizations track and manage AI costs while scaling with confidence. Admins gain visibility into per-team consumption and can set limits to optimize spend.
-
Marginal Advantage Accumulation for Memory-Driven Agent Self-EvolutionMarginal advantage accumulation for self-evolving memory agentsThe paper proposes marginal advantage accumulation, a cross-batch, operation-level mechanism for memory-driven agent self-evolution. It aims to distinguish stably effective memory operations from accidental hits, addressing contradictory feedback that the same operation can receive across different batches in trace distillation.
-
UltraQuant: 4-bit KV Caching for Context-Heavy AgentsUltraQuant: 4-bit KV caching for context-heavy agentsContext-heavy agents put unusual pressure on the key-value cache as long prefixes are reused across calls. UltraQuant applies 4-bit quantization to compress the KV cache while preserving quality.
-
Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI SystemsAnalyzing defensive misdirection against attacks on agentic AIAgentic AI systems increasingly rely on language-model components to interpret instructions, exposing them to attacks. This paper analyzes defensive misdirection as a countermeasure against model-guided automated attacks.
-
Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat MinimaFisher-geometric sharpness and SGD's implicit bias to flat minimaThe paper introduces a Fisher-geometric notion of sharpness to study the implicit bias of SGD toward flat minima. It addresses the fact that standard Euclidean flatness measures, such as the trace or maximum eigenvalue of the loss Hessian, are not invariant under reparametrizations that preserve the network function.
-
Agentic Symbolic Search: Characterizing PDEs Beyond Hand-crafted Expressions, Meshes, and Neural NetworksAgentic symbolic search for characterizing PDE solutionsThe paper proposes agentic symbolic search, an approach to characterize partial differential equation solutions through mathematical structures rather than tables of computed values. It targets the structural understanding that neither numerical simulation nor neural networks produce directly, traditionally derived by hand.
-
Repurposing a Speech Classifier for Guided Diffusion-Based Speech GenerationRepurposing a speech classifier for guided diffusion speech generationClassifier guidance controls diffusion generation using a noise-conditioned classifier. This work repurposes an existing speech classifier to guide diffusion-based speech generation.
-
SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU DataSSH-Net: predicting failure-time distributions under competing risksThe paper proposes SSH-Net, a deep neural network for predicting failure-time distribution functions under competing risks. It targets time-to-event modeling in complex engineering settings and is demonstrated on GPU failure data, building on the flexibility of neural networks for competing-risk prediction.
-
Evolutionary Two-Stage Hyperparameter Optimization Strategies for Physics-Informed Neural NetworksEvolutionary two-stage hyperparameter optimization for PINNsThe paper proposes evolutionary two-stage hyperparameter optimization strategies for physics-informed neural networks (PINNs). It targets PINNs' unstable convergence, training plateaus, and strong sensitivity to architectural and optimization hyperparameters arising from their highly non-convex training.
-
Interpretable Sperm Morphology Classification via Attention-Guided Deep LearningInterpretable sperm morphology classification via attention-guided deep learningMale infertility is a major cause of couple infertility and is often linked to abnormal sperm morphology. This work uses attention-guided deep learning for interpretable sperm morphology classification.
-
Multi-View Decompilation for LLM-Based Malware ClassificationMulti-view decompilation for LLM-based malware classificationMalware analysts often inspect compiled binaries through decompiled pseudo-C when source code is unavailable. This work uses multi-view decompilation to improve LLM-based malware classification.
-
Neural network surrogates with uncertainty quantification for inverse problems in partial differential equationsNN surrogates with uncertainty quantification for PDE inverse problemsThe paper develops neural network surrogates with uncertainty quantification for inverse problems in partial differential equations. It targets the inference of unknown model parameters from noisy or incomplete observations, where traditional numerical methods are costly, particularly in Bayesian settings.
-
On the Redundancy of Timestep Embeddings in Diffusion ModelsAre timestep embeddings redundant in diffusion models?The paper challenges the necessity of explicit timestep embeddings in diffusion models, which are typically used to modulate denoising across noise scales. Through empirical analysis of U-Net and Diffusion Transformer architectures, together with theoretical arguments, it examines whether these temporal signals are redundant.
-
Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based ApproachTackling modality imbalance in federated graph learning via synthesisThe paper addresses modality imbalance in multimodal federated graph learning with a data-synthesis-based approach. It targets two granularities of imbalance—client-level, where some clients lack entire modalities, and node-level, where individual nodes have missing modalities.
-
CRAX: Fast Safe Reinforcement Learning BenchmarkingCRAX: fast benchmarking for safe reinforcement learningSafety is a core concern when deploying reinforcement learning agents in real-world domains. CRAX provides a framework for fast benchmarking of safe reinforcement learning methods.
-
Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D GenerationUsing a de-biased VLM 3D judge to improve single-image 3D generationThe paper presents a de-biased VLM-as-3D-judge protocol for single-image 3D generation. Building on a cross-model judge that ranks single-image-to-3D mesh quality where geometry and CLIP proxies fall short, it asks whether the judge's preferences can cheaply specialize a strong open generator, TRELLIS, on one asset class such as furniture without human labels.