Developer Tools B
Showing 121–150 of 305
-
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast ErrorsHybrid LSTM–Vision Transformer predicts HRRR forecast errorsForecast errors in high-resolution numerical weather prediction such as HRRR often stem from unresolved planetary boundary layer processes, convection, and terrain-induced circulations. This work uses a hybrid LSTM–Vision Transformer architecture to predict HRRR forecast errors from vertically structured features.
-
Sumi: Open Uniform Diffusion Language Model from ScratchSumi: an open uniform diffusion language model from scratchDiffusion models are a promising alternative to autoregressive ones, and uniform diffusion language models (UDLMs) let any token be updated at any step. This work releases Sumi, an open uniform diffusion language model built from scratch, supporting research and reproducibility in diffusion LMs.
-
G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom AlignmentG-IdiomAlign: a gloss-pivoted cross-lingual idiom benchmarkIdioms resist literal cross-lingual mapping because they are non-compositional. G-IdiomAlign anchors each idiom to an English Wiktionary gloss and adds a high-confidence reference alignment set. Two protocols (multiple-choice idiom equivalence and gloss-contrastive generation) isolate the effect of explicit glosses.
-
ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception DetectionThinkDeception: interpretable multimodal deception detection via RLExisting multimodal deception detection relies on end-to-end black boxes that offer no transparent reasoning. ThinkDeception is a progressive reinforcement learning framework that explicitly captures subtle cross-modal cues and produces interpretable reasoning trajectories for deception detection.
-
Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question AnsweringDirect timestep embedding and contrastive alignment for time-series QATime-series question answering casts analysis as natural-language QA. Instead of tokenizing the series, this work embeds timesteps directly and uses contrastive alignment to match language representations, avoiding the information loss of tokenization.
-
Mitigating Scoring Errors and Compensating for Nonverbal Subtests in Speech-Based Dementia AssessmentMitigating scoring errors in speech-based dementia assessmentEarly detection of cognitive impairment relies on neuropsychological tests whose scoring is subjective. This work mitigates scoring errors and compensates for nonverbal subtests in speech-based dementia assessment, aiming for more objective and reliable screening.
-
A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRIA controlled benchmark of quantum-latent GAN augmentation for brain MRIMedical image classification is constrained by limited labeled data. This paper builds a controlled benchmark evaluating quantum-latent GAN data augmentation for brain MRI classification, measuring its effect under standardized conditions.
-
GraphPO: Graph-based Policy Optimization for Reasoning ModelsGraphPO: graph-based policy optimization for reasoning modelsReinforcement learning with verifiable rewards has become standard for reasoning models. GraphPO introduces a graph-based policy optimization method that exploits structure across reasoning steps to improve reasoning performance.
-
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language ModelsRTSGameBench: an RTS benchmark for strategic reasoning by VLMsModern vision-language models struggle with strategic reasoning. RTSGameBench uses real-time strategy games to benchmark VLMs on planning and situational judgment, probing their strategic reasoning abilities.
-
As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative LanguageCan LLMs interpret negation in figurative language?Figurative language and negation both challenge current language models. This study assesses how well large language models interpret negation embedded in figurative expressions, revealing model limitations where the two phenomena intersect.
-
Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair ExtractionLearning robust pair confidence for multimodal emotion-cause extractionMultimodal emotion-cause pair extraction requires reliable pairing of emotions and their causes. This work learns robust pair confidence, yielding emotion-cause extraction that is more resilient to noise and ambiguity.
-
Improving Medical Communication using Rubric-Guided Counterfactual RecommendationsRubric-guided counterfactual recommendations for medical communicationText-based telemedicine increasingly relies on lightweight patient feedback. This work improves medical communication using rubric-guided counterfactual recommendations, enhancing the quality of patient-clinician interactions.
-
The State of Fable, The Jailbreak Problem, SpaceX Acquires CursorStratechery on Fable's state, jailbreaks, and SpaceX buying CursorA Stratechery column by Ben Thompson on three topics: the state of Anthropic's Fable model, the AI jailbreak problem, and SpaceX's acquisition of Cursor. Thompson argues the administration is likely wrong about Fable but that responsibility ultimately lies with Anthropic. Views are the author's; deal specifics are unverified.
-
Efficient Financial Language Understanding via Distillation with Synthetic DataEfficient financial language understanding via distillation with synthetic dataLarge instruction-following models are powerful but costly to deploy, especially in finance. This work distills capabilities using synthetic data to build lightweight models that understand financial language efficiently.
-
Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative MiningAligning implied statements for generalizable implicit hate detectionClassifying implicit hate speech is hard because intent is rarely explicit. This work aligns implied statements and applies context-bounded semi-hard negative mining to improve the generalizability of implicit hate speech detection.
-
ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective RefinementScholarSum: student-teacher abstractive summarization with KG reasoningAbstractive summarization enables efficient understanding. ScholarSum combines a student-teacher framework with knowledge-graph reasoning and reflective refinement to produce summaries with improved factuality and coherence.
-
Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement LearningBeyond reward engineering: a data recipe for long-context RLLong-context reasoning is essential for large language models. Rather than relying on reward engineering, this work presents a data recipe for long-context reinforcement learning that drives effective training.
-
Cursor、Gitホスティング「Origin」発表 SpaceXによる買収発表直後にCursor unveils 'Origin' Git hosting, seen as a GitHub rivalCursor, the AI coding tool, announced 'Origin', a Git hosting service that the article frames as aimed at rivaling GitHub. The reveal reportedly came right after news of SpaceX acquiring Cursor. Acquisition terms and Origin's features are article-based, and third-party verification is unconfirmed.
-
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology ReportsBeyond scalar scores: LLM-based metrics for radiology report significanceReliable evaluation of generated radiology reports requires strict clinical validity. Going beyond scalar scores, this work explores LLM-based metrics for clinical significance evaluation, assessing report quality in clinically meaningful terms.
-
HandwritingAgent: Language-Driven Handwriting Synthesis in Scalable Vector SpaceHandwritingAgent: language-driven handwriting synthesis in vector spaceEmulating natural handwriting styles remains an open problem. HandwritingAgent synthesizes handwriting in a scalable vector space from language-driven instructions, enabling generation of diverse, resolution-independent handwriting styles.
-
RedactionBenchRedactionBench: a benchmark for redacting sensitive informationLarge language models are increasingly applied to sensitive domains. RedactionBench evaluates how well models redact sensitive information in such settings, supporting verification toward safer deployment.
-
Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence AggregationImproving long-document retrieval with chunk evidence aggregationDense retrieval matches one query vector against one document vector, but long documents get lost in a single vector. This work splits documents into chunks and aggregates per-chunk evidence to improve long-document retrieval.
-
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension AssessmentLLMs struggle to measure item discrimination in reading assessmentItem discrimination is a fundamental psychometric property that distinguishes students of different proficiency. This study shows that large language models struggle to measure item discrimination in reading comprehension assessment, exposing limits of automated evaluation.
-
Attention as Frustrated SynchronizationAttention as frustrated synchronizationA network of oscillators that synchronizes perfectly computes nothing. This work frames attention as frustrated synchronization, offering a physics-inspired view that interprets the workings of attention through partial, non-trivial synchronization.
-
日立、OpenAIとの連携を本格化 「Codex」でレガシーシステム刷新、サイバー防衛もHitachi deepens OpenAI tie-up, using Codex to modernize legacy systemsHitachi is expanding its partnership with OpenAI, pairing the code-analysis AI "Codex" with its own systems-development expertise. It aims to establish an AI-driven workflow that visualizes upstream specifications from existing code through migration testing, and also cites cybersecurity defense as a use case.
-
SpaceX、AIコーディング「Cursor」を9.6兆円で買収 「近く大幅な改善」へSpaceX reported to acquire AI coding tool Cursor for 9.6 trillion yenSpaceX is reported to be acquiring the AI coding tool "Cursor" for 9.6 trillion yen. Cursor said on its official X account that "major improvements are coming soon," according to the article. Deal details and the headline figure are based on the report and remain unverified by third parties.
-
GrapheneOS has been ported to Android 17GrapheneOS ported to Android 17, releases coming soonA forum post reporting that the privacy-focused mobile OS GrapheneOS has been ported to Android 17, with official releases said to be coming soon. Porting details are based on the community announcement.
-
Variable-Width TransformersVariable-width transformer cuts FLOPs ~22% via x-shaped layer widthsThe paper proposes an x-shaped transformer that keeps early and late layers wide while narrowing the middle, using a parameter-free residual resizing mechanism. Across dense 200M-2B and 3B MoE decoder-only models it outperforms parameter-matched uniform baselines and reduces FLOPs by about 22% under loss-matched scaling, with smaller KV cache.
-
ReproRepo: Scaling Reproducibility Audits with GitHub Repository IssuesReproRepo scales reproducibility audits using GitHub repo issuesReproducing results from papers and code is central to science but existing benchmarks are hard to scale. ReproRepo leverages GitHub repository issues to evaluate, at scale, how well LLM agents can assist with reproducibility tasks, addressing the manual effort that limits prior reproducibility benchmarks.
-
EvolveNav: Proactive Preflection and Self-Evolving Memory for Zero-Shot Object Goal NavigationEvolveNav: a self-evolving framework for zero-shot object-goal navigationThe paper proposes a self-evolving zero-shot object-goal navigation framework that builds an agentic rule memory by extracting actionable knowledge from past trajectories and uses a retrieval strategy to enable continuous test-time improvement.