Funding & M&A C
Showing 61–73 of 73
-
P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMsP3B3: a benchmark for Portuguese variety bias in LLMsThe paper introduces P3B3, an expert-curated benchmark and framework for measuring European versus Brazilian Portuguese variety bias in LLMs. It reports most models lean strongly toward pt-BR and argues for more balanced multilingual representation.
-
Automated jailbreak attack targeting multiple defense strategiesUNIATTACK: a defense-oriented framework for automated black-box LLM jailbreaksThe paper presents UNIATTACK, an adversarial testing framework that systematically builds effective black-box attack prompts on LLMs from a defense-oriented perspective. Unlike static templates or model-specific tuning, it extracts minimal but high-impact features from diverse existing attacks and optimizes them.
-
Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving AgentsPaper on evaluator preference collapse in self-evolving agentsAn arXiv paper reportedly examining preference collapse in multimodal evaluators and its cross-modal contagion within self-evolving agent systems. The source excerpt was unavailable (content filter), so this summary is based on the title only; see the original for methods and findings.
-
SCAR: Semantic Continuity-Aware Retrieval for Efficient Context Expansion in RAGSCAR: semantic continuity-aware retrieval for RAG context expansionNote: the abstract was unavailable, so this is summarized neutrally from the title alone. The paper proposes SCAR, a 'semantic continuity-aware retrieval' method aimed at efficient context expansion in retrieval-augmented generation (RAG). Specific mechanisms and evaluation results cannot be confirmed from the title.
-
Islamic Large Language Models: From Knowledge Acquisition to Trustworthy and Hallucination-Resistant AISurvey reviews Islamic LLMs and trustworthy, hallucination-resistant AIThis survey reviews the emerging field of Islamic LLMs and trustworthy Islamic AI, spanning Arabic NLP, Qur'anic question answering, knowledge benchmarks, retrieval-augmented generation, and legal reasoning. It argues that Arabic fluency alone is insufficient, and that reliable systems need curated sources, verification modules, and citation-aware generation.
-
How Far Can Machine Translation Quality Take You? Extrinsic Discourse Evaluation in Goal-Oriented SetupsExtrinsic discourse evaluation of machine translation qualityThis arXiv paper argues that standard machine-translation (MT) metrics assess quality intrinsically and miss the downstream consequences of translation errors. Under a static regime, the authors propose an entity-counting task probing referential consistency and show high intrinsic MT quality does not reliably predict downstream discourse success. Under an interactive regime, they use the goal-oriented multi-agent Welfare Diplomacy game as a probe.
-
Can LLM Coding Agents Reason About Time Series?Can LLM coding agents reason about time series? A benchmark studyThis arXiv study tests whether LLM agents can analyze ubiquitous time series data used in finance, healthcare, and environmental monitoring. Comparing three approaches—raw numerical data, the LLM as a coding agent, and a combination—the authors find that agents with code access can outperform models processing raw data by up to 10%, though even the best agent still answers roughly 22-34% incorrectly.
-
Cohere triples UK footprint with new London office to support R&D growthCohere triples its UK footprint with a new London R&D officeCohere announced it will move to a larger London office at 100 New Oxford Street, nearly tripling its UK footprint. The expansion backs the city's AI talent and R&D base and supports growing demand for secure, enterprise-grade sovereign AI across the UK and Europe.
-
How to Earn a Billion DollarsPaul Graham essay lays out how to build a billion-dollar fortuneIn a new essay, Y Combinator co-founder Paul Graham examines how to earn a billion dollars, arguing the most reliable path is founding a fast-growing startup that makes something people genuinely want. He frames wealth creation in terms of scale, market choice, and entrepreneurial ambition, in his characteristically plain-spoken style.
-
Quoting Andrew SingletonSimon Willison shares a quote from Andrew SingletonSimon Willison's blog features a quotation from Andrew Singleton. Based on its tags, the post touches on Meta, software engineering, and neural networks; see the original post for the full quote and context.
-
Which Directions Matter? Sparse Design for Affine Robust OptimizationSparse design identifies which directions matter in robust optimizationThe work studies which uncertainty directions a model must cover in affine robust optimization defined by a finite dictionary and budget. It proposes a sparse design selecting the directions that matter.
-
olmo-eval: An evaluation workbench for the model development loopAllenAI releases olmo-eval, evaluation workbench for model dev loopAllen Institute for AI published olmo-eval, an evaluation workbench for the model development loop. The tool appears to support continuous evaluation of models during training, building on the team's OLMo open-model development work.
-
Rethinking Global Average Pooling: Your Classifier Is Secretly a Multi-Instance LearnerRethinking GAP: your classifier is secretly a multi-instance learnerModern image classifiers widely use global average pooling followed by a linear head. The paper shows this linearity makes GAP-based classifiers behave as multi-instance learners, prompting a rethink of global average pooling.