We’re excited to share a few new programs launching in our open science community—each designed to support curiosity, collaboration, and deeper dives into key ML topics:
🚀 ML Understanding – Break down complex machine learning ideas, one concept at a time.
🏢 ML Industry – Explore how ML is being applied in the real world, from startups to scale.
🔍 Retrieval & Search – Dive into the systems that help us find what we’re looking for—fast and smart.
🤖 ML Agents – Investigate intelligent agents and their role in interactive, decision-making environments.
📚In addition to these new programs, our long-standing Beginners in Research Driven Studies (BIRDS) group has launched their latest sprint designed to help beginners become familiar with NLP. Topics covered will include Tokenization, Text Classification, Neural Networks, Embeddings and more.
We’re looking forward to connecting with many of you at ACL 2025 in Vienna, Austria. Some of our research team will be attending and are looking forward to connecting and discussing our recent works.
Check out Papers in the Park! It’s been great to see community members in Toronto gathering to review research papers over pizza.
If you're based in the Toronto area, this group will be meeting throughout the summer for more paper discussions and in-person connection.
Can smarter training reduce our reliance on prompt engineering? In our work “Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers” we introduce a framework that enriches training data with fine-grained markers—like task, domain, and length—to improve controllability and performance, especially on underrepresented use cases. In evaluation, this approach achieved a 5.7% lift in Arena-Hard-Auto win rate, with gains over 9% in underrepresented domains, 14.1% on tasks like CodeRepair, and 35.3% on length-following instructions. Led by: Daniel D'souza , Julia Kreutzer, Adrien Morisot, Ahmet Üstün and Sara Hooker.
Can we boost LLM performance at inference without heavy sampling or specialized reward models? Our latest work “When Life Gives You Samples: The Benefits of Scaling up Inference Compute for Multilingual LLMs” introduces LLMonade, a new inference-time recipe for multilingual settings. It starts with Hedged Sampling, which blends deterministic and stochastic strategies to improve multilingual output—averaging a +7.6% gain over single-sample baselines.Next, our CHOPS and X-MBR methods enhance selection efficiency, with X-MBR achieving a +12% gain using just five samples. Together, these techniques show that smart sampling and selection—using generalist LLMs as judges—can deliver strong results with minimal compute. Led by: Ammar Khairi, Daniel D'souza, Ye Shen, Julia Kreutzer, and Sara Hooker.
Most pretrained models struggle to adapt to new languages, as English and a few widely spoken languages dominate pretraining data. In this work, “One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers” we ask: what can be done before pretraining to improve language plasticity—the ability to learn new languages later? We explore a universal tokenizer designed to improve adaptability without sacrificing pretraining performance. This approach leads to 2× better language adaptation using just 1/8 of the data. Even for completely unseen, low-resource languages, the model performs up to 5% better—highlighting a more efficient and inclusive path to multilingual AI. Led by: Diana Abagyan, Alejandro R. Salamanca, Andres Felipe Cruz-Salinas, Kris Cao, Hangyu Lin, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, Sara Hooker.
How much does English-only work dominate over other languages in LLM safety research? By a huge margin unfortunately, and this language gap is growing in LLM safety research––even high-resource non-English languages receive minimal attention. In general, non-English languages are rarely studied as a standalone language, and to make the matter worse, English safety research exhibits poor language documentation practice. In “The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating it" we present a comprehensive survey analysis on this language gap, and we make several recommendations for *ACL venues as well as future multilingual LLM safety research work. Led by Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer.
This month we are spotlighting Andrej Jovanović, one of our leads for the ML Theory Program! Andrej works as a research assistant with CaMLSys group at the University of Cambridge and his focus is on compression/efficiency within large-scale distributed (federated) optimisation.
And if you want to see past sessions from our community events, check out our Community Talks video playlist!
We're excited to launch the Cohere Labs Open Science Community Summer School, a learning initiative featuring some of the leading minds in machine learning from INRIA, META (FAIR), Google DeepMind, Cohere Labs and more. These speakers have shaped the field across areas like foundation models, retrieval, multimodal learning, and AI for social impact. Now, they’re coming together to share their insights with you!