Apriel-H1: The Surprising Key to Distilling Efficient Reasoning Models

We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The key? A non-obvious insight about what data to distill on, and why intuition fails here. When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became “efficient attention is dead.” Within days, Kimi Linear proved otherwise. The real lesson: it depends on your constraints. Our constraint was simple: we had a strong […]

Read more

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

While everyone (and their grandma 👵) is spinning up new ASR models, picking the right one for your use case can feel more overwhelming than choosing your next Netflix show. As of 21 Nov 2025, there are 150 Audio-Text-to-Text and 27K ASR models on the Hub 🤯 Most benchmarks focus on short-form English transcription (<30s), and overlook other important tasks, such as (1) multilingual performance and (2) model throughput, which can a be deciding factor for long-form audio like meetings […]

Read more

20x Faster TRL Fine-tuning with RapidFire AI

Hugging Face TRL now officially integrates with RapidFire AI to accelerate your fine-tuning and post-training experiments. TRL users can now discover, install, and run RapidFire AI as the fastest way to compare multiple fine-tuning/post-training configurations to customize LLMs without major code changes and without bloating GPU requirements. Why this matters When fine-tuning or post-training LLMs, teams often do not have the time and/or budget to compare multiple configs even though that can significantly boost eval metrics. RapidFire AI    

Read more

OVHcloud on Hugging Face Inference Providers 🔥

We’re thrilled to share that OVHcloud is now a supported Inference Provider on the Hugging Face Hub! OVHcloud joins our growing ecosystem, enhancing the breadth and capabilities of serverless inference directly on the Hub’s model pages. Inference Providers are also seamlessly integrated into our client SDKs (for both JS and Python), making it super easy to use a wide variety of models with your preferred providers. This launch makes it easier than ever to access popular open-weight models like gpt-oss, […]

Read more

Building Deep Research: How we Achieved State of the Art

Research agents are rapidly becoming one of the most important applications of AI. Research is a foundational knowledge-work task: collecting, reading, and synthesizing information underpins everything from writing and decision-making to coding itself. Yet human-driven research is constrained by memory, reading speed, and time. AI research agents, by contrast, can process vast amounts of information, synthesize insights instantly, and scale effortlessly. Because of this, research agents are emerging as a top use case for AI today and will soon become […]

Read more

Continuous batching

TL;DR: in this blog post, starting from attention mechanisms and KV caching, we derive continuous batching by optimizing for throughput. If you’ve ever used Qwen, Claude, or any other AI chatbot, you’ve probably noticed something: it takes a while for the first word of the response to appear, and then words appear one-by-one on your screen with (hopefully) a regular and fast-paced frequency. That’s because at the heart of it, all LLMs are just fancy next token predictors. An LLM […]

Read more

Welcome FLUX.2 – BFL’s new open image generation model 🤗

FLUX.2 is the recent series of image generation models from Black Forest Labs, preceded by the Flux.1 series. It is an entirely new model with a new architecture and pre-training done from scratch! In this post, we discuss the key changes introduced in FLUX.2, performing inference with it under various setups, and LoRA fine-tuning. 🚨 FLUX.2 is not meant to be a drop-in replacement of FLUX.1, but a new image generation and editing model. Table of contents

Read more

Transformers v5: Simple model definitions powering the AI ecosystem

Transformers’ version v4.0.0rc-1, the initial release candidate for version 4, was released on November 19th, 2020. Five years later, we now release v5.0.0rc-0. Today, as we launch v5, Transformers is installed more than 3 million times each day via pip – up from 20,000/day in v4 🤯. Altogether, it has now surpassed 1.2 billion installs! The ecosystem has expanded from 40 model architectures in v4 to over 400 today, and the community has contributed more than 750,000 model checkpoints on […]

Read more

DeepMath: A lightweight math reasoning Agent with smolagents

By Intel AI Software Group DeepMath is an aligned math reasoning agent built on Qwen3-4B Thinking and fine-tuned with GRPO (Group Relative Policy Optimization). Instead of verbose text, the model emits tiny Python snippets for intermediate steps, runs them in a secure sandbox, and folds the results back into its reasoning, reducing errors and output length. The agent is implemented using the smolagents library. We evaluate DeepMath on four math datasets: MATH500, AIME, HMMT, and HLE, and    

Read more
1 54 55 56 57 58 1,010