Judge Arena: Benchmarking LLMs as Evaluators

LLM-as-a-Judge has emerged as a popular way to grade natural language outputs from LLM applications, but how do we know which models make the best judges? We’re excited to launch Judge Arena – a platform that lets anyone easily compare models as judges side-by-side. Just run the judges on a test sample and vote which judge you agree with most. The results will be organized into a leaderboard that displays the best judges. Judge Arena Crowdsourced, randomized    

Read more

Introduction to the Open Leaderboard for Japanese LLMs

LLMs are now increasingly capable in English, but it’s quite hard to know how well they perform in other national languages, widely spoken but which present their own set of linguistic challenges. Today, we are excited to fill this gap for Japanese! We’d like to announce the Open Japanese LLM Leaderboard, composed of more than 20 datasets from classical to modern NLP tasks to understand underlying mechanisms of Japanese LLMs. The Open Japanese LLM Leaderboard was built by the LLM-jp, […]

Read more

Faster Text Generation with Self-Speculative Decoding

Self-speculative decoding, proposed in LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding is a novel approach to text generation. It combines the strengths of speculative decoding with early exiting from a large language model (LLM). This method allows for efficient generation by using the same model’s early layers for drafting tokens, and later layers for verification. This technique not only speeds up text generation, but it also achieves significant memory savings and reduces computational latency. In order to obtain an […]

Read more

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Current static evaluations and user-driven arenas have exhibited their limitations and biases in the previous year. Here, we explore a novel way to evaluate LLMs: debate. Debate is an excellent way to showcase reasoning strength and language abilities, used all across history, from the debates in the Athenian Ecclesia in the 5th century BCE to today’s World Universities Debating Championship. Do today’s large language models exhibit debate skills similar to humans? Which model is currently the best at debating? What […]

Read more

Rearchitecting Hugging Face Uploads and Downloads

As part of Hugging Face’s Xet team’s work to improve Hugging Face Hub’s storage backend, we analyzed a 24 hour window of Hugging Face upload requests to better understand access patterns. On October 11th, 2024, we saw: Uploads from 88 countries 8.2 million upload requests 130.8 TB of data transferred The map below visualizes this activity, with countries colored by bytes uploaded per hour. Currently, uploads are stored in an S3 bucket in us-east-1 and optimized using S3 Transfer Acceleration. […]

Read more

Open Source Developers Guide to the EU AI Act

Not legal advice. The EU AI Act, the world’s first comprehensive legislation on artificial intelligence, has officially come into force, and it’s set to impact the way we develop and use AI – including in the open source community. If you’re an open source developer navigating this new landscape, you’re probably wondering what this means for your projects. This guide breaks down key points of the regulation with a focus on open source development, offering a clear introduction to this […]

Read more

Investing in Performance: Fine-tune small models with LLM insights – a CFM case study

Overview: This article presents a deep dive into Capital Fund Management’s (CFM) use of open-source large language models (LLMs) and the Hugging Face (HF) ecosystem to optimize Named Entity Recognition (NER) for financial data. By leveraging LLM-assisted labeling with HF Inference Endpoints and refining data with Argilla, the team improved accuracy by up to 6.4% and reduced operational costs, achieving solutions up to 80x cheaper than large LLMs alone. In this post, you will learn: How to use LLMs for […]

Read more

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

In the rapidly evolving landscape of large language models (LLMs), comprehensive and robust evaluation methodologies remain a critical challenge, particularly for low-resource languages. In this blog, we introduce AraGen, a generative tasks benchmark and leaderboard for Arabic LLMs, based on 3C3H, a new evaluation measure for NLG which we hope will inspire work for other languages as well. The AraGen leaderboard makes three key contributions: 3C3H Measure: The 3C3H measure scores a model’s response and is central to this framework. […]

Read more
1 47 48 49 50 51 1,022