Rearchitecting Hugging Face Uploads and Downloads

As part of Hugging Face’s Xet team’s work to improve Hugging Face Hub’s storage backend, we analyzed a 24 hour window of Hugging Face upload requests to better understand access patterns. On October 11th, 2024, we saw: Uploads from 88 countries 8.2 million upload requests 130.8 TB of data transferred The map below visualizes this activity, with countries colored by bytes uploaded per hour. Currently, uploads are stored in an S3 bucket in us-east-1 and optimized using S3 Transfer Acceleration. […]

Read more

Open Source Developers Guide to the EU AI Act

Not legal advice. The EU AI Act, the world’s first comprehensive legislation on artificial intelligence, has officially come into force, and it’s set to impact the way we develop and use AI – including in the open source community. If you’re an open source developer navigating this new landscape, you’re probably wondering what this means for your projects. This guide breaks down key points of the regulation with a focus on open source development, offering a clear introduction to this […]

Read more

Investing in Performance: Fine-tune small models with LLM insights – a CFM case study

Overview: This article presents a deep dive into Capital Fund Management’s (CFM) use of open-source large language models (LLMs) and the Hugging Face (HF) ecosystem to optimize Named Entity Recognition (NER) for financial data. By leveraging LLM-assisted labeling with HF Inference Endpoints and refining data with Argilla, the team improved accuracy by up to 6.4% and reduced operational costs, achieving solutions up to 80x cheaper than large LLMs alone. In this post, you will learn: How to use LLMs for […]

Read more

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

In the rapidly evolving landscape of large language models (LLMs), comprehensive and robust evaluation methodologies remain a critical challenge, particularly for low-resource languages. In this blog, we introduce AraGen, a generative tasks benchmark and leaderboard for Arabic LLMs, based on 3C3H, a new evaluation measure for NLG which we hope will inspire work for other languages as well. The AraGen leaderboard makes three key contributions: 3C3H Measure: The 3C3H measure scores a model’s response and is central to this framework. […]

Read more

Welcome PaliGemma 2 – New vision language models by Google

We are excited to welcome Google’s all-new vision language models, PaliGemma 2, a new iteration of PaliGemma. Like its predecessor, PaliGemma 2 uses the same powerful SigLIP for vision, but it upgrades to the latest Gemma 2 for the text decoder part. PaliGemma 2 comes with new pre-trained (pt) models, in sizes of 3B, 10B, and 28B parameters. All of them support various input resolutions: 224×224, 448×448, and 896×896. These combinations provide a lot of flexibility for different use cases, […]

Read more

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

The Data is Better Together community releases yet another important dataset for open source development. Due to the lack of open preference datasets for text-to-image generation, we set out to release an Apache 2.0 licensed dataset for text-to-image generation. This dataset is focused on text-to-image preference pairs across common image generation categories, while mixing different model families and varying prompt complexities. TL;DR? All results can be found in this collection on the Hugging Face Hub and code for pre- and […]

Read more

Use Hugging Face models with Amazon Bedrock

We are excited to announce that popular open models from Hugging Face are now available on Amazon Bedrock in the new Bedrock Marketplace! AWS customers can now deploy 83 open models with Bedrock Marketplace to build their Generative AI applications. Under the hood, Bedrock Marketplace model endpoints are managed by Amazon Sagemaker Jumpstart. With Bedrock Marketplace, you can now combine the ease of use of SageMaker JumpStart with the fully managed infrastructure of Amazon Bedrock, including compatibility with high-level APIs […]

Read more

LeMaterial: an open source initiative to accelerate materials discovery and research

Today, we are thrilled to announce the launch of LeMaterial, an open-source collaborative project led by Entalpic and Hugging Face. LeMaterial aims to simplify and accelerate materials research, making it easier to train ML models, identify novel materials and explore chemical spaces. ⚛️🤗 As a first step, we are releasing a dataset called LeMat-Bulk, which unifies, cleans and standardizes the most prominent material datasets, including Materials Project, Alexandria and OQMD — giving rise to a single harmonized data format with […]

Read more

Introducing the Synthetic Data Generator – Build Datasets with Natural Language

Introducing the Synthetic Data Generator, a user-friendly application that takes a no-code approach to creating custom datasets with Large Language Models (LLMs). The best part: A simple step-by-step process, making dataset creation a non-technical breeze, allowing anyone to create datasets and models in minutes and without any code. A short demo video What is synthetic data and why is it useful? Synthetic data is artificially generated information that mimics real-world data. It allows overcoming data limitations by expanding or enhancing […]

Read more
1 48 49 50 51 52 1,023