SafeCoder vs. Closed-source Code Assistants

For decades, software developers have designed methodologies, processes, and tools that help them improve code quality and increase productivity. For instance, agile, test-driven development, code reviews, and CI/CD are now staples in the software industry. In “How Google Tests Software” (Addison-Wesley, 2012), Google reports that fixing a bug during system tests – the final testing stage – is 1000x more expensive than    

Read more

Overview of natively supported quantization schemes in 🤗 Transformers

We aim to give a clear overview of the pros and cons of each quantization scheme supported in transformers to help you decide which one you should go for. Currently, quantizing models are used for two main purposes: Running inference of a large model on a smaller device Fine-tune adapters on top of quantized models So far, two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq. Note that some additional quantization schemes are […]

Read more

Fine-tuning Llama 2 70B using PyTorch FSDP

In this blog post, we will look at how to fine-tune Llama 2 70B using PyTorch FSDP and related best practices. We will be leveraging Hugging Face Transformers, Accelerate and TRL. We will also learn how to use Accelerate with SLURM. Fully Sharded Data Parallelism (FSDP) is a paradigm in which the optimizer states, gradients and parameters are sharded across devices. During the forward pass, each FSDP unit performs an all-gather operation to get the complete weights, computation is performed […]

Read more

Introducing Würstchen: Fast Diffusion for Image Generation

What is Würstchen? Würstchen is a diffusion model, whose text-conditional component works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by orders of magnitude. Training on 1024×1024 images is way more expensive than training on 32×32. Usually, other works make use of a relatively small compression, in the range of 4x – 8x    

Read more

Optimizing your LLM in production

Note: This blog post is also available as a documentation page on Transformers. Large Language Models (LLMs) such as GPT3/4, Falcon, and LLama are rapidly advancing in their ability to tackle human-centric tasks, establishing themselves as essential tools in modern knowledge-based industries. Deploying these models in real-world tasks remains    

Read more

Introduction to 3D Gaussian Splatting

3D Gaussian Splatting is a rasterization technique described in 3D Gaussian Splatting for Real-Time Radiance Field Rendering that allows real-time rendering of photorealistic scenes learned from small samples of images. This article will break down how it works and what it means for the future of graphics.

Read more

Inference for PROs

Today, we’re introducing Inference for PRO users – a community offering that gives you access to APIs of curated endpoints for some of the most exciting models available, as well as improved rate limits for the usage of free Inference API. Use the following page to subscribe to PRO. Hugging Face PRO users now have access to exclusive API endpoints for a curated list of powerful models that benefit from ultra-fast inference powered by text-generation-inference. This is a benefit on […]

Read more

Llama 2 on Amazon SageMaker a Benchmark

Deploying large language models (LLMs) and other generative AI models can be challenging due to their computational requirements and latency needs. To provide useful recommendations to companies looking to deploy Llama 2 on Amazon SageMaker with the Hugging Face LLM Inference Container, we created a comprehensive benchmark analyzing over 60 different deployment configurations for Llama    

Read more
1 27 28 29 30 31 1,021