The NLP Cypher | 11.21.21

Hey … so have you ever deployed a state-of-the-art production level inference server? Don’t know how to do it?

Well… last week, Michael Benesty dropped a bomb when he published one of the first ever detailed blogs on how to not only deploy a production level inference API but benchmarking some of the most widely used frameworks such as FastAPI and Triton servers and runtime engines such as ONNX runtime (ORT) and TensorRT (TRT). Eventually, Michael recreated Hugging Face’s ability to reach a 1–2ms inference with miniLM & a T4 GPU. 👀

🥶🥶🥶🥶🥶🥶

Hugging Face Transformer Inference Under 1 Millisecond Latency

Go to production with Microsoft and Nvidia open source tooling

towardsdatascience.com

Code:

GitHub – ELS-RD/triton_transformers: Deploy optimized transformer based models on Nvidia Triton…

Yes, you can perfom inference with transformer based model in less than 1ms on the cheapest GPU available on Amazon…

github.com

Another Tutorial for Triton and Hugging Face Inference

How to deploy (almost) any Hugging face model on NVIDIA Triton Inference Server with an…

SUMMARY

medium.com

NVIDIA’s Triton Server Update

NVIDIA Announces Major Updates to Triton Inference Server as 25,000+ Companies Worldwide Deploy…

Capital One, Microsoft, Samsung Medison, Siemens Energy, Snap Among Industry Leaders Worldwide Using Platform NVIDIA AI…

www.yahoo.com

PyTorch Lite Inference Toolkit: works with Hugging Face pipeline.

Here’s an example for text generation with GPT-J (6 Billi param model)

To finish reading, please visit source site