The NLP Cypher | 11.21.21
Hey … so have you ever deployed a state-of-the-art production level inference server? Don’t know how to do it?
Well… last week, Michael Benesty dropped a bomb when he published one of the first ever detailed blogs on how to not only deploy a production level inference API but benchmarking some of the most widely used frameworks such as FastAPI and Triton servers and runtime engines such as ONNX runtime (ORT) and TensorRT (TRT). Eventually, Michael recreated Hugging Face’s ability to reach a 1–2ms inference with miniLM & a T4 GPU. 👀
🥶🥶🥶🥶🥶🥶
Code:
Another Tutorial for Triton and Hugging Face Inference
NVIDIA’s Triton Server Update
PyTorch Lite Inference Toolkit: works with Hugging Face pipeline.
Here’s an example for text generation with GPT-J (6 Billi param model)