Hugging Face Inference-as-a-Service

Launch of Hugging Face Inference-as-a-Service powered by NVIDIA NIM, a new service on the Hugging Face Hub.

So, we can use open models with the accelerated compute platform, of NVIDIA DGX Cloud for inference serving.

Code is fully compatible with OpenAI API, allowing you to use the openai’ sdk for inference.

Note: You need access to an Organization with a Hugging Face Enterprise subscription to run Inference.

📌 So NVIDIA NIMs is an inference microservices that provide models as optimized containers — to deploy on clouds, data centers or workstations, giving them the ability to easily build generative AI applications for copilots, chatbots and more, in minutes rather than weeks.

📌 Maximizes infrastructure investments and compute efficiency. For example, running Meta Llama 3-8B in a NIM produces up to 3x more generative AI tokens on accelerated infrastructure than without NIM.

📌 NIM containers are pre-built to speed model deployment for GPU-accelerated inference and can include NVIDIA CUDA® software, NVIDIA Triton Inference Server™ and NVIDIA TensorRT™-LLM software.

📌 Many community models are available to experience as NIM endpoints on ai.nvidia. com, including Meta Llama 3.1, Databricks DBRX, Google’s open model Gemma, Meta Llama 3, Microsoft Phi-3, Mistral Large, Mixtral 8x22B and Snowflake Arctic.

Read other articles:

Intro to HuggingFace Transformers

July 30, 2024

Tags:

Hub, Hugging Face