Hugging Face puts squeeze on Nvidia's AI microservice play

Hugging Face this week announced HUGS, its answer to Nvidia's Inference Microservices (NIMs), which the AI repo claims will let customers deploy and run LLMs and models on a much wider variety of hardware.

Like Nvidia's previously announced NIMs, Hugging Face Generative AI Services (HUGS) are essentially just containerized model images that contain everything a user might need to deploy the model. The idea is that rather than having to futz with vLLM or TensorRT LLM to get a large language model running optimally at scale, users can instead spin up a preconfigured container image in Docker or Kubernetes and connect to it via standard OpenAI API calls.

HUGS are built around its open source Text Generation Inference (TGI) and Transformers frameworks and libraries, which means they can be deployed on a variety of hardware platforms including Nvidia and AMD GPUs, and will eventually extend support for more specialized AI accelerators like Amazon's Inferentia or Google's TPUs. Apparently no love for Intel Gaudi just yet.

Despite being based on open source technologies, HUGS like NIMS aren't free. If deployed in AWS or Google Cloud, they'll run you about $1 an hour per container.

For comparison, Nvidia charges $1 per hour per GPU for NIMs deployed in the cloud or $4,500 a year per GPU on-prem. If you're deploying a larger model, say Meta's Llama 3.1 405B, that spans eight GPUs, Hugging Face's offering will be significantly less expensive to deploy. What's more, support for alternative hardware types means customers won't be limited to Nvidia's hardware ecosystem.

Whether or not HUGS will be more performant or better optimized than NIMs, remains to be seen.

For those looking to deploy HUGS at a smaller scale, Hugging Face will also make the images available on DigitalOcean's cloud platform at no additional cost, but you'll still have to pay for the compute.

DigitalOcean recently announced the availability of GPU-accelerated VMs based on Nvidia's H100 accelerators which will run you between $2.5 and $6.74 per hour per GPU depending on whether you opt for a single accelerator or sign a 12-month commitment for eight.

Finally, those shelling out the $20 a month per user for Hugging Face's Enterprise Hub subscribers will have the option to deploy HUGS on their own infrastructure.

Nvidia CEO whines Europeans aren't buying enough GPUs
Anthropic's latest Claude model can interact with computers – what could go wrong?
Sorry, but the ROI on enterprise AI is abysmal
Major publishers sue Perplexity AI for scraping without paying

In terms of models, Hugging Face is fairly conservative and focuses on some of the most popular open models, including:

Meta's Llama 3.1 8B, 70B, and 405B (FP8)
Mistral AI's Mixtral 8x7B, 8x22B, and Mistral 7B
Nous Research's Hermes fine tunes of: Meta's three Llama 3.1 models and Mistral's Mixtral 8x7B
Google's Gemma 2 9B and 27B
Alibaba's Qwen 2.5 7B

We expect Hugging Face will quickly expand support to additional models like Microsoft's Phi-series of LLMs in the near future.

But, if paying for what essentially is a bundle of open source software and model files doesn't strike your fancy, nothing stops anyone from building their own containerized models using vLLM, Llama.cpp, TGI, or TensorRT LLM. You can find our hands-on guide on containerizing AI apps here.

With that said, what you're really paying for with Hugging Faces' HUGS or Nvidia's NIMs, for that matter, is the time and effort spent tuning and optimizing the containers for maximum performance. ®

Related stories

Other stories