HuggingFace acquired XetHub

We are excited to announce that XetHub, a Seattle-based company, has been acquired by Hugging Face. XetHub was founded by Yucheng Low, Ajit Banerjee, and Rajat Arya, who previously worked at Apple, where they built and scaled Apple’s internal machine learning infrastructure. The mission of XetHub has always been to bring software engineering best practices to AI development.

The company has developed technologies that allow Git to scale to terabyte-sized repositories and enable teams to explore, understand, and collaborate on large, evolving datasets and models. The XetHub team has since grown to include 12 talented members, who are now joining Hugging Face. You can follow their new journey at their official Hugging Face organization page: hf.co/xet-team.

Our Shared Vision at Hugging Face

Julien Chaumond, CTO of Hugging Face, shared his enthusiasm about the acquisition, stating,

“The XetHub team will help us unlock the next five years of growth for Hugging Face datasets and models by transitioning to our own, optimized version of Large File Storage (LFS) as the storage backend for the Hub’s repositories.”

When Hugging Face first built the Hub in 2020, they chose Git LFS as the storage solution. While it was a logical choice at the time, the team always knew that a more optimized storage and versioning system would eventually be necessary. Although Git LFS stands for Large File Storage, it wasn’t designed to handle the massive files common in AI development.

Future Use Cases and Enhancements

The integration of XetHub’s technology will bring significant improvements to how large files are handled on the Hugging Face Hub. For example, consider a 10GB Parquet file where you need to add just a single row. Currently, this requires re-uploading the entire 10GB file. With XetHub’s chunked files and deduplication technology, only the chunks containing the new row would need to be re-uploaded.

Another use case involves GGUF model files. Imagine a scenario where a single metadata value in the header of a Llama 3.1 405B model needs updating. In the future, users will only need to re-upload a small chunk of data, making the process far more efficient.

As AI models scale to trillions of parameters in the coming months, this new technology aims to unlock new levels of scale both within the community and enterprise companies.

Collaboration on large datasets and models also presents challenges. Questions like “How do teams work together on large data, models, and code?” and “How do users track changes in their data and models?” will be central to our ongoing efforts to find better solutions.

Current Hub Statistics

Here are some impressive current statistics from the Hugging Face Hub:

Number of repositories: 1.3 million models, 450,000 datasets, 680,000 spaces
Total cumulative size: 12 petabytes stored in LFS (280 million files) and 7.3 terabytes stored in Git (non-LFS)
Daily number of requests: 1 billion
Daily CloudFront bandwidth: 6 petabytes

With XetHub now part of the Hugging Face family, we look forward to pushing the boundaries of AI development even further.

Read related articles:

Hugging Face Inference-as-a-Service

August 8, 2024

Tags:

Hugging Face