Hugging Face Moves from Git LFS to Xet Storage

Hugging Face is replacing Git LFS with Xet storage on its Hub to improve AI model distribution.
Xet offers faster downloads and uploads by serving only necessary data chunks and reusing existing ones.
The new system significantly reduces storage costs through intelligent deduplication, storing identical file chunks only once.
Xet’s content-addressable nature enhances caching and verification speed.
The migration is largely transparent to users, maintaining existing workflows and commands.
This shift aims to lower barriers to entry for AI researchers and developers by making model access faster and cheaper.

Hugging Face Switches from Git LFS to Xet Storage for AI Models

Downloading large AI models from Hugging Face’s Hub can be a slow and frustrating experience. Files may take a long time to transfer, processes can stall, and users might have to restart downloads. This is about to change as Hugging Face transitions its Hub storage from Git LFS to a new system called Xet. This technical shift offers significant benefits, including faster downloads, reduced storage usage, and lower costs for AI model distribution.

The Challenges of Using Git LFS for Large AI Files

Git is excellent for tracking code changes, but it struggles with the massive binary files characteristic of AI models. Git LFS (Large File Storage) was developed as an add-on to handle these large files by storing them separately with pointers in the main repository. However, this approach has limitations.

Speed Issues with Git LFS Downloads

When cloning a repository using Git LFS, users download both code and large model files. This can be time-consuming, especially since users might have to download the entire file history even if they only need one version. It’s like having to pull out an entire huge drawer to get a single document.

High Costs and Wasted Storage

Git LFS charges based on data transfer and storage. For a platform like Hugging Face, hosting numerous models and datasets, these costs escalate quickly. Furthermore, Git LFS lacks efficient deduplication. If similar models are uploaded, they are stored as separate files, leading to significant wasted space.

Frustrating User Experience

For researchers and developers who frequently download multiple models for testing, the slow speeds and potential connection drops with Git LFS create a frustrating user experience. The Hugging Face team recognized the need for a more scalable solution.

Introducing Xet Storage: A Smarter Approach

Xet is a new storage system developed in-house at Hugging Face, designed specifically for the Hub’s needs. It utilizes content-addressable storage, meaning each piece of data is identified by a unique digital fingerprint based on its content.

How Xet’s Deduplication Works

Unlike Git LFS, Xet breaks files into smaller chunks. These chunks are stored based on their fingerprints. If a chunk already exists in the system, Xet only stores a reference to it, avoiding redundant copies. This deduplication process significantly reduces storage requirements. For example, if two books share the same first chapter, Xet stores that chapter only once.

Content Addressability and Efficiency

Because Xet stores data by its content hash, it enables fast caching and verification. The system is designed to be efficient, akin to a smart library that stores each paragraph only once and tracks which books use it.

Intelligent Compression and Versioning

Xet also handles compression and versioning intelligently. It can store multiple versions of a model efficiently by only storing the differences between them, optimizing space usage for evolving AI models.

Benefits of Xet Storage: Speed and Cost Savings

Xet’s design translates directly into tangible benefits for users and Hugging Face.

Faster Downloads and Uploads

Xet can serve only the specific chunks of data a user needs, and it reuses existing chunks from previous downloads. This dramatically speeds up download times, with users reporting 2 to 5 times faster speeds in some cases. Uploads are also quicker, as only new chunks need to be transferred.

Reduced Infrastructure Costs

Deduplication and efficient data transfer lead to significant cost savings on storage and bandwidth for Hugging Face. The system’s efficiency also means it can handle high loads with fewer servers, further reducing operational costs.

Efficient Model Versioning

Storing new versions of models is much more efficient with Xet. Only changed chunks are stored, making it cost-effective for researchers who manage multiple model iterations.

Improved Repository Cloning

Cloning repositories becomes much faster with Xet, as many chunks can be cached or streamed efficiently, reducing the time spent waiting for large files to download.

The Migration Process to Xet Storage

Hugging Face is gradually migrating its Hub storage from Git LFS to Xet. This process is largely happening in the background, aiming for a seamless transition for users.

User Experience During Migration

For most users, the migration is transparent. The user interface and APIs remain unchanged, and existing commands continue to work. The primary change is the backend storage system. Users do not need to learn new commands or alter their scripts.

Potential Considerations

Users with custom Git LFS hooks or scripts might need to update them. There could be a brief warm-up period where some operations are slightly slower as the Xet cache populates. Compatibility with certain Git LFS-specific commands may also differ, though Hugging Face is working to maintain high compatibility.

Automated and Monitored Transition

The migration is automated, with models moved in batches. Hugging Face monitors the process closely to prevent data loss and ensures a smooth transition. Users do not need to take any action for their existing models.

Impact of Xet Storage on the AI Community

The shift to Xet storage has significant implications for the broader AI community.

Lowering Barriers to Entry

Faster downloads enable researchers to experiment with more models in less time, accelerating the pace of innovation. This allows them to focus more on analysis and less on waiting for files.

Potential Cost Reductions

Hugging Face’s infrastructure savings from Xet could lead to more accessible free tiers and lower prices for premium services, making AI resources more affordable.

Setting a New Industry Standard

As a leading AI platform, Hugging Face’s adoption of Xet could influence other platforms to move towards similar content-addressable storage solutions, fostering a more efficient AI ecosystem.

Addressing Potential Risks

While Xet offers many benefits, potential risks like vendor lock-in and system complexity are being addressed. Hugging Face emphasizes open standards and has invested in robust monitoring and testing to ensure data integrity and system reliability.

The Future of AI Data Storage

The evolution of AI models and datasets necessitates advancements in storage solutions. Xet represents a significant step forward, but innovation in this area is ongoing.

Content Addressable Storage for All AI Data

The principles of Xet, such as deduplication, could be applied to other large AI artifacts like datasets and embeddings, leading to substantial space savings.

Advancements in Streaming and Integration

Future storage systems may offer more advanced streaming capabilities, allowing users to access only the necessary parts of models or datasets. Tighter integration between storage systems and AI frameworks could also speed up training pipelines.

Continued Development and Open Source

Hugging Face plans to continue developing Xet, adding new features and improving performance. They are also open-sourcing parts of the system, encouraging community contributions and the development of related tools.

A More Efficient AI Ecosystem

The migration to Xet is a move towards a faster, cheaper, and more efficient AI ecosystem. As storage technologies evolve, they will better support the growing demands of modern artificial intelligence.

Frequently Asked Questions

Why is Hugging Face moving from Git LFS to Xet storage?

Hugging Face is moving from Git LFS to Xet storage to address the slow download speeds and high costs associated with handling large AI model files. Xet offers a more efficient and scalable solution for storing and distributing these massive datasets.

What are the main benefits of Xet storage for users?

Xet storage provides significantly faster download and upload speeds for AI models. It also reduces storage space usage through deduplication and can lead to cost savings, potentially making AI resources more accessible.

How does Xet storage work differently from Git LFS?

Xet breaks files into smaller chunks and stores them based on their content fingerprint, reusing existing chunks to save space and time. Git LFS stores entire files separately, which is less efficient for large, often similar, AI models.

Will users need to learn new commands for Xet storage?

No, the migration to Xet is designed to be seamless for users. Existing commands and workflows for interacting with the Hugging Face Hub will continue to work as before, with the changes happening in the backend.

Are there any potential downsides to Xet storage?

Potential concerns include the complexity of content-addressable storage and the risk of vendor lock-in. However, Hugging Face is committed to open standards and has implemented robust systems to ensure data integrity and reliability.

How will Xet storage impact the cost of using Hugging Face?

By reducing Hugging Face's infrastructure costs for storage and bandwidth, the savings are expected to trickle down to users. This could result in more generous free tiers or lower prices for premium services.

References

Migrating the Hub from Git LFS to Xet – Original report (Hugging Face)
Is Hugging Face's Xet Storage the Future of AI Repositories? – Analytics India Magazine – This article provides industry perspective on the significance of the migration, though full text was inaccessible.

AI・Biotech & Health

I Used Claude Code to Analyze My MRI. Here’s What Happened.

Enterprise・Technology

Why HackerRank’s Open-Source ATS Gave My Resume Three Different Scores

Gaming・Technology

How to Set Up a Hytale Dedicated Server on Linux

Economy・Enterprise

The Office Doesn’t Fix Loneliness at Work

AI・Mobility

TechCrunch Mobility: All Eyes on Tesla FSD

Apple・Hardware

Apple’s touchscreen MacBook to launch with M5 chips, not M7

Hardware・Technology

Resurrecting a Nearly Forgotten Netbook: The Lemote Yeeloong and OpenBSD

AI • Techinfra

TBB Desk

TBB Desk

Key Takeaways

Leave a Comment Cancel reply

Join thousands of readers shaping the tech conversation.

Join thousands of readers shaping the tech conversation.

Sections

Topics

Resources

Advertise

Company