mmBERT: Hugging Face's Multilingual AI for 1,800+ Languages

mmBERT is a new AI model from Hugging Face that understands over 1,800 languages.
It is significantly faster, running 2-4 times quicker than older multilingual models like mBERT and XLM-R.
The model is built on ModernBERT, incorporating architectural improvements for efficiency and speed.
mmBERT was trained on an extensive 3 trillion tokens, allowing for broad language coverage.
It can be used for applications like machine translation, cross-lingual search, and sentiment analysis across many languages.
Limitations include performance variations across languages and its encoder-only nature, meaning it cannot generate text.

Imagine an AI that can understand text in any of 1,800 languages, from widely spoken English to rare indigenous tongues, and do it faster than ever before. That is what Hugging Face’s new mmBERT promises.

Hugging Face, the company behind many popular open-source AI models, has released a new tool called mmBERT. It is a type of language model that can process text in over 1,800 languages. The company says it is two to four times faster than previous models that do similar work.

The model is already available on Hugging Face’s model hub. Researchers and developers can download and use it for free.

What is mmBERT?

To understand mmBERT, it helps to know a little about BERT. BERT stands for Bidirectional Encoder Representations from Transformers. It is a way of training AI to understand the meaning of words by looking at the words around them. BERT was first released by Google in 2018 and changed how computers handle language.

mmBERT is an encoder-only model. That means it reads text and creates a rich representation of each word’s meaning. It does not generate new text like models such as GPT. Instead, it is designed to understand and classify text. For example, it can determine whether a sentence is positive or negative, or find names of people and places in a document.

The ‘mm’ in mmBERT stands for ‘Modern Multilingual.’ It is built on top of ModernBERT, a newer and more efficient version of the original BERT architecture. ModernBERT introduced several improvements that make it faster and better at handling long documents. mmBERT takes those improvements and applies them to a huge number of languages.

How mmBERT Achieves Speed and Scale

Speed is one of mmBERT’s biggest selling points. Hugging Face says it runs two to four times faster than earlier multilingual encoders like mBERT and XLM-R. That means tasks that took seconds now take fractions of a second.

The speed comes from changes in the underlying architecture. ModernBERT uses a more efficient attention mechanism. Attention is the part of the model that decides which words are most important to understand the meaning of a sentence. Older models used a method that got very slow when processing long texts. ModernBERT’s approach is faster and uses less memory.

Another change is in the way the model handles layers. ModernBERT has a different structure that lets information flow more quickly through the network. It also uses something called ‘rotary positional embeddings’ to keep track of word order without slowing down.

All these changes add up. mmBERT can process large amounts of text in many languages without needing expensive computer hardware. This makes it more practical for real-world applications.

The model was trained on 3 trillion tokens. A token is a piece of text, usually a word or part of a word. Three trillion is a huge number. For comparison, the original English BERT was trained on about 3.3 billion tokens. mmBERT used nearly a thousand times more data. This massive training set is what allows the mmBERT multilingual encoder to cover so many languages.

The training data came from a variety of sources, including web pages, books, and other texts. Hugging Face has not released the exact breakdown of how many tokens came from each language. But they have said that the data includes both high-resource languages like English and Chinese, and low-resource languages with very little digital text available.

Languages Covered: From Common to Rare

One of the most impressive things about mmBERT is its language coverage. It understands over 1,800 languages. That includes all the major world languages like English, Spanish, Mandarin, Arabic, and Hindi. But it also includes many smaller languages that are often ignored by AI.

For example, mmBERT can process texts in Quechua, an indigenous language spoken by millions in the Andes. It can handle Swahili, a widely spoken language in East Africa. It also covers many languages from Papua New Guinea, which has over 800 languages.

Earlier multilingual models like mBERT covered about 104 languages. XLM-R, another popular model, covered about 100 languages. mmBERT covers more than 17 times as many. This is a big step forward for people who speak languages that are not well served by technology.

However, not all languages are treated equally. The model performs best on languages that had more training data. For languages with very little text available, the model’s understanding may be weaker. Hugging Face has acknowledged this and says they plan to improve coverage for low-resource languages in future versions.

The inclusion of so many languages is possible because of the way the model was trained. Instead of training separate models for each language, mmBERT learns a shared representation. It finds patterns that are common across languages. This allows it to use knowledge from high-resource languages to better understand low-resource ones.

Comparing mmBERT with Previous Multilingual Encoders

To see how mmBERT improves on earlier work, it is useful to compare it with two well-known models: mBERT and XLM-R.

mBERT was released by Google in 2018. It was one of the first multilingual BERT models. It covered 104 languages and was trained on Wikipedia text. For its time, it was a breakthrough. But it had limitations. It could only handle languages that had a large Wikipedia presence. It was also slow by today’s standards.

XLM-R came from Facebook AI in 2019. It improved on mBERT by using more data and a better training method. It covered about 100 languages but performed better on low-resource languages than mBERT. It was also faster, but still not as fast as mmBERT.

mmBERT goes much further. It covers more than 17 times as many languages. It is two to four times faster than both mBERT and XLM-R. And it was trained on 3 trillion tokens, compared to the billions used for earlier models.

In terms of accuracy, Hugging Face has not released detailed benchmarks comparing mmBERT to mBERT and XLM-R on standard tasks. But the company says early tests show that mmBERT matches or exceeds the performance of older models on many tasks, while being much faster.

One area where mmBERT may lag is on very high-resource languages like English. Because the model has to spread its capacity across so many languages, it might not be as good as a specialized English-only model. But for multilingual tasks, the mmBERT multilingual encoder offers a good balance of speed and coverage.

Where mmBERT Can Be Used: Applications

mmBERT is not a chatbot or a text generator. It is a tool for understanding text. That makes it useful for many practical applications.

One obvious use is machine translation. An encoder like mmBERT can be used as part of a translation system. It can also be used for cross-lingual search. For example, a search engine could use mmBERT to find documents in one language that match a query in another language.

Another application is sentiment analysis. Companies that operate in multiple countries can use mmBERT to analyze customer feedback in many languages. Instead of building separate systems for each language, they can use one model that handles all of them.

Information extraction is another area. mmBERT can be trained to find names, dates, places, and other key information in text. This works across languages, so a single model can process documents from around the world.

There are also applications in academia and research. Linguists can use mmBERT to study patterns across many languages. Historians can analyze texts in ancient or rare languages if the model covers them.

Because mmBERT is open source and available on Hugging Face, anyone can use it. Developers can fine-tune it for specific tasks. That means a small team can build a multilingual AI tool without needing huge resources.

Limitations and Future Developments

Despite its impressive capabilities, mmBERT has limitations. The biggest one is that performance varies by language. Languages with more training data get better results. Speakers of very rare languages may find the model’s understanding is limited.

Another limitation is that mmBERT is an encoder-only model. It cannot generate text. For tasks that require writing, like chatbots or translation, you would need to pair it with a decoder model. That adds complexity.

There are also ethical considerations. Any large language model trained on internet data can pick up biases. Those biases can include stereotypes, offensive language, or cultural assumptions. Because mmBERT covers so many languages, it may have learned biases from many different cultures. Hugging Face has not released a detailed bias analysis, but they have said they are working on it.

Another ethical issue is the potential for misuse. A model that can understand text in 1,800 languages could be used for surveillance or censorship. Governments or companies could use it to monitor communications across many languages. Hugging Face encourages responsible use but cannot control how the model is deployed.

Looking ahead, Hugging Face plans to continue improving mmBERT. They want to add more languages and improve performance on low-resource ones. They also plan to release more detailed benchmarks and documentation.

The release of mmBERT is a significant step forward for multilingual AI. It shows that it is possible to build a single model that can understand thousands of languages while being faster than ever. For researchers, developers, and anyone who works with multiple languages, mmBERT is a powerful new tool.

As with any new technology, it will take time to see how it is used. But the potential is huge. In a world where language barriers still cause problems, mmBERT offers a way to break them down.

Frequently Asked Questions

What is mmBERT?

mmBERT is a 'Modern Multilingual' AI language model developed by Hugging Face. It is designed to understand and process text in over 1,800 languages, making it a powerful tool for multilingual natural language processing tasks.

How much faster is mmBERT than previous models?

Hugging Face states that mmBERT runs two to four times faster than earlier multilingual encoders such as mBERT and XLM-R. This speed improvement is due to architectural enhancements in the underlying ModernBERT model.

How many languages does mmBERT support?

mmBERT supports an impressive range of over 1,800 languages. This includes major global languages as well as many less common and indigenous languages that are often underserved by AI technology.

What kind of tasks can mmBERT be used for?

As an encoder-only model, mmBERT excels at understanding text. It can be used for tasks like machine translation, cross-lingual information retrieval, sentiment analysis, and information extraction across multiple languages.

Are there any limitations to mmBERT?

Yes, mmBERT's performance can vary depending on the amount of training data available for each language, with lower-resource languages potentially having weaker understanding. Additionally, as an encoder, it cannot generate text on its own.

Is mmBERT available for public use?

Yes, mmBERT is open-source and available on Hugging Face's model hub. Researchers and developers can download and use it for free, and it can be fine-tuned for specific applications.

References

mmBERT: ModernBERT goes Multilingual – Original report (Hugging Face Blog)
Hugging Face Introduces mmBERT, a Multilingual Encoder for 1,800+ Languages – infoq.com – News article confirming the model's coverage of 1,800+ languages and its encoder-only nature.
Meet mmBERT: An Encoder-only Language Model Pretrained on 3T Tokens of Multilingual Text in over 1800 Languages and 2–4× Faster than Previous Models – MarkTechPost – Article detailing the model's pretraining on 3 trillion tokens, speed improvements, and language coverage.

AI・Biotech & Health

SAIR: A New AI Tool That Could Speed Up Drug Discovery

AI・Enterprise

AssetOpsBench: A New Way to Test AI in Real Factories and Power Plants

Apps・Google

Time Is Running Out: How to Save Your Samsung Messages Before July

Economy・Enterprise

The Office Doesn’t Fix Loneliness at Work

Economy・EVs

Polestar Out, Volvo In: A Trade Rule That Makes No Sense

Apps・Google

Time Is Running Out: How to Save Your Samsung Messages Before July

AI・Hardware

Wall Street Is Whispering a New Name Alongside Nvidia: Micron. But History Says to Be Careful.

AI • Technology

TBB Desk

TBB Desk

Key Takeaways

Leave a Comment Cancel reply

Join thousands of readers shaping the tech conversation.

Join thousands of readers shaping the tech conversation.

Sections

Topics

Resources

Advertise

Company