LaBSE, a new language-agnostic embedding model, supports 109 languages with SOTA accuracy
By Для просмотра ссылки Войдиили Зарегистрируйся · Aug 19, 2020
By Для просмотра ссылки Войди
[SHOWTOGROUPS=4,20,22]
Для просмотра ссылки Войдиили Зарегистрируйся
The fields of natural language processing (NLP) and natural language generation (NLG) have benefited greatly from the inception of the transformer architecture. Transformer models like BERT and its derivatives have been applied to a range of domains including sentiment analysis and classification.
In recent years, significant effort has gone into making these models even more robust, particularly by extending Для просмотра ссылки Войдиили Зарегистрируйся pre-training and combining it with Для просмотра ссылки Войди или Зарегистрируйся to make the models language-agnostic. While this nexus of MLM and TLM have proved helpful with fine-tuning on downstream tasks, thus far, they have not directly produced multilingual sentence embeddings, which are critical for translation tasks.
With this in mind, researchers at Google Для просмотра ссылки Войдиили Зарегистрируйся a multilingual BERT embedding model called "Для просмотра ссылки Войди или Зарегистрируйся”, or LaBSE for short, which produces language-agnostic cross-lingual sentence embeddings for 109 languages on a single model.
Succinctly, LaBSE combines the venerable MLM and TLM pre-training on a 12-layer transformer housing a vocabulary of 500,000 tokens with a translation ranking task using bi-directional dual encoders.
Для просмотра ссылки Войдиили Зарегистрируйся
LaBSE's dual-encoder architecture. Image via Google AI
To train the model, the researchers used 17 billion monolingual sentences and 6 billion bilingual sentence pairs. Once trained, LaBSE was evaluated using the Для просмотра ссылки Войдиили Зарегистрируйся whereby the model was tasked with finding the nearest neighbor translation for a given sentence using the cosine distance.
Resultantly, the model demonstrated that it is effective even on low-resource languages for which there is no data available during training. In addition to this, the LaBSE also established a new state of the art (SOTA) on multiple parallel text or bitext retrieval tasks. Specifically, as the number of languages increased, traditional models like m~USE and LASER models, demonstrated a sharper decline in average accuracy in comparison to LaBSE.
The pre-trained model is now available for use on Для просмотра ссылки Войдиили Зарегистрируйся
It can be used out of the box or can be fine-tuned to a dataset of your own liking. If you are interested in further details, you may study the original Для просмотра ссылки Войдиили Зарегистрируйся
[/SHOWTOGROUPS]
Для просмотра ссылки Войди
The fields of natural language processing (NLP) and natural language generation (NLG) have benefited greatly from the inception of the transformer architecture. Transformer models like BERT and its derivatives have been applied to a range of domains including sentiment analysis and classification.
In recent years, significant effort has gone into making these models even more robust, particularly by extending Для просмотра ссылки Войди
With this in mind, researchers at Google Для просмотра ссылки Войди
Succinctly, LaBSE combines the venerable MLM and TLM pre-training on a 12-layer transformer housing a vocabulary of 500,000 tokens with a translation ranking task using bi-directional dual encoders.
Для просмотра ссылки Войди
LaBSE's dual-encoder architecture. Image via Google AI
To train the model, the researchers used 17 billion monolingual sentences and 6 billion bilingual sentence pairs. Once trained, LaBSE was evaluated using the Для просмотра ссылки Войди
Resultantly, the model demonstrated that it is effective even on low-resource languages for which there is no data available during training. In addition to this, the LaBSE also established a new state of the art (SOTA) on multiple parallel text or bitext retrieval tasks. Specifically, as the number of languages increased, traditional models like m~USE and LASER models, demonstrated a sharper decline in average accuracy in comparison to LaBSE.
The potential applications of LaBSE include mining parallel text from the web. The researchers applied it to CommonCrawl to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by LaBSE. With these embeddings in place, the translation model demonstrated an impressive accuracy reaching BLEU scores of 35.7 and 27.2, which "is only a few points away from current state-of-art-models trained on high-quality parallel data," Google wrote.... reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).
The pre-trained model is now available for use on Для просмотра ссылки Войди
It can be used out of the box or can be fine-tuned to a dataset of your own liking. If you are interested in further details, you may study the original Для просмотра ссылки Войди
[/SHOWTOGROUPS]