News/Info LaBSE, a new language-agnostic embedding model, supports 109 languages with SOTA accuracy - MODERN BABEL TOWER

emailx45 · 19 Авг 2020

LaBSE, a new language-agnostic embedding model, supports 109 languages with SOTA accuracy
By Для просмотра ссылки Войди или Зарегистрируйся · Aug 19, 2020

[SHOWTOGROUPS=4,20,22]
Для просмотра ссылки Войди или Зарегистрируйся
The fields of natural language processing (NLP) and natural language generation (NLG) have benefited greatly from the inception of the transformer architecture. Transformer models like BERT and its derivatives have been applied to a range of domains including sentiment analysis and classification.

In recent years, significant effort has gone into making these models even more robust, particularly by extending Для просмотра ссылки Войди или Зарегистрируйся pre-training and combining it with Для просмотра ссылки Войди или Зарегистрируйся to make the models language-agnostic. While this nexus of MLM and TLM have proved helpful with fine-tuning on downstream tasks, thus far, they have not directly produced multilingual sentence embeddings, which are critical for translation tasks.

With this in mind, researchers at Google Для просмотра ссылки Войди или Зарегистрируйся a multilingual BERT embedding model called "Для просмотра ссылки Войди или Зарегистрируйся”, or LaBSE for short, which produces language-agnostic cross-lingual sentence embeddings for 109 languages on a single model.

Succinctly, LaBSE combines the venerable MLM and TLM pre-training on a 12-layer transformer housing a vocabulary of 500,000 tokens with a translation ranking task using bi-directional dual encoders.

Для просмотра ссылки Войди или Зарегистрируйся
LaBSE's dual-encoder architecture. Image via Google AI

To train the model, the researchers used 17 billion monolingual sentences and 6 billion bilingual sentence pairs. Once trained, LaBSE was evaluated using the Для просмотра ссылки Войди или Зарегистрируйся whereby the model was tasked with finding the nearest neighbor translation for a given sentence using the cosine distance.

Resultantly, the model demonstrated that it is effective even on low-resource languages for which there is no data available during training. In addition to this, the LaBSE also established a new state of the art (SOTA) on multiple parallel text or bitext retrieval tasks. Specifically, as the number of languages increased, traditional models like m~USE and LASER models, demonstrated a sharper decline in average accuracy in comparison to LaBSE.

... reduction in accuracy from the LaBSE model with increasing numbers of languages is much less significant, outperforming LASER significantly, particularly when the full distribution of 112 languages is included (83.7% accuracy vs. 65.5%).

The potential applications of LaBSE include mining parallel text from the web. The researchers applied it to CommonCrawl to find a potential translation from a pool of 7.7 billion English sentences pre-processed and encoded by LaBSE. With these embeddings in place, the translation model demonstrated an impressive accuracy reaching BLEU scores of 35.7 and 27.2, which "is only a few points away from current state-of-art-models trained on high-quality parallel data," Google wrote.

The pre-trained model is now available for use on Для просмотра ссылки Войди или Зарегистрируйся

It can be used out of the box or can be fine-tuned to a dataset of your own liking. If you are interested in further details, you may study the original Для просмотра ссылки Войди или Зарегистрируйся

[/SHOWTOGROUPS]

News/Info LaBSE, a new language-agnostic embedding model, supports 109 languages with SOTA accuracy - MODERN BABEL TOWER

emailx45

Похожие темы