Disclaimer: We may earn a commission if you make any purchase by clicking our links. Please see our detailed guide here.

Follow us on:

Google News
Whatsapp

Facebook In Developing Single Encoder Based Technology to Translate 93 Languages

Bipasha Mandal
Bipasha Mandal
Bipasha Mondal is writer at TechGenyz

Join the Opinion Leaders Network

Join the Techgenyz Opinion Leaders Network today and become part of a vibrant community of change-makers. Together, we can create a brighter future by shaping opinions, driving conversations, and transforming ideas into reality.

Facebook researchers recently published a paper based on Schwenk (2018a) which proposes an architecture for learning joint multilingual sentence representation in 93 languages using a single BiLSTM encoder and BPE vocabulary shared by all languages. There have been other researches in this area as well. Still, all of them have somewhat been limited in terms of performance primarily because they work on a separate model for each language and a cross connection between different languages is barred.

Facebook researchers are interested in the representation of sentence vectors common to both the input language and NLP tasks. The aim of this research is to help languages with limited resources, to achieve zero-shot migration of NLP models and to implement code conversion. What makes this research different from any other NLPs is that this research sets out to study the joint sentence representation in 93 different languages. In contrast, the common NLP focuses on two languages at the most.

Facebook Language Training Samples
Figure 1: 75 out of 93 languages used to train the proposed model

The study covers a huge number of 34 languages and 28 different writing systems. This herculean task is achieved through the use of zero-shot cross-language natural language inference (XNLI datasets), classification (MLDoc datasets), bitext mining (BUCC datasets), and multilingual similarity searches (Tatoeba datasets). The new data obtained from the research based on Tatoeba Corpus acts as the baseline results for 122 languages.

The architecture of the study works in an encoding-decoding manner. Once a sentence is embedded, it is linearly transformed to initialize the LSTM decoder. There is only one encoder and decoder in the system and the researchers have used a joint byte-pair encoding vocabulary which will make the encoder learn language independent representations. The encoder is limited to 1-5 layers and each layer of every dimension is limited to 512 dimensions. The decoder generates meaning using the language ID and has a 2048 dimensional layer.

Facebook Multi Language Sentence Embedding
Figure 2: Architecture of our system to learn multilingual sentence embeddings.

Moses statistical machine translation system is used for the pre-processing except for Chinese and Japanese texts for which Jieba and Mecab are used to split the texts respectively.

Join 10,000+ Fellow Readers

Get Techgenyz’s roundup delivered to your inbox curated with the most important for you that keeps you updated about the future tech, mobile, space, gaming, business and more.

Recomended

Partner With Us

Digital advertising offers a way for your business to reach out and make much-needed connections with your audience in a meaningful way. Advertising on Techgenyz will help you build brand awareness, increase website traffic, generate qualified leads, and grow your business.

Power Your Business

Solutions you need to super charge your business and drive growth

More from this topic