- Jul 30, 2021
Facebook researchers recently published a paper based on Schwenk (2018a) which proposes an architecture for learning joint multilingual sentence representation in 93 languages using a single BiLSTM encoder and BPE vocabulary shared by all languages. There have been other researches in this area as well but all of them have somewhat been limited in terms of performance primarily because they work on a separate model for each language and a cross connection between different languages is barred.
Facebook researchers are interested in the representation of sentence vectors common to both the input language and NLP tasks. The aim of this research is to help languages with limited resources, to achieve zero-shot migration of NLP models and to implement code conversion. What makes this research different from any other NLPs is that this research sets out to study the joint sentence representation in 93 different languages whereas the common NLP focuses on two languages at the most.
The study covers a huge number of 34 languages and 28 different writing systems. This herculean task is achieved through the use of zero-shot cross-language natural language inference (XNLI datasets), classification (MLDoc datasets), bitext mining (BUCC datasets), and multilingual similarity searches (Tatoeba datasets). The new data obtained from the research based on Tatoeba Corpus acts as the baseline results for 122 languages.
The architecture of the study works in an encoding-decoding manner. Once a sentence is embedded, it is linearly transformed to initialize the LSTM decoder. There is only one encoder and decoder in the system and the researchers have used a joint byte-pair encoding vocabulary which will make the encoder learn language independent representations. The encoder is limited to 1-5 layers and each layer of every dimension is limited to 512 dimensions. The decoder generates meaning using the language ID and has a 2048 dimensional layer.
Moses statistical machine translation system is used for the pre-processing except for Chinese and Japanese texts for which Jieba and Mecab are used to split the texts respectively.