Multilingual models can produce token, and therefore in extension, sentence embeddings for multiple languages at once. While this capability extends the possible use cases, it comes with a small caveat: There is no guarantee that vector spaces across languages are aligned. This basically means that the same word or sentence, translated into different languages and processed by the model, could be assigned vector representations that aren’t similar nor close in the embedding space. This prevents us from performing tasks such as information retrieval, clustering and semantic textual similarity across languages.
However, this is not to say that such tasks are impossible to perform within a single language. Semantically meaningful sentence embeddings can, and have been, generated successfully through models such as SentenceBERT (SBERT). If you haven’t already read that paper, I recommend you to have a look at my paper summary which covers the motivation, implementation, related works and results achieved by the authors. In short, SBERT is trained to generate sentence embeddings that preserve the input sequence’s semantics. This is achieved through mapping similar sentences close to each other while dissimilar ones further apart.
What if we could extend this capability of assigning semantically meaningful representations to work both within and across a wider set of languages? That would open up lots of interesting use-cases. This, in fact, is exactly what Reimers et al. study in Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. They transfer the capabilities of SBERT to multilingual models such as XLM-Roberta (XLM-R) through a novel knowledge distillation training process.