The Wiki40B-lm module includes language models trained on the newly published cleaned-up Wiki40B dataset available on TFDS. The training setup is based on the paper “Wiki-40B: Multilingual Language Model Dataset” .
The models take an input string text, and output text embeddings for each layer, and negative log likelihood that is used to calculate perplexity. Our models also have the ability to perform generation. See our Colab for a full demo.
The collection includes 41 monolingual models (en, ar, zh-cn, zh-tw, nl, fr, de, it, ja, ko, pl, pt, ru, es, th, tr, bg, ca, cs, da, el, et, fa, fi, he, hi, hr, hu, id, lt, lv, ms, no, ro, sk, sl, sr, sv, tl, uk, vi), and 2 multilingual models trained on 64k and 128k spm vocabularies.