Language Support¶
Rasa NLU can be used to understand any language, but some backends are restricted to specific languages.
The tensorflow_embedding
pipeline can be used for any language, because
it trains custom word embeddings for your domain.
Pre-trained Word Vectors¶
With the spaCy backend you can now load fastText vectors, which are available for hundreds of languages.
backend | supported languages |
---|---|
spacy-sklearn | english (en ),
german (de ),
spanish (es ),
portuguese (pt ),
italian (it ),
dutch (nl ),
french (fr ) |
MITIE | english (en ) |
Jieba-MITIE | chinese (zh ) * |
These languages can be set as part of the Server Configuration.
Adding a new language¶
We want to make the process of adding new languages as simple as possible to increase the number of supported languages. Nevertheless, to use a language you either need a trained word representation or you need to train that presentation on your own using a large corpus of text data in that language.
These are the steps necessary to add a new language:
spacy-sklearn¶
spaCy already provides a really good documentation page about Adding languages. This will help you train a tokenizer and vocabulary for a new language in spaCy.
As described in the documentation, you need to register your language using set_lang_class()
which will
allow Rasa NLU to load and use your new language by passing in your language identifier as the language
Server Configuration option.
MITIE¶
- Get a ~clean language corpus (a Wikipedia dump works) as a set of text files
- Build and run MITIE Wordrep Tool on your corpus. This can take several hours/days depending on your dataset and your workstation. You’ll need something like 128GB of RAM for wordrep to run - yes that’s alot: try to extend your swap.
- Set the path of your new
total_word_feature_extractor.dat
as value of the mitie_file parameter inconfig_mitie.json
Jieba-MITIE¶
Some notes about using the Jieba tokenizer together with MITIE on chinese
language data: To use it, you need a proper MITIE feature extractor, e.g.
data/total_word_feature_extractor_zh.dat
. It should be trained
from a Chinese corpus using the MITIE wordrep tools
(takes 2-3 days for training).
For training, please build the MITIE Wordrep Tool. Note that Chinese corpus should be tokenized first before feeding into the tool for training. Close-domain corpus that best matches user case works best.
A detailed instruction on how to train the model yourself can be found in A trained model from Chinese Wikipedia Dump and Baidu Baike can be crownpku ‘s blogpost.