Choosing a Pipeline¶
Pre-trained or custom word vectors?¶
The two most important pipelines are tensorflow_embedding
and spacy_sklearn
.
The biggest difference between them is that the spacy_sklearn
pipeline uses pre-trained
word vectors from either GloVe or fastText. Instead, the tensorflow embedding pipeline
doesn’t use any pre-trained word vectors, but instead fits these specifically for your dataset.
The advantage of the spacy_sklearn
pipeline is that if you have a training example like:
“I want to buy apples”, and Rasa is asked to predict the intent for “I want to buy pears”, your model
already knows that the words “apples” and “pears” are very similar. This is especially useful
if you don’t have very much training data (< 500 labeled examples).
The advantage of the tensorflow_embedding
pipeline is that your word vectors will be customised
for your domain. For example, in general English, the word “balance” is closely related to “symmetry”,
but very different to the word “cash”. In a banking domain, “balance” and “cash” are closely related
and you’d like your model to capture that.
You can read more about this topic here .
As a rule of thumb, if there is a spaCy model for your language,
then the spacy_sklearn
pipeline is a good choice for getting started.
However once you have more training data (>500 sentences),
it is highly recommended that you try the tensorflow_embedding
pipeline.
There are also the mitie
and mitie_sklearn
pipelines, which use MITIE as a source of word vectors.
We do not recommend that you use these; they are likely to be deprecated in a future release.
Multiple Intents¶
If you want to split intents into multiple labels, e.g. for predicting multiple intents or for modeling hierarchical intent structure, you can only do this with the tensorflow pipeline. To do this, use these flags:
intent_tokenization_flag
iftrue
the algorithm will split the intent labels into tokens and use a bag-of-words representations for them;intent_split_symbol
sets the delimiter string to split the intent labels. Default_
Here’s an example configuration:
language: "en"
pipeline:
- name: "intent_featurizer_count_vectors"
- name: "intent_classifier_tensorflow_embedding"
intent_tokenization_flag: true
intent_split_symbol: "_"
Custom pipelines¶
Creating your own pipelines is possible by directly passing the names of the
components to Rasa NLU in the pipeline
configuration variable, e.g.
pipeline:
- name: "nlp_spacy"
- name: "ner_crf"
- name: "ner_synonyms"
This creates a pipeline that only does entity recognition, but no intent classification. Hence, the output will not contain any useful intents. You can find the details of each component in Pipeline and Component Configuration.
If you want to use custom components in your pipeline, see Custom Components.