Evaluating and Testing¶
Note
If you’re looking to evaluate both Rasa NLU and Rasa Core predictions combined, take a look at the section on end-to-end evaluation.
Evaluating a Trained Model¶
You can evaluate your trained model on a set of test stories by using the evaluate script:
$ python -m rasa_core.evaluate --core models/dialogue \
  --stories test_stories.md -o results
This will print the failed stories to results/failed_stories.md.
We count any story as failed if at least one of the actions
was predicted incorrectly.
In addition, this will save a confusion matrix to a file called
results/story_confmat.pdf. The confusion matrix shows, for each action in
your domain, how often that action was predicted, and how often an
incorrect action was predicted instead.
The full list of options for the script is:
/home/travis/virtualenv/python3.5.6/bin/python: Error while finding module specification for 'rasa_core.evaluate' (ImportError: No module named 'rasa_core')
End-to-end evaluation of Rasa NLU and Core¶
Say your bot uses a dialogue model in combination with a Rasa NLU model to
parse intent messages, and you would like to evaluate how the two models
perform together on whole dialogues.
The evaluate script lets you evaluate dialogues end-to-end, combining
Rasa NLU intent predictions with Rasa Core action predictions.
You can activate this feature with the --e2e option in the
rasa_core.evaluate module.
The story format used for end-to-end evaluation is slightly different to
the standard Rasa Core stories, as you’ll have to include the user
messages in natural language instead of just their intent. The format for the
user messages is * <intent>:<Rasa NLU example>. The NLU part follows the
markdown syntax for Rasa NLU training data.
Here’s an example of what an end-to-end story file may look like:
## end-to-end story 1
* greet: hello
   - utter_ask_howcanhelp
* inform: show me [chinese](cuisine) restaurants
   - utter_ask_location
* inform: in [Paris](location)
   - utter_ask_price
## end-to-end story 2
...
If you’ve saved these stories under e2e_storied.md,
the full end-to-end evaluation command is this:
$ python -m rasa_core.evaluate default --core models/dialogue \
  --nlu models/nlu/current \
  --stories e2e_stories.md --e2e
Note
Make sure you specify an NLU model to load with the dialogue model using the
--nlu option of rasa_core.evaluate. If you do not specify an NLU
model, Rasa Core will load the default RegexInterpreter.
Comparing Policies¶
To choose a specific policy, or to choose hyperparameters for a specific policy, you want to measure how well Rasa Core will generalise to conversations which it hasn’t seen before. Especially in the beginning of a project, you do not have a lot of real conversations to use to train your bot, so you don’t just want to throw some away to use as a test set.
Rasa Core has some scripts to help you choose and fine-tune your policy.
Once you are happy with it, you can then train your final policy on your
full data set. To do this, you first have to train models for your different
policies. Create two (or more) policy config files of the policies you want to
compare (containing only one policy each), and then use the compare mode of
the train script to train your models:
$ python -m rasa_core.train compare -c policy_config1.yml policy_config2.yml \
  -d domain.yml -s stories_folder -o comparison_models --runs 3 --percentages \
  0 5 25 50 70 90 95
For each policy configuration provided, Rasa Core will be trained multiple times with 0, 5, 25, 50, 70 and 95% of your training stories excluded from the training data. This is done for multiple runs, to ensure consistent results.
Once this script has finished, you can now use the evaluate script in compare mode to evaluate the models you just trained:
$ python -m rasa_core.evaluate compare --stories stories_folder \
  --core comparison_models \
  -o comparison_results
This will evaluate each of the models on the training set, and plot some graphs to show you which policy is best. By evaluating on the full set of stories, you can measure how well Rasa Core is predicting the held-out stories.
If you’re not sure which policies to compare, we’d recommend trying out the
EmbeddingPolicy and the KerasPolicy to see which one works better for
you.
Note
This training process can take a long time, so we’d suggest letting it run somewhere in the background where it can’t be interrupted
Evaluating stories over http¶
Rasa Core’s server lets you to retrieve evaluations for the currently
loaded model. Say your Rasa Core server is running locally on port 5005,
and your story evaluation file is saved at eval_stories.md. The command
to post stories to the server for evaluation is this:
$ curl --data-binary @eval_stories.md "localhost:5005/evaluate" | python -m json.tool
If you would like to evaluate end-to-end stories
(docs),
you may do so by adding the e2e=true query parameter:
$ curl --data-binary @eval_stories.md "localhost:5005/evaluate?e2e=true" | python -m json.tool
Have questions or feedback?¶
We have a very active support community on Rasa Community Forum that is happy to help you with your questions. If you have any feedback for us or a specific suggestion for improving the docs, feel free to share it by creating an issue on Rasa Core GitHub repository.