Toward the Systematic Evaluation of Chatbots


Currently, the best method for evaluating chatbots relies on humans, but the lack of standardization can make this difficult. Six people at the University of Pennsylvania are working to solve this problem. They have developed a unified system to evaluate chatbots that augments current tools and also provides an online hub for chatbot developers to test and show off their latest creations.

Evaluating chatbot performance

Even small variations in the training stage of a chatbot can lead to different model performance scores. Adding several evaluation techniques and the lack of a large community for chatbot developers makes the problem of evaluating performance even more difficult. Taking a look at the various papers surrounding chatbots, it’s easy to see that all the unique methods for rating performance makes it harder to pin down consistent and reproducible results.

To solve these problems, Joao Sedoc,  Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch collaborated to come up with a new, better way to evaluate chatbot performance. ChatEval consists of open-source code for automated human evaluation and a web-based community that offers access to chatbot code, including trained parameters and shared evaluation results. The more people using ChatEval, the better it becomes.

Not surprisingly, Amazon and Facebook have their own chatbot evaluation systems. ChatEval is different in that it deals with the evaluation and warehousing of chatbot models using only text file output. There’s no need for code integration for the system to work, making it even easier to develop a consistent way to rate the performance of chatbots.

The ChatEval web interface

ChatEval on the web is easy to use with four pages. Besides the overview page, users can submit a model, view a model, or comparing different models.

  • Model submission – Researches submit a model along with a description, model responses, and other pertinent information. The chatbot responses based on one or more of ChatEval’s evaluation datasets are manually checked by humans. Automated evaluation is also used.
  • Model profile – Each chatbot submitted includes a description about the chatbot in addition to the creators and other relevant information like a link to the chatbot. The past performance based on various evaluation models is also listed, including a visualization of the differences between human and automated evaluation.
  • Response comparison – The final feature of the website is simple yet powerful. It allows users to see a qualitative comparison of various chatbot models. Users can see each of the various prompts used by an evaluation method as well as the particular responses generated by each chatbot.

The best part is that once a chatbot is in their database, it can be run against future chatbot models input into the ChatEval system.

ChatEval evaluation toolkit

In taking a closer look at the ChatEval Evaluation Toolkit, it’s easy to see the usefulness of this chatbot developer community.

Automatic evaluation

Several different metrics are used for the automatic evaluation.

  • The unique n-grams in the chatbot’s responses divided by the total number of tokens.
  • The average cosine-similarity between the mean of the embedded words of a chatbot response and ground-truth response.
  • The average BLEU-2 score.
  • The perplexity of the responses.

The automatic evaluation method used by ChatEval is modular so that it can add further evaluation metrics over time.

Human evaluation

In addition to automated evaluation of chatbot responses, ChatEval uses human evaluation. This includes A/B comparison tests of different chatbot responses wherein a human rates one of the responses as better or declares them equal. Using this method, the overall inter-annotator agreement (IAA) typically varies depending on various factors.

ChatEval employs crowdsourced workers to evaluate the responses. Workers are paid $0.01 per evaluation. Each chatbot response is rated three times in this way. For example, if there are a total of 200 chatbot responses, 600 evaluations will need to be conducted for a total price of six dollars. Currently, chatbot developers are asked to pay this fee.

Evaluation datasets

ChatEval understands the importance of datasets and uses ones based on the dialogue breakdown detection (DBDC) dataset. Participants, who knew they were chatting with a bot, used TikTok, Iris, and CIC. The best human interactions in these tests were pulled to be used as the evaluation dataset.

Future evaluation of chatbots

In time, more effort will be put into the tangible benefits of chatbots. How much does the chatbot upsell? Does using a particular chatbot increase customer satisfaction and loyalty? As social media chatbot technology improves in the years ahead, these questions and more will become the focal point for chatbot evaluation. For now, tools like ChatEval offer a basic understanding of how close to humans a chatbot can be.


No Comment.