Sunday 1 April 2018

Chatbots and Text Classification

Practical advice on intents



The problem of recognizing intents in dialogues is, after all, a text classification problem. This article on Medium, apart from reintroducing the concept, gives some practical advice on how to deal with this problem.

Assuming we choose the machine-learning approach, and our goal is first of all to identify intents, these are things we should take into account


  • Intent count — average number of intents for one app should be 5–10 intents. Fewer intents will be too simplistic, while more intents will harm the accuracy.
  • Data Magnitude and quality— we all know that as in any machine learning task, the more data we have, and the closer it is to inference queries, we will have better results
  • Transfer learning possibility — transfer learning, or in other words, using a pretrained model for similar problem may be very helpful if available
  • Inference input size — users don’t tend to be concise in their querying of our bot. therefore, text summation tools, among with cues for users to be short and to the point, will help our app to have better accuracy
Few intents, lot of data, force users to write short sentences. And reuse a pretrained model.Got it ?


The Classifier - the core of the Chatbot

In the end, the problem of finding intents for queries is a text classification problem. You might use standard sci-kit pipelines, combined with text preprocessing, to create such classifiers. The approaches used to take on problems such as the google newsgroups dataset still work.

However, it is possible that we are not going to have anywhere near 20000 texts to train our classifier. We might then use at first the "Wechat" approach : start from keywords and then move on to machine learning, leveraging on the data that we gather from users.


What Pipeline to use


Ok, sure, this is all nice and good. However, text classifications problems are not all born equal. First and foremost, you have to think on how to clean up your text.
This article suggests a natural pipeline for chatbots. This sums up the preprocessing part:

  • Spellcheck 
  • Split into sentences 
  • Split into words 
  • POS tagging 
  • Lemmatize words  
  • Entity recognition 
  • Find concepts / Synonyms  
Once you have normalized your sentences you can move on to next step ?

Rule Based or Machine-Learning ? 

The above mentioned article goes on to suggest defining patterns or DSLs to identify intents from the simplified sentence. That is a Rule-Based approach, pretty much the Pattern matching approach we already mentioned.

Machine-learning approaches may give better results when more data is available.

No comments:

Post a Comment