Understanding Conversations in Depth through Synergistic Human/Machine Interaction

shutterstock_1149533141.jpg

Every day, billions of people communicate via email, chat, text, social media, and more. Every day, people are communicating their desires, concerns, challenges and victories. And every day, organizations struggle to understand this conversation so they can better service their customers.

Consider a few examples:

  • A communication system enables a famous politician or star to communicate with thousands or millions of constituents or fans

  • A product or service review system like Yelp gathers free form reviews from millions of people

  • An email system automatically conducts conversations with people after they fill out a form, stop by a booth, or otherwise indicate interest

  • An insurance company records millions of audio transcripts of conversations regarding a claim

  • A trend prediction system scans social media conversations to predict the next flavor food companies should plan for — in the past it was pomegranate, what will it be in 6 months?

In each of these cases, there is a need to automatically understand what is being said. Understanding a direct message rapidly can allow a system to elevate priority, compose a suggested reply, or even automatically reply on someone’s behalf. Understanding a large number of messages, can allow a politician to make sense of their massive inbox so they can better understand their constituency’s perspective on a given day or topic.

Understanding a large number of reviews can enable a surgeon to easily understand exactly what they are doing right, and where they should improve, or help a product manager understand what aspects of their products are well received and which are problematic.

Understanding the conversation begins with understanding one document. Once we can teach a machine to understand everything in a single document, we can project this understanding up to a collection, thread or larger corpus of documents to understand the broader conversation.

The anatomy of a single document is shown below. In it, we see a template for a document. A given document could be an email, a text or social media message, a blog post, a product review, etc. Typically a title or subject of some sort is present. Next, some document level descriptive information is often present like the author, date of the document, or perhaps a case # if it is a legal document. Next we have the body of the document, usually paragraphs composed of multiple sentences. In addition to the document content shown below, usually documents exists in a context — an email can be in reply to another email or a social media message can belong to a discussion thread. For simplicity, however, we’ll focus on a single document, and leave the inter-document discussion for later.

Screen Shot 2019-03-14 at 10.33.42 AM.png

Typically, much of this information is accessible in a machine readable form, but the unstructured text is not easily understood without some NLP (natural language processing) tailored AI accelerated by tooling like that in our Mayetrix SDK. From an AI vantage, there are multiple types of information we can train a machine to extract. Sentences are usually split by a model trained to do just that. We might use a sentence splitter trained on reasonably well composed text, like news, or, we might train a custom sentence splitter for more informal discourse styles like those present in social media or SMS. Next, individual key phrases or entities like specific people, places or things, might be present inside a sentence. We have multiple options for how to automatically extract phrases and entities, typically a combination of knowledge and example trained machine learning models. We also often extract sentence level insights. These might come in the form of categories a given sentence can be placed into. These might also come in the form of grammatical clause level information (think back to seventh grade grammar class), such as a source>action>target structure like LeBron James [nba_player] > score > final shot. Finally, there are document level insights we might extract, often assisted by the more granular information extraction described above. Document level information might include, for example, the overall sentiment, or a summarization of the document.

So how do we build AI or machine learning models for each of these types of information to extract?

Much like a toddler learns the word furniture through shown examples like a chair, sofa or table, AI text analysis systems require examples.

Before we can gather examples, however, we need to decide what exactly we are going to label. This might be easy in some cases, like building a spam detector — all email messages are either spam, or not spam. But in some cases, we have a significantly more complex task.

shutterstock_576832354.jpg

Consider for example the case where millions of constituents email their congressional representatives thoughts and opinions. We can further presume the busy congress person receiving thousands of emails a day wishes to understand the key perspectives worthy of their response. We might find through an early analysis that constituents are often expressing an emotion of some sort, an opinion on a topic or piece of legislation, requesting some specific type of action, or asking questions.

An initial task is to simply understand what the broader conversation consists of. In the chart below, we see that much of this conversation might consist of feedback, a question, or an emotional expression.

Screen Shot 2019-03-14 at 4.51.13 PM.png

These broader high level categories, might prove insufficient. We might ask what types of questions are being asked, and for some very important questions, we might want to know exactly which question is being asked, for example, where can I buy it? One approach we regularly use at Xyonix is to employ hierarchical label structures, or a label taxonomy. For example, for the referenced political corpus above, we might have a few entries like this:

  • suggestion/legislation_related_suggestion/healthcare_suggestion/include_public_option

  • question/legislation_related_question/healthcare_question/can_i_keep_my_doctor

  • feedback/performance_feedback/positive_feedback/you_are_doing_great

These hierarchical labels provide a few key advantages:

  • it can be easier to teach human annotators to label very granular categories

  • granular categories can be easily included under other taxonomical parents after labeling has commenced, thus preventing costly relabelling.

  • more granularity can result in very specific corresponding actions, like a bot replying to a question

Generating labels is often best done in conjunction with AI model construction. If AI models perform very badly at recognizing a select label, for example, it can often be a sign that the category is too broad or fuzzy. In this case, we may choose to tease out more granular and easily defined sub-categories.

In addition to defining labels, we also of course need to get to actual examples that our AI models can learn from. The next question is how do we select our examples since it is costly to have our humans label things? Should we just choose randomly, based on product or business priorities, or something more efficient? The reality is that not all examples are created equal. If a toddler is presented with hundreds of different types of chairs and told they are furniture, but never sees a table, then they’ll likely fail to identify a table as furniture. A similar thing happens with our models.

We need to present training examples that are illustrative of the target category but different from those the model has seen before.

This is why setting out arbitrary numerical targets like label 1 million randomly selected documents is rarely optimal. One very powerful technique which we use regularly at Xyonix accelerated by our Mayetrix platform is to create a tight feedback loop where mistakes a current model makes are identified by our human annotators and labeled correctly. The next model then learns from its prior mistakes, and improves faster than if only trained using random examples. The models tell the humans what they “think”, and the humans tell the models when they are wrong. When our human annotators notice the machines making many of the same mistakes, they provide more examples in that area, much the way a teacher might tailor problem sets for a student. The result overall, is a nice human / machine synergy. You can read about our data annotation platform or our annotation service if you wish to see how we label data at Xyonix .

shutterstock_1018377352.jpg

Once we have sufficient training data, we can begin optimizing our AI models so they are more accurate. This requires a number of steps, like:

  • efficacy assessment: comparing how well each of the tasks above perform on a set aside test set (a set of examples the trained model has never seen)

  • model selection: selecting a model architecture like a classical machine learning SVM or a more powerful but challenging to train deep learning based model

  • model optimization: optimizing model types, parameters and hyper-parameters, in essence, teaching the AI to build the best AI system.

  • transfer learning: bootstrapping the AI from other, larger training example sets beyond what you are gathering for just your problem. For example, learning word and phrase meanings from Wikipedia or large collections of Twitter tweets.

Finally, once models are built and deployed, there is the next step of aggregating insights from individual documents, into a broader understanding based on the overall conversation. At Xyonix, we typically employ a number of techniques like aggregating and tracking mentions across time, or different users, or various slices of the corpus. For example, in one project of ours, we built a system that measures the overall sentiment of other surgeons to a surgeon who has submitted a recent surgery for review. Telling the surgeon that 44% of their reviews expressed negative sentiment is one thing, but telling them that their score is 15% below the mean of peer surgeons is another, more valuable insight. Surgeons didn’t get where they are by being average, let alone below average, so they are more likely to move to correct the specific issues mentioned.

Understanding conversations in depth automatically is a significant endeavor. One key we’ve found to being successful is by looking well beyond just AI model development. Considering what the labels are, how they are structured, how they will be used, how you will improve them, how you will get training examples for the models, how the model’s weaknesses can be improved — and perhaps most importantly, how you will do all of these things over a timeline, with AI models and the product’s using them always improving.