Categories
Metrics

Measuring Customer Support

“What gets measured, gets managed.”

— Peter Drucker

Great products and great marketing attract customers. However, no matter how good a product is, the journey of the customer is often full of hiccups and misunderstandings.

Customer support must then help the user — to keep them using the product, to retain them as a paying customer, and to encourage them to become evangelists for the product. This is especially true for competitive markets, products with significant onboarding commitments, and customers with high lifetime values (LTVs).

Every company, then, must invest in developing a customer support team to increase their revenue. But — like sales teams which are judged on whether they meet their quota — how might we measure the performance of a customer support or service team? Let’s first consider the factors at play.

Factors

When considering metrics, there are several types that we may want to consider, such as:

  1. Operational metrics: Metrics that determine how efficient customer support is in processing requests.
  2. Customer satisfaction metrics: Metrics of customer engagement, churn, satisfaction, and promotion of the product.
  3. Support agent satisfaction metrics: An oft-neglected dimension, these metrics consider the happiness of the team at the core of the support operations.

Similar to tracking the performance of other systems, there are leading and lagging metrics. Lagging metrics are often the ultimate success metrics that we care about, but leading metrics are more easily and immediately tracked. Both types must be tracked and used to improve customer service, and we’ll consider each type in this post.

Yet another trade-off is between cost and quality. For example, a generic self-serve website may be cheap to run (despite a high initial investment) but insufficient and low-quality. A service where a customer is immediately connected to a support agent who is told to fulfill every request may be high quality, but not cost-effective. The trade-offs that make sense will vary for each company, depending on whether support requests tend to be short and transactional, or more full-service.

Finally, when setting metrics in general, it’s important to choose metrics that fulfill certain criteria. A popular mnemonic for metrics and goals is SMART (specific, measurable, assignable, realistic, and time-bound).

Let’s start by considering operational metrics.

Operational metrics

Operational metrics can be specified at the level of the agent and at the level of individual teams. For the most part, operational metrics are leading metrics. For more details, we encourage you to read this GlowTouch article, from which this operational metrics section borrows heavily.

We’ll try to divide operational metrics into those that allow the company to control costs, versus those that affect downstream customer satisfaction, though many are tied to both.

Cost

Operational metrics that can be used to manage costs include:

  • Ticket volume: The number of requests per hour or day will determine the size of the support team necessary. Unfortunately, volume is not constant, and surges and quiet spells will lead to inefficiencies. Thus it’s useful to examine volume by different time periods. It’s also important to look at volume by channel, to determine the specific channels and areas where more resources should be devoted.
  • Availability: Availability is the portion of the time that agents can be reached; unless agents are available 24/7, availability will be less than 100%. In addition to reducing costs, increasing availability during off-work hours is one of the reasons why many companies look to outsourced support teams, which may be the topic of a future post.
  • Occupancy and concurrency: Occupancy measures the fraction of the time that an agent is serving customers. Concurrency considers the number of customers an agent can service at one time.
  • Handle time (by activity and channel): The amount of time an agent spends interacting with a customer, including wait times. Segmenting activities into more fine-grained buckets can provide insights into which steps of the process need to be improved. For example, it may be the case that agents are spending a great deal of time clicking through the knowledge base, in which case a better query and retrieval tool should be developed, or it may be the case that agents spend much of their time composing messages, in which case a tool like Sapling may be helpful.

Using the above metrics, you can calculate the expected throughput of your support agents. Each can be used as a knob to adjust for cost while achieving satisfactory quality. Next, we’ll discuss measures of quality.

Quality

For most of these metrics, a simple way to report the metric is as an average. However, percentiles are more informative—for example, if most customers receive a response within a few minutes, but for a few customers it takes half a day, averages will be skewed high despite satisfactory service.

  • (First) Reply time (or time to engage): In order to prevent a customer from becoming frustrated waiting for support, it’s important to measure how long they must typically wait. One cost-effective solution to reduce the response time is to set up an automatic workflow (or, in the case of calls, an interactive voice system) to get a customer started. Once these have been implemented, however, you’ll want to consider response times after the first when trying to improve the quality of service. Note that reasonable response times will vary significantly between channels (e.g. between chat, phone, and email.)
  • Service level: Similar to response time, service level specifies the percentiles of how long it takes for a customer to receive a response after being placed in a queue.
  • Number of interactions per ticket: While this could be a cost-saving measure, a high number of interactions may make for a very dissatisfied customer, depending on the quality of service delivered. In general the number of interactions should be low.
  • Resolution time: Although we place resolution time in the quality section, note that at the other extreme, tickets being closed before a customer feels their issue has been given due consideration can seriously hurt customer satisfaction. As previously mentioned, it’s useful to segment this further into each of the steps until resolution.
  • First contact resolution rate (FCRR): First contact resolution rate measures the percentage of the time that an issue or ticket is resolved in the first interaction with the support agent. In many cases, FCR can be the metric most correlated to customer satisfaction. Make sure that the definition of a ticket being resolved handles cases where tickets are reopened or resumed in new tickets when measuring FCRR. Learn more about this important metric here.
  • Abandonment rate: This considers the fraction of the time that customers stop being responsive after contacting customer support. This could be because response time or service is too slow, or because the quality of service is poor.

Customer satisfaction metrics

We now consider the ultimate success metrics to evaluate the effect of support on customer relationships. Many of these metrics are lagging metrics of quality. As we previously mentioned, lagging metrics are often more difficult to measure, and a key reason is because they are often based on surveys. It can be difficult both to get a representative sample as well as to get customers to respond to such surveys, the result being that it is only the customers who are extremely unhappy or displeased with the product or service who respond.

In the previous sections we have considered each support case to be equally important; however, especially for companies selling to businesses and enterprises, this will not be the case, as accounts should be prioritized based on account value.

  • Customer satisfaction score (CSAT): Unfortunately, there is no universally agreed upon method for computing CSAT. One method is to survey users asking for their satisfaction with the product on a scale of 1 to 10, then taking the average over responses. Other companies such as Zendesk simply ask the customer if an interaction was good or bad (a binary choice). Due to the surveying issues we mentioned, the importance of how the question is phrased, and the non-specificity of the term “satisfaction”, CSAT can be difficult to get right. Maintaining high CSAT, however, is crucial for preventing churn, or the percentage of customers who discontinue their relationship with the business. Here’s a useful guide for surveying from SurveyMonkey: Smart Survey Design. It’s also useful to calibrate your satisfaction score against benchmarks in your industry.
  • Net promoter score (NPS): Net promoter score is computed by first asking the question: “How likely is it that you would recommend our company/product/service to a friend or colleague?” (link).The idea behind NPS is that customers can be divided into promoters (9, 10), passives (7, 8), and detractors (0–6). Promoters encourage more users to use the product, while detractors discourage other users from trying it.
  • Customer effort score (CES): CES is similar to CSAT in that it is survey-based, but instead asked the customer how much effort their support transaction required from them. The assumption is that the amount of effort required from the customer in a support request is a more direct or actionable measure of the support experience, as well as the customer’s loyalty to the company.
  • Up-sells and cross-sells: The frequency with which customers buy higher-priced or related products from the company. This is an indirect but useful measure of satisfaction and how engaged a customer is with a product.

Support agent satisfaction metrics

A crucial piece of the puzzle, but one that is sometimes neglected in considering metrics, is the happiness of the agents who are providing the support. Happier, more motivated agents will provide better support. Further, keeping team members as part of the team can help maintain support quality and reduce time-consuming onboarding of new members.

  • Agent satisfaction (ASAT): Like CSAT, ASAT is measured through a survey which asks support agents to rate their satisfaction on a scale. According to the Zendesk customer support metrics guide, it may also ask for which aspects of their job they like and which aspects could be improved, which gives more actionable insights to their rating. Surveys should be administered to everyone on the team.
  • Number of escalations: A high number of escalations relative to other members of the team may indicate that an agent has not been well-trained to respond to support requests. Helping educate agents to help them be more productive is one step towards improving agent satisfaction.

Conclusion

There are metrics we have not discussed here, such as metrics that provide insights as to which aspects of the product may be problematic, as well as metrics that try and respond to viral trends such as on social media. These may be the topics of future posts.

Any feedback or other metrics that should have been included? What metrics are most important to you and your team? Please leave a comment below. ■

References and further reading


About Sapling

At Sapling, we’re building the intelligence layer for chats, tickets, and emails. Our team has over a dozen years of experience in machine learning and deep learning at the Berkeley AI Research Lab, the Stanford AI Lab, and Google’s Brain Team. The Sapling product suite is used by teams supporting startups as well as several Fortune 500 companies.

The Sapling Blog describes our learnings from developing solutions for customer-facing teams using the latest AI technology.

If you want us to email you when we publish new essays, sign up for our newsletter below (we’ll ping you biweekly or monthly, no more than that).

Categories
Essays

The Case for Human-in-the-Loop AI for Customer Conversations

Interactive Voice Response has existed since the 1970s. Autonomous cars using neural networks were first developed in the 1980s. When will conversational chatbots deliver? This post discusses old and recent AI developments and makes the case for human-in-the-loop AI for customer conversations from the perspectives of technology development, user experience, and business learnings.

Motivation/Context

In recent years, automation has been a trending topic. With advances in artificial intelligence, particularly in a subfield of machine learning termed deep learning, AI systems now appear capable of tasks such as autonomous driving, holding conversations, and common back-office tasks.

There have been astounding advances in AI research, with applications in object detection, speech transcription, machine translation, and robotic control. Yet, we see a lack of disciplined reasoning when it comes both to developing and purchasing these systems. Many parallels to the AI wave can be drawn from prior technological breakthroughs. In most of these breakthrough moments, it took a period of time for the technology base to be sufficiently mature for wide adoption. Despite protocols, such as the World Wide Web, that have accelerated the propagation and adoption of technological advances, we at Sapling believe that for tasks such as holding conversations with customers, a human-in-the-loop solution is best when excluding the most transactional of interactions.

This essay argues for human-in-the-loop conversations, starting from first principles. We draw upon analogies in other fields such as autonomous driving and automated call handling. While we first focus on the maturity of the technology, we later discuss the advantages of human-to-human conversations.

Why Now

“There are decades where nothing happens; and there are weeks where decades happen.”

― Vladimir Ilyich Ulyanov

Two changes prompted us to write this essay.

  1. The first is the rise of AI and machine learning, in particular the rapid adoption of AI technology by industry.
  2. The second, related change is the rise of automation, in particular with chat-based support and inside sales stretching customer-facing teams thin. The COVID-19 virus further accelerates the need for automation for many businesses.

Examples from History

To begin, we provide some historical perspective by describing the adoption of technology for two other developments: interactive voice responses and autonomous vehicles.

Interactive Voice Response

Interactive voice response (IVR) systems are familiar to almost everyone. You call a bank, or your cable provider, or a large health provider, and an automated responder collects some basic information from you and tries to route you to the right information (or just punts and gets you off the line). These systems extend as far back as the 1970s. It was only in the 2000s, however, that improvements in speech recognition and in CPU processing power made IVR more widely deployable.

Unfortunately, IVR never lived up to its promise. While IVR can handle simple decision trees and recognize customer responses from a heavily constrained set of possible responses (“say yes” or “say your account number”), much of the functionality may as well be transformed into a simple webpage form—no AI necessary.

Recent systems allow for the transcription of calls for later inspection—however, these tools are for quickly analyzing calls instead of responding to customers in real-time. Ultimately, a true IVR layers the complexity of speech recognition on top of chatbot technology. Until helpful customer-facing chatbot systems are developed, we see no reason to expect IVR’s capabilities to expand (this includes certain products under development by Fortune 500 companies that you may have heard of in recent years).

Autonomous Vehicles

While it seems autonomous driving has only become a hot topic in the last five years, with Google’s Waymo, Tesla, and upstarts such as Aurora, the history of machine learning for autonomous driving extends far past 2015.

Carnegie Mellon University published results on an autonomous truck in 1989—the year of the fall of the Berlin Wall—and development had started five years prior. Not only did the vehicle use machine learning, but it in fact used neural networks—the machine learning model that later developed into deep learning—in order to predict steering wheel positions from images of the road ahead. All more than 30 years ago.

ALVINN, the neural network-powered autonomous car from 1989.

In 2004, the first DARPA Grand Challenge was held to determine whether a research team could design a vehicle to autonomously drive along a 240-km route in the Mojave Desert. While no vehicle completed the route in 2004, in the second Grand Challenge in 2005 five competitors completed a 212-km off-road course. It seemed autonomous vehicles were near.

Yet 15 years later, tens of billions of dollars invested, and the smartest people in the world working on the problem, autonomous vehicles are still restricted to a small number of riders along specific routes, or fall back on teleoperation. Reasons for this include the much greater difficulty of driving on crowded local roads, the high levels of safety required of such systems, and the many conditions such systems must handle.

Automating Conversations

Given the history of previous technology products that took some time to reach mainstream adoption, we now turn to automated conversational systems and human-in-the-loop conversational aides. We loosely refer to automated conversation systems at chatbots and human-in-the-loop systems as conversational assistants. From our previous discussion on IVR, we consider speech-based assistants—such as Samantha from Her—as a wrapper around a core language-understanding system that relies purely on text. This has inaccuracies as voice can communicate emotions and other cues that pure text cannot, but to simplify the discussion we’ll make this assumption.

FAANG and Existing Systems

To start, consider the companies that are best positioned to deploy a chatbot system—namely, the FAANG companies. Google, FB, Microsoft, Apple, and Amazon have orders of magnitude more data and orders of magnitude more money (and therefore compute) to train and deploy machine learning systems on that data—and with the establishment of research labs, more AI talent as well.

Besides having the resources, these companies are also incentivized to deploy these systems, as many of their core products are for communication. In the case of Google, Gmail and Meetings, for Facebook, Messenger and Whatsapp, for Microsoft, Outlook and LinkedIn, and for Amazon, of course there’s Alexa—just to sample a few examples.

Google’s Smart Reply is limited to short, single sentence responses. Screenshot from the Google Blog.
While it has many capabilities, Alexa tends to be used for simple, single turn commands.

Considering these companies, however, what are the communication assistants that have been deployed? How many turns of conversation can they handle? What level of depth do the suggested responses handle? And in how many cases are the replies in fact fully automated?

(For more on why this is the case, you can find an explanation here: https://www.youtube.com/watch?v=Ihmm_tQGBeE&t=3m15s)

UX Perspective

From a user perspective, is chat actually the best option? It’s easy to say that natural language is an intuitive interface, but in many cases natural language is not efficient at all. Consider the following examples:

  • Purchasing a product: Imagine going to a retail website and not being able to search and scroll through products, but instead being forced to go back and forth with a chatbot to narrow down to the product you want.
  • Updating an account: It’s much easier to use browser autofill.
  • Getting clarifying information: It’d be faster if that information were instead in a FAQ.

The problem with chat (and of voice as well) is that it constrains the interaction to a single thread, while the web is designed to allow for parallel blocks of information to be displayed to the user. For highly transactional, self-service tasks, simple webviews are often the best solution. This then leaves tasks that require more back-and-forth and complex interaction—namely, tasks that require human intervention.

Simple TaskComplex Task
Faster ResolutionApp WebviewHuman Chat
Slower ResolutionIVR/ChatbotsHuman Ticket/Call
Segmentation by resolution time and complexity of the task to be completed for the customer.

Chitchat vs. Tasks

A minor point: there exist chatbots in broad use that provide chitchat functionality. These chatbots serve to entertain and in some cases inform users without requiring precise understanding of the conversation or the ability to manage information from many turns ago. The discussion above is not about chitchat bots, but instead bots that act as assistants intended to help complete tasks—which entails precise understanding and management of longer-term information.

Business Implications

Assuming a company is looking to automate certain conversational sequences or convert them to self-service, it’s fair to assume that they feel they have a good grasp of all the possible variations of those conversations. Even in these cases, there can be significant drawbacks to full automation beyond the speed vs. quality tradeoff.

Trade-offs when considering chatbots vs. humans for customer interactions.

Conversational Insights

Customer conversations are a key source of feedback in order to improve product and other aspects of a business. Guiding customers down fixed paths with decision trees and click-to-accept options prunes away new and diverse feedback from customers—the more open-ended the interaction, the more opportunity there is to learn from the interaction, and this is especially true for the long tail. Automated pathways push users along the beaten paths of past interactions, while conversations should also provide insights from unvisited paths.

Example of new paths of learning that scripted conversation flows can remove.

Adaptability

With the increased expressive power of deep learning systems and their ability to improve with more data comes a price: deep neural networks are not as interpretable or controllable as traditional methods.

Thus modern deep learning systems will often yield the best results on existing benchmarks, they are not well-suited for rapid adaptation to new requirements. On the other hand, humans shine at rapidly changing their behavior and navigating fuzzy requirements.

ChatbotsHumans
ConsistentPersonal
ImmediateDelayed
Simple scenariosSimple and complex scenarios
Adapt infrequentlyAdapt quickly
Advantages and disadvantages of chatbots vs. humans

Assisting Agents

Finally, assuming that a human-in-the-loop is desired, how can AI technology best empower agents while helping the business?

The Business Perspective

We consider two axes along which human-in-the-loop AI can help businesses. The first is by improving the efficiency of teams, and the second by improving the quality.

Ways by which AI tools can yield efficiency gains include retrieving information (for the agent as well as the customer), segmenting and routing customer groups to the right resources, and suggesting responses or partial responses to the agent. In contrast to fully automated approaches, these tasks can easily include a human approval step or can be overridden by the customer.

TaskExample(s)
Retrieve informationFetch knowledge base article that may address customer question.
Segmenting/routingIdentify common issues by segmenting tickets into buckets. Route a particular customer request to the right customer service department.
Suggesting responsesChat assist where agents can simply click on the desired response.
Example ways in which AI can augment and assist customer-facing teams.

Besides enabling gains in efficiency, AI assistants can improve quality metrics such as CSAT (satisfaction) and CES (customer effort) as well. Searching through the knowledge base ensures that customers receive the information they need. Routing requests to the right department similarly helps improve resolution rates. Agents are faced with typing the same repetitive replies to inbound messages; suggested responses can help streamline their chat workflow. AI can further help sanity check messages for language quality and compliance.

Three ways in which AI can augment agent customer workflows.

The Agent Perspective

From our own time developing tools for agents, we’ve found a few key requirements for tools that deliver returns on investment:

  • The tool must help consistently. If it helps just 5% of the time, it gets used 0% of the time.
  • The tool must sit within existing workflows. Whatever helpdesk or sales engagement platform the agents are already using, the tool should be integrated with. It’s too much to ask for an agent to context switch to the tool in order to receive assistance.
  • The tool should rely on the agent or customer to make decisions that have any significant uncertainty–—augmenting the capabilities of humans instead of automating them away. This is the argument we’ve been making throughout. ■

About Sapling

At Sapling, we’re building the intelligence layer for chats, tickets, and emails. Our team has over a dozen years of experience in machine learning and deep learning at the Berkeley AI Research Lab, the Stanford AI Lab, and Google’s Brain Team. The Sapling product suite is used by teams supporting startups as well as several Fortune 500 companies.

The Sapling Blog describes our learnings from developing solutions for customer-facing teams using the latest AI technology.

If you want us to email you when we publish new essays, sign up for our newsletter below (we’ll ping you biweekly or monthly, no more than that).