One of the main drivers of the current excitement around artificial intelligence is its uncanny ability to act almost human. Unlike the stiff chatbots of the past, recent models can more effectively respond to a user’s emotions with nuance and understanding. The effect is convincing enough that some people attribute complex human abilities such as empathy to these computerized companions.
Companies have already capitalized upon this advance to deploy AI chatbots in sensitive lines of work—such as medical advice, therapy, and life or career coaching—traditionally performed by trained professionals. But is AI just providing statistically determined helpful responses, or can it actually recognize when a reply expresses empathy?
“There’s lots of evidence that computers can say or write a response such that someone feels validated, affirmed, and heard,” says Matthew Groh, assistant professor of management and organizations at the Kellogg School. “What’s less clear is whether it can recognize empathic communication when it sees it.”
In new research, Groh and a team of researchers evaluated how AI measures up to humans in recognizing the kind of empathic communication that’s critical for this type of high-stakes work. Specifically, they compared three large language models (LLMs)—Gemini 2.5 Pro, ChatGPT 4o, and Claude3.7 Sonnet—with both experienced and inexperienced people on their ability to judge the nuances of empathy in text-based conversations.
Using several frameworks for measuring empathic communication, the researchers found that LLMs were nearly as good at recognizing empathy as experts—and far more reliable than nonexperts.
The team, which includes first author Aakriti Kumar, Nalin Poungpeth, and Bruce Lambert of Northwestern, Diyi Yang of Stanford, and Erina Farrell of Penn State, also found that evaluating AI models in this way could potentially teach humans something new about empathy—both how we measure it and how we apply it.
“Studying how experts and AI evaluate empathic communication forces us to be precise about what effective empathic responses look like in practice,” says Kumar, a postdoctoral researcher at Kellogg and the Northwestern Institute on Complex Systems (NICO). “If we can break empathy down into reliable components, we can give humans and AI clearer feedback on how to make others feel heard and understood.”
Do you know empathy when you see it?
To assess empathic communication, the researchers gathered 200 text conversations between a speaker sharing a personal problem and a second person providing support. Then they asked three LLMs, three experts, and hundreds of crowd workers to annotate those conversations based on four distinct frameworks used in psychology and natural-language-processing research: Empathic Dialogues, Perceived Empathy, EPITOME, and a new framework they developed called the Lend-an-Ear Pilot.
Each framework asks observers to judge a conversation based on characteristics such as “encouraging elaboration” and “demonstrating understanding,” or questions such as “Does the response make an attempt to explore the seeker’s experiences and feelings?”
In total, the researchers collected 3,150 LLM annotations, 3,150 expert annotations, and 2,844 crowd-worker annotations.
“We looked at four different [frameworks], or how four independent groups chose to assess empathic communication, to evaluate empathic communication across a diversity of perspectives,” Groh says.
Because there was no objective “right” answer to how much empathy a communication contained, the researchers were interested in inter-rater reliability—how much the scores of different observers varied. For highly trained communication experts, you would expect the variation to be low, which is what the team observed. The annotations of amateur judges, conversely, should be all over the map; another prediction they confirmed.
When they compared the judgments of the three AI models to both groups, they were much more like the experts’ assessments than the crowd workers’ assessments. In other words, the LLMs were able to reliably recognize the nuances of empathic communication almost as well as experts—and much more consistently than nonexperts.
“The fact that LLMs can evaluate empathic communication at a level approaching experts suggests promising opportunities to scale training for applications like therapy or customer service, where empathic skills are essential,” Kumar says.
Quality in, quality out
But the studies also found that the frameworks themselves mattered. Inter-rater reliability, even for experts alone, varied widely across the four frameworks and for different questions or measures within those frameworks.
The more comprehensive and reliable the framework was, the more reliable the annotations were, both for the LLMs and for the experts, according to Groh.
“The quality of the framework really matters,” Groh says. “When experts agree on what empathic communication looks like, LLMs can too. But when experts are inconsistent, the models struggle as well. LLMs as a judge are only as reliable as the framework is.”
The findings suggest that what empathic communication entails is still not yet a perfectly settled subject. Through rigorous evaluation and optimization using both human and AI judges, scientists can create stronger frameworks for identifying empathy in conversations—and help people get better at expressing it.
“By more precisely characterizing empathic communication, we can transform what used to be a ‘soft skill’ into a hard skill,” Groh says.
Applications for the real world
Researchers and people in relevant businesses have not given enough attention to building the right frameworks for soft skills like empathy, according to Groh, and it’s partly because people did not realize they could be evaluated very rigorously at scale. Advancements in AI technology may help shift that kind of thinking.
“LLMs have the potential to teach us about the nuances of empathic communication and help us, as humans, communicate to make others feel heard and validated,” Groh says.
For instance, therapists could lean on LLMs in training to improve their capacity to show empathy and, ultimately, better support their clients. Or customer-service teams could role-play with LLMs as part of their training, using improved empathic communication frameworks to evaluate their responses.
Improving these skills will be as critical for leaders as it might be for any other group, if not more, because “leaders are in the business of decision-making, and empathy is core to decision-making,” Groh says.
“As every leader knows, there’s often times when you have to make a decision where not everyone agrees with you,” Groh says. “If you can show people you’re listening—if you respond with empathic communication—you’re more likely to bring others along, even if they disagree with your decision,” Groh says.
Yet while the research shows that LLMs are already nearly expert level in judging empathy, that doesn’t mean they feel it. And that means that your therapist doesn’t need to worry about AI replacing them—at least not yet.
“Just because AI can give you advice—and sometimes say it better than some people—doesn’t mean the human role disappears,” Groh says. “The human touch is still special.”