The suggestions and support offered by AI are helpful only if they’re relevant. On today’s episode, Walter Sun, senior vice president and global head of artificial intelligence at SAP, joins hosts Sam Ransbotham and Shervin Khodabandeh to share how his organization is helping employees get smarter about artificial intelligence through the company’s AI Days. Additionally, Walter gives specific examples of support that AI agents could provide to an end user, and makes the case that small language models (fine-tuned large language models) can be built to assist in decision-making.
Subscribe to Me, Myself, and AI on Apple Podcasts or Spotify.
Transcript
Shervin Khodabandeh: How is one enterprise resource planning company thinking about language models — both large and small — to benefit its employees and its — customers? Find out on today’s episode.
Walter Sun: I’m Walter Sun from SAP, and you’re listening to Me, Myself, and AI.
Sam Ransbotham: Welcome to Me, Myself, and AI, a podcast on artificial intelligence in business. Each episode, we introduce you to someone innovating with AI. I’m Sam Ransbotham, professor of analytics at Boston College. I’m also the AI and business strategy guest editor at MIT Sloan Management Review.
Shervin Khodabandeh: And I’m Shervin Khodabandeh, senior partner with BCG and one of the leaders of our AI business. Together, MIT SMR and BCG have been researching and publishing on AI since 2017, interviewing hundreds of practitioners and surveying thousands of companies on what it takes to build and to deploy and scale AI capabilities, and really transform the way organizations operate.
Sam Ransbotham: Hi, everyone. Thanks for joining us. Today, Shervin and I are talking with Walter Sun. He’s a senior vice president and head of AI at SAP. Walter, great to have you.
Walter Sun: [It’s] great to be here. Thank you for having me.
Sam Ransbotham: Before we start, I want the listeners to know that we had a bit of a small-world moment when I met Walter. We’ll get a little bit into his background later, but we discovered that we’re both Georgia Tech grads in the same scholarship program. So I’m wearing a Georgia Tech shirt for our recording session to make Walter feel at home. I hope it helps.
Walter Sun: Excellent. Nice to see you representing the university.
Sam Ransbotham: All right, before we get into your background, let’s start now. SAP is a European multinational software company based in Germany, largely known for its enterprise resource planning, or ERP. I’m sure we’re going to use that acronym today. So, Walter, first tell us a bit about your current role at SAP.
Walter Sun: Thank you for asking. My role entails working and leading the engineering team to build reusable technologies across the different lines of businesses at SAP. As you mentioned, we have ERP, but we also have many other areas of business applications, including human capital management with SuccessFactors. We have a CRM [customer relationship management] solution as well as [a] procurement [solution] with Ariba. We also have travel [applications, and] a lot of people are familiar with Concur.
My team builds technologies, AI technologies in particular, which are reusable by the lines of businesses. And we also build them for our developers as well, through our business technology platform, where we actually make more than 30 different large language models available.
We have this dual consulting capability, which we released. It helps consulting partners work with our joint customers on things like implementations and cloud migrations. The idea is we effectively uploaded, with retrieval-augmented generation, our help documents, our consulting docs. As a consultant, you can ask it natural language questions and get responses. That feature I mentioned is timely because we’re having our annual Sapphire conference, May 19-21. We’ll talk a bit more about that as well as other capabilities, [like] the agents, but that capability makes it easier for people to onboard to our SAP products.
Shervin Khodabandeh: It sounds like, Walter, you’re focused on two use cases for AI. There’s the internal use, and then there’s also the external use because you have so much presence within so many different organizations. Let’s pick one to talk about first. What are you most excited about?
Walter Sun: Thank you for asking. I think maybe we’ll start with just the fact that we’re building access to large language models, but we’re not building our own LLM, because that area has been well-researched and well-developed. What we feel like we want to do at SAP is we want to make the best of the technology and make our customers as productive as possible. What we have is basically a generative AI hub, which is in [the] business technology platform I mentioned.
The hub has access to more than 30 large language models, and our internal developers can access it, as well as external developers. We provide business benchmarks to different cases and then recommend the best language model for one’s use case.
Imagine if you tell me, “I have a Q&A scenario. It’s not super difficult, but it happens 100,000 times a day.” Maybe I say, “Hey, of the 35 language models we have, these 27 likely will give you the right answer. So we can recommend one of these 27 or at least 27.” We probably would sort them by price, to make them as inexpensive for you as possible.
Sam Ransbotham: Internally, how has SAP thought about upscaling its workforce to use these technologies?
Walter Sun: We basically want to make sure our employees are equipped to handle the model. In 2023, we had what we called AI Days, [when] we delivered a bunch of sessions across different global units so people could have time to attend at reasonable hours. And we explained what large language models were. We had a sandbox internally, where people could access large language models in an enterprise way. [This] basically means, as you may be aware, if you use a public consumer LLM, it’s free because the trade-off is you’re actually allowing them to improve their models on your own questions, which is fair for them, but it’s not good for a business because you may have secrets you’re asking.
So we provided this enterprise-grade LLM. We had many people try it, and people tried it again and again. [There were] millions of instances of access to [the] LLM. So people could try it out, understand what it’s all about.
I think it’s easy to read the news and say, “Oh, people are liking LLMs; [they do] a lot.” But [the process also involves] getting your hands dirty, trying a large language model, changing the prompts, realizing the more detailed your prompt is, the better the response. We did training to help people understand that. And it went a long way, because, as I mentioned from the beginning, my team’s job is to provide reusable AI to different product teams.
The more that they use LLMs, the more they know what LLMs can and can’t do. So they’ve asked us to become more targeted. … An example is job description creation with the feature we shipped over a year ago, where you [use] large language models, and you use natural language: “I want to hire a new tenure-track professor [from a] research university [with a] Ph.D., hopefully maybe a couple of years of work experience.” That description can be enough for it to create a three-paragraph, four-paragraph job description using a large language model.
The people in the SuccessFactors team — that’s the team that we built this with — they know that large language models can be verbose, and they can be descriptive, and they just need some prompting. So basically we built this feature saying, “We provide a few prompts, and from the few prompts, you can describe something that’s very descriptive.” It can also [incorporate] company or legal guidelines. A lot of companies have policies, you know, disclaimers and terms that need to be in the job description.
So using both the short description, which is called the prompt, along with the meta prompt of whatever the company’s legal policy or compliance policies are, you can create a very nice response using a language model. The AI Days allowed us to educate the lines of business, which has made our partnership a lot more productive in the time since.
Sam Ransbotham: You focused on large language models here. I don’t think we’ve talked a lot about small language models. Maybe for listeners, what’s the difference [between] small and large, so we can have some background, and then how does that change?
Walter Sun: That’s a great question. I think that it’s a good segue to talk about agents. So the idea of large language models is the number of parameters a language model has. I tell people when you’re in school, you learn how to do a best fit line. A best fit line, if you remember, is y is equal to ax plus b. Your two parameters are a and b. So you take 10 or 15 data points, and you put a best fit line. You saw the two-parameter model. As language models started getting bigger and bigger, we’re getting to like 400 million and then hundreds of millions, and then got into the billions.
Now we’re looking at like 100 billion parameters, so that’s considered large. … There’s a lot of free flexibility. So obviously, you need trillions and trillions of data points to fit that, to build that model. It’s expensive computation-wise. So the idea is we’re looking at building smaller models now, called small language models, which still can be in the hundreds of millions of parameters or so.
People are saying today it’s small, but you know, 10 years ago that would still be considered massively big. But these smaller models are tuned and focused on specific tasks. The value for this is they get the job done with better context, and they’re cheaper. The whole idea of [AI] agents, which is a big topic today, is these models are tuned to do a task very well.
Let’s say you ask a large language model a question like, “Book me a flight from — I think, Sam, you’re in the Boston area [so] let’s say for argument’s sake — Lexington [Massachusetts] to Palo Alto [California].” … A large language model, two or three years ago, [or] a year ago, would just say, “OK, I’m booking you a flight leaving Lexington at 11 a.m. Eastern time, arriving in Palo Alto at 2:15 Pacific time.”
And the language model would’ve hallucinated that because [it] basically took the internet and said, “OK, there’s an airport in Lexington. It’s not a commercial airport, I think, but it’s a small airport, and there happens to be a private airport in Palo Alto.” So the travel time between those two airports is maybe six and a half hours, so that would work out. It’s theoretically possible. And if we had infinite wealth, we would just book [the] plane and say, “Let’s do that.”
Most people say that’s [a] hallucination, but the language model doesn’t know better. The language model says, “That’s possible. There are two airports. There’s a plane you can book without a high cost, and do it.” That’s not useful for most human beings. Most of us, like you and I, probably take mostly commercial flights. We actually know there’s a limited number of commercial airports in the world, something like 300 or so.
You can put a fine-tuned model that says, “Hey, don’t just know everything in the world. There are actually 300 airports in the world that are commercial, and people typically fly from these. So when you ask for flights from Lexington, Massachusetts, to Palo Alto, California, we’re looking at the nearest airports.” And so they’d say, “OK, Boston, Logan [Airport] for the source side.” And then, at the destination side, it’s interesting because there’s San Francisco [International] Airport, which is close, but San Jose is equally close to Palo Alto, and there’s also another airport, Oakland, which is somewhat close as well.
So in this case, these agents would … offer flights, but it could become a multi-turn or iterative conversation where it says, “Hey, I can give you a lot of choices, but I still need your input, because there are actually three airports that you can actually land at. Two, possibly, or maybe a third. And so maybe you can tell me, are you happy with flying into SFO, which is possibly the closest? San Jose is sort of close. And then Oakland is close as well.” One of the fun things about these models is they’re smaller models. They’re focused and tuned, and they know a lot, but they can still communicate with the human and ask … for iterations.
It’s the same [as] if you had a travel assistant. They might come back to you and say, “Hey, Sam, I found you many flight options, but there are different airports you can fly to. Which of these would you like?” And that’s when the iteration happens.
Sam Ransbotham: Actually, I think that’s a great example. I have to say I was worried when you started, because you started with Lexington and Palo Alto, and I didn’t realize how deeply you thought through that example to pull out that detail. I think that really brings out this idea of when we moved to agentic, we are thinking some of the probabilistic answers are perhaps difficult.
For example, I think a classic thing is for a student to come in and say, “What is the theme of this paper?” or “What’s a business plan?” Those are very general sorts of ideas.
There’s not a good reference point for checking notes, but in your example, you could make up a flight that didn’t exist. And that’s going to be a bit of a problem when we start thinking about agents.
Walter Sun: Yes. I think it’s a great question. The reason why I use that example is that people criticize, for good reason, language models for hallucinating a lot, but … think about a language model as maybe a general receptionist or reference agent at a library. You walk up to a librarian who happens to know a lot about all the books in the library. That’d be great. Ask them a question. They might [answer] generally [to] “Hey, tell me about dinosaurs.” “Oh, dinosaurs are in this section.” “Tell me about apple.” “OK, well, there’s apple, the fruit, and there’s a company called Apple.” You can disambiguate. “I can send you to the right part of the library.”
But then if you ask that person on a flight from Lexington to Palo Alto, and you say, “On the spot, you need to give me an answer,” they might not give you the right answer.
And that’s where hallucinations come in, because there’s a lack of context. The background model knows a lot, believe it or not, but it just doesn’t know exactly what you want. So then it kind of gives you an answer. It doesn’t know your financial situation. It might think you can buy your own jet or whatever. So it gives you this answer of you can fly from Lexington to Palo Alto.
Believe it or not, a lot of the facts are correct, right? The duration, the time zones, the flyability, the fact that [these] airports exist, although many people may not know that. But the reality is once you add some context — “I’m a business traveler [who] travels commercial. I have to fly commercial flights. I want to just book a flight” — you suddenly say, “OK, I can actually take a small language model and focus it on flights, and answer that question.”
You don’t need to have a general language model serve each of these AI agents; rather, you have to tune each of these agents to be specialized and then let them communicate with one another. … The analogy is great for humans because each of these agents could be assisting a human persona. … There are travel agents that might be airline agents/specialists. If you can imagine, you can ask a librarian a general question about booking a flight. He or she will not know exactly what you’re looking for. But if you ask a travel agent that question, they might not know where to find the books about dinosaurs or apples or whatever, but they’ll know about flights.
So what you do is you take these large models, you fine-tune them to make them small models, and create them for different agents. You can have an expert on history, a historian agent [that] can answer questions about history. You can have an agent that’s an expert on the apple, the fruit. … I’m in the Washington state area, so there are probably some [human] experts here. You can actually have agents learn off of them and build specialized agents [to] talk about apple, the fruit. And then you have travel agents, and you have hotel agents, and then you actually have them communicate with one another.
In this case, [with] your travel agent, you book your flight to Palo Alto, and then the next step is to actually find out more about where to stay in a hotel. So you can have the airline agent then communicate with the hotel agent, saying, “Sam is flying in on Tuesday, May 13, and then flying back Monday, May 19. He needs a hotel for six nights.” And that’s where you say, “OK, the hotel agent now is getting information from the airline agent, and communicating and making a full itinerary” … before the response comes back. … That’s kind of what we’re trying to do with our system, where Joule is our natural language copilot that communicates with customers or users.
Underneath the hood, the different AI agents, which are built on small language models, can communicate with one another, coordinate a plan, return a result, hopefully with minimum hallucinations, by virtue of them being specialized, instead of them being generalists. [Through] retrieval-augmented generation, context setting [they could say,] “I can, with your permission, if you upload your profile to the system, we can know where you live.” [or] “We can know where you’re visiting in Palo Alto. Maybe you’re visiting something in North Palo Alto, so SFO is closer, or maybe you’re living somewhere closer to Mountain View or [San Jose], so San Jose’s airport is closer. And you can sort of think about that whole idea of the more context the language model gets, the better the response.
The same way that … we just met today, and the more we communicated, we know we both went to the same college at Georgia Tech, so we can talk about scholarship programs or the football team or whatever. We have this context and we can communicate better. Whereas, if I met you on a street somewhere and we started talking and didn’t know the context, the conversation wouldn’t be as rich.
Shervin Khodabandeh: Your example of the librarian and the individual travel agents makes sense to me because I can imagine how much an individual human would know. When I think about this multiagent scenario, we’re not constrained by the limits of one individual’s knowledge, and a specialized person can go so much deeper than a general person. But isn’t some of the promise of the largest models that they incorporate both? Both the depth and the general?
Walter Sun: I think it’s difficult. I mean, there are some mixtures of expert models, which is to say, we know everything, but when you start asking me a question, I can find the right expert for a task. So there are methodologies.
I think one argument for smaller agents — I know people can argue the opposite — but the reason why AI agents are nice is that you can build a smaller agent, so it’s a smaller computational cost. Then they can still perform at, let’s say 95% or close to the full language model in the area of expertise, maybe even better, as my example was indicating.
But I think that you can actually specialize. The same way that humans, when we all go through college, we pick a major, and we specialize, and then you [work at] a company, and all of us play different roles. I think having AI agents play different personas’s roles allows them to be better. Believe it or not, this whole idea of reasoning, this test-time computes what would happen, basically that’s just a fancy technical term of [asking] a language model a question. Instead of saying, “Give me the answer right away,” it’s like, “Take a minute to think about it.”
By thinking, the models can actually reason: Do I need to get help from another AI agent? Do I need to actually iterate? Do I need to get clarification between different airports? Do I need to contact the hotel agent in that one case or contact something else to see if there are other constraints? You might check [for] constraint optimizations? All of that helps to provide a better answer. A large language model in itself, a vanilla model … built out of the box, it has a lot of information, and it becomes kind of like a needle-in-a-haystack problem where … there’s so much information. It can give you information, but it might overwhelm you with the response.
Whereas, if you simply have a question about hotels or about flights or about apple, the fruit, you don’t need to know everything else, right? The large language model, when you ask it, “When do you harvest apples?” It might know that, but it knows a billion other things as well. It might get confused. It happens with Apple, the technology company, and it might start answering questions about when they release their products, right? Harvesting could be like when they release their products in September or whatever.
That’s the whole idea that maybe knowing too much could be difficult in terms of giving you a clear and straight answer, if that makes sense.
Sam Ransbotham: That’s good. The other part I think about that is with small language models and the iteration and retraining time. When we talk about these large language models, we’re talking about literally months and tons of power to recompute them. Objectively speaking, the small language models by definition have to be faster and less performant. So then it seems like you could iterate better.
Walter Sun: That’s right, absolutely right. You can improve, iterate, and also fine-tune. One thing I like about these models is each company has its own policy and preferred partners. So you can fine-tune these models with actual data from a company. Let’s say that we have this travel agent model, and then each company deploys this model, this agent, and then it’s doing reinforcement learning. It’s learning in terms of how to be a better travel agent for this company, maybe for an individual, longer term, if there are enough data samples.
So a company has preferred airlines, preferred hotels, policies against flying business or policies on not flying red-eyes or minimizing connections. The model will learn that and say, “OK, I have a general model, which tells you how to fly from airport A to airport B anytime you want. Over time, I’m learning this company’s employees always pick no red-eyes. If that’s the case, then I’ll lower the ranking of red-eye flights and prefer just daytime flights.” And so [those] types of things, like you said, you can iterate faster with a lower cost of retraining, and you can do it very quickly.
Whereas, these very large models, you can’t really just retrain them for one company without it being a very expensive cost.
Sam Ransbotham: [What’s] going to be a challenge we’ll all face in the future, as we use more agents, is figuring out what the objective criteria of that agent is.
Walter Sun: I think it’s interesting. I think that the fact remains just the same with web pages, right? And websites and domains, search engines, everything we do. I think part of the reason [is] that online companies or software providers, even like SAP, need to develop … trust [with] our customers and our users. Over time, you build trust and say, “Hey, we don’t have an ulterior motive,” or “Our goal is to do what’s best for you.” And it takes time to build that trust. And [it’s the] same with web pages. Sometimes you go to websites and domains, and maybe they show certain sources of information and not others.
You have to learn over time, if I’m going to this domain, I have to take [the results] with a grain of salt. I’m getting data [that] is leading a certain way or the other way. Or if I’m looking at certain companies [that] have other partnerships, I go to a travel agency and they’re owned partially by another specific airline or something, then you have to take [those results] with a grain of salt.
Sam Ransbotham: It seems like, to tie back to SAP maybe a little bit stronger, if I think about SAP and the strength of SAP is, honestly, just a massive amount of tabular data.
Walter Sun: Yes. The more we do, the better we can get correct answers. I think there’s the other aspect you mentioned — predictive — which I like that topic because language models in fact aren’t good at math. There are all these stories about the fails they had. And so a vanilla language model, and of course people will argue that these AI agents, which use reasoning now are able to solve, like Duo and like Olympiad problems now. And so that’s true. But the language model, when they’re built, they learn from the internet and they learn about language, and so they’re very good at language, and they predict the next word in a sentence. So you say, “I want to go to the store and buy some blank,” —
Sam Ransbotham: Apple.
Walter Sun: … and then you can write a story and say, “Buy milk or apples or whatever,” and it can go through this and it’s all predictive. And it says, “OK, what do people on the internet say that they want when they say that?” But people don’t have a lot of tabular data because [of] privacy reasons. “Here’s my bank account, here’s my eight transactions,” or “I’ve made these 15 credit card purchases at the grocery store with the 16th.” This information isn’t generally available. So language models aren’t trained to do that.
And so what we’ve done at SAP, given the breadth of applications and access to [personally identifiable information], basically, privacy data removed, we actually can train off of a larger model set. I’ll give an example where everyone understands the reason, and then we’ll talk about how this can work for more subtle cases. If I take a look at all retailers in the North American area and transactions, I can come up with a conclusion that in late November to mid-December or late December, purchases are a lot higher.
There are a lot more purchases that happen every year, around some late November date. It changes a little bit every year, but it’s give or take within a seven-day window in late November into late December. That information doesn’t give away secrets from any one company. Rather, it’s the retail business in North America. So if every company is opting in and we anonymize the data, we don’t need to have individual companies’ data because this is a pattern [that] works across the industry. In fact, we don’t want individual companies’ data providing patterns because it’s not useful for all of our customers.
The example I’m giving obviously is Black Friday to Christmas, but you can imagine more subtle things, like say I noticed that on Thursdays in the South China Sea, shipments are always delayed. And that can only be determined if you have a bunch of data from a bunch of different customers, right? If you have one shipper, that shipper might not work on Thursdays. But if you have 100 shippers, and all 100 shippers have [the same] reasons [for] delays in the South China Sea on Thursdays. … I’m making this up, but it might be that there’s some reason. Maybe the government is approving imports/exports, or there are reasons that there are delays, and it’s good for a company to know.
The value of this tabular foundation model that we’re building is to help provide insights there. I mentioned earlier we’re not building a large language model, but we’re actually building a tabular foundation model to complement a large language model, using the same transformer-based technologies that are used by LLMs for this tabular model. And this model is focused on business data and predictions.
I’ll go back to the last thought about predictions. Predictions in the future don’t have to be perfectly accurate. I understand the point that [for] facts about the past, we can’t make errors, but one thing that we want to do with this model is also help you with forecasting. In this case of Thursdays, predict that [for] future Thursdays, there may be some disruption in the South China Sea. Those things hopefully are useful for businesses as they understand they’re still predictions and not facts.
Sam Ransbotham: Our show is Me, Myself, and AI. And, obviously, we know that you’re brilliant from your Georgia Tech background, but tell us a bit more about the rest of your life between Georgia Tech and currently.
Walter Sun: My background: I wanted to get some experience for my Ph.D., so I did my master’s degree at MIT, and then I took a leave of absence for a few years to work and get some experience. And I did some work in Wall Street.
I worked at BlackRock doing quant work on fixed income. And then I went to the West Coast. I worked at Apple for a little bit, working on some of the early versions of QuickTime, which is now integrated into iTunes and other products.
And then I worked at a startup company called 2Wire, where we built one of the world’s first consumer broadband routers at home. Then I went back to get my Ph.D. My Ph.D. focused on image processing and just the stochastics or the uncertainties behind it. You had talked about probabilities. I really like probabilities because the world is a very uncertain place. So being able to understand and model it as best as possible — maybe not perfectly because we can’t predict the future perfectly — but modeling in a way that can prepare people, like supply chain agents or retailers [for] the right things. That’s valuable. So I spent a few years doing that. And then I joined Microsoft almost 20 years ago now.
I worked on a variety of products, from the Windows products for the codecs, which [are] basically encoders and decoders for video and audio, to later working in Bing for many years.
I actually started up a new product called Bing Predicts, which was to use big data to do predictive analytics. That area was very interesting to me, much like what we’re doing now. I mean, everyone knows that data can be used as a field for AI. We were actually saying, “Hey, you can take a lot of searches and delete patterns.” I mentioned this example about the Thursday delays in the South China Sea, those types of things. We’re actually predicting political affiliations based off of aggregated and anonymized web search traffic.
So much the same way, the analogy of the tabular model today … we can do it without looking at individual companies’ data. We can actually look at people’s preferences without looking at individual searches or individual data. That’s the kind of thing I was doing at Microsoft as well, looking at how to do predictive analytics with big data. And then I joined SAP a couple of years ago, and I’ve been working on building out alternative AI for our customers and our products.
Sam Ransbotham: Well, it’s been a fascinating discussion. The thing that really struck me is the depth of your knowledge about how all these pieces can work together in an ecosystem of enterprise resource planning that has many customers and is balancing the difficulties of vendors and the changing technology. It’s a difficult place to be, but I’ve enjoyed talking to you. Thanks.
Walter Sun: Thank you. I really enjoyed it too. I think that part of the reason that we are passionate at SAP about building these technologies is that, like you said, we have a wide base of different technologies that we provide for our customers. By virtue of having that base, we can actually try to understand across the board. We have a knowledge graph that connects different applications together, and we want to have this dual AI agent connect with the AI agents across each line of business in order to best provide the best answers for our users.
Shervin Khodabandeh: Thanks for listening, everyone. Next time, Sam and I speak with Josh Weiner, senior vice president of consumer engagement and analytics at CVS Health. Please join us.
Allison Ryder: Thanks for listening to Me, Myself, and AI. Our show is able to continue, in large part, due to listener support. Your streams and downloads make a big difference. If you have a moment, please consider leaving us an Apple Podcasts review or a rating on Spotify. And share our show with others you think might find it interesting and helpful.