An Industry Benchmark for Data Fairness: Sony’s Alice Xiang

On today’s episode of Me, Myself, and AI, host Sam Ransbotham talks with Alice Xiang, global head of AI governance at Sony and lead research scientist for AI ethics at Sony AI, about what it actually takes to put responsible artificial intelligence into practice at scale. Alice shares how Sony moved early on AI ethics and why governance, not just principles, is now the real challenge as AI spreads across products and workflows. The conversation dives into FHIBE, Sony’s publicly available and ethically sourced benchmark for evaluating bias in computer vision, and why measuring fairness is often harder than fixing it. Along the way, they tackle data consent, “data nihilism,” and the very real risks of deploying biased systems in everyday and high-stakes contexts.

Subscribe to Me, Myself, and AI on Apple Podcasts or Spotify.

Transcript

Allison Ryder: Why did one large company decide to create a data fairness tool that’s free and publicly available? Find out on today’s episode.

Alice Xiang: I’m Alice Xiang from Sony, and you’re listening to Me, Myself, and AI.

Sam Ransbotham: Welcome to Me, Myself, and AI, a podcast from MIT Sloan Management Review exploring the future of artificial intelligence. I’m Sam Ransbotham, professor of analytics at Boston College. I’ve been researching data, analytics, and AI at MIT SMR since 2014, with research articles, annual industry reports, case studies, and now 13 seasons of podcast episodes. In each episode, corporate leaders, cutting-edge researchers, and AI policy makers join us to break down what separates AI hype from AI success.

Hi, listeners. Thanks again to everyone for joining us. Today I’m talking with Alice Xiang, global head of AI governance at Sony and lead research scientist for AI ethics at Sony AI. She leads a team that guides the establishment of AI governance policies and frameworks across all of Sony’s business units. She’s been a research scientist and a whole bunch more. Alice, thanks for joining us.

Alice Xiang: Thank you so much for having me.

Sam Ransbotham: To start, I think we first talked a few years ago when you were at the Partnership on AI. I’m curious what you’re up to now. Can you tell us about Sony’s work on responsible AI ethics and governance?

Alice Xiang: Sure. Sony is a large multinational company headquartered in Japan with a diverse array of businesses around creative entertainment and technology. We have operating companies focused on music, motion pictures, video games, and electronics. AI became an early focus of ours back in 2018 when we first set up our AI ethics guidelines. And that was since, as a technology company, we wanted to ensure that this new and emerging technology was being used responsibly across our business units. Indeed, since then it’s only grown in importance for our company.

I have two hats on at Sony. One is as global head of AI governance. Over the past several years, we were one of the early companies to start investing in AI ethics. When I joined, I established our AI ethics office and our processes in terms of how different AI uses and AI integration into products and services [are] evaluated for responsible AI.

Now we’re at the point of not just thinking about AI ethics but also AI governance. So how do we establish these frameworks for ensuring the responsible evaluation of these technologies that are increasingly being integrated in every aspect of business?

That’s one hat that I have on, policy making, guidance setting, so on and so forth. Then the other is leading our AI ethics research team within Sony AI. A lot of the work of my team over the past several years has been looking at … the fundamental gaps and barriers that practitioners face in terms of being able to develop responsible AI in practice. One of the major areas there is lack of ethically sourced data, even for pretty basic things, like being able to check for bias in models. There [are] not really great fairness evaluation data sets in many areas like human-centric computer vision. So we’ve been doing a lot of work there, and that’s recently culminated in a publication in Nature of our Fair Human-centric Image Benchmark, also known as FHIBE. We really hope that our work can help enable the broader community, both within Sony and outside, to be able to move toward more trustworthy and responsible AI development.

Sam Ransbotham: I got pretty excited when I saw the FHIBE work. I think it’s pretty interesting. It’s definitely a problem. I’m glad there are a few people working on this. Give us some details, like what exactly is involved in FHIBE, and what are the pieces, and how do I use it tomorrow if I want to?

Alice Xiang: It’s so interesting, because when I first got into this AI ethics space around computer vision, I kind of assumed that a lot of my role would be helping people on how exactly to do fairness assessments. It’s a pretty difficult area in terms of: How do you measure fairness? What does fairness mean? What do you do when you have biases? But I realized the biggest initial barrier folks have is just being able to evaluate for bias. Theoretically … you want a data set that’s been ethically sourced. That means there’s been appropriate consent and compensation and sourcing throughout the process, so everyone who’s participated in that data collection process has been appropriately compensated and has consented to their data being used and has control over how that data is being used. Then also, for bias evaluation in particular, you want to have a very diverse, ideally globally diverse data set, because you don’t want to be checking for bias, but all of your subjects have light skin, for example. In that case you really wouldn’t be able to tell whether your model performs well on darker skin tones.

So it was quite easy to say what would need to go into this, but when you actually looked at the different data sets that are available, it turned out that the standards in the field were quite low. Computer vision is a field that played a major role in terms of the deep learning revolution, and part of that was these web scrape data sets that were massive and relatively cheap to source, because they didn’t involve any of these considerations of consent and compensation.

Even though it’s been many years now of this field progressing, and the technology has just gotten better and better, that baseline of relying on problematically sourced data sets hasn’t necessarily changed a lot. A lot of our work in the past several years has been really thinking deeply of, “How do we actually do this in practice?” It’s very easy to say, “Please collect data from people around the world. Please ask them for consent. Please pay them. And then please make a rigorous benchmark that can be used to check a lot of different types of AI models.” But that’s much more difficult to do than it is to say, and that’s what a lot of our project has been about.

Sam Ransbotham: I think that’s always the case with this. I think no one comes out there and says, “Hey, I’d really like to be unethical, and let’s have our company have terrible governance.” It’s kind of like data cleaning in some way. Everybody knows that they want clean data, but it’s easy to say and hard to do.

Let’s start with measuring bias. How does that manifest itself in corporate America in terms of mistakes?

Alice Xiang: [That’s a] great question. For the most part, I think there’s a lot more awareness of bias than there [are] good practices of actually measuring and mitigating it. When folks don’t do this, what ends up happening is you have technologies that are released that maybe don’t work well for certain subpopulations and especially in areas like human-centric computer vision that are used in surveillance contexts, law enforcement contexts, but also just everyday contexts, like unlocking your phone, making payments on your phone, or going through border entry.

These are all different areas where if the technology doesn’t work well for you, at minimum, it’s an inconvenience. Maybe you have to look at your phone several times at different angles before it can recognize you. But this can also lead to much more problematic impacts in terms of anything from financial fraud to folks being wrongfully arrested. This is an area where I think everyone knows it’s a problem; everyone wants to make it better. No one actually wants these technologies to perform poorly on folks. But if you don’t have good ways to measure bias in the first place, then there’s no way you’re going to be able to then further try to mitigate that.

Sam Ransbotham: I like your gamut of possible badness. I wouldn’t be surprised if we added up the societal cost of all the tiny problems, that they would actually be a big number in terms of … something silly, like total lost productivity from looking at your phone three times versus one time. … Maybe I’ll tell a story here. One of the things I do in class is I have a data set that has star-bellied Sneetches, if you remember Star-Bellied Sneetches from Dr. Seuss. We’re not discriminating against stars or non-stars — what we do is correlate data in that data set and say that even if you ignore those stars, you end up with biased data.

So help me with my class exercise. We end up making a bunch of biased models that inadvertently use stars on star bellies. What should students do from there? If you’re listening, students, don’t copy this.

Alice Xiang: In terms of how FHIBE can help in this situation, for example, there [are] two major components of FHIBE. One is the ethical sourcing component, which I talked [about] a little bit before, of consent and compensation. The other is making sure that you have a wide variety of diversity and labels for these different sorts of demographic attributes and other attributes as well. A few notable aspects of FHIBE were that we did use self-reported demographic information in an effort both to ensure our labels were more accurate and also that we weren’t relying on third-party annotators to guess, for example, is this person from this ancestry or that ancestry, this gender, that gender? It gets into a pretty dicey place pretty quickly.

Self-reported demographics were really key. And then we had also extensive annotations about the environment, other physical attributes of the person, the cameras being used, all sorts of things. This really allows you to kind of slice and dice a bit more [into] what are some of the relevant types of biases that you might be concerned about and also, on a more granular level, diagnose what might be some of the underlying causes. So, for example, when we talk to folks in computer vision, [for] something like skin tone, it might be the skin tone itself, it might be issues of contrast like with the background and stuff. There [are] many different ways in which someone’s, say, race or gender might manifest into visual artifacts that then can make the model perform better or worse. That’s really useful for folks then, to be able to figure out how to improve their models.

Using your example, we want to know which creatures have the stars on their bellies, which don’t, so we can see how they’re being treated differently. But not just that dimension. We want to see a lot of other dimensions as well and then be able to [say], “OK, where we see those differences in performance for the model, why is that? Why is the model having such a hard time on stars versus not stars?” And then go from there in terms of further improvements.

Sam Ransbotham: I like that because it seems like you’re pointing out a couple of different things. I think that when we talk about these things, it tends to go down a path of, “Oh, get more data.” Certainly, there’s nothing wrong with getting more data. I think we all would like more data. It’s not always practical or possible to do that. You actually mentioned a whole different category of the ethical issues around the ways that you get them. And that’s true as well.

But one thing you’re pointing out there is that the modelers can actually make some technical improvements here. I don’t want us to say, “All right, well, we’ll give up. [We’ll] just take whatever data we have and just let the modelers figure it out.” That never is a good solution. But you are offering some hope here, and I guess that’s some of what FHIBE is about: giving modelers a better data set to work with, to do that. Have you seen people pick up and start using this?

Alice Xiang: Yeah, we’ve seen a lot of great pickup, even just in the first couple weeks of FHIBE’s release. There were over 60 different institutions downloading it — folks from academic institutions, industry, and government institutions. It was great to see the wide swath of folks who were interested in using FHIBE. Our hope is that this can be an industry benchmark in terms of lifting standards around responsible data collection in general. So that can be true regardless of whether the future data sets are collected for training versus evaluation purposes, fairness versus not-fairness-oriented purposes, and then, also, most immediately, to be able to start checking their models, since I think fairness evaluations can be very empowering in that it opens up a lot of different avenues of how you could make model improvements. Like you mentioned, there [are] possibilities in terms of collecting more data.

There [are] also possibilities in terms of trying to think about the loss function of the model, what is it being optimized for, how to train it to optimize a bit more for different groups, and thus have more balanced performance. There [are] also possibilities on the nontechnical side as well. For example, [if] your model does not perform very well in certain lighting conditions, then maybe you don’t use that particular model for certain downstream uses, or maybe you try to have mitigations, like if it’s on a device, maybe that device will have like a flashlight or something to illuminate before it does whatever task that [it’s] carrying out. So it’s always good on the mitigation front to think about these models as being embodied in real-life systems as well, because it’s not just a matter of the model itself. It’s everything around it that then impacts how ethical the use case is.

Sam Ransbotham: I like a lot about that because I think most of the world would love it if you would come out and say, “All right, here’s a magical answer to solve bias and to solve fairness. Just use this benchmark.” What you’re pointing out is a lot of different mitigation strategies that each one, I’m guessing, is imperfect. I mean a flashlight on a device will probably help but not completely solve. But you add enough of these things together and then we’re improving. It’s not “Oh, we’ve solved the problem,” but it’s made some steps toward that.

Actually, that makes me think about a phrase. I was reading something that you were talking about — data nihilism. Maybe talk briefly about what that means for listeners who didn’t read that article. And we can tie that to what you were just saying.

Alice Xiang: When we first started FHIBE, it was like an AI ethics problem. We put together all these ethical desiderata, including consent and compensation. And at the time, there wasn’t as much discussion, I would say, in the mainstream about ethical sourcing of data for AI models.

Nowadays, though, I think that has grown a lot, especially with the whole [generative] AI revolution. Now everyone’s very conscious of the fact that their data and their content is probably being ingested by AI models somewhere, somehow. I think that can lead to a sense of data nihilism where folks feel like, “OK, either we have these superpowerful models, which, maybe at this point, you know the cat’s out of the bag there, and we have to give up on all of our data rights. We have to give up on control there. Or we try to reverse things, but maybe that’s not possible.”

That’s what I mean by data nihilism, this sort of feeling of this dichotomy that we’re trapped in, where we really can’t have the technology and also have any sort of control over our data. I think what’s really notable about FHIBE is we really sought to show that you could source data in a more ethical way. Obviously, it’s more difficult, it’s more expensive, but FHIBE was kind of the proof of concept that, at least on some scale, you’re able to do this. There [are] so many brilliant minds right now working in the AI space that, if these practices are considered important, then we can figure out ways to try to scale this and change how these technologies are being developed, and preserve more data rights in the process.

Sam Ransbotham: I get that, that whole idea of “It’s out there. My images are out there. Everything is out there anyway.” But we can all make a bit of progress toward that. And if everybody does a little bit, then it can help. I like that overall message. How does that reflect back into Sony products?

Alice Xiang: FHIBE is being used as well across Sony. Even before the launch, we tested it out with a number of our business units that are developing computer-vision technologies. It has become an important part of enabling them to do fairness assessment more. That way, we are able to do that sort of diagnosis, [to] see [if there] are any failure modes of these models and then work with them in terms of possible mitigation strategies as well. I think it’s been great to see that, because, again, the release of FHIBE publicly is to try to encourage this happening everywhere in the industry, and we hope this will put more attention on this issue and unblock folks to be able to see, “Yeah, there maybe are some things that could be improved in these models before they go out.” Hopefully, eventually, that becomes more industry standard as well. One thing that’s quite difficult now is … without requirements to always do these sorts of assessments, it’s really on individual business units or companies to decide that they care about this and want to assess for bias.

Sam Ransbotham: That last point is a huge point, because if you give someone a questionnaire [asking,] “Do you want to do the right thing?” everyone’s going to tick the “yes” box, but in the end, we all are constrained by time and resources. I admit I’m lazy. If there’s a shortcut, I’m going to say, “Well, why don’t I go down that path?” It certainly has been a shortcut within the community to go for this data that’s just sort of out there and lingering. I think one thing you’ve done there is moving toward making it easy. We’ve had a previous guest, Ziad Obermeyer. He was working with medical data. And it’s very hard for normal people to get a bunch of medical data to build models, and his thinking was that [his nonprofit] Nightingale Open Science would collect a bunch of data and then let the solving of the data problem differ from the solving of the data-sourcing problem. That’s a lot of what you’ve done there.

You mentioned FHIBE’s largely [focused] around image information. What about things like voice, sound, other modalities? How should we be thinking about those?

Alice Xiang: We positioned FHIBE around image recognition because it is one of the most sensitive areas. There’s just so much biometric information, other personally identifiable information that is available in images, and also the [intellectual property] rights around images are particularly important as well. So I think in other modalities, like voice and such, they have similar challenges and similar considerations as well around the consent of the individuals being recorded, for example, and in certain contexts there might be also copyright considerations as well.

But I see this as FHIBE sort of took a bit more of a superset of a lot of the ethical issues that might come up, and like in most other modalities, you’ll see a bit more of a subset of that. So there’s obviously more research that would need to be done to kind of apply those, but some of the general concepts in terms of consent, compensation, privacy, IP diversity, and fairness — all of those at a high level — would apply in these different contexts. So we hope that some parts of this can be recycled as well for folks [who] are collecting data in other modalities.

Sam Ransbotham: I like that. At the principal level, it’s just a stream of ones and zeros. How we interpret that is largely up to us. Maybe I was self-motivated because I’ve never been able to get through a phone tree successfully because nothing recognizes my beautiful Southern accent, but that’s a problem I face. I guess we’ll have to wait for the next version of FHIBE to help out with that.

Alice Xiang: It’s a great example of how important diversity in these evaluation data sets is. You can imagine some of the challenges, too, in terms of how to identify folks with different accents and how to classify those, and how to source from … not just around the world but very specific regions as well in order to get that kind of diversity, and then all of the different languages as well layered on top.

It’s definitely also a very rich area of research, and we hope that FHIBE helps inspire more of this as well, because I think the reality of ethical data collection is [that] a lot of it is quite difficult, challenging work that requires this level of operationalization and thinking about real-world challenges, too, that often researchers shy away from. It’s not as glamorous as coming up with the new way for models to actually learn and such, and algorithmic improvements usually are much faster to develop and publish — or maybe not always faster to develop, but at least people get very excited about the technical improvements.

That said, when we talk about responsible AI, like what you mentioned, it’s really hard to make progress in these areas if we don’t actually think about the real-world question of, “How do you collect data from different folks?”

Sam Ransbotham: Actually, we had a guest on a few weeks ago from Wendy’s. He was working with their drive-through restaurants. That’s all voice information. I remember Will [Croushorn] was talking about how many different varieties of the ways that people speak there are, and they faced a lot of challenges in that process there. But going back to FHIBE, … it’s a good example [of] a process that you can follow there.

I want to switch here a little bit and talk about how you got there. When I last talked to you, you were at the Partnership on AI. Tell us a little bit about your career path and how you ended up at Sony, and how you got interested in these things. Take us on that path.

Alice Xiang: I feel like it’s kind of been a winding path. Overall, my background is very interdisciplinary. I started out more on the math and economics side and thinking I was going to work in economic policy. So that was my main target there. And then my first stint working at a tech company and developing my first [machine learning] model really was very illuminating to me. At the time, that’s when I first got interested in these issues of algorithmic bias, because I realized the model that I myself was developing was quite biased.

The data was skewed toward certain geographies. What it was learning made a lot of sense for people in those geographies but not elsewhere necessarily. And part of it was also kind of my personal experience as well. So the data I had access to was primarily from folks on the East Coast and West Coast, and this was job-related data. For myself, coming from Appalachia, I knew that it was quite different in terms of the economic opportunities that people had there, and any sort of model built on this data was not necessarily going to work the best for folks like those who I grew up with. And that concerned me a lot. I talked to a lot of folks in the company about this, but there was no such thing at the time as a field of algorithmic bias. This was like 12 years ago now.

But it got me really interested in this area and how do we think about ethics and justice in the context of these new technologies that we’re learning from massive amounts of data. And that sort of steered my career more into the tech policy space, and the AI and algorithmic fairness space in particular. So, after finishing all of my graduate programs — I have graduate degrees in law, statistics, and economics — I ended up working a bit at a law firm and then going to the Partnership on AI since I wanted to really focus in on this issue of algorithmic bias. And there I started up a research lab focused on these areas. Pretty quickly these issues of data availability became quite salient.

If you look at the algorithmic-fairness literature, there’s a lot of emphasis on metrics and how do you measure bias, how do you mitigate it because those are kind of the really fun technical problems. But at PAI, we were doing these multi-stakeholder convenings where consistently, convening after convening, all of the companies were saying, “You know, we actually are struggling with that first step of if we want to collect data that has demographic information in it, our privacy team will just shut that down. But then if we don’t have any demographic data, it’s really hard to do any sort of fairness assessment.” So, going back to the star belly example, if we don’t even know who has stars on their bellies and who doesn’t, then how are you going to tell if your model works well for creatures with stars on their bellies? That became a major focus for me from a research perspective of what can we do in that context.

When I moved to Sony, I thought that this was a really great opportunity to not just report on this issue and try to impact this more on a policy level, but instead to create a solution to this problem. And that’s what inspired FHIBE, since it’s one thing to say, “Yeah, this is a problem. People need to do something about it.” It’s another thing to create a data set that fills the gap.

Sam Ransbotham: I like your bringing out the catch-22 because I think that’s something that we don’t always appreciate, that we have to collect a bunch of seemingly invasive information in order to assess whether or not we’re making decisions based on seemingly invasive information. This is an inherent problem.

One of the things that we do here is a quick segment where I like to ask some rapid-fire questions. Just answer the first thing that comes to your mind. What’s moving faster about artificial intelligence than you expected, and what’s moving slower?

Alice Xiang: That’s a good question. I guess, what’s moving faster is just how much it’s being integrated into everyday life, especially in companies. It’s been interesting to see just in the past couple of years how everyone’s KPIs have suddenly become about AI adoption. That would have been hard to predict maybe five to 10 years ago. And then in terms of what’s moving slower, in general, [there’s the] actual implementation of anything around AI ethics in model development. It still often feels very divorced from actual practitioners implementing these different techniques. Again, that’s kind of part of the motivation of FHIBE, to create that incentive there, but it has been something that I haven’t seen as much of across the field.

Sam Ransbotham: I have to say I’m not a bit surprised that was your answer about [what’s moving] slower, given your background. What’s the worst use of AI? How are people using this technology poorly?

Alice Xiang: Contexts where there isn’t sufficient human expertise or oversight, but there are very important decisions being made. I say it kind of abstractly in a way because I think oftentimes folks will point to specific types of use cases, like high-risk use cases like HR or health care or criminal justice. I think it’s a little bit more use-case specific than that. You can theoretically have a good use in those domains, if you have the proper human oversight, the proper training both for the individuals and proper oversight of the technology. But where it’s just kind of being used in a very autonomous fashion and there’s either no one providing oversight or the individuals providing oversight don’t have the knowledge or expertise to do so, that’s where I would be most worried.

Sam Ransbotham: How do you personally use these tools?

Alice Xiang: Not as much as maybe you’d expect. I feel like I spend a lot of time auditing them. That’s probably the most common use case for me.

Sam Ransbotham: I think it counts as a normal use case.

Alice Xiang: Time-wise, probably the most common is me trying to audit them and see where they might have failure modes, but I also do use it in my daily life, to check on the health of my plants, for example. I will say, they are quite useful in context where maybe you’re a novice and you want some starting point for diagnosing things.

Sam Ransbotham: One question we used to ask people was, “What was the first career you wanted?” And given the variety of your background, let me ask: What did you want to be when you grew up?

Alice Xiang: When I was a little kid, I wanted to be an artist. In a way I feel like my current work is starting to nudge back into that direction, with the way AI development has gone and a lot of these AI ethics issues now actually coinciding a lot with artists’ rights, which was not at all the case when I first entered this field, but it’s kind of nice in that I have always been really interested in the field of art as well.

Sam Ransbotham: I was not expecting that. Well, this has been fascinating. I think that the kinds of efforts you’re making, which are about making it easier for people to do the right thing, and FHIBE is one example. As long as societally we make it easier for people to do the right thing, then people will be more likely to do the right thing. So I’m thrilled at the amount of effort and work that you all put into doing this. Thanks for taking the time to talk with us today.

Alice Xiang: Thank you so much for having me. Hopefully our work can help inspire others to also do the same.

Sam Ransbotham: Thanks for listening today. I hope you’ll download FHIBE and give it a try. On our next episode, I’m joined by Taylor Stockton, my former student, and chief innovation officer at the U.S. Department of Labor. Please join us.

Allison Ryder: Thanks for listening to Me, Myself, and AI. Our show is able to continue, in large part, due to listener support. Your streams and downloads make a big difference. If you have a moment, please consider leaving us an Apple Podcasts review or a rating on Spotify. And share our show with others you think might find it interesting and helpful.