Top
Navigation
2024 Excellence in Technology Reporting, Large Newsroom winner

Uncovering bias in two leading AI systems

About the Project

Bloomberg conducted a pair of months-long, data-driven investigations to systematically test two of the most widely-used generative AI services for harmful biases: Stable Diffusion, an image generator, and OpenAI’s GPT, the technology underpinning ChatGPT.

Stable Diffusion

At the end of 2022, Bloomberg reporters found that more and more institutions were relying on generative AI to create images for advertising, entertainment and other use cases, even though there had been minimal external testing of how safe these systems were for the general public. Bloomberg investigated Stable Diffusion and found that it is prone to exacerbating racial and gender stereotypes. We prompted the AI model to create more than 5,000 images of workers with various jobs, as well as people associated with crime, and compared the results with US government data on race and gender. The analysis revealed that the text-to-image model does not just replicate real-world stereotypes — it amplifies them. For example, women are rarely doctors, according to Stable Diffusion, and men with dark skin are more likely to commit crimes. Through data visualizations and AI-generated photographs, the story explored how these systemic biases in generative AI have the frightening potential to exacerbate real-world inequities.

GPT

Months after publishing the Stability AI investigation, Bloomberg reporters noticed another concerning trend: A cottage industry of services had emerged to help HR vendors and recruiters interview and screen job candidates using AI chatbots, despite concerns that generative AI can replicate and amplify biases found in their training data. Over the course of five months, Bloomberg conducted a data-driven investigation and found that OpenAI’s GPT – the best-known large language model and one that powers some of these recruiting tools – discriminates against names when ranking resumes. We replicated the recruitment workflow in a classic hiring discrimination study: First, we assigned demographically-distinct names derived from public records to equally-qualified resumes. Then we asked GPT to rank the candidates against one of four job postings from Fortune 500 companies. Bloomberg found GPT’s answers displayed stark disparities: Names distinct to Black Americans were the least likely to be the top ranked candidate for a financial analyst role, for example. Meanwhile, names distinct to Asian women were ranked as the top candidate for the analyst role more than twice as often as those associated with Black men. Importantly, Bloomberg did not find one direction of bias across the four jobs we tested. Instead, biases were based on the job posting used to evaluate candidates. For example, GPT preferred names distinct to women for an HR role, adhering to stereotypes. Bloomberg’s reporting shows that names are enough to lead to racialized outcomes when generative AI is used in hiring.

Judges Comments

This investigation provides a social service, demonstrating bias in AI systems. It´s a complex topic to investigate, but extremely relevant because it provides the transparency that algorithms and LLMs don’t provide. The presentation is beautiful and clean, and the search for truth in the darkness is inspirational.