Measuring Machine Learning Bias By Dr. Supriya Ranjan Mitra, Director, IT at Schneider Electric

Measuring Machine Learning Bias

Dr. Supriya Ranjan Mitra, Director, IT at Schneider Electric | Monday, 10 February 2020, 06:28 IST

Supriya is B.Tech in Mechanical Engineering, I.I.T Madras and PhD in SCM from Syracuse University, NY. He has authored publications in leading journals and text books.

Gartner estimates that by 2021, AI augmentation will generate $2.9 trillion in business value and recover 6.2 billion hours of worker productivity. Yet, recent surveys and studies have revealed that fewer than 1 in 4 people trust AI to make significant life decisions on their behalf. The emergence and widespread usage of Machine Learning (ML) systems in a wide variety of applications, ranging from recruitment decisions to pretrial risk assessment has raised concerns about their potential unfairness towards people with certain personas. Anti-discrimination laws in various countries prohibit unfair treatment of individuals based on sensitive attributes such as gender, race, etc. Quoting a Guardian editorial dated Oct. 2016 - "Although neural networks might be said to write their own programs, they do so towards goals set by humans, using data collected for human purposes. If the data is skewed, even by accident, the computers will amplify injustice". As per a study done by Anupam Datta on Google Job Ads, the CMU professor ascertained that male job seekers were six times more likely to be shown Ads for high paying jobs than female job seekers. Amazon decided to scrap its ML based recruitment engine in 2015, when it realized that the engine was not rating candidates for technical posts in a gender-neutral way. James Zou (from Microsoft research) designed an algorithm to read and rank web page relevance. Surprisingly, the engine would rank information from female programmers as less relevant than that from their male counterparts.

“Researchers are now writing fairness guidelines into machine-learning algorithms to ensure that predictions and misclassifications for different groups are at equal rates”

Researchers are now writing fairness guidelines into machine-learning algorithms to ensure that predictions and misclassifications for different groups are at equal rates. The first step to managing fairness is to measure the same. We illustrate 3 different types of Bias measure with a hypothetical recruitment example below. In Table 1 - gender is a sensitive attribute, whereas the other two attributes (relevant experience and relevant education) are non-sensitive. We assume 3 male and 3 female candidates, and the √ / X indicates whether the candidates met the relevant non-sensitive criterion or not. Table 2 indicates actual decision on candidate selection by the interviewer. Table 3 indicate outcomes of three ML classifiers (C1, C2 and C3) on same candidates. Table 4 computes the presence/absence of 3 types of classifier bias as explained below.

Disparate treatment (DT) arises when the classifier provides different outputs for groups of people with similar values of non-sensitive features but different values of sensitive features. In above example, Candidates Male 1 and Female 1 (also Male 2 and Female 2) have same non-sensitive attribute values for experience and education. However, the prediction of classifier C2 on Male 1 and Female 1 and classifier C3 on Male 2 and Female 2 is unfair.

Disparate Impact (DI) arises when the classifier provides outputs that benefit (hurt) a group of people. We deem classifier C1 as unfair due to disparate impact because the fraction of males and females that were hired are different (1.0 and 0.66 respectively).

Disparate mistreatment (DM) arises when classifier outputs have different misclassification rates for groups of people having different values of sensitive attribute. In our example, C1 and C2 are unfair because the rate of erroneous decisions for males and females are different: C1 has different false negative rates for males and females (0.0 and 0.5 respectively), whereas C2 has different false positive rates (0.0 and 1.0) as well as different false negative rates (0.0 and 0.5) for males and females.

As can be observed, both DT and DI have no dependency on Actual Labels - hence, they are appropriate where historical decisions are not reliable or trustworthy (for example in recruitment decisions). DM may be a preferred measure when the ground truth rationale for Actual Label decisions are explicable. An example would be pre-trial re-offence risk assessments for criminals (such as COMPAS classification used by State of Florida) where reliability of past sentencing terms can be elucidated by re-offence data from same criminals.

Despite growing concerns on ML Bias, efforts to curb the same are still insignificant. According to Nathan Srebro, computer scientist at the University of Chicago - “I’m not aware of any system either identifying or resolving discrimination that’s actively deployed in any application. Right now, it’s mostly trying to figure things out.”