Your Bias Checklist!

Dipika Jiandani
7 min readJan 24, 2022
Bias-Free of Charge Creative Commons Typewriter image

In recent times, data leaders have given immense importance to acknowledge the existence of bias in any data solution, bias detection, comprehending the negative social, economic and financial impact of overlooking those errors in data, and proactively taking measures to rectify them. Bias is being prejudiced to a specific outcome in a decision-making process. Bias not only deviates from the expected output but can also cause racial, gender, and political biases. At this point, a good number of bias types have been detected and studied and the list only increases.

Reading real-life case studies about the types of biases and their impacts on each one of us has made me cautious of not injecting biases into the ML projects & solutions that I’m working on. The book “97 Things About Ethics Everyone in Data Science Should Know” (Edited by Bill Franks) is a collection of interviews of well-established data leaders, tech leads, managers, and researchers who share their experiences on real-world best practices to deal with bias. This book is a must-read for all data enthusiasts and has greatly inspired me to read research papers about the types of biases. As a result of all my research, I’ve put together this quick overview of biases along with real-world examples that you can always refer to while building your data science solutions.

In a data science lifecycle, bias can be introduced at multiple phases. Some of those phases are -

  • Research phase
  • Data cleaning and mining phase
  • Data visualization phase
  • Schema modeling phase
  • Model building phase
  • Tuning phase

Some of the well-known biases are as follows -

  1. Group Attribution Bias:

When we generalize or judge an individual of a group based on the attributes and features of that entire group.

Seen in recruiting where students from Ivy leagues are given preference over the other graduates. The underlying assumption being all students who graduate from an Ivy League would be a great fit for the job.

2. Implicit Bias/Cognitive Bias/Societal Bias:

When actions or decisions are based on the unconscious and unsupported assumptions that people make related to social aspects like age, gender, race/ethnicity, etc. These biases are one of the most difficult ones to deal with because they can cause a great negative impact if there’s no realization of the biases during the decision-making process.

There are many subdivisions of implicit bias -

  • race bias
  • age bias
  • gender bias
  • ability bias
  • affinity bias
  • name bias
Cognitive Bias Examples (https://commons.wikimedia.org/wiki/File:Common_Cognitive_Biases.png)

Seen in one of the most common examples, assuming that a profession includes only a specific gender. Assuming and associating a nurse with being a woman. A simple google search for the word ‘nurse’ displays a dominating chunk of images of women. Additionally, Amazon’s job recruiting tool learned that ‘women’ are less qualified for a given job than a male with similar qualifications. This was because the ML model was trained on a dataset that was benchmarked against the predominantly male population working for Amazon for the past 10 years instead of benchmarking based on their skillsets. This led to gender bias in their recruiting process.

Google search result for the word ‘nurse’: https://bit.ly/3fKcp8z

3. Reporting Bias:

When the frequency of occurrence of published or reported events does not represent the real-world distribution. It also means that the reported outcome is purposefully deviated in a favorable direction ignoring what the actual data has to show.

Seen in the medical publishing industry. One popular reporting bias case had occurred with Pfizer in 2004. They heavily promoted the use of off-label gabapentin for 11 different reasons including migraines and alcohol withdrawal seizures whereas it was actually approved to be used only for partial seizures and postherpetic neuralgia.

News article: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC416587/

4. Selection Bias/Sample Bias/Population Bias:

When there is underrepresentation or overrepresentation of a certain individual, group, or data in the research subgroups.

Seen in a 2017 study of football players’ brains. Chronic traumatic encephalopathy (CTE) was detected in 99% of the deceased football players’ brains. Though over a dozen news articles were published highlighting the results of this study, on digging deeper into the research process of this study, it was found that the sample set contained 202 football players’ brains, out of which more than half were former NFL players. These articles failed to highlight the fact that there are over a million people who play football but the sample set didn’t include a representation of the general population that played football. Even though the research was not entirely wrong/irrelevant, the fact that it was written and interpreted by the general public and various media without stating the assumptions involved in selecting the sample set was a mistake.

News article:https://bit.ly/3nN1TlA

5. Confirmation Bias/Observer Bias/Exclusion Bias:

(https://www.picpedia.org/financial-09/c/confirmation-bias.html)

When the observer tries to influence the outcome of an experiment or model in accordance with their beliefs and expectations. This includes ignoring certain outcomes (exclusion bias) and reinforcing certain outcomes based on your beliefs.

Seen when Elizabeth Holmes’ Theranos claimed to have developed a device that accelerated blood tests along with using only a few drops of blood as opposed to the standard 2–3 ml. She aggressively drove the pilot test results according to her ideas, beliefs, and research to the extent that she faked positive outcomes in favor of selling the devices. She was recently found guilty of various charges related to Theranos.

News article: https://nyti.ms/3Ao0uH5

6. Measurement Bias:

When the data variables collected for a study have incorrect values due to an inaccurate measurement technique.

Seen in Predictive Policing, which is an algorithmic solution that uses past anonymized arrest data and mined social media data to predict locations that are more susceptible to crimes. These results are then used to deploy police forces strategically in the most susceptible zones and provide a risk score to each offender. An individual living in a crime susceptible zone has a greater risk factor to offend again. This calculation of the risk factors is a clear measurement bias that has led to racial discrimination.

News article: https://bit.ly/3rDqCcT

7. Aggregation Bias:

When we assume that the trends generated by a model apply to all subgroups in a dataset. This can result in a misleading correlation between variables.

Seen in a regression predictive model built for cryptocurrencies. The model trained on all types of cryptocurrencies will not reflect the trends for each crypto separately. Ethereum predictions cannot be the same as the aggregated model predictions and could be misleading.

8. Historical Bias:

When the training data fed into the model contains data/trends representative of the past and not the present reality. If this bias is not identified, it can create and sustain historical societal biases.

Seen in Amazon’s recruiting tool, that was gender-biased and had training data benchmarked on historical data trends that did not represent the current reality. Over the last decade, the ratio of women to men in the workplace has increased considerably. Hence, the tool rejected more female applications than males leading to biased results.

News article: https://bit.ly/3KwEU7S

9. Labeling Bias:

When the labels created during the ML process are according to one’s beliefs and perspectives. This can also include assigning a wrong label unintentionally.

Seen in the ImageNet dataset, widely used for image recognition research, contains 14 million images that are labeled manually. A study conducted on this dataset called the ‘ImageNet Roulette’ showed that many images were biased and incorrectly labeled. For instance, a child wearing sunglasses was labeled as a ‘loser’, ‘failure’.

News article: https://bit.ly/3FXybR6

10. Outlier Bias:

When one or a few data points do not represent the actual trend in the dataset, it greatly influences the output of the ML model.

Seen when calculating the average income. In the dataset, if we include the incomes of the general population, along with the richest person’s income it will extremely skew the results. The mean value wouldn’t be a true representation of the average income.

It is important to believe and accept that bias can exist in one’s solution and be educated about the different bias types. This becomes the first step towards detecting and eliminating bias in the data science life cycle.

I’d recommend the book, “97 Things About Ethics Everyone in Data Science Should Know”, to further understand the intricacies of bias definition, detection, impact, and prevention perspectives from top data leaders.

Here’s the Amazon link to the book:

If you found this article useful or have any feedback, let me know what you think! I’d love to connect on Medium, LinkedIn, or Instagram!
Thank you!

Instagram: https://www.instagram.com/dipikajiandani/

--

--