Achieving Fairness in Machine Learning-Powered Credit Scoring Models

14 min readOct 26, 2024

About this series

This is the first in a series of posts based on the summative work I completed during my time at the University of Oxford. Specifically, this piece comes from my “Fairness, Accountability, and Transparency in Machine Learning” course, an elective I took in my second term (Hilary Term) in 2022.

I learned a great deal from the process. Therefore, I’m excited to share the valuable insights I gained in a more condensed format here in Medium.

Summary

As our society becomes increasingly networked, digitized, and “datafied” (Mayer-Schönberger & Cukier, 2013), the volume of available data, particularly personal data, has surged. Simultaneously, there is a rise in algorithmic decision-making like machine learning algorithms which are used to enhance or replace human decision-making (Aggarwal, 2021).

Credit providers have embraced this trend, using advanced ML techniques and new types of data to assess creditworthiness, a process known as algorithmic credit scoring (O’Dwyer, 2018). Deep learning models, in particular, excel at capturing nonlinear relationships within data (Sirignano, Sadhwani, & Giesecke, 2018), enabling the identification of patterns that traditional credit scoring models might overlook. This allows for more accurate predictions of a borrower’s creditworthiness by accounting for complex interactions between individuals and their environments.

There are research which have shown that alternative credit scoring can promote financial inclusion by expanding the range of customers who can access credit (Widiyasari & Widjaja, 2021). Those who historically lacked access to formal financial institutions due to limited credit histories, can now obtain credit (Jaiswal & Akhilesh, 2020).

However, this post argues otherwise. Despite these advancements, consumers still require regulatory protection to ensure that the companies employing these algorithms do give out credit fairly and inclusively.

Structure

The structure of this post is as follows. We will first explore different methods to assess creditworthiness — both traditional and alternative approaches. Then, we will dive deeper into how machine learning, particularly supervised machine learning, is used to assess individual credit scores. Once we understand how machine learning is applied to score individuals, we will identify the problems and pitfalls at each step of the process that can lead to potential unfairness in the resulting credit scores. The section after will introduce existing regulations that ensures credit is given responsibly. Finally, the post will recommend further policies needed to ensure credit scoring by ML models are fair and inclusive.

Methods to assess creditworthiness

There are two methods used to score individuals creditworthiness: the traditional method and the alternative method. Traditionally, credit scoring has relied on linear statistical models and a limited set of fixed data points (Thomas, 2009). These models include basic forms of statistical methods such as linear discriminant analysis and logistic regression. In these conventional approaches, the most critical factor is typically an individual’s repayment history (Equifax, 2020).

However, this traditional method excludes a significant portion of the population with no established credit history. To address this gap, alternative credit scoring methods were developed, leveraging the growing availability of diverse data sources. This approach became feasible with the explosion of personal data, the widespread adoption of the Internet, advances in machine learning from the mid-2000s, and the contraction in bank lending after the 2008 global financial crisis (BoE-FCA, 2019). Fintech firms such as Zopa and Wonga capitalised on these shifts, becoming pioneers of algorithmic credit scoring and extending credit to the underbanked and unbanked populations underserved by traditional banks (Aggarwal, 2021).

Though alternative data is less structured than conventional credit data, it is often more feature-rich and high-dimensional. These data can be collected directly by fintech firms or acquired from third-party data brokers like Acxiom and Experian. Machine learning models then analyse these large, complex datasets to identify patterns and features relevant to predicting an individual’s creditworthiness (Hurley & Adebayo, n.d.). The key differences in the types of data used in traditional and alternative credit scoring methods are summarised in the figure below.

Figure 1: Data used to evaluate creditworthiness from The World Bank (2020)

Use of machine learning in alternative credit scoring

Credit scoring methods have become increasingly sophisticated over the years, evolving from traditional statistical techniques to cutting-edge methods like machine learning algorithms. Machine learning is a class of algorithms that learn from training data to optimise a given objective without human intervention (SAS, 2019). They are particularly effective in identifying complex patterns hidden within large datasets, making them ideal for credit scoring.

There are three main categories of machine learning algorithms used in the credit scoring process: supervised, unsupervised, and semi-supervised learning. Supervised learning algorithms, such as random forests, gradient boosting, and deep neural networks, are commonly used to develop credit scores (Knutson, 2020). Meanwhile, unsupervised and semi-supervised methods are often applied in other credit-related processes, such as credit collection.

The following section will focus on the end-to-end process of implementing alternative credit scoring with supervised machine learning. The process can be broadly divided into four major steps: problem specification, data collection and transformation, refining methods, and model production. While individual companies may include additional steps like data verification or testing, this overview aims to outline how these models are typically trained, so that we are able to identify the potential pitfalls at each stage.

Step 1: Problem Specification
The first step is to clearly define the problem that the algorithm is designed to solve — in this case, assessing an individual’s creditworthiness. The data scientist must then specify a target variable that represents the desired outcome. A common target variable in credit scoring is past borrowing behaviour. For instance, by analysing historical data on borrowers, the model can be trained to recognise patterns that distinguish those who repay their loans from those who do not. This helps guide the model in predicting creditworthiness based on the input data.

Step 2: Data Collection and Transformation
Once the problem is defined, the next step is gathering and preparing the data. The raw data must be accessed, cleaned, and transformed so it can be used by the model. This process, known as feature engineering, involves identifying and creating relevant features from the data to improve the model’s performance. Proper feature selection is critical to capturing the most relevant insights from the data for an accurate model.

Step 3: Model Training and Validation
In this step, the prepared data is fed into the machine learning model for training. After the model is trained, test data is used to assess its accuracy. The goal of training is for the model to learn general patterns from the data, enabling it to predict the behaviour of new, unseen data. The model’s performance on test data is evaluated to ensure it is not overfitting — i.e., learning patterns too specific to the training data that do not generalise well to new data. If overfitting is detected, the data scientist will iterate between refining the model and reworking the data to improve its performance. On the flip side, the business team would also verify the model’s key features to ensure they align with business logic. For example, it would not be reasonable for a model to deny credit to higher-income individuals under equal circumstances. Accuracy and interpretability are the most important factors to consider at this stage.

Step 4: Model Production and Monitoring
When both the data scientist and business team are satisfied with the model’s accuracy and interpretability, the model is ready for production. At this stage, a feedback loop or data pipeline is established to keep the model up to date with new data. Regular updates help maintain the model’s relevance and accuracy over time. Figure 2 summarises the steps required to build an alternative credit scoring model.

Figure 2: Steps involved in building an alternative credit scoring model

Problem — pitfalls in using machine learning for alternative credit scoring

This section will explore the potential risks associated with each step of creating an alternative credit scoring model.

Step 1: Problem Specification
One of the primary risks in the problem specification stage is the potential for implicit bias in the model’s target variable. Since credit scoring algorithms produce a single score, this score may mask discriminatory lending practices. For example, if the target variable is based on borrowers with existing FICO scores, the model may systematically exclude populations historically underrepresented in the credit market. By using access to prior credit as a baseline, the model fails to capture whether individuals without a credit history are responsible borrowers, leading to their exclusion from lending decisions.

Step 2: Data Collection and Transformation
Data collection poses several risks, including the accuracy of data, subjective labelling, and creditworthiness by association. Alternative data is sourced from various external providers, so the quality and accuracy of this data need to be carefully verified. When multiple datasets are combined, errors in data representation can lead to inaccurate predictions. Furthermore, some of this data is often collected without the consumer’s knowledge, raising concerns about privacy and transparency.

Subjective labelling is another risk. While objective labelling might determine whether a person has a credit history — a clear yes/no, subjective labelling could involve opinions, such as an employer’s assessment of an individual’s suitability for a job. Human judgment in these subjective labels can introduce bias into the model (Hurley & Adebayo, n.d.).

A third concern is creditworthiness by association. Automated machine learning systems might systematically exclude certain groups based on associations like family background, region, or preferences. For instance, according to Experian, Baby Boomers tend to have higher credit scores (731 on average) than Millennials (668 on average), not necessarily because they are better borrowers, but because they’ve had more time to build credit histories (Stolba, 2019). This association unfairly skews creditworthiness based on age, which is unrelated to individuals actual repayment ability.

Step 3: Refining Methods
When refining machine learning models, fairness issues need to be carefully managedas well. In “black-box” models like deep neural networks, it can be difficult to discern whether the algorithm is using sensitive proxies such as race or gender. While these proxies may not be explicitly part of the data, the model might infer them from neutral data points, leading to unintended discrimination. Without careful examination, the model might inadvertently target vulnerable populations and deny their credit access.

Step 4: Model Production and Use
The application of machine learning based feedback loop in credit scoring could further compound these problems. Once a group is assigned a lower score, they are likely to be denied credit, preventing them from building the credit history needed to improve their score. This creates a feedback loop that perpetuates historical inequalities, limiting access to wealth-building opportunities for marginalised groups.

Conclusion

Even though alternative credit scoring can broaden financial inclusion by reaching the underbanked, relying on alternative data without addressing underlying biases may perpetuate societal inequalities. A study by Bono, Croxson, and Giles (2021), which analysed data for 800,000 UK borrowers, found that machine learning does not eliminate fairness issues. Many individuals are still denied fair access to credit, and borrowers have little control over how they are scored or the ability to contest biased or inaccurate assessments.

Therefore, it is crucial for companies to carefully consider the assumptions behind the data, labelling, and models when developing alternative credit scoring systems (Binns, 2020). Additionally, the governing authority should also establish regulations to guarantee that credit scoring algorithms operate fairly and inclusively. Cowgill and Tucker (2019) suggested that effective regulations should not only focus on the technology and engineering practices but also on the outcomes, ensuring that lending decisions are fair and equitable.

Existing solutions and further policy recommendations

Existing Solutions

This part of the post analyses whether existing institutional arrangements under the consumer credit and data protection regulations are appropriate for potential harms algorithmic credit scoring can bring. Then policy recommendations will be made according to the harms to enhance algorithmic fairness. The rules and legislation that are currently available to ensure that credits are given responsibly consist of the FCRA (Fair Credit Reporting Act) (Federal Trade Commision, 2022) and ECOA (Equal Credit Opportunity Act) (U.S. Government Publishing Office, 2011).

The FCRA regulates what individual information credit reporting agencies can collect, access, use and share (Federal Trade Commission, 2022). It also helps consumers understand what they can do with their credit information. However, the FCRA only establish basic accuracy requirements for the data used in different credit assessment tools. Furthermore, according to the FCRA, consumers are required to
identify and dispute any inaccuracy they find in their credit scores.

On the other hand, the ECOA is aimed to prohibit creditors from discriminating against credit applications based on sensitive characteristics such as race, sex, religion, national origin or marital status (U.S. Government Publishing Office, 2011). Since the ECOA has only one framework to govern both traditional and machine learning-based credit
scoring methods, borrowers may find it harder to make a case for the machine learning method. When a lender justifies that the decision was based on a sophisticated algorithm using thousands of data points, it is more difficult to indicate disparate treatment of certain groups of people. Hence both FCRA and ECOA do not place limits on the potential pitfalls of modern algorithmic scorings.

This shows that existing laws are insufficient to ensure that credit-scoring systems are fair, transparent, and accurate. Further policy improvement is required.

Recommended Solutions

This section includes policy suggestions which falls into 6 different areas including; requirements for increased transparency, reducing consumer burden to ensure accuracy, prevention of discrimination by proxy, privacy and autonomy protection, introducing a straightforward way to request for review and finally, protection for the vulnerable groups.

Increased Transparency
To address implicit bias in credit-scoring models, regulators must understand the data being used, how target variables are defined, and which features drive credit decisions. However, there is limited information about how credit-scoring companies, such as ZestFinance, implement these models (Merrill et al., 2015). Developers of credit-scoring tools should be required to regularly disclose their data sources, target variables, and algorithms. Additionally, the data collection process — especially when data is bought from third parties — must be transparent, with a clear audit trail detailing how data is used. Transparency ensures that both consumers and regulators can verify that credit scores are appropriate and are based on accurate data.
Shifting the Burden of Accuracy
The current system places an unfair burden on consumers to ensure their credit data is accurate. Given the complexity of these models, consumers often struggle to challenge unfair credit decisions or improve their scores. Instead, credit scorers should be held responsible for verifying the accuracy of the data they use, including externally sourced data. Credit scorers must ensure that all data points are traceable to the individual consumer and that lending decisions can be clearly explained. Penalties should be introduced for inaccurate or unverifiable data, empowering both consumers and regulators to hold companies accountable.
Preventing Discrimination by Proxy
Algorithmic models often use thousands of data points, making it difficult to detect when sensitive characteristics, such as race or gender, are being used as proxies for creditworthiness. The burden of proving discrimination currently rests on the consumer, but this responsibility should be shifted to the developers. Models that rely on data highly correlated with sensitive characteristics should be prohibited. Furthermore, these models must be trained on representative data to avoid favoring specific groups. Regulators could test the model on unseen cases to ensure fairness and compliance with anti-discrimination laws.
Creating Accessible Review Mechanisms
Consumers need an easy and straightforward way to request a review of their credit decisions, particularly when automated processes are involved. Even with transparency in data use and decision-making logic, consumers may struggle to challenge unfair decisions if there is no formal avenue for review. Consumers with the lowest credit scores, who are often the most vulnerable, may lack the time or resources to navigate complex bureaucratic processes. Simplifying the review process would allow individuals to contest unfair outcomes more effectively.
Protecting Vulnerable Consumers
There is a risk that companies could exploit vulnerable individuals through predatory scoring techniques that lead them into debt traps, where repayment becomes impossible due to high interest rates. The FCRA and ECOA do little to prevent the use of these predatory scoring techniques.People who previously do not have access to credit may not be as proficient in how debt traps from credit work. Therefore, before companies lend to these vulnerable people, they should also consider the impact of the loan on that specific consumer’s future financial stability. Creditworthiness should incorporate the consumer’s ability to repay a loan but they also have to do so without risking harm to the consumer’s financial stability.
Enhancing Privacy and Autonomy
Many consumers are unaware that non-traditional data, such as online behaviour, is being collected and used to assess their creditworthiness. Regular privacy impact assessments should be mandatory to ensure that consumer data is not being used without consent. Consumers should have the right to correct inaccuracies, object to specific uses of their data, and request the deletion of erroneous information. This would give consumers greater autonomy over their data and ensure that privacy concerns are addressed in the credit-scoring process.

As data increasingly drives the financial system, regulators such as the Financial Conduct Authority (FCA) in the UK and the Securities and Exchange Commission (SEC) in the US must prioritise both data protection and algorithmic fairness to safeguard consumer interests (FCA, 2021).

Conclusion

The adoption of alternative data has the potential to significantly expand credit access for individuals who have traditionally been excluded from formal lending. While it may not create complete equality, it can enhance financial inclusion and improve the accuracy of credit assessments for millions. However, the widespread use of alternative credit scoring requires thoughtful regulation to ensure that lending practices remain fair, transparent, and accountable.

As consumer power increasingly relies on data-driven algorithmic decision-making, there are growing concerns about the risks associated with using alternative data and algorithimic credit scroing. These algorithms may unintentionally reinforce societal inequalities embedded in the data they rely on. Therefore, issues like data privacy, fairness, potential discrimination against minority groups, and model interpretability must be carefully addressed in any new policies governing algorithmic credit scoring. It is crucial that models trained on historical data are handled with caution to avoid perpetuating historical biases.

References

Aggarwal, N. (2021). The norms of algorithmic credit scoring. The Cambridge Law Journal, 80(1), 42–73.
Binns, R. (2020). On the apparent conflict between individual and group fairness. In Conference on fairness, accountability, and transparency.
BoE-FCA. (2019, October). Machine learning in uk financial services. Bank of England.
Bono, T., Croxson, K., & Giles, A. (2021). Algorithmic fairness in credit scoring. Oxford Review of Economic Policy, 37(3), 585–617.
Cowgill, B., & Tucker, C. (2019). Economics, fairness and algorithmic bias. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3361280
Equifax. (2020, November). How are credit scores calculated? Retrieved
from https://www.equifax.com/personal/education/credit/score/how-is-credit-score-calculated/.
FCA. (2021). Financial services and markets act 2000. https://www.legislation.gov.uk/ukpga/2000/8/section/1C
Federal Trade Commision. (2022). Fair credit reporting act.
https://www.ecfr.gov/cgi-bin/text-idx.
Hurley, M., & Adebayo, J. (n.d.). Credit scoring in the era of big data. Yale J.L. Tech.
Jaiswal, A. K., & Akhilesh, K. B. (2020). Smart technologies. In (chap. Tomorrow’s AI-Enabled Banking). Springer, Singapore.
Knutson, M. L. (2020, February). Credit scoring approaches guidelines. The World Bank.
Mayer-Schonberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work and think (№73–97). London.
Merrill, D. C., Merrill, J. W., Budde, S. M., Gu, L., & McGuire, J. P. (2015). System and method for building and validating a credit scoring function. Google Patents.
O’Dwyer, R. (2018, July). Are you creditworthy? the algorithm will decide. Retrieved 2022, from https://undark.org/2018/05/07/algorithmic-credit-scoring-machine-learning/
SAS. (2019). Artificial intelligence: What it is and why it matters. https://www.sas.com/enus/insights/analytics/what−is−artificial−intelligence.html.
Sirignano, J., Sadhwani, A., & Giesecke, K. (2018, March). Deep learning for mortgage risk. arXiv.
Stolba, S. L. (2019, November). A look at highest credit limits among generations and states. Experian.
Thomas, L. C. (2009). Consumer credit models: Pricing, profit and portfolios. Oxford.
U.S. Government Publishing Office. (2011). The equal credit opportunity act. https://www.govinfo.gov/content/pkg/USCODE-2011-title15/html/USCODE-2011-title15-chap41-subchapIV.htm.
Widiyasari, V., & Widjaja, H. (2021, January). This new approach to credit scoring is accelerating financial inclusion in emerging economies. World Economic Forum.