Explainability Challenges Are a Growing Concern for Bank Governance of AI
4/24/2024
Over the years, AI has transformed from a futuristic science fiction concept to a developed technology in everyday use with countless applications. Banks have actually engaged in AI-related practices for decades, but the dawn of generative AI, highlighted by ChatGPT and other large language models (LLMs) since November 2022, creates new challenges for the industry. This article outlines the history of AI in banking, how the increasing sophistication and power of AI is affecting the ability of model risk managers to explain the technology, and what that means for risk management and regulatory compliance.
AI applications for banks range from the straightforward use of automated systems and regression models starting in the late 1980s to more sophisticated applications in advanced machine learning (ML) algorithms, service operation optimization such as cloud computing, contact center automation, and—in the last few years—AI-based product enhancements or development.
Banks and regulators have acknowledged the significant benefits AI applications can bring. They are also aware of the novel risks they pose. The rapid embrace of AI prompted the Office of the Comptroller of the Currency (OCC) and other members of the Federal Financial Institutions Examination Council to initiate a Request for Information (RFI) on Financial Institutions’ Use of Artificial Intelligence, Including Machine Learning in 2021. The RFI highlighted risks from broader or more intensive data processing and usage, including:
- Cybersecurity risk
- Dynamic updating
- Community banks’ use of AI
- Oversight of third-party use of AI
- Fair lending
But the concern that topped this list—and which this article focuses on—is explainability. Or, increasingly, the inability to explain how AI works.
AI has not always been so enigmatic. Over time, the relevance of explainability changed as the concept of AI, AI applications, AI technologies, AI adoption practices, and governance of AI applications transformed. The financial industry’s AI use can be divided into three phases over the last four decades. Looking at these periods can help understand how we got where we are and put today’s challenges in context.
Phase I – The Proto AI Period (1970s to mid-2010s)
The earliest literature on explainability traces back to the 1970s and ’80s, when automated systems were mainly knowledge-based expert systems or rule-based models. Expert systems were designed to mimic decision-making capabilities, were heuristic in nature, and did not have human capabilities. Because the rules and the knowledge were defined and programmed by humans with expertise in specific fields, these systems and their results were easy for developers and users to understand and interpret. The banking industry has used expert systems to solve problems related to loan approval, cross‐selling, risk analysis, treasury operations, and more for decades . Even today the majority of automated underwriting and technology platforms that host scorecard models and strategies are in this category. Rather than replacing decision makers, expert systems assist them. More importantly, they typically are not considered models and do not require validation from independent parties.
Another common Proto AI period theme was the development and implementation of simple regression models classified as supervised learning. Supervised learning methods rooted in statistical and econometric regression analysis have dominated risk and financial modeling in credit application, fraud detection, and stock price prediction. Supervised learning is a type of machine learning. By programming computers (the machine) to optimize regression parameters (the learning), the underlying algorithms not only transform inputs to output, but also enable relationships to be constructed between predictors and the target variable. Simple regression models are therefore highly explainable.
Explaining how regression models work generally consists of explaining data processing, data representation, statistical properties of the chosen methodology, and a direct impact quantification of inputs on output. Model-specific explanation is typically done by theoretical justification and evaluating model statistical fit. While statistical fit focuses on checking mathematical properties of the chosen methodology, theoretical justification adds common sense to dry figures and constructs the causal relationships between explainable and target variables. Linear regression, logistic regression, generalized linear models (GLMs), generalized additive models (GAMs) and decision trees such as CART (classification and regression trees) are commonly considered to be interpretable models. Regression models are always categorized as models and subject to the SR 11-7 Guidance on Model Risk Management by all banks regardless of size. In practice, model developers, users, owners, and validators co-share responsibilities to explain how a given model works.
Phase II – The Surge of ML Algorithms (Mid-2010 to November 2022)
The development and advancement of boosting, a supervised learning approach, has brought computer science (CS) and statistics closer while making significant impact on both fields. The question of whether a “weak” learning algorithm can be “boosted” into a “strong” learning algorithm was brought up by computer science professionals Michael Kearns and Leslie Valiant in the late 1980s. A decade later, gradient boosting machines (GBM) were introduced by Jerome H. Friedman, a leading researcher in statistics, after Robert Schapire offered an affirmative answer to Kearns in 1990 with analyses of AdaBoost training errors and generalization error. A more powerful algorithm XGBoost was introduced in 2016 by researchers in computer science and has since gained popularity in tackling classification and regression tasks in banking.
Two types of ML explainability techniques, global explanation and local explanation, are commonly applied to supervised learning algorithms. Global explanation methods include PDPs (partial dependence plots) and permutation feature importance. Local explanation methods include ICE (individual conditional expectation), LIME (local interpretable model-agnostic explanations), SHAP (Shapley Additive explanations), and counterfactual explanations. With deep learning algorithms helping achieve state-of-the-art results in computer vision, computerized speech recognition, and natural language processing (NLP), feature visualization and pixel attribution have been developed specifically for neural network algorithms to uncover how the hidden layers work. The most common algorithms include convolutional neural network (CNN), recurrent neural network (RNN), and artificial neural network (ANN).
The more complex an ML algorithm is, the more challenging it is to explain, even with the help of established metrics. The challenges of explaining modern deep learning led to a new topic of research for the research community and the AI industry. The amount of literature on explainable artificial intelligence (XAI) or similar topics has increased vastly since 2012.
The vast majority of advancements in AI and ML techniques in the last 20 years were led by computer scientists and electrical engineers. Domain knowledge, in addition to mathematics, is required to understand and explain complex ML algorithms. For example, explaining deep neural networks (DNNs) may require an explanation of deep neural architectures. Furthermore, deep learning algorithms are nonparametric and nonlinear in nature. This feature substantially limits the ability to leverage the magnitude and signs of coefficients to quantify the impact of inputs on the output. For instance, the use of visualization methods or NLP techniques does not enable a clear quantification of the improvements gained in terms of interpretability. Those facts are additional evidence of the often-ignored reality that explainability methodologies have transitioned: They used to be statistics-driven and include common sense practice. They have evolved to require domain knowledge and a mathematical-centric system without causal inference. Model risk and SR 11-7 continue to be relevant with respect to ML explainability. However, when applying traditional validation approaches on deep learning models, difficulties are likely to arise without help from computer science professionals.
Phase III – Dawn of Generative AI (Post-November 2022)
Before consensus could be achieved on key issues regarding ML explainability, the late 2022 launch of Open AI’s ChatGPT marked an inflection point.
ChatGPT and other LLM-based chatbot applications were developed to interact with users and create humanlike conversations. The degree of complexity they introduced is something no organizations, including banks, had seen before. The intricacy of these applications is multifold. ChatGPT-3 was trained on 45 terabytes of public online text data with around 175 billion trainable parameters. While financial data is mostly numeric and financial in nature, and therefore can be fed directly into mathematical models to produce estimates, processing text data on this level is different and typically involves transduction or transductive learning.
Although it is generally acknowledged that GPTs are multimodal LLMs that meet the broad model definition prescribed in SR11-7, explaining those LLMs is not easy. ChatGPT involves the development of three sequential models and the application of reinforcement learning from human feedback (RLHF) to train the language models to better align the model output with user intent. The first step is to use a supervised fine-tuning (SFT) model to create a supervised training dataset. The next step is to train a reward model with feedback from the user to capture intentions. The reward model input is a series of prompts and responses with model output as a scalar value. The final action is to leverage reinforcement learning—the Proximal Policy Optimization (PPO)—to fine-tune the reward model. At each stage of the conversation, prompt engineering is taken into account as a context. It also takes many rounds of iteration of the total process before the GPT model is finalized.
LLMs differ fundamentally from financial models. They are not designed to assist decisions related to core banking operations such as loan origination, loss forecasting, valuation, and asset and liability management. Aside from data difference (numerical vs. text), developing, implementing, and recalibrating LLMs is also much more complex than with simple ML models. Several challenges are inevitable for LLM validation.
The first is related to data and methodology evaluation. This applies to both open source LLMs and private LLMs. For open source LLMs, it is impossible for model validators to conduct standard activities including the data quality and completeness check, the alternative modeling approach evaluation, and model replication exercises. The only meaningful assessment a traditional MRM team can perform is the evaluation of model accuracy for a given use case.
Financial institutions are interested in using LLMs to automate the execution of frequent tasks. LLMs may help provide responses to customer calls in an accurate, swift, and secure way. One way to address this need is to construct a comprehensive “prompt architecture.” The other solution is to fine-tune an existing LLM by retraining a segment of an LLM with a proprietary dataset. Prompt engineering involves leveraging existing LLMs without modifying the underlying model itself or its training data. In contrast, fine-tuning implies the underlying foundational LLM will be modified and tailored with domain-specific knowledge. Fine-tuning is better suited for banks due to stringent data privacy requirements because fine-tuning allows banks to maintain control of the training process—and recalibrate the model within their own environment for their exclusive use—as well as the portions of data used for training purposes. For internally developed or customized LLMs, MRM may leverage approaches introduced by researchers in academia to aid LLM validation. One is the Holistic Evaluation of Language Models (HELM) developed by Stanford to improve the transparency of language models with a multi-metric approach. The other is to use an LLM to validate an LLM. That approach was presented by Carnegie Mellon using its self-developed ReLM, a Regular Expression engine for Language Models. However, those approaches are fairly new, do not cover the full spectrum of model validation processes established by SR 11-7, and have not been fully tested in the financial sector. Further, MRM must employ a fair amount of computer science expertise to complete the review. Smaller financial institutions may simply lack the means and resources to augment their MRM teams.
The second challenge is that LLMs are prone to hallucination and can generate nonsensical or unfaithful responses. It seems unlikely this drawback will be resolved any time soon. This creates significant compliance concerns. Business lines need to explore LLMs and leverage the experience and feedback of users to drive meaningful enhancement to minimize hallucination. There has not been agreement in the AI sector on what is deemed as an acceptable accuracy rate.
The third challenge is talent. To perform a robust validation on LLMs, MRM must employ computer science expertise to complete the review. It may be impossible for many financial institutions to augment their MRM teams due to limited resources. Additionally, explaining generative AI mechanisms makes little sense without taking into account other risk evaluation, including but not limited to: data privacy, cybersecurity, fairness, and compliance. Put differently, explainability still matters but in the grand scheme of AI risk is no longer the focal concern.
Closing Thoughts
Explaining supervised learning algorithms continues to be extremely relevant when those models take the form of conventional statistical and econometric regressions. Explaining non-parametric supervised learning models is also important when those techniques are applied to assist core banking activities such as loan origination decisions. Explaining automated AI systems, including recently developed generative AI solutions, is necessary but must be balanced with other risks to form a comprehensive assessment.
Over the last few decades, financial institutions have taken advantage of continued AI/ML improvements. AI/ML use cases have expanded from straightforward use of automated systems, to development of regression models, to solutions that employ advanced ML algorithms and LLMs. To stay competitive, banks will continue to leverage a combination of all three types of AI as well as future advancements in AI. As technology innovations become more powerful, they greatly improve operational efficiency and banks’ ability to manage a variety of operations. At the same time, increased sophistication creates increased difficulty for users to understand and explain how AI tools and systems work. To put it in Spider-Man terms: “With great sophistication comes great explainability requirements.” An AI governance framework that speaks to the relevance of explainability will enable proper risk assessment without stifling innovation.
Disclaimer: The views presented in this research are solely those of the author and do not necessarily represent those of Ally Financial Inc. (AFI) or any subsidiaries of AFI.