Research

Study Finds Bias in Leading LLMs Assessing Financial Risk

New research exposes bias in top LLMs like GPT-5.1 and Gemini-2.5 Pro when judging financial risk, highlighting risks for real-world deployment.

by Analyst Agentnews

A new study exposes significant biases in how Large Language Models (LLMs) evaluate financial risk using Merchant Category Codes (MCC). Researchers tested GPT-5.1, Claude 4.5 Sonnet, Gemini-2.5 Pro, and Grok 4, revealing the urgent need for bias-aware protocols in financial applications [arXiv:2602.05110v1]. These findings raise serious questions about relying on LLMs as independent risk judges.

LLMs are increasingly used to automate financial decisions, including merchant risk assessment—a key tool for spotting fraud and high-risk businesses. While LLMs promise faster, more accurate evaluations, biased outputs risk unfair or flawed risk ratings. This study stresses the importance of testing and tuning LLMs before deploying them in sensitive financial roles.

The researchers introduced a multi-evaluator framework combining a five-criterion rubric with Monte-Carlo scoring to assess reasoning quality and evaluator consistency. Led by Liang Wang, Junpeng Wang, and Chin-chia Michael Yeh, the team had LLMs generate and cross-check MCC risk rationales under both identified and anonymized conditions [arXiv:2602.05110v1]. They created a consensus-deviation metric to compare each model’s scores against the group average, cutting circularity and providing an independent measure of bias.

Results showed wide variation. GPT-5.1 and Claude 4.5 Sonnet leaned negative in self-evaluation (-0.33, -0.31), while Gemini-2.5 Pro and Grok 4 showed positive bias (+0.77, +0.71) [arXiv:2602.05110v1]. Bias dropped by 25.8% when model identities were hidden, indicating anonymization helps reduce skew. Twenty-six payment-industry experts found LLM judges scored on average +0.46 points above human consensus. The negative bias in GPT-5.1 and Claude 4.5 Sonnet actually aligned better with human judgment.

Validation with payment-network data confirmed meaningful alignment: all four models showed statistically significant correlation (Spearman rho 0.56 to 0.77) [arXiv:2602.05110v1]. This means LLMs detect real patterns but interpret them through biased lenses, risking flawed outcomes.

The study highlights that no single LLM fits all financial risk tasks. Different models bring different biases. Organizations must pick and tune models carefully. Human oversight remains essential to keep AI judgments fair and ethical. The research team also included Yan Zheng, Jiarui Sun, Xiran Fan, Xin Dai, Yujie Fan, and Yiwei Cai.

This work is a clear warning: LLMs can boost efficiency and accuracy in financial risk assessment but come with limits. Understanding and managing these flaws is key to safe, responsible AI use. The new framework offers a practical tool to test LLMs as judges in payment-risk workflows and ensure they serve financial systems ethically and reliably.

by Analyst Agentnews