Do LLMs Possess Intrinsic
Investment Biases?


\


Do LLMs Possess Intrinsic
Investment Biases?


\


Do LLMs Possess Intrinsic
Investment Biases?


\


  • Large Language Models (LLMs) have demonstrated remarkable capability in interpreting unstructured and qualitative financial information, and they are increasingly being adopted in real-world investment decision-support systems. However, if these models harbor intrinsic investment biases, their outputs may deviate from investor intent, leading to distorted and unreliable investment judgments.

  • To systematically uncover these hidden investment biases, we designed an experimental framework that analyzes bias across multiple models and presents the comparative results in the form of a public leaderboard.

  • The leaderboard is based on recently published paper, the details of which are provided in the github.

Acknowledgements

Model
  • GPT-4.1GPT-4.1
    68
    6.69
    366
  • Mistral-Small-24BMistral-Small-24B
    166
    2.29
    151
  • Qwen3-235BQwen3-235B
    280
    0.84
    654
  • 4
    Gemini-2.5-FlashGemini-2.5-Flash
    287
    1.67
    252
  • 5
    Claude-Sonnet-4.5Claude-Sonnet-4.5
    310
    15.96
    827
  • 6
    LLaMA-4-ScoutLLaMA-4-Scout
    377
    0.65
    308
  • 7
    Grok-4-FastGrok-4-Fast
    400
    10.06
    424
  • 8
    GPT-5 (low)GPT-5 (low)
    403
    19.98
    1091
  • 9
    Deepseek-V3Deepseek-V3
    666
    3.61
    483
  • Bias Index: The Bias Index quantifies a model's bias magnitude and inconsistency simultaneously, calculated as (Absolute mean bias score) x (Standard deviation across groups). High values signal strong average bias and large cross-group differences; values near zero indicate neutrality or perfect consistency.

  • Cost: The total cost incurred to calculate the bias score.

  • Latency: The total processing time required to calculate the bias score.

  • Bias Index: The Bias Index quantifies a model's bias magnitude and inconsistency simultaneously, calculated as (Absolute mean bias score) x (Standard deviation across groups). High values signal strong average bias and large cross-group differences; values near zero indicate neutrality or perfect consistency.

  • Cost: The total cost incurred to calculate the bias score.

  • Latency: The total processing time required to calculate the bias score.

Experimental Overview

  • This study was designed to minimize the possibility of hallucination, based on prior work indicating that LLMs are significantly less likely to generate false information when prompted about topics they have sufficiently encountered during training.

  • Accordingly, the experiment focused on 427 major stocks that have been continuously included in the S&P 500 index over the past five years. These companies are highly visible in public disclosures and media coverage, increasing the likelihood that they are well represented in the models’ training data. Thus, the experiment aims to observe decisions driven primarily by the model’s internal parametric knowledge, rather than by speculative generation.

Bias Induction and Measurement Procedure

  1. Balanced Prompt Input: For each stock, a balanced prompt was constructed containing an equal number of buy and sell arguments and presented to the model.

  2. Repetitive Evaluation: The model was asked to make ten repeated decisions (N = 10) based on the same prompt, choosing either “buy” or “sell” in each trial.

  3. Decision Recording: For every stock, the number of buy and sell decisions was recorded to capture the model’s overall tendency.

  4. Bias Assessment: The ratio between buy and sell choices was analyzed to compute a bias score, representing the direction and magnitude of the model’s preference. A higher score indicates a buy bias, while a lower score indicates a sell bias.


The bias score is computed under identical conditions that present equal evidence for buy and sell. The range is -100 to 100; higher values indicate a greater share of buy selections, while lower values indicate a greater share of sell selections. Numbers in parentheses denote the standard deviation across repeated runs.


Through this procedure, we systematically analyzed how LLMs exhibit inherent biases toward key financial factors such as sector, size, and momentum.

Sector Bias

LLM Model
Technology
Energy
Healthcare
Communication Services
Industrials
Utilities
Real Estate
Basic Materials
Consumer Cyclical
Financial Services
Consumer Defensive
GPT-4.1
13(6.00)
4(13.00)
-3(7.00)
1(10.00)
-10(4.00)
-5(3.00)
3(7.00)
-12(5.00)
-17(3.00)
-14(2.00)
-23(4.00)
Mistral-24B
38(2.00)
27(5.00)
30(5.00)
22(5.00)
14(3.00)
8(5.00)
11(4.00)
7(5.00)
4(1.00)
5(2.00)
2(2.00)
Qwen3-235B
50(2.00)
37(7.00)
31(7.00)
32(4.00)
25(7.00)
30(2.00)
23(2.00)
23(11.00)
18(1.00)
17(3.00)
16(3.00)
Gemini-2.5-Flash
52(4.00)
51(5.00)
41(2.00)
37(6.00)
39(3.00)
37(10.00)
26(10.00)
30(11.00)
26(3.00)
23(4.00)
27(8.00)
Claude-Sonnet-4.5
-5(5.00)
-3(10.00)
-22(1.00)
-14(4.00)
-25(0.00)
-39(3.00)
-32(3.00)
-25(5.00)
-30(2.00)
-28(3.00)
-52(2.00)
Llama4-Scout
91(1.00)
93(5.00)
88(2.00)
89(7.00)
88(2.00)
87(1.00)
90(2.00)
83(5.00)
81(2.00)
79(2.00)
74(1.00)
Grok-4-Fast
75(1.00)
72(6.00)
64(4.00)
65(11.00)
60(2.00)
69(3.00)
63(1.00)
50(11.00)
48(8.00)
48(8.00)
54(2.00)
GPT-5
-7(4.00)
5(10.00)
-22(1.00)
-20(0.00)
-32(6.00)
-41(3.00)
-25(8.00)
-30(7.00)
-39(4.00)
-44(4.00)
-51(5.00)
DeepSeek-V3
92(2.00)
80(6.00)
80(0.00)
79(6.00)
78(0.00)
73(3.00)
72(2.00)
69(3.00)
69(1.00)
64(2.00)
65(2.00)

100

Buy

50

0

-50

-100

Sell

Bias Score

Key Takeaways

  • Sector bias is pronounced, and differences in bias scores are statistically significant. Across many models, Technology and Energy show relatively higher bias scores, while Financial Services and Consumer Defensive show relatively lower bias scores.

  • A consistent preference for certain sectors, particularly Technology, could lead the model to systematically overvalue assets within that sector, irrespective of individual stock fundamentals or market conditions. This poses a risk of over-concentrating investment portfolios, hindering diversification, and causing missed opportunities in other promising areas.

Size Bias

LLM Model
Q1
Q2
Q3
Q4
GPT-4.1
10(4.00)
-4(5.00)
-15(1.00)
-16(4.00)
Mistral-24B
28(2.00)
17(3.00)
9(2.00)
10(2.00)
Qwen3-235B
41(4.00)
27(3.00)
25(1.00)
17(4.00)
Gemini-2.5-Flash
44(2.00)
37(3.00)
31(5.00)
30(7.00)
Claude-Sonnet-4.5
-13(2.00)
-20(2.00)
-30(4.00)
-37(4.00)
Llama4-Scout
89(2.00)
87(2.00)
82(1.00)
84(2.00)
Grok-4-Fast
65(6.00)
61(2.00)
59(6.00)
56(5.00)
GPT-5
-13(2.00)
-28(1.00)
-34(4.00)
-42(2.00)
DeepSeek-V3
87(2.00)
79(1.00)
69(1.00)
66(3.00)

100

Buy

50

0

-50

-100

Sell

Bias Score

Key Takeaways

  • Market-cap quartiles are based on the average market capitalization over the past five years, with Q1 representing the largest market caps and Q4 the smallest.

  • Size bias is observed across all models, with a common pattern of higher bias scores in Q1 that decline toward Q4. The differences in bias scores are statistically significant.

  • A distinct preference for large-cap stocks could lead the model to positively assess a company merely due to its size, irrespective of its intrinsic growth potential or valuation. This entails the risk of overlooking high-growth opportunities in innovative small and mid-cap stocks and could lead to skewed investment recommendations that heavily favor dominant large-cap corporations.

Momentum Bias

0.00.20.40.60.81.0DeepSeek-V3GPT-4.1Gemini-2.5-FlashLlama-4-ScoutMistral-Small-24BQwen3-235Bclaude-Sonnet-4.5gpt-5Grok-4-FastContrarianMomentum

Key Takeaways

  • We measured momentum bias by constructing a buy argument from one investment perspective (e.g., momentum) and a sell argument from the opposing perspective (e.g., contrarian). The model’s final decision determined the winning perspective. By repeating this process, we calculated the win rate for each viewpoint, thereby quantifying the model’s bias.

  • Our analysis revealed a consistent and statistically significant preference for the contrarian perspective across most models. Qwen3-235B exhibited the strongest contrarian preference. Although the margin was narrow for Gemini-2.5-flash, its preference was still statistically significant. Uniquely, gpt-5 displayed a discernible preference for the momentum-based viewpoint.

  • This consistent bias toward a specific investment style presents a potential risk. It could lead to distorted investment decisions that favor the model’s preferred perspective, even when market signals strongly support an opposing strategy.