LLM Investment Bias
Leaderboard:
Do LLMs Possess Intrinsic
Investment Biases?

Large Language Models (LLMs) have demonstrated remarkable capability in interpreting unstructured and qualitative financial information, and they are increasingly being adopted in real-world investment decision-support systems. However, if these models harbor intrinsic investment biases, their outputs may deviate from investor intent, leading to distorted and unreliable investment judgments.
To systematically uncover these hidden investment biases, we designed an experimental framework that analyzes bias across multiple models and presents the comparative results in the form of a public leaderboard.
The leaderboard is based on recently published paper, the details of which are provided in the github.

Acknowledgements

Model

①
GPT-5.2 (none)
58
9.71
495.68
②
GPT-4.1
68
6.70
366.48
③
Mistral-Small-24B
166
2.29
151.15
4
Gemini 3 Pro (low)
243
8.97
755.90
5
Claude Opus 4.5 (high)
272
29.20
1,220.12
6
Qwen3 235B
280
0.84
654.04
7
Gemini 2.5 Flash
287
1.67
252.85
8
Claude Opus 4.5 (low)
291
25.30
914.22
9
Claude Sonnet 4.5
310
15.97
827.99
10
Claude Opus 4.5 (medium)
343
28.38
1,010.42
11
GPT-5.1 (none)
344
7.27
575.12
12
LLaMA 4 Scout
377
0.65
308.08
13
GPT-5.1 (high)
382
54.72
3,633.18
14
Grok 4 Fast
400
10.07
424.86
15
GPT-5
403
19.98
1,091.02
16
Gemini 3 Pro (high)
413
8.19
2,969.82
17
GPT-5.1 (low)
432
10.81
941.34
18
GPT-5.1 (medium)
465
16.06
1,152.38
19
GPT-5.2 (high)
622
43.94
2,405.99
20
DeepSeek V3
666
3.62
483.83
21
GPT-5.2 (low)
693
30.75
1,800.43
22
GPT-5.2 (medium)
694
36.52
1,977.67

Bias Index: The Bias Index quantifies a model's bias magnitude and inconsistency simultaneously, calculated as (Absolute mean bias score) x (Standard deviation across groups). High values signal strong average bias and large cross-group differences; values near zero indicate neutrality or perfect consistency.
Cost: The total cost incurred to calculate the bias score.
Latency: The total processing time required to calculate the bias score.

Bias Index: The Bias Index quantifies a model's bias magnitude and inconsistency simultaneously, calculated as (Absolute mean bias score) x (Standard deviation across groups). High values signal strong average bias and large cross-group differences; values near zero indicate neutrality or perfect consistency.
Cost: The total cost incurred to calculate the bias score.
Latency: The total processing time required to calculate the bias score.

Experimental Overview

This study was designed to minimize the possibility of hallucination, based on prior work indicating that LLMs are significantly less likely to generate false information when prompted about topics they have sufficiently encountered during training.
Accordingly, the experiment focused on 427 major stocks that have been continuously included in the S&P 500 index over the past five years. These companies are highly visible in public disclosures and media coverage, increasing the likelihood that they are well represented in the models’ training data. Thus, the experiment aims to observe decisions driven primarily by the model’s internal parametric knowledge, rather than by speculative generation.

Bias Induction and Measurement Procedure

Balanced Prompt Input: For each stock, a balanced prompt was constructed containing an equal number of buy and sell arguments and presented to the model.
Repetitive Evaluation: The model was asked to make ten repeated decisions (N = 10) based on the same prompt, choosing either “buy” or “sell” in each trial.
Decision Recording: For every stock, the number of buy and sell decisions was recorded to capture the model’s overall tendency.
Bias Assessment: The ratio between buy and sell choices was analyzed to compute a bias score, representing the direction and magnitude of the model’s preference. A higher score indicates a buy bias, while a lower score indicates a sell bias.

The bias score is computed under identical conditions that present equal evidence for buy and sell. The range is -100 to 100; higher values indicate a greater share of buy selections, while lower values indicate a greater share of sell selections. Numbers in parentheses denote the standard deviation across repeated runs.

Through this procedure, we systematically analyzed how LLMs exhibit inherent biases toward key financial factors such as sector, size, and momentum.

Sector Bias

LLM Model

Technology

Energy

Healthcare

Communication Services

Industrials

Utilities

Real Estate

Basic Materials

Consumer Cyclical

Financial Services

Consumer Defensive

GPT-5.2 (none)

25(5.00)

23(1.00)

10(2.00)

19(11.00)

2(3.00)

2(1.00)

6(6.00)

0(3.00)

-16(5.00)

-5(2.00)

-16(4.00)

GPT-4.1

13(6.00)

4(13.00)

-3(7.00)

1(10.00)

-10(4.00)

-5(3.00)

3(7.00)

-12(5.00)

-17(3.00)

-14(2.00)

-23(4.00)

Mistral-Small-24B

38(2.00)

27(5.00)

30(5.00)

22(5.00)

14(3.00)

8(5.00)

11(4.00)

7(5.00)

4(1.00)

5(2.00)

2(2.00)

Gemini-3-Pro (low)

-11(2.00)

-5(6.00)

-23(2.00)

-11(14.00)

-25(3.00)

-32(5.00)

-19(7.00)

-26(2.00)

-32(1.00)

-39(6.00)

-40(0.00)

Claude-Opus-4.5 (high)

47(2.00)

27(7.00)

24(6.00)

18(8.00)

26(4.00)

-7(4.00)

17(5.00)

6(3.00)

11(5.00)

11(0.00)

-17(8.00)

Qwen3-235B

50(2.00)

37(7.00)

31(7.00)

32(4.00)

25(7.00)

30(2.00)

23(2.00)

23(11.00)

18(1.00)

17(3.00)

16(3.00)

Gemini-2.5-Flash

52(4.00)

51(5.00)

41(2.00)

37(6.00)

39(3.00)

37(10.00)

26(10.00)

30(11.00)

26(3.00)

23(4.00)

27(8.00)

Claude-Opus-4.5 (low)

44(8.00)

42(1.00)

21(0.00)

23(5.00)

18(1.00)

1(3.00)

12(4.00)

-1(4.00)

15(6.00)

11(4.00)

-17(5.00)

Claude-Sonnet-4.5

-5(5.00)

-3(10.00)

-22(1.00)

-14(4.00)

-25(0.00)

-39(3.00)

-32(3.00)

-25(5.00)

-30(2.00)

-28(3.00)

-52(2.00)

Claude-Opus-4.5 (medium)

47(5.00)

42(5.00)

28(7.00)

29(3.00)

31(8.00)

5(6.00)

19(1.00)

6(5.00)

8(4.00)

12(3.00)

-15(3.00)

GPT-5.1 (none)

53(2.00)

49(5.00)

35(4.00)

35(10.00)

28(5.00)

22(7.00)

33(5.00)

27(10.00)

20(3.00)

17(2.00)

14(1.00)

Llama4-Scout

91(1.00)

93(5.00)

88(2.00)

89(7.00)

88(2.00)

87(1.00)

90(2.00)

83(5.00)

81(2.00)

79(2.00)

74(1.00)

GPT-5.1 (high)

47(4.00)

56(5.00)

37(3.00)

21(9.00)

31(1.00)

6(7.00)

24(2.00)

33(2.00)

28(6.00)

13(4.00)

27(7.00)

Grok-4-Fast

75(1.00)

72(6.00)

64(4.00)

65(11.00)

60(2.00)

69(3.00)

63(1.00)

50(11.00)

48(8.00)

54(2.00)

GPT-5

-7(4.00)

5(10.00)

-22(1.00)

-20(0.00)

-32(6.00)

-41(3.00)

-25(8.00)

-30(7.00)

-39(4.00)

-44(4.00)

-51(5.00)

Gemini-3-Pro (high)

-43(0.00)

-33(0.00)

-42(0.00)

-50(0.00)

-55(0.00)

-51(0.00)

-44(0.00)

-50(0.00)

-52(0.00)

-53(0.00)

-67(0.00)

GPT-5.1 (low)

52(1.00)

53(6.00)

39(6.00)

36(3.00)

35(3.00)

13(2.00)

25(1.00)

43(2.00)

30(6.00)

11(2.00)

28(1.00)

GPT-5.1 (medium)

60(4.00)

55(6.00)

39(2.00)

27(6.00)

40(1.00)

16(1.00)

26(8.00)

45(6.00)

35(2.00)

17(1.00)

33(1.00)

GPT-5.2 (high)

-38(3.00)

-52(5.00)

-56(4.00)

-48(6.00)

-64(1.00)

-76(3.00)

-61(3.00)

-64(8.00)

-64(4.00)

-69(2.00)

-73(1.00)

DeepSeek-V3

92(2.00)

80(6.00)

80(0.00)

79(6.00)

78(0.00)

73(3.00)

72(2.00)

69(3.00)

69(1.00)

64(2.00)

65(2.00)

GPT-5.2 (low)

-33(2.00)

-40(9.00)

-52(1.00)

-44(7.00)

-59(3.00)

-72(2.00)

-60(4.00)

-53(10.00)

-63(1.00)

-68(1.00)

-67(4.00)

GPT-5.2 (medium)

-39(4.00)

-48(5.00)

-54(4.00)

-46(8.00)

-60(3.00)

-72(1.00)

-61(4.00)

-60(5.00)

-67(1.00)

-68(2.00)

-71(2.00)

100

Buy

-50

-100

Sell

Bias Score

Key Takeaways

Sector bias is pronounced, and differences in bias scores are statistically significant. Across many models, Technology and Energy show relatively higher bias scores, while Financial Services and Consumer Defensive show relatively lower bias scores.
A consistent preference for certain sectors, particularly Technology, could lead the model to systematically overvalue assets within that sector, irrespective of individual stock fundamentals or market conditions. This poses a risk of over-concentrating investment portfolios, hindering diversification, and causing missed opportunities in other promising areas.

Size Bias

LLM Model

GPT-5.2 (none)

23(1.00)

5(3.00)

-9(3.00)

-4(3.00)

GPT-4.1

10(4.00)

-4(5.00)

-15(1.00)

-16(4.00)

Mistral-Small-24B

28(2.00)

17(3.00)

9(2.00)

10(2.00)

Gemini-3-Pro (low)

-15(6.00)

-27(1.00)

-25(4.00)

-35(3.00)

Claude-Opus-4.5 (high)

39(3.00)

21(3.00)

8(6.00)

2(4.00)

Qwen3-235B

41(4.00)

27(3.00)

25(1.00)

17(4.00)

Gemini-2.5-Flash

44(2.00)

37(3.00)

31(5.00)

30(7.00)

Claude-Opus-4.5 (low)

41(3.00)

21(3.00)

3(4.00)

1(5.00)

Claude-Sonnet-4.5

-13(2.00)

-20(2.00)

-30(4.00)

-37(4.00)

Claude-Opus-4.5 (medium)

41(1.00)

26(4.00)

10(1.00)

5(2.00)

GPT-5.1 (none)

45(4.00)

29(1.00)

23(1.00)

23(4.00)

Llama4-Scout

89(2.00)

87(2.00)

82(1.00)

84(2.00)

GPT-5.1 (high)

46(4.00)

30(1.00)

21(2.00)

Grok-4-Fast

65(6.00)

61(2.00)

59(6.00)

56(5.00)

GPT-5

-13(2.00)

-28(1.00)

-34(4.00)

-42(2.00)

Gemini-3-Pro (high)

-42(0.00)

-45(0.00)

-52(0.00)

-60(0.00)

GPT-5.1 (low)

50(1.00)

33(3.00)

25(2.00)

22(7.00)

GPT-5.1 (medium)

53(1.00)

35(1.00)

31(1.00)

26(2.00)

GPT-5.2 (high)

-48(2.00)

-59(1.00)

-66(1.00)

-69(1.00)

DeepSeek-V3

87(2.00)

79(1.00)

69(1.00)

66(3.00)

GPT-5.2 (low)

-39(1.00)

-56(3.00)

-63(3.00)

-67(2.00)

GPT-5.2 (medium)

-42(1.00)

-56(1.00)

-66(1.00)

-71(1.00)

100

Buy

-50

-100

Sell

Bias Score

Key Takeaways

Market-cap quartiles are based on the average market capitalization over the past five years, with Q1 representing the largest market caps and Q4 the smallest.
Size bias is observed across all models, with a common pattern of higher bias scores in Q1 that decline toward Q4. The differences in bias scores are statistically significant.
A distinct preference for large-cap stocks could lead the model to positively assess a company merely due to its size, irrespective of its intrinsic growth potential or valuation. This entails the risk of overlooking high-growth opportunities in innovative small and mid-cap stocks and could lead to skewed investment recommendations that heavily favor dominant large-cap corporations.

Momentum Bias

Key Takeaways

We measured momentum bias by constructing a buy argument from one investment perspective (e.g., momentum) and a sell argument from the opposing perspective (e.g., contrarian). The model’s final decision determined the winning perspective. By repeating this process, we calculated the win rate for each viewpoint, thereby quantifying the model’s bias.
Our analysis revealed a consistent and statistically significant preference for the contrarian perspective across most models. Qwen3-235B exhibited the strongest contrarian preference. Although the margin was narrow for Gemini-2.5-flash, its preference was still statistically significant. Uniquely, gpt-5 displayed a discernible preference for the momentum-based viewpoint.
This consistent bias toward a specific investment style presents a potential risk. It could lead to distorted investment decisions that favor the model’s preferred perspective, even when market signals strongly support an opposing strategy.

LLM Investment Bias Leaderboard:Do LLMs Possess IntrinsicInvestment Biases?

LLM Investment Bias Leaderboard:Do LLMs Possess IntrinsicInvestment Biases?

LLM Investment Bias Leaderboard:Do LLMs Possess IntrinsicInvestment Biases?

Acknowledgements

Experimental Overview

Bias Induction and Measurement Procedure

Sector Bias

Key Takeaways

Size Bias

Key Takeaways

Momentum Bias

Key Takeaways

LLM Investment Bias
Leaderboard:
Do LLMs Possess Intrinsic
Investment Biases?

LLM Investment Bias
Leaderboard:
Do LLMs Possess Intrinsic
Investment Biases?

LLM Investment Bias
Leaderboard:
Do LLMs Possess Intrinsic
Investment Biases?