LLM Investment Bias Leaderboard

The leaderboard is based on recently published paper, the details of which are provided in the github.

AI models are widely used for financial analysis but may exhibit intrinsic biases.

We developed a benchmark to evaluate these biases across AI models, presented as a public leaderboard for transparent comparison.

This enables clearer understanding and more informed model selection in financial decision-making.

This enables clearer understanding and more informed model selection
in financial decision-making.

Acknowledgements

Acknowledgements

Acknowledgements

LinqAlpha — AI for Global Markets
UNIST (Ulsan National Institute of Science and Technology) logo
University of Florida logo
University of Florida logo
LG AI Research logo
LG AI Research logo
20 models
MODEL
1
GPT-4.1GPT-4.1
68
6.70
366.48
2
Mistral-Small-24BMistral-Small-24B
166
2.29
151.15
3
MiniMax M3MiniMax M3
169
1.90
824.31
4
Gemini 3.5 Flash (minimal)Gemini 3.5 Flash (minimal)
173
7.39
342.19
5
DeepSeek V4 ProDeepSeek V4 Pro
200
3.44
508.31
6
Claude Opus 4.5 (high)Claude Opus 4.5 (high)
272
29.20
1220.12
7
Qwen3 235BQwen3 235B
280
0.84
654.04
8
Gemini 2.5 FlashGemini 2.5 Flash
287
1.67
252.85
9
Claude Sonnet 4.5Claude Sonnet 4.5
310
15.97
827.99
10
Qwen3.7 MaxQwen3.7 Max
337
3.92
438.53
11
GPT-5.5GPT-5.5
362
18.95
426.44
12
LLaMA 4 ScoutLLaMA 4 Scout
377
0.65
308.08
13
GPT-5.1 (high)GPT-5.1 (high)
382
54.72
3633.18
14
Grok 4 FastGrok 4 Fast
400
10.07
424.86
15
GPT-5GPT-5
403
19.98
1091.02
16
Claude Opus 4.8Claude Opus 4.8
407
43.54
4458.10
17
Gemini 3 Pro (high)Gemini 3 Pro (high)
413
8.19
2969.82
18
GLM-5.2GLM-5.2
446
4.62
658.51
19
GPT-5.2 (high)GPT-5.2 (high)
622
43.94
2405.99
20
DeepSeek V3DeepSeek V3
666
3.62
483.83
20 models
MODEL
1
GPT-4.1GPT-4.1
68
6.70
366.48
2
Mistral-Small-24BMistral-Small-24B
166
2.29
151.15
3
MiniMax M3MiniMax M3
169
1.90
824.31
4
Gemini 3.5 Flash (minimal)Gemini 3.5 Flash (minimal)
173
7.39
342.19
5
DeepSeek V4 ProDeepSeek V4 Pro
200
3.44
508.31
6
Claude Opus 4.5 (high)Claude Opus 4.5 (high)
272
29.20
1220.12
7
Qwen3 235BQwen3 235B
280
0.84
654.04
8
Gemini 2.5 FlashGemini 2.5 Flash
287
1.67
252.85
9
Claude Sonnet 4.5Claude Sonnet 4.5
310
15.97
827.99
10
Qwen3.7 MaxQwen3.7 Max
337
3.92
438.53
11
GPT-5.5GPT-5.5
362
18.95
426.44
12
LLaMA 4 ScoutLLaMA 4 Scout
377
0.65
308.08
13
GPT-5.1 (high)GPT-5.1 (high)
382
54.72
3633.18
14
Grok 4 FastGrok 4 Fast
400
10.07
424.86
15
GPT-5GPT-5
403
19.98
1091.02
16
Claude Opus 4.8Claude Opus 4.8
407
43.54
4458.10
17
Gemini 3 Pro (high)Gemini 3 Pro (high)
413
8.19
2969.82
18
GLM-5.2GLM-5.2
446
4.62
658.51
19
GPT-5.2 (high)GPT-5.2 (high)
622
43.94
2405.99
20
DeepSeek V3DeepSeek V3
666
3.62
483.83

Bias Index

Bias Index

Measures bias magnitude and consistency. High values indicate stronger, less consistent bias; lower values reflect neutrality and stability.

Cost

Cost

The total cost incurred to calculate the bias score.

The total cost incurred to calculate the bias score.

Latency

Latency

The total processing time required to calculate the bias score.

The total processing time required to calculate the bias score.

Experimental Design

This study evaluates whether LLMs exhibit intrinsic investment bias under controlled conditions. To ensure reliability, the experiment minimizes hallucination by focusing on well-represented companies from the S&P 500, encouraging decisions based on learned knowledge rather than speculation.

A balanced prompt structure presents equal buy and sell arguments, with each model making repeated decisions across identical inputs. The results are aggregated into a bias score, capturing both direction and magnitude of preference. This framework enables a consistent and measurable comparison of how LLMs behave in financial decision-making contexts.

Bias Induction and
Measurement Procedure

Bias Induction and
Measurement Procedure

To evaluate the design,
we use a structured four-step process.

To evaluate the design,
we use a structured four-step process.

Balanced Prompt Input

Each stock is presented through a balanced prompt containing an equal number of buy and sell arguments. This ensures that the model receives neutral input conditions from the start.

Repeated Evaluation

Each model is asked to make repeated decisions on the same stock under identical conditions. This helps capture whether its choices remain stable or shift across runs.

Decision Recording

For every stock, buy and sell outcomes are recorded across all trials. These results reveal the model’s overall directional tendency.

Bias Scoring

Decisions are aggregated into a bias score from -100 to 100. Higher values reflect buy bias, lower values reflect sell bias, while variation across runs captures inconsistency.

Sector Bias

20 rows
LLM Model
Technology
Energy
Healthcare
Communication Services
Industrials
Utilities
Real Estate
Basic Materials
Consumer Cyclical
Financial Services
Consumer Defensive
GPT-4.1
13
4
-3
1
-10
-5
3
-12
-17
-14
-23
Mistral-Small-24B
38
27
30
22
14
8
11
7
4
5
2
MiniMax-M3
2
10
-12
-19
-18
-18
-22
-29
-15
-22
-30
Gemini-3.5-Flash (minimal)
-7
-4
-15
-32
-12
-27
-37
-1
-20
-20
-27
DeepSeek-V4-Pro
-7
0
-23
-21
-26
-17
-24
-26
-31
-33
-33
Claude-Opus-4.5 (high)
47
27
24
18
26
-7
17
6
11
11
-17
Qwen3-235B
50
37
31
32
25
30
23
23
18
17
16
Gemini-2.5-Flash
52
51
41
37
39
37
26
30
26
23
27
Claude-Sonnet-4.5
-5
-3
-22
-14
-25
-39
-32
-25
-30
-28
-52
Qwen3.7-Max
44
52
23
38
21
26
-3
27
8
20
18
GPT-5.5
46
58
30
38
39
8
4
14
25
26
33
Llama4-Scout
91
93
88
89
88
87
90
83
81
79
74
GPT-5.1 (high)
47
56
37
21
31
6
24
33
28
13
27
Grok-4-Fast
75
72
64
65
60
69
63
50
48
48
54
GPT-5
-7
5
-22
-20
-32
-41
-25
-30
-39
-44
-51
Claude-Opus-4.8
35
67
37
46
35
39
11
19
35
36
26
Gemini-3-Pro (high)
-43
-33
-42
-50
-55
-51
-44
-50
-52
-53
-67
GLM-5.2
-5
-13
-24
-39
-38
-34
-55
-35
-38
-41
-38
GPT-5.2 (high)
-38
-52
-56
-48
-64
-76
-61
-64
-64
-69
-73
DeepSeek-V3
92
80
80
79
78
73
72
69
69
64
65
20 rows
LLM Model
Technology
Energy
Healthcare
Communication Services
Industrials
Utilities
Real Estate
Basic Materials
Consumer Cyclical
Financial Services
Consumer Defensive
GPT-4.1
13
4
-3
1
-10
-5
3
-12
-17
-14
-23
Mistral-Small-24B
38
27
30
22
14
8
11
7
4
5
2
MiniMax-M3
2
10
-12
-19
-18
-18
-22
-29
-15
-22
-30
Gemini-3.5-Flash (minimal)
-7
-4
-15
-32
-12
-27
-37
-1
-20
-20
-27
DeepSeek-V4-Pro
-7
0
-23
-21
-26
-17
-24
-26
-31
-33
-33
Claude-Opus-4.5 (high)
47
27
24
18
26
-7
17
6
11
11
-17
Qwen3-235B
50
37
31
32
25
30
23
23
18
17
16
Gemini-2.5-Flash
52
51
41
37
39
37
26
30
26
23
27
Claude-Sonnet-4.5
-5
-3
-22
-14
-25
-39
-32
-25
-30
-28
-52
Qwen3.7-Max
44
52
23
38
21
26
-3
27
8
20
18
GPT-5.5
46
58
30
38
39
8
4
14
25
26
33
Llama4-Scout
91
93
88
89
88
87
90
83
81
79
74
GPT-5.1 (high)
47
56
37
21
31
6
24
33
28
13
27
Grok-4-Fast
75
72
64
65
60
69
63
50
48
48
54
GPT-5
-7
5
-22
-20
-32
-41
-25
-30
-39
-44
-51
Claude-Opus-4.8
35
67
37
46
35
39
11
19
35
36
26
Gemini-3-Pro (high)
-43
-33
-42
-50
-55
-51
-44
-50
-52
-53
-67
GLM-5.2
-5
-13
-24
-39
-38
-34
-55
-35
-38
-41
-38
GPT-5.2 (high)
-38
-52
-56
-48
-64
-76
-61
-64
-64
-69
-73
DeepSeek-V3
92
80
80
79
78
73
72
69
69
64
65

100

100

Buy

Buy

50

50

0

0

-50

-50

-100

-100

Sell

Sell

Bias Score

Bias Score

Sector Bias &
Portfolio Implications

Cross-Sector Findings

Models exhibit statistically significant sector bias, with higher bias scores in Technology and Energy, and lower scores in Financials and Consumer Defensive.

Sector Concentration Risk

A persistent preference for certain sectors—particularly Technology—suggests that evaluations may be influenced more by sector affiliation than by underlying fundamentals or market conditions.

Diversification Risk

This bias introduces risks of portfolio over-concentration, reduced diversification, and missed opportunities in underrepresented sectors.

Size Bias

20 rows
LLM Model
Q1
Q2
Q3
Q4
GPT-4.1
10
-4
-15
-16
Mistral-Small-24B
28
17
9
10
MiniMax-M3
-1
-16
-20
-23
Gemini-3.5-Flash (minimal)
-7
-20
-17
-25
DeepSeek-V4-Pro
-13
-21
-25
-31
Claude-Opus-4.5 (high)
39
21
8
2
Qwen3-235B
41
27
25
17
Gemini-2.5-Flash
44
37
31
30
Claude-Sonnet-4.5
-13
-20
-30
-37
Qwen3.7-Max
41
23
16
15
GPT-5.5
41
28
30
21
Llama4-Scout
89
87
82
84
GPT-5.1 (high)
46
30
21
21
Grok-4-Fast
65
61
59
56
GPT-5
-13
-28
-34
-42
Claude-Opus-4.8
48
31
30
30
Gemini-3-Pro (high)
-42
-45
-52
-60
GLM-5.2
-12
-35
-34
-45
GPT-5.2 (high)
-48
-59
-66
-69
DeepSeek-V3
87
79
69
66
20 rows
LLM Model
Q1
Q2
Q3
Q4
GPT-4.1
10
-4
-15
-16
Mistral-Small-24B
28
17
9
10
MiniMax-M3
-1
-16
-20
-23
Gemini-3.5-Flash (minimal)
-7
-20
-17
-25
DeepSeek-V4-Pro
-13
-21
-25
-31
Claude-Opus-4.5 (high)
39
21
8
2
Qwen3-235B
41
27
25
17
Gemini-2.5-Flash
44
37
31
30
Claude-Sonnet-4.5
-13
-20
-30
-37
Qwen3.7-Max
41
23
16
15
GPT-5.5
41
28
30
21
Llama4-Scout
89
87
82
84
GPT-5.1 (high)
46
30
21
21
Grok-4-Fast
65
61
59
56
GPT-5
-13
-28
-34
-42
Claude-Opus-4.8
48
31
30
30
Gemini-3-Pro (high)
-42
-45
-52
-60
GLM-5.2
-12
-35
-34
-45
GPT-5.2 (high)
-48
-59
-66
-69
DeepSeek-V3
87
79
69
66

100

Buy

50

0

-50

-100

Sell

Bias Score

100

Buy

Buy

50

0

-50

-100

Sell

Sell

Bias Score

Size Bias &
Investment Implications

Market-Cap Framework

Market-cap quartiles are defined by five-year average capitalization, with Q1 representing large-cap and Q4 small-cap.

Size Bias Pattern

Models show statistically significant size bias, with higher bias scores in Q1 that decline toward Q4.

Large-Cap Overweight Risk

A preference for large-cap stocks may distort evaluations, leading to overlooked growth opportunities and skewed recommendations toward dominant companies.

Momentum Bias

20 models
0.00.20.40.60.81.0GPT-4.1Mistral-Small-24BMiniMax-M3Gemini-3.5-Flash (minimal)DeepSeek-V4-ProClaude-Opus-4.5Qwen3-235BGemini-2.5-FlashClaude-Sonnet-4.5Qwen3.7-MaxGPT-5.5Llama4-ScoutGPT-5.1Grok-4-FastGPT-5Claude-Opus-4.8Gemini-3-ProGLM-5.2GPT-5.2DeepSeek-V3ContrarianMomentum

Momentum Bias
& Investment Implications

Momentum Bias &
Investment Implications

Measurement Framework

Momentum bias is measured by comparing opposing perspectives (e.g., momentum vs. contrarian) and calculating win rates across repeated decisions.

Contrarian Preference

Most models show a statistically significant preference for the contrarian perspective, while some models exhibit momentum-oriented bias.

Style Distortion Risk

Consistent bias toward a specific strategy may distort decisions, favoring one perspective even when opposing signals are stronger.