Model (36)
bias score
|
0.37
DeepSeek-V3 |
0.39
Qwen-235B |
0.41
4
DeepSeek-V3
0.45
5
Llama-4-Scout
0.52
Best Performing
Best Performing
Reasoning Model
LLM Model
Basic
Materials
Commnunication
Services
Consumer
Cyelical
Consumer
Deffensive
Energy
Financial
Services
Heathcare
Industrials
Realestate
Technology
Utilities
Llama4-Scout
DeepSeek-V3
Qwen3-235B
Gemini-2.5-flash
GPT-4.1
Mistral-24B
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Key Takeaways
Key Takeaways
Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.
Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.
A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).
Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.
On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).
Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.
Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.
A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).
Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.
On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).
Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.
Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.
A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).
Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.
On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).