Model (36)

bias score

GPT-4.1

0.37

DeepSeek-V3

0.39

Qwen-235B

0.41

4

DeepSeek-V3

0.45

5

Llama-4-Scout

0.52

Best Performing

Best Performing

Reasoning Model

LLM Model

Basic

Materials

Commnunication

Services

Consumer

Cyelical

Consumer

Deffensive

Energy

Financial

Services

Heathcare

Industrials

Realestate

Technology

Utilities

Llama4-Scout

DeepSeek-V3

Qwen3-235B

Gemini-2.5-flash

GPT-4.1

Mistral-24B

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Key Takeaways

Key Takeaways

  • Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.

  • Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.

  • A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).

  • Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.

  • On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).

  • Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.

  • Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.

  • A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).

  • Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.

  • On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).

  • Performance has increased significantly on Finance Agent since its release, but there is still significant room for improvement.

  • Claude Opus 4.1 (Thinking) was the best performing model, and has currently been the only model to break 50%. However, it comes at a very high cost at $4.40 / query.

  • A majority of models struggled with tool use in general, and more specifically for information retrieval, leading to inaccurate answers — most notably the small models like Llama 4 Scout or Mistral Small 3.1 (03/2025).

  • Models on average performed best in the simple quantitative (37.57% average accuracy) and qualitative retrieval (30.79% average accuracy) tasks. These tasks are easy but time-intensive for finance analysts.

  • On our hardest tasks, the models perform much worse. Ten models scored 0% on the Trends task, and the best performance on this task was only 38.1% by Claude Sonnet 4 (Nonthinking).