At this year’s AI for Finance Symposium, speakers aligned on the new foundations of reliable AI: evaluation discipline, randomness control, and context engineering
At this year’s AI for Finance Symposium - the 2nd Workshop on LLMs and Generative AI for Finance (ACM ICAIF ‘25) - the focus was squarely on turning AI from prototypes into reliable workflow tools. Speakers from BlackRock, Fidelity, Lion Global, CLSA, IBM, and MIT Sloan converged on three themes: evaluation discipline, randomness control, and context engineering. Together, these themes captured how buy-side, sell-side, and research teams are tackling the practical challenges of bringing AI into production.
(1) Evaluation is becoming the core of AI deployment
Across buy- and sell-side teams, the focus has shifted from testing individual models to evaluating entire workflows. Speakers from Fidelity and Lion Global emphasized that answers must stay tied to evidence, behave consistently across multi-turn questions, and be testable with point-in-time data. Many firms described evaluation frameworks that combine reproducibility checks, backtestable retrieval, and clearer documentation of assumptions. The takeaway was consistent: the ability to evaluate models rigorously - not new model capability - determines whether AI reaches production in the investment process.
(2) Randomness control is now a gating requirement
A recurring theme, highlighted by IBM’s Raffi Khatchadourian and Rolando Franco, was the need to control nondeterminism. Investment workflows cannot rely on outputs that shift from run to run. Output drift - where a model gives different answers to the same question - showed up even in well-tuned systems. IBM demonstrated that smaller 7–8B models can deliver fully deterministic outputs, while larger models tend to “out-reason themselves” and lose stability. To manage this, teams are adding cross-provider validation, agreement tests across multiple models, drift monitoring, and multi-turn reproducibility checks. Firms like BlackRock stressed that stability and predictability matter more than raw capability. Deterministic behavior is becoming essential for compliance-heavy workflows, even if hybrid probabilistic workflows remain useful in research.
(3) Context engineering drives real workflow quality
Speakers across BlackRock, Bernstein, and CLSA showed that performance improves dramatically when AI systems are fed structured internal data and firm-specific frameworks. Teams are investing in cleaner retrieval pipelines, consistent formatting of filings and broker research, and deterministic extraction layers that reduce noise. Several firms demonstrated layered agents that separate retrieval, reasoning, memory, and validation. Adding lightweight memory systems often increased correctness from ~30% to ~85% by helping agents follow firm-specific processes rather than relying on one-shot prompts. The strongest deployments made clear that context, not prompting, drives reliability inside the investment workflow.
All this points to show that adoption is already underway across research, sales, and investment teams. The frontier is shifting from what AI can do to how reliably it can do it - and whether firms can build the evaluation discipline, randomness controls, and context pipelines needed to use AI consistently inside the investment process.
