Large language models (LLMs) are getting surprisingly good at learning the basics of trading. Consider that the latest models are able to perform tasks like pricing simple scenarios, reasoning through rules and even outlining basic strategies.
That gave trader Quintin Brown an idea. For his first project after moving to Optiver’s Applied AI team, Quintin decided to give the LLMs the exam we give to our interns. To push their limits, Quintin also ran the models through a mock assessment designed to test traders after six months on the job.
The questions were simple: How good at trading is the current generation of LLMs? And where is there room for improvement?
After a battery of tests and experiments, the answer was nuanced. LLMs are already highly useful in some areas, but still have limitations in others.
Recent LLM models showed improvement over older models on the exam. The median Optiver intern scored 61%. Source: Optiver
Good at the basicsOn theory-style questions, the models performed admirably well – good enough to give our interns a run for their money. They could handle most mathematical problems and consistently demonstrated a solid understanding of trading concepts.
In fact, the performance of the models was broadly comparable to our human interns in many areas, with errors concentrated in more complex edge cases, such as in event pricing (making markets around elections or data releases, for instance), or when we challenged them with more nuanced market structure questions.
We also tested the models in simulated trading environments. These market-making games are designed to help our interns start thinking like traders. Here too, the models held up well. They could calculate fair value, identify situations when they might be adversely selected, and they were able to describe the role of a market maker as someone aiming to capture the bid/ask spread while managing risk.
In short, after the initial batch of tests, we concluded the LLMs were already very capable at a number of core trading tasks. In many areas, they were able to perform at the level we expect of our interns.
So far, so good.
Where things break downTrading is an interesting challenge because it requires you to continuously update your view of the world. It’s not a one-and-done exercise. It’s a constant process of updating beliefs, reacting to others and making decisions under uncertainty. The ability to do this is one of the key skills that we select for and develop in our traders.
It’s also where the current models show the most limitations.
Across multiple scenarios, some of the most common mistakes we observed the models making were:
For example, here’s a behavior we observed repeatedly: A model would correctly adjust its pricing after ingesting new information. It would then use that information to execute a trade. But when additional signals appeared, it would fail to update its view again. This kind of breakdown in multi-step reasoning happened pretty frequently.
The result was strategies that were often directionally correct, but not yet proficient in sequential decision-making.
More recent LLM models showed improvements in a trading simulation designed to test traders after six months, but fell short of actual Optiver traders. Source: Optiver
The EV problemAnother consistent issue we encountered with the models was the way they approached expected value (EV). In trading, EV determines whether a quote or trade is worth making. Our traders are trained to weigh outcomes by probability and to act when EV is positive.
The models definitely understand this concept. They can explain that profit comes from identifying and trading when you think you have an edge. But a limitation we observed was that in practice they often defaulted to conservative or heuristic-driven decisions that left money on the table.
Instead of fully committing to positive EV trades, the models tended to default to overly conservative strategies, such as trading smaller than optimal or prioritizing hedging over pursuing a positive EV opportunity.
This highlights a gap between understanding what drives profitability and consistently executing on it. In testing, EV maximization was one of the weakest areas of performance for the models. In other words, the intuition is there, but the precision needed to execute optimally is still developing.
Adverse selectionThe models were reasonably good at identifying adverse selection. That’s the term for the situation in trading where an information asymmetry exists between you and your counterparty.
Identifying “informed flow” is one of the most common and interesting challenges traders face. The models, to their credit, recognized the risks of adverse selection and could accurately describe how to adjust their pricing in response.
They didn’t always act on that insight consistently, though. For instance, even after correctly identifying an informed counterparty, they still chose to trade with that counterparty at levels that implied negative EV. Or they didn’t account for how their own actions influenced the behavior of others in the market.
That ties into a broader limitation we observed among the models. The LLMs were pretty good at categorizing other market participants (such as informed traders, position-driven traders and liquidity providers). But they struggled to simulate how those participants would react over time. Instead, they often relied on simplified or optimistic assumptions about how they would behave, rather than the kind of realistic, adversarial thinking we expect of our traders.
Sensible, but still developing depthWhen asked to outline trading strategies, the models all utilized the following structure: observe → infer → act. But while sensible, we often found the resulting strategies to be lacking in depth. For instance, they rarely demonstrated the more nuanced aspects of how our traders think and make decisions, such as:
This led the LLMs to produce strategies that sounded reasonable on paper but often lacked the precision needed for real execution. Their plans were coherent but underspecified.
What our experiments tell usToday’s models are good at:
They’re currently less reliable at:
Despite the shortcomings of the current models though, these systems are already widely used within Optiver for analysis, prototyping and education. And their role continues to expand.
At the same time, it highlights where human judgment still matters most today. Trading isn’t just about knowing the right answer. It’s about updating one’s probabilistic model as the world changes, anticipating others’ behavior and committing to decisions under uncertainty.
That combination—sequential reasoning, probabilistic thinking, and adversarial awareness—remains difficult to replicate.
For now at least, traders still have the edge. But the trajectory is clear, and the opportunities to combine human expertise with AI systems are becoming more and more significant every day.
Use our AI to tailor your resume for this Where AI trading models work (and where they still fall short) position at Optiver.