Abstract
Do language models make decisions under uncertainty like humans do? And if so,
what role does extended reasoning play in the underlying decision process? We answer this question by
introducing an active probabilistic reasoning task that cleanly separates sampling (actively acquiring
evidence) from inference (integrating evidence towards a decision). Benchmarking humans and a broad set of
contemporary LLMs against optimal reference policies reveals a consistent pattern: extended reasoning is the
key determinant of strong performance, driving large gains in inference, while yielding only modest
improvements in active sampling. To explain these differences, we fit a behavioral model that captures
systematic deviations from optimal Bayesian behavior through interpretable parameter families, placing
humans and models in a shared low-dimensional cognitive space. The resulting fits show how reasoning shifts
models toward human-like regimes of evidence accumulation and belief-to-choice mapping, and yield testable
predictions about the latent dynamics that might drive each decision. Probing residual-stream activations of
an open-weight reasoning model, we find that the geometry of internal representations tracks these predicted
dynamics, linking behavior to representational correlates of the fitted latent dynamics.