Benchmarking Agentic Financial Analysis
How well can frontier AI models perform real financial work? We evaluate leading models across in-house simulated RL environments spanning financial reasoning, SEC filing analysis, and complex Excel modeling tasks.
Leaderboard
Overall scores across all environments. Higher is better (0–1 scale).
Environments
Multiple environments testing financial knowledge, numerical reasoning, coding, and spreadsheet modeling.
Methodology
All evaluations are run using the verifiers framework. We evaluate 18 frontier models across 11 financial environments spanning knowledge, reasoning, and modeling.
Sampling & evaluation size
- QA benchmarks — We sample the first N questions from each dataset: 100 for FAMMA, FinQA, and TAT-QA; 50 for FinanceBench, ConvFinQA, and FinQA-Code; 20 for Finance MCQ (full set). Each model is evaluated on the same question subset for comparability.
- Excel modeling — Each environment (LBO, DCF KCorp SOTP, Precedent Transactions, PE Waterfall) contains a single expert-authored task. Every model receives the same template workbook and prompt.
Environment types
- Single-turn QA (FAMMA, Finance MCQ) — No tools. The model answers a financial knowledge question directly.
- Tool-augmented QA (FinQA, TAT-QA, FinanceBench, ConvFinQA) — The model is given a
calculate tool for arithmetic and can call it as many times as needed.
- Code execution QA (FinQA-Code) — Same questions as FinQA, but the model must write and execute Python code via an
execute_python tool to compute its answer.
- Excel financial modeling (LBO, DCF, Precedent Txns, Waterfall) — The agent receives a template
.xlsx workbook and builds a complete financial model using Python (openpyxl). Available tools: execute_python, list_files, read_file, and read_excel_sheet.
Scoring
- QA tasks — Graded by numeric accuracy (2% tolerance), exact match, or fuzzy text comparison depending on the expected answer type. Score is 0 (incorrect) or 1 (correct).
- Excel tasks — Graded by cell-level comparison against a gold-standard solution, checking values, formulas, formatting, and sheet structure. Score is continuous from 0 to 1.
- Aggregation — Environment score = mean score across all evaluated questions. Overall model score = mean across all environments (equal weight). Pass rate = fraction of tasks scoring > 0.5.
Reproducibility
- All model calls use temperature 0 (greedy decoding) where supported.
- Full agent trajectories (reasoning, tool calls, results) are logged and browsable above.