All Benchmarks/IDP Core Bench

IDP Core Bench

v1.0

Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.

Models Evaluated

22

Dataset Size

~2,000 documents

Metrics

4

Source

View on GitHub

Overall Score = Average of KIE, OCR, Table, and VQA scores

Rankings

#
Model
Overall
KIE
OCR
Table
VQA
1Gemini 3.1 ProGoogle89.686.882.896.485.0
2GPT-5.4OpenAI84.485.769.194.878.2
3Gemini-3-ProGoogle81.885.781.895.864.1
4Claude Sonnet 4.6Anthropic81.289.573.796.365.2
5Claude Opus 4.6Anthropic81.189.874.096.064.4
6Gemini-3-FlashGoogle80.591.181.785.663.5
7GPT-5.2OpenAI77.487.572.886.063.5
8Qwen3.5-9BAlibaba76.286.565.576.679.5
9GPT-4.1OpenAI74.787.175.673.163.0
10Qwen3.5-4BAlibaba74.586.064.776.772.4
11Nanonets OCR2+Nanonets73.886.464.079.765.1
12GPT-5-MiniOpenAI73.385.773.069.565.0
13Claude Haiku 4.5Anthropic72.985.665.081.759.2
14Ministral-8BMistral AI71.785.767.875.957.4
15Mistral Small 4Mistral AI68.578.357.467.677.9
16Qwen3.5-2BAlibaba67.178.556.272.459.8
17GPT-5-NanoOpenAI65.884.769.645.363.5
18Qwen3.5-0.8BAlibaba61.275.962.959.952.4
19Pixtral-12BMistral AI59.076.254.847.557.5
20Llama-3.2-Vision-11BMeta58.676.165.841.151.5
21GLM-OCRZhipu AI54.983.566.724.544.9
22Gemma-3-12B-ITGoogle0.00.00.00.00.0

Metrics

KIEHigher is better

Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.

OCRHigher is better

OCR accuracy on mixed handwritten and printed text documents.

TableHigher is better

Table understanding including cell-level extraction and structural parsing.

VQAHigher is better

Visual Question Answering requiring reasoning over document layout and content.