IDP Core Bench

Name: IDP Core Bench Leaderboard
Creator: Nanonets

v1.0

Built by Nanonets. ~2,000 invoices, receipts, forms, and handwritten docs. Four tasks: field extraction from structured documents (KIE), OCR on printed and handwritten text, table cell extraction and structure parsing, and answering questions about document content (VQA). Overall = mean of all four.

Models Evaluated

Dataset Size

~2,000 documents

Metrics

Source

View on GitHub

Overall Score = Average of KIE, OCR, Table, and VQA scores

Rankings

#	Model	Overall	KIE	OCR	Table	VQA
1	Gemini 3.1 ProGoogle	89.6	86.8	82.8	96.4	85.0
2	GPT-5.4OpenAI	84.4	85.7	69.1	94.8	78.2
3	Gemini-3-ProGoogle	81.8	85.7	81.8	95.8	64.1
4	Claude Sonnet 4.6Anthropic	81.2	89.5	73.7	96.3	65.2
5	Claude Opus 4.6Anthropic	81.1	89.8	74.0	96.0	64.4
6	Gemini-3-FlashGoogle	80.5	91.1	81.7	85.6	63.5
7	Nanonets OCR-3Nanonets	80.2	84.3	73.8	85.2	73.0
8	Qwen3-VL-235BAlibaba	80.0	83.8	71.7	85.0	75.5
9	Qwen3-VL-PlusAlibaba	79.8	83.8	71.9	85.0	74.7
10	GPT-5.2OpenAI	77.4	87.5	72.8	86.0	63.5
11	Qwen3.5-9BAlibaba	76.2	86.5	65.5	76.6	79.5
12	GPT-4.1OpenAI	74.7	87.1	75.6	73.1	63.0
13	Qwen3.5-4BAlibaba	74.5	86.0	64.7	76.7	72.4
14	Nanonets OCR2+Nanonets	73.8	86.4	64.0	79.7	65.1
15	GPT-5-MiniOpenAI	73.3	85.7	73.0	69.5	65.0
16	Claude Haiku 4.5Anthropic	72.9	85.6	65.0	81.7	59.2
17	Ministral-8BMistral AI	71.7	85.7	67.8	75.9	57.4
18	Mistral Small 4Mistral AI	68.5	78.3	57.4	67.6	77.9
19	Qwen3.5-2BAlibaba	67.1	78.5	56.2	72.4	59.8
20	GPT-5-NanoOpenAI	65.8	84.7	69.6	45.3	63.5
21	Qwen3.5-0.8BAlibaba	61.2	75.9	62.9	59.9	52.4
22	Pixtral-12BMistral AI	59.0	76.2	54.8	47.5	57.5
23	Llama-3.2-Vision-11BMeta	58.6	76.1	65.8	41.1	51.5
24	Gemma-4-E4B-itGoogle	55.0	11.1	74.0	55.0	65.3
25	GLM-OCRZhipu AI	54.9	83.5	66.7	24.5	44.9
26	Gemma-4-E2B-itGoogle	44.1	4.3	72.3	39.0	52.5
27	Gemma-3-12B-ITGoogle	0.0	0.0	0.0	0.0	0.0

Metrics

KIEHigher is better

Key Information Extraction accuracy on invoices, receipts, and forms using exact-match and fuzzy-match metrics.

OCRHigher is better

OCR accuracy on mixed handwritten and printed text documents.

TableHigher is better

Table understanding including cell-level extraction and structural parsing.

VQAHigher is better

Visual Question Answering requiring reasoning over document layout and content.