I received CVs today for a project job interview. I tried to make LLMs help me. all had to run locally for obvious reasons. Here are the results
Vision Model Benchmark — Document Extraction
Vision Model Benchmark — Document Extraction
Task: Evaluate local LLMs for extracting structured data from scanned (image-only) applicant document PDFs for the NIH candidate-screening project.
Date: 2026-06-16
1. Objective
For the candidate-screening SPA, each applicant’s CV and certificate documents must be read and converted into structured data (name, education with graduation/postgrad years, work experience, skills) plus an exact text transcription.
A significant portion of the documents are scanned images embedded in PDFs (no selectable text layer), so the chosen model must have reliable vision capability — not just accept an image, but actually read and transcribe it.
This benchmark measures extraction speed and vision quality across the available models to pick the optimal one for an ~810-document batch run.
2. Method
- Test set: 5 documents — all confirmed scanned/image-only PDFs (0 extractable text characters), single page each. This stresses vision capability uniformly.
- Controlled input: Each PDF page was rendered to PNG (150 DPI) once, and the same image + identical prompt were sent to every model. Differences are therefore due to the model, not rendering or input.
- Prompt: A single-shot instruction asking the model to return a JSON object with
doc_type,full_name,education,graduation_year,postgrad_year,work_experience,skills,summary. - Success metric: Did the model extract the correct person name and structured fields? (A model that returns
full_name = nulland 0 skills has failed to read the image.) - Token budget: IMPORTANT. The Qwen3.6 and several Gemma-4 models are reasoning models — they emit a
<think>/reasoning_contentblock before answering, and the server bills this against the samemax_tokenscompletion budget. An initial run capped at 700 output tokens truncated them mid-answer (finish_reason = length), producing a false “blind” result. The reasoning models were therefore re-tested at 2500 tokens, and think vs. answer tokens are reported separately.
3. Results — Per-Document
Time to complete one document. ✅ = name correctly extracted (image successfully read); ❌ = blank/empty output.
| Model | Doc 1 | Doc 2 | Doc 3 | Doc 4 | Doc 5 |
|---|---|---|---|---|---|
| gemma-4-26b-a4b-it | 10.9s ✅ | ERR* | 10.0s ✅ | 10.9s ✅ | 11.5s ✅ |
| qwen/qwen3.6-35b-a3b (2500 tok) | 32.1s ✅ | 29.4s ✅ | 14.6s ✅ | 16.9s ✅ | 22.0s ✅ |
| google/gemma-4-e4b (2500 tok) | 27.5s ✅ | 37.6s ✅ | 23.8s ✅ | 23.9s ✅ | 27.2s ✅ |
| gemma-4-31b-it | 64.8s ✅ | 61.0s ✅ | 64.5s ✅ | 73.3s ✅ | 71.7s ✅ |
| google/gemma-4-12b (2500 tok) | 69.3s ❌ | 85.0s ❌† | 62.7s ✅ | 63.4s ❌ | 86.1s ❌† |
| google/gemma-4-26b-a4b-qat | 16.3s ❌ | 10.4s ❌ | 10.3s ❌ | 10.2s ❌ | 10.3s ❌ |
| sarvam-translate-i1 | 10.0s ❌ | — | — | 9.7s ❌ | 9.9s ❌ |
*Doc 2 error for gemma-4-26b-a4b-it was a JSON formatting glitch (Expecting ',' delimiter), not a vision failure — the model read the image correctly but emitted slightly malformed JSON. This is recoverable with a JSON-repair step.
†gemma-4-12b hit finish_reason=length on Docs 2 & 5 — its reasoning step consumed the entire token budget before producing an answer. It also misread names on Docs 1 & 4 (“Saurav Das”, “GUID DEV”), so it is genuinely inaccurate, not merely token-starved.
4. Results — Summary
Only models that accept image input are listed below. Several models reject images outright (HTTP 400) and are noted separately.
| Model | Avg | Min | Max | Names extracted | Reads images? | Verdict |
|---|---|---|---|---|---|---|
| gemma-4-26b-a4b-it | 10.8s | 10.0s | 11.5s | 4 / 5 | ✅ Yes | 🏆 Fastest + accurate |
| qwen/qwen3.6-35b-a3b | 23.0s | 14.6s | 32.1s | 5 / 5 | ✅ Yes | 🏆 Most accurate |
| google/gemma-4-e4b | 28.0s | 23.8s | 37.6s | 5 / 5 | ✅ Yes | accurate (reasoning) |
| gemma-4-31b-it | 67.1s | 61.0s | 73.3s | 5 / 5 | ✅ Yes | accurate but slow |
| google/gemma-4-12b | 73.3s | 62.7s | 86.1s | 1 / 5 | ⚠️ Reads but inaccurate | reasoning-starved + misreads |
| google/gemma-4-26b-a4b-qat | 11.5s | 10.2s | 16.3s | 0 / 5 | ❌ No (blind) | unusable |
| sarvam-translate-i1 | 9.9s | 9.7s | 10.0s | 0 / 5 | ❌ No | translation model, not fit |
Models that reject image input entirely (HTTP 400 — “does not support images”):qwen3.6-35b-a3b-nsc-ace-saber-mtplx-optimized-speed, liquid/lfm2-24b-a2b, sarvam-30b-uncensored-i1, medgemma-27b-text-it-mlx.
Inconclusive: qwen3.6-35b-a3b-mtp — re-run at raised token budget exceeded the 900s timeout (reasoning model, very slow). Behavior likely matches qwen/qwen3.6-35b-a3b but could not be confirmed.
5. Findings
Correction to the first pass
An initial run used a 700-token output cap. The Qwen3.6 and Gemma-4 reasoning models hit finish_reason=length and appeared “blind” (empty output). On closer inspection the server exposes the <think> step in a separate reasoning_content field (billed against the same completion budget) — the model had in fact begun reasoning correctly but was truncated before producing an answer. Re-testing qwen/qwen3.6-35b-a3b and google/gemma-4-e4b at 2500 tokens yielded 5/5 correct for both. They were never blind, only token-starved. Lesson: for reasoning models, set a high max_tokens (≥2500) and read the reasoning_content / completion_tokens_details.reasoning_tokens fields separately from the answer.
Three strong vision models
gemma-4-26b-a4b-it— fastest by a wide margin (10.8 s, very consistent 10.0–11.5 s), reads images correctly (4/5; the miss was a recoverable JSON glitch). Best throughput.qwen/qwen3.6-35b-a3b— most accurate (5/5), ~2× slower (23 s avg) due to its reasoning step.google/gemma-4-e4b— also 5/5 accurate, ~28 s avg (reasoning). A solid mid option.
Genuinely vision-blind model
google/gemma-4-26b-a4b-qat accepts the image, completes normally (finish=stop), but returns empty structured output (full_name = null, 0 skills). It cannot read scanned documents — confirmed blind, not token-starved. Its “optimized” naming is misleading for this task.
Inaccurate reasoning model
google/gemma-4-12b does read images but is both slow (73 s avg, heavy thinking — 9119 think tokens) and inaccurate (misreads names on 2 docs, truncated to finish=length on 2 others). Not usable.
Others
gemma-4-31b-it is accurate (5/5) but impractically slow (~67 s/doc). sarvam-translate-i1 is a translation model and returns empty. Several models simply reject image input.
6. Recommendation
There is a clear speed vs. accuracy trade-off between the two viable models:
| Priority | Model | Why |
|---|---|---|
| Max throughput (810-doc batch) | gemma-4-26b-a4b-it | ~10.8 s/doc, consistent; 1-in-5 JSON glitch is fixable |
| Max extraction accuracy | qwen/qwen3.6-35b-a3b | 5/5 correct, but ~23 s/doc (requires 2500-token budget) |
Suggested approach for the full pipeline:
- Primary model:
gemma-4-26b-a4b-itfor all documents — fast enough to complete ~810 docs in reasonable time, with genuinely good vision. - Add a JSON-repair fallback so the occasional malformed-JSON output (the 1/5 glitch) never discards a whole document.
- Optional accuracy pass: for the difficult scanned documents (illegible, dense mark sheets), run
olmocr-2-7b-1025-mlxorqwen/qwen3.6-35b-a3bas a second-opinion extractor where the primary failed or looks incomplete. - Hybrid fast path: for the ~523 PDFs that already have a text layer, skip rendering and send the extracted text directly (~3–5 s); reserve vision for the ~287 scanned/image-only documents.
Estimated full-batch cost
With gemma-4-26b-a4b-it + the hybrid fast path:
- ~523 text-layer docs × ~4 s ≈ 35 min
- ~287 scanned/image docs × ~11 s ≈ 53 min
- Total ≈ 90 min for all ~810 documents (cache is resume-safe, so the run is fully interruptible and re-runnable).
Note: if Qwen is used for the scanned subset instead, that portion rises to ~287 × 23 s ≈ 110 min (higher accuracy, slower).
