Epidemiology & Technology

LLM based Text extraction from PDFs

I received CVs today for a project job interview. I tried to make LLMs help me. all had to run locally for obvious reasons. Here are the results

Vision Model Benchmark — Document Extraction

Vision Model Benchmark — Document Extraction

Task: Evaluate local LLMs for extracting structured data from scanned (image-only) applicant document PDFs for the NIH candidate-screening project.

Date: 2026-06-16


1. Objective

For the candidate-screening SPA, each applicant’s CV and certificate documents must be read and converted into structured data (name, education with graduation/postgrad years, work experience, skills) plus an exact text transcription.

A significant portion of the documents are scanned images embedded in PDFs (no selectable text layer), so the chosen model must have reliable vision capability — not just accept an image, but actually read and transcribe it.

This benchmark measures extraction speed and vision quality across the available models to pick the optimal one for an ~810-document batch run.


2. Method

  • Test set: 5 documents — all confirmed scanned/image-only PDFs (0 extractable text characters), single page each. This stresses vision capability uniformly.
  • Controlled input: Each PDF page was rendered to PNG (150 DPI) once, and the same image + identical prompt were sent to every model. Differences are therefore due to the model, not rendering or input.
  • Prompt: A single-shot instruction asking the model to return a JSON object with doc_type, full_name, education, graduation_year, postgrad_year, work_experience, skills, summary.
  • Success metric: Did the model extract the correct person name and structured fields? (A model that returns full_name = null and 0 skills has failed to read the image.)
  • Token budget: IMPORTANT. The Qwen3.6 and several Gemma-4 models are reasoning models — they emit a <think> / reasoning_content block before answering, and the server bills this against the same max_tokens completion budget. An initial run capped at 700 output tokens truncated them mid-answer (finish_reason = length), producing a false “blind” result. The reasoning models were therefore re-tested at 2500 tokens, and think vs. answer tokens are reported separately.

3. Results — Per-Document

Time to complete one document. ✅ = name correctly extracted (image successfully read); ❌ = blank/empty output.

ModelDoc 1Doc 2Doc 3Doc 4Doc 5
gemma-4-26b-a4b-it10.9s ✅ERR*10.0s ✅10.9s ✅11.5s ✅
qwen/qwen3.6-35b-a3b (2500 tok)32.1s ✅29.4s ✅14.6s ✅16.9s ✅22.0s ✅
google/gemma-4-e4b (2500 tok)27.5s ✅37.6s ✅23.8s ✅23.9s ✅27.2s ✅
gemma-4-31b-it64.8s ✅61.0s ✅64.5s ✅73.3s ✅71.7s ✅
google/gemma-4-12b (2500 tok)69.3s ❌85.0s ❌†62.7s ✅63.4s ❌86.1s ❌†
google/gemma-4-26b-a4b-qat16.3s ❌10.4s ❌10.3s ❌10.2s ❌10.3s ❌
sarvam-translate-i110.0s ❌9.7s ❌9.9s ❌

*Doc 2 error for gemma-4-26b-a4b-it was a JSON formatting glitch (Expecting ',' delimiter), not a vision failure — the model read the image correctly but emitted slightly malformed JSON. This is recoverable with a JSON-repair step.

gemma-4-12b hit finish_reason=length on Docs 2 & 5 — its reasoning step consumed the entire token budget before producing an answer. It also misread names on Docs 1 & 4 (“Saurav Das”, “GUID DEV”), so it is genuinely inaccurate, not merely token-starved.


4. Results — Summary

Only models that accept image input are listed below. Several models reject images outright (HTTP 400) and are noted separately.

ModelAvgMinMaxNames extractedReads images?Verdict
gemma-4-26b-a4b-it10.8s10.0s11.5s4 / 5✅ Yes🏆 Fastest + accurate
qwen/qwen3.6-35b-a3b23.0s14.6s32.1s5 / 5✅ Yes🏆 Most accurate
google/gemma-4-e4b28.0s23.8s37.6s5 / 5✅ Yesaccurate (reasoning)
gemma-4-31b-it67.1s61.0s73.3s5 / 5✅ Yesaccurate but slow
google/gemma-4-12b73.3s62.7s86.1s1 / 5⚠️ Reads but inaccuratereasoning-starved + misreads
google/gemma-4-26b-a4b-qat11.5s10.2s16.3s0 / 5❌ No (blind)unusable
sarvam-translate-i19.9s9.7s10.0s0 / 5❌ Notranslation model, not fit

Models that reject image input entirely (HTTP 400 — “does not support images”):
qwen3.6-35b-a3b-nsc-ace-saber-mtplx-optimized-speed, liquid/lfm2-24b-a2b, sarvam-30b-uncensored-i1, medgemma-27b-text-it-mlx.

Inconclusive: qwen3.6-35b-a3b-mtp — re-run at raised token budget exceeded the 900s timeout (reasoning model, very slow). Behavior likely matches qwen/qwen3.6-35b-a3b but could not be confirmed.


5. Findings

Correction to the first pass

An initial run used a 700-token output cap. The Qwen3.6 and Gemma-4 reasoning models hit finish_reason=length and appeared “blind” (empty output). On closer inspection the server exposes the <think> step in a separate reasoning_content field (billed against the same completion budget) — the model had in fact begun reasoning correctly but was truncated before producing an answer. Re-testing qwen/qwen3.6-35b-a3b and google/gemma-4-e4b at 2500 tokens yielded 5/5 correct for both. They were never blind, only token-starved. Lesson: for reasoning models, set a high max_tokens (≥2500) and read the reasoning_content / completion_tokens_details.reasoning_tokens fields separately from the answer.

Three strong vision models

  • gemma-4-26b-a4b-it — fastest by a wide margin (10.8 s, very consistent 10.0–11.5 s), reads images correctly (4/5; the miss was a recoverable JSON glitch). Best throughput.
  • qwen/qwen3.6-35b-a3b — most accurate (5/5), ~2× slower (23 s avg) due to its reasoning step.
  • google/gemma-4-e4b — also 5/5 accurate, ~28 s avg (reasoning). A solid mid option.

Genuinely vision-blind model

google/gemma-4-26b-a4b-qat accepts the image, completes normally (finish=stop), but returns empty structured output (full_name = null, 0 skills). It cannot read scanned documents — confirmed blind, not token-starved. Its “optimized” naming is misleading for this task.

Inaccurate reasoning model

google/gemma-4-12b does read images but is both slow (73 s avg, heavy thinking — 9119 think tokens) and inaccurate (misreads names on 2 docs, truncated to finish=length on 2 others). Not usable.

Others

gemma-4-31b-it is accurate (5/5) but impractically slow (~67 s/doc). sarvam-translate-i1 is a translation model and returns empty. Several models simply reject image input.


6. Recommendation

There is a clear speed vs. accuracy trade-off between the two viable models:

PriorityModelWhy
Max throughput (810-doc batch)gemma-4-26b-a4b-it~10.8 s/doc, consistent; 1-in-5 JSON glitch is fixable
Max extraction accuracyqwen/qwen3.6-35b-a3b5/5 correct, but ~23 s/doc (requires 2500-token budget)

Suggested approach for the full pipeline:

  1. Primary model: gemma-4-26b-a4b-it for all documents — fast enough to complete ~810 docs in reasonable time, with genuinely good vision.
  2. Add a JSON-repair fallback so the occasional malformed-JSON output (the 1/5 glitch) never discards a whole document.
  3. Optional accuracy pass: for the difficult scanned documents (illegible, dense mark sheets), run olmocr-2-7b-1025-mlx or qwen/qwen3.6-35b-a3b as a second-opinion extractor where the primary failed or looks incomplete.
  4. Hybrid fast path: for the ~523 PDFs that already have a text layer, skip rendering and send the extracted text directly (~3–5 s); reserve vision for the ~287 scanned/image-only documents.

Estimated full-batch cost

With gemma-4-26b-a4b-it + the hybrid fast path:

  • ~523 text-layer docs × ~4 s ≈ 35 min
  • ~287 scanned/image docs × ~11 s ≈ 53 min
  • Total ≈ 90 min for all ~810 documents (cache is resume-safe, so the run is fully interruptible and re-runnable).

Note: if Qwen is used for the scanned subset instead, that portion rises to ~287 × 23 s ≈ 110 min (higher accuracy, slower).

Related Posts