LLM based Text extraction from PDFs

Vivek GuptaJune 16, 2026

I received CVs today for a project job interview. I tried to make LLMs help me. all had to run locally for obvious reasons. Here are the results

Vision Model Benchmark — Document Extraction

Task: Evaluate local LLMs for extracting structured data from scanned (image-only) applicant document PDFs for the NIH candidate-screening project.

Date: 2026-06-16

1. Objective

For the candidate-screening SPA, each applicant’s CV and certificate documents must be read and converted into structured data (name, education with graduation/postgrad years, work experience, skills) plus an exact text transcription.

A significant portion of the documents are scanned images embedded in PDFs (no selectable text layer), so the chosen model must have reliable vision capability — not just accept an image, but actually read and transcribe it.

This benchmark measures extraction speed and vision quality across the available models to pick the optimal one for an ~810-document batch run.

2. Method

Test set: 5 documents — all confirmed scanned/image-only PDFs (0 extractable text characters), single page each. This stresses vision capability uniformly.
Controlled input: Each PDF page was rendered to PNG (150 DPI) once, and the same image + identical prompt were sent to every model. Differences are therefore due to the model, not rendering or input.
Prompt: A single-shot instruction asking the model to return a JSON object with doc_type, full_name, education, graduation_year, postgrad_year, work_experience, skills, summary.
Success metric: Did the model extract the correct person name and structured fields? (A model that returns full_name = null and 0 skills has failed to read the image.)
Token budget: IMPORTANT. The Qwen3.6 and several Gemma-4 models are reasoning models — they emit a <think> / reasoning_content block before answering, and the server bills this against the same max_tokens completion budget. An initial run capped at 700 output tokens truncated them mid-answer (finish_reason = length), producing a false “blind” result. The reasoning models were therefore re-tested at 2500 tokens, and think vs. answer tokens are reported separately.

3. Results — Per-Document

Time to complete one document. ✅ = name correctly extracted (image successfully read); ❌ = blank/empty output.

Model	Doc 1	Doc 2	Doc 3	Doc 4	Doc 5
gemma-4-26b-a4b-it	10.9s ✅	ERR*	10.0s ✅	10.9s ✅	11.5s ✅
qwen/qwen3.6-35b-a3b (2500 tok)	32.1s ✅	29.4s ✅	14.6s ✅	16.9s ✅	22.0s ✅
google/gemma-4-e4b (2500 tok)	27.5s ✅	37.6s ✅	23.8s ✅	23.9s ✅	27.2s ✅
gemma-4-31b-it	64.8s ✅	61.0s ✅	64.5s ✅	73.3s ✅	71.7s ✅
google/gemma-4-12b (2500 tok)	69.3s ❌	85.0s ❌†	62.7s ✅	63.4s ❌	86.1s ❌†
google/gemma-4-26b-a4b-qat	16.3s ❌	10.4s ❌	10.3s ❌	10.2s ❌	10.3s ❌
sarvam-translate-i1	10.0s ❌	—	—	9.7s ❌	9.9s ❌

*Doc 2 error for gemma-4-26b-a4b-it was a JSON formatting glitch (Expecting ',' delimiter), not a vision failure — the model read the image correctly but emitted slightly malformed JSON. This is recoverable with a JSON-repair step.

†gemma-4-12b hit finish_reason=length on Docs 2 & 5 — its reasoning step consumed the entire token budget before producing an answer. It also misread names on Docs 1 & 4 (“Saurav Das”, “GUID DEV”), so it is genuinely inaccurate, not merely token-starved.

4. Results — Summary

Only models that accept image input are listed below. Several models reject images outright (HTTP 400) and are noted separately.

Model	Avg	Min	Max	Names extracted	Reads images?	Verdict
gemma-4-26b-a4b-it	10.8s	10.0s	11.5s	4 / 5	✅ Yes	🏆 Fastest + accurate
qwen/qwen3.6-35b-a3b	23.0s	14.6s	32.1s	5 / 5	✅ Yes	🏆 Most accurate
google/gemma-4-e4b	28.0s	23.8s	37.6s	5 / 5	✅ Yes	accurate (reasoning)
gemma-4-31b-it	67.1s	61.0s	73.3s	5 / 5	✅ Yes	accurate but slow
google/gemma-4-12b	73.3s	62.7s	86.1s	1 / 5	⚠️ Reads but inaccurate	reasoning-starved + misreads
google/gemma-4-26b-a4b-qat	11.5s	10.2s	16.3s	0 / 5	❌ No (blind)	unusable
sarvam-translate-i1	9.9s	9.7s	10.0s	0 / 5	❌ No	translation model, not fit

Models that reject image input entirely (HTTP 400 — “does not support images”):
qwen3.6-35b-a3b-nsc-ace-saber-mtplx-optimized-speed, liquid/lfm2-24b-a2b, sarvam-30b-uncensored-i1, medgemma-27b-text-it-mlx.

Inconclusive: qwen3.6-35b-a3b-mtp — re-run at raised token budget exceeded the 900s timeout (reasoning model, very slow). Behavior likely matches qwen/qwen3.6-35b-a3b but could not be confirmed.

5. Findings

Correction to the first pass

An initial run used a 700-token output cap. The Qwen3.6 and Gemma-4 reasoning models hit finish_reason=length and appeared “blind” (empty output). On closer inspection the server exposes the <think> step in a separate reasoning_content field (billed against the same completion budget) — the model had in fact begun reasoning correctly but was truncated before producing an answer. Re-testing qwen/qwen3.6-35b-a3b and google/gemma-4-e4b at 2500 tokens yielded 5/5 correct for both. They were never blind, only token-starved. Lesson: for reasoning models, set a high max_tokens (≥2500) and read the reasoning_content / completion_tokens_details.reasoning_tokens fields separately from the answer.

Three strong vision models

gemma-4-26b-a4b-it — fastest by a wide margin (10.8 s, very consistent 10.0–11.5 s), reads images correctly (4/5; the miss was a recoverable JSON glitch). Best throughput.
qwen/qwen3.6-35b-a3b — most accurate (5/5), ~2× slower (23 s avg) due to its reasoning step.
google/gemma-4-e4b — also 5/5 accurate, ~28 s avg (reasoning). A solid mid option.

Genuinely vision-blind model

google/gemma-4-26b-a4b-qat accepts the image, completes normally (finish=stop), but returns empty structured output (full_name = null, 0 skills). It cannot read scanned documents — confirmed blind, not token-starved. Its “optimized” naming is misleading for this task.

Inaccurate reasoning model

google/gemma-4-12b does read images but is both slow (73 s avg, heavy thinking — 9119 think tokens) and inaccurate (misreads names on 2 docs, truncated to finish=length on 2 others). Not usable.

Others

gemma-4-31b-it is accurate (5/5) but impractically slow (~67 s/doc). sarvam-translate-i1 is a translation model and returns empty. Several models simply reject image input.

6. Recommendation

There is a clear speed vs. accuracy trade-off between the two viable models:

Priority	Model	Why
Max throughput (810-doc batch)	gemma-4-26b-a4b-it	~10.8 s/doc, consistent; 1-in-5 JSON glitch is fixable
Max extraction accuracy	qwen/qwen3.6-35b-a3b	5/5 correct, but ~23 s/doc (requires 2500-token budget)

Suggested approach for the full pipeline:

Primary model: gemma-4-26b-a4b-it for all documents — fast enough to complete ~810 docs in reasonable time, with genuinely good vision.
Add a JSON-repair fallback so the occasional malformed-JSON output (the 1/5 glitch) never discards a whole document.
Optional accuracy pass: for the difficult scanned documents (illegible, dense mark sheets), run olmocr-2-7b-1025-mlx or qwen/qwen3.6-35b-a3b as a second-opinion extractor where the primary failed or looks incomplete.
Hybrid fast path: for the ~523 PDFs that already have a text layer, skip rendering and send the extracted text directly (~3–5 s); reserve vision for the ~287 scanned/image-only documents.

Estimated full-batch cost

With gemma-4-26b-a4b-it + the hybrid fast path:

~523 text-layer docs × ~4 s ≈ 35 min
~287 scanned/image docs × ~11 s ≈ 53 min
Total ≈ 90 min for all ~810 documents (cache is resume-safe, so the run is fully interruptible and re-runnable).

Note: if Qwen is used for the scanned subset instead, that portion rises to ~287 × 23 s ≈ 110 min (higher accuracy, slower).

Epidemiology & Technology

Search site