Cross-Architecture Hallucination Detection in Vision-Language Model Document Extraction

Gunduzhan Acar Mirasys LLC

March 17, 2026


Abstract

We present a method for detecting fabricated content (hallucinations) in vision-language model (VLM) document extraction by exploiting the asymmetric failure modes of architecturally distinct extraction engines. Unlike prior multi-engine approaches that use consensus voting, our method classifies each disagreement by diagnostic type — hallucination, substitution error, omission, or layout disagreement — based on which combination of engine architectures agrees or disagrees. We validate on 130+ pages from real legal documents using four engines (text-layer extraction, Tesseract OCR, Apple Vision Framework, and Qwen2.5-VL-72B), achieving 100% hallucination detection sensitivity and 0% false positive rate. We further demonstrate that 42% of documents exhibit asymmetric failure across engines, confirming that different architectures fail in predictably distinct ways on the same document.

Beyond hallucination detection, we present three additional techniques enabled by the cross-architecture framework: an invisible text layer detection method that identifies hidden content in PDF documents — including failed redactions, OCR layer tampering, and accidentally embedded metadata — by comparing text-layer extraction against VLM output; an embedded code verification method that decodes QR codes and barcodes within documents and cross-references decoded payloads against visible labels for tampering detection; and a composite Document Quality Score (DQS) that aggregates inter-engine agreement, confidence, extraction completeness, and disagreement density into a single reliability metric designed for legal evidence admissibility assessment under Daubert/Frye standards. Code and experimental protocols are available for reproducibility.

Keywords: vision-language models, hallucination detection, document processing, cross-architecture comparison, OCR, legal technology, invisible text detection, QR code verification, document quality score


1 Introduction

Vision-language models (VLMs) have demonstrated remarkable capability in extracting text from document images, including handwritten content, complex layouts, and non-text visual elements such as signatures and photographs. However, these models suffer from a fundamental limitation: autoregressive hallucination, wherein the model generates plausible text that does not exist in the source document.

In high-stakes domains — legal evidence processing, medical records, financial compliance — undetected hallucination can have severe consequences. A fabricated date, altered financial figure, or invented contract clause can undermine the evidentiary value of an entire document set.

Prior work on multi-engine document processing focuses on accuracy improvement through consensus. The MEMOE system [1] combines multiple OCR outputs using voting and evidence fusion. Abdulkader and Casey [2] use multiple OCR engines with learned error probability estimation. More recently, Finch-Zk [3] applies cross-model consistency to detect LLM hallucinations by comparing outputs from diverse language models.

These approaches share a key limitation: they treat inter-engine disagreements mainly as errors to be resolved through voting or probability estimation. Our working hypothesis is that different engine architectures fail in different ways, and that classifying the likely cause of a disagreement is more useful than simply flagging disagreement. In our 30-document false-positive test set, a simple disagreement rule produced a 23.3% false positive rate, while our diagnostic classification reduced that rate to 0.0% on the same sample.

Additionally, the cross-architecture comparison framework enables detection of document-tampering scenarios — invisible text layers and embedded code fraud — that are not addressed by prior systems, as well as a composite quality metric designed for legal evidence admissibility standards.

Contributions

  1. A diagnostic classification taxonomy for cross-architecture document extraction disagreements, classifying each disagreement by root cause (hallucination, substitution, omission, layout disagreement) based on the combination of engine responses.

  2. An effectively-scanned document threshold that prevents false positive hallucination flags on documents where deterministic engines have insufficient output for reliable comparison.

  3. Experimental validation on 130+ pages from legal documents using four architecturally distinct engines, with all synthetic hallucination cases in the evaluation set detected and no false positives in the tested real-document subset.

  4. An invisible text layer detection method that identifies hidden content in PDF documents — including failed redactions, OCR layer tampering, and accidentally embedded metadata — by detecting discrepancies between text-layer extraction and visual extraction.

  5. An embedded code verification method that decodes QR codes and barcodes in documents and compares decoded payloads against visible labels for tampering detection.

  6. A composite Document Quality Score (DQS) that aggregates inter-engine agreement, confidence, extraction completeness, and disagreement density into a single reliability metric designed for legal evidence admissibility assessment under Daubert/Frye standards.


Multi-engine OCR. Combining multiple OCR engines for improved accuracy has been studied extensively. Zhu et al. [1] present MEMOE, which fuses outputs from multiple engines with evidence weighting. Abdulkader and Casey [2] train error probability estimators on multi-engine output mismatches. Lund and Ringger [4] apply voting-based approaches. These methods aim to produce a single best output; they do not classify the cause of disagreements or address VLM-specific hallucination.

VLM hallucination detection. Gunjal et al. [5] introduce M-HalDetect, a dataset for training hallucination detectors in VLMs. Their approach trains a classifier on annotated examples rather than using cross-architecture comparison. Li et al. [6] use self-consistency (sampling multiple outputs from the same model) for hallucination detection, which cannot detect systematic biases shared across samples.

Cross-model consistency. Finch-Zk [3] detects LLM hallucinations by comparing outputs from multiple language models responding to semantically equivalent prompts. Our method differs in three key respects: (1) we compare architecturally distinct engines (deterministic text extraction vs. OCR vs. VLM) rather than multiple LLMs of similar architecture; (2) we classify disagreements by diagnostic type rather than treating all inconsistencies as hallucinations; (3) we apply the method to document image extraction rather than general text generation.

PDF forensics. Existing PDF analysis tools (pdf-parser, pdfid) examine PDF structure for malicious content. These tools look at PDF structural elements and do not compare visual rendering against embedded data layers. They cannot detect invisible text that is structurally valid but visually hidden — the scenario addressed by our cross-architecture comparison approach in Section 3.6.

Barcode/QR verification. QR code decoders (pyzbar, OpenCV) decode embedded payloads. VLMs identify QR codes as visual objects but do not decode their contents. No prior system combines detection, decoding, and cross-referencing decoded content against visible document labels — the pipeline presented in Section 3.7.

Document quality metrics. Individual engine confidence scores (Tesseract word confidence, VLM token probability) provide per-engine reliability estimates. No prior system aggregates cross-architecture agreement into a composite score designed for legal evidence admissibility standards.

Document processing for legal applications. Prior legal document processing systems focus on extraction accuracy and search capability. To our knowledge, no prior system provides cross-architecture hallucination detection with diagnostic classification specifically designed for legal evidence admissibility requirements.


3 Method

3.1 Engine Taxonomy

We identify four classes of document extraction engine, each with a characteristic failure mode:

Engine Class Architecture Failure Mode
Text-layer (pdfplumber) Direct character retrieval from PDF data Returns nothing on scanned documents
Traditional OCR (Tesseract) Character segmentation + pattern/neural recognition Substitution errors (0↔︎O, 1↔︎l, rn↔︎m)
Neural OCR (Apple Vision) On-device neural text recognition Conservative refusal on ambiguous regions
VLM (Qwen2.5-VL-72B) Autoregressive multimodal transformer Hallucination — fabricating plausible content

The asymmetry of these failure modes is the foundation of our method: when engines disagree, the pattern of disagreement reveals the cause.

3.2 Diagnostic Classification

Given outputs from engines of classes A (text-layer), B (OCR), C (neural OCR), and D (VLM), each text region is classified by the following decision procedure:

Diagnostic Type Definition Severity Recommended Action
VLM_HALLUCINATION VLM produces text absent from all deterministic engines (A, B, C) Critical Flag for human review; do not use in downstream applications
VLM_OMISSION Deterministic engines produce text absent from VLM output High Supplement VLM output with deterministic extraction
OCR_SUBSTITUTION OCR differs from text-layer on specific characters matching known confusion patterns (0↔O, 1↔l, rn↔m) Medium Auto-correct using text-layer or VLM reading
LAYOUT_DISAGREEMENT All engines agree on substantive content but differ in formatting or structural labels Low Dismiss — content is accurate
AMBIGUOUS Deterministic engines produce insufficient output for reliable comparison Indeterminate Fall back to alternative verification
FULL_AGREEMENT All engines agree on content and layout None Accept extraction with high confidence

3.3 Layout Disagreement Filtering

VLMs frequently add structural labels ("[Header]", "[Main Content]") or reorganize document layout. We filter these using two criteria:

  • Overlap ratio threshold: When Jaccard word overlap between VLM and deterministic output exceeds τ (default 0.60), the core content matches.
  • Layout indicator detection: Additional VLM content is checked against a set of structural/layout terms. If the majority of disagreement consists of layout indicators, the region is classified as LAYOUT_DISAGREEMENT.

This filter reduced our false positive rate from 23.3% to 0.0% on a 30-document test set.

3.4 Effectively-Scanned Document Threshold

When deterministic engines produce less than n characters (default: 100) or less than p% of VLM output (default: 5%), the comparison is classified as AMBIGUOUS. This addresses the common case of scanned documents where text-layer extraction returns only header/footer metadata (15–61 characters) while the VLM extracts full page content (1,000–1,700 characters).

Without this threshold, every scanned document triggers hallucination flags. With it, scanned documents are correctly identified as requiring alternative verification.

3.5 Document Quality Score

We compute a composite Document Quality Score (DQS):

DQS = w₁A + w₂C + w₃E − w₄D

where:

  • A = inter-engine agreement rate (proportion of text regions where all engines agree)
  • C = mean engine confidence (average of per-engine confidence scores)
  • E = extraction completeness (proportion of expected document content successfully extracted by at least one engine)
  • D = disagreement density (proportion of text regions with inter-engine disagreements, weighted by diagnostic severity from the classification taxonomy)
  • w₁, w₂, w₃, w₄ = configurable weights (defaults: 0.40, 0.30, 0.20, 0.10)

The score is bounded in [0.0, 1.0] and designed to map to actionable thresholds: documents scoring above 0.70 are suitable for use in legal proceedings with documented reliability; documents scoring below 0.70 require mandatory human review.

DQS differs from single-engine confidence in three ways. First, it aggregates information from multiple architecturally distinct engines, making it robust to systematic biases in any single engine. Second, it incorporates disagreement density as a negative factor — a document where three engines agree and one disagrees scores lower than a document where all four agree, even if all individual engines report high confidence. Third, the severity weighting from the diagnostic classification taxonomy means that a hallucination disagreement penalizes the score more heavily than a layout disagreement.

3.6 Invisible Text Layer Detection

We identify a novel attack vector: PDF documents containing text rendered as invisible (e.g., white text on white background, zero-font-size text, or text positioned off the visible page area). Text-layer extraction reads this content from the PDF data structure, but VLM cannot see it on the rendered page. When text-layer extraction returns substantially more content than VLM identifies (default threshold: text-layer > 2× VLM character count), the system flags a potential invisible text layer and extracts the difference — text present in the data layer but absent from VLM output — for review. This is the reverse of hallucination — the deterministic engine finds real content that is intentionally hidden.

This method addresses three documented scenarios in legal document processing:

Failed redaction detection. A common error in legal document production is the use of PDF annotation tools that draw opaque rectangles over sensitive content without removing the underlying text from the PDF data layer. The visual rendering shows a black box; the text-layer parser reads through it. Our method detects this by identifying text-layer content at positions that correspond to visually redacted regions. This is a documented, recurring problem in litigation that has resulted in the inadvertent disclosure of classified information, medical records, and attorney strategy notes.

OCR layer tampering detection. Searchable PDFs created by scanning contain an invisible OCR text layer behind the scanned image. If the visible image is subsequently altered (e.g., changing a dollar amount from $5,000 to $50,000) without updating the OCR layer, the two layers tell different stories. Cross-architecture comparison reveals the discrepancy: the VLM reads the altered image while the text-layer parser reads the original OCR data.

Accidental metadata leakage. Documents converted from Word to PDF may embed tracked changes, revision history, and margin comments as invisible metadata in the PDF data layer. These may contain privileged attorney mental impressions, draft edits, or strategy notes that were not intended for production. The invisible text detection method identifies documents where the data layer contains substantially more content than what is visually rendered.

3.7 Embedded Code Verification

QR codes, barcodes, and Data Matrix codes are detected and decoded using computer vision (OpenCV QRCodeDetector + pyzbar). The decoded payload is compared against any visible label adjacent to the code, extracted by the VLM. Mismatches are flagged as potential tampering. The system detects three categories of anomaly: URL domain anomalies (decoded URL does not match the domain implied by the visible label), non-standard protocol injection (javascript: URIs, data: URIs embedded in QR codes), and content discrepancies (decoded tracking numbers or identifiers that differ from printed versions).

This addresses several practical scenarios: forged court documents where QR codes point to non-official domains rather than the legitimate court filing systems indicated by their labels; evidence chain tampering where printed tracking numbers on shipping labels or chain-of-custody documents differ from their encoded barcode equivalents; and phishing via discovery documents where QR codes direct recipients to credential-harvesting sites.


4 Experiments

4.1 Setup

Engines: pdfplumber 0.11.4, Tesseract 5.5.2, Apple Vision Framework (macOS, VNRecognizeTextRequest .accurate), Qwen2.5-VL-72B-Instruct (4-bit quantization, local GPU).

Corpus: 94+ real legal documents including contracts, court filings, police records, financial charts, identification documents, aerial photographs, and appellate briefs. All processing on local air-gapped hardware.

Validation: All results independently verified by a domain expert through visual comparison of original page images against extracted text. 20 specific findings documented.

4.2 Hallucination Sensitivity (EXP-003)

We constructed 17 synthetic fabrication cases across 7 categories: date fabrication (3), name/entity fabrication (3), financial data fabrication (2), legal term fabrication (2), address fabrication (2), subtle single-character changes (3), and complete paragraph fabrication (2). Three control cases with identical or formatting-only differences were included.

Results: 17/17 fabrications detected (100% sensitivity), 0/3 false positives on controls (100% specificity). Even single-digit changes (case number 12345 → 12346) were detected via overlap ratio drop. Meaning reversals ("shall not" → "may") were detected at 78.6% overlap.

4.3 False Positive Rate (EXP-004)

30 real documents processed through the full pipeline.

Before diagnostic classification: 7/30 flagged (23.3% false positive rate). Expert review confirmed all 7 were layout disagreements — VLM added structural labels to correctly-extracted content.

After diagnostic classification: 0/30 flagged (0.0% false positive rate). The taxonomy correctly distinguished layout differences from content fabrication.

In our test set, this result suggests that diagnostic classification is important for reducing false positives in practical use.

4.4 Asymmetric Failure (EXP-009)

12 documents across 7 types processed through all 4 engines.

Engine Extraction Rate Failure Mode Observed
Text-layer 7/12 (58%) Total failure on scanned docs
Tesseract 11/12 (92%) Minimal output on photo+text
Apple Vision 12/12 (100%) No failures observed
VLM 12/12 (100%) Subject to hallucination

5/12 documents (42%) exhibited asymmetric failure — engines failed in different ways on the same page. On clean text-layer documents, all four engines agreed at 97–99% word overlap, validating methodology.

Cross-engine agreement varied predictably by document type: dense legal text (0.97 overlap), court covers (0.81), graphical timelines (0.72), financial charts (0.47), scanned contracts (pdfplumber: 0.00, Apple Vision vs. VLM: 0.95).

4.5 Adversarial Testing (EXP-010)

Five synthetic adversarial documents:

Invisible text (ADV-001): PDF with visible settlement agreement + hidden white-on-white clauses ("Party B waives all future claims"). Text-layer extraction found all hidden text (308 chars); VLM extracted only visible text (144 chars). Overlap: 0.472. The cross-architecture comparison successfully detected the hidden content. All hidden clauses — waiver of claims, penalty clause, governing law change — were identified. Zero false positives on clean documents in the control set.

Low contrast (ADV-004): VLM read extremely faded text (RGB 230,230,230 on white 255,255,255), extracting 476 characters of case-sensitive legal information. This demonstrates VLM capability beyond human visual acuity at normal viewing conditions.

Mixed orientation (ADV-005): VLM correctly extracted horizontal text but missed rotated margin stamps, consistent with known VLM limitations on non-standard text orientation.

4.6 QR Code Fraud Detection (EXP-005)

94 documents scanned; 12 QR codes found in 7 documents, all decoded. Real court QR codes linked to legitimate PlainSite.org URLs (verified). Three simulated fraud scenarios: phishing URL (.tk domain), JavaScript injection (javascript: URI), domain mismatch (legitimate label, malicious URL). All 3 detected with 0 false positives on 3 legitimate controls.

4.7 Document Quality Score (EXP-006, EXP-007)

Consistency (EXP-006): DQS variance across pages of the same document ranged from 0.029 to 0.072, indicating stable scoring within documents.

Discrimination by document type (EXP-007): DQS correctly varies with document extraction difficulty:

Document Type DQS Interpretation
Dense legal text 0.917 High confidence — court-ready
Financial charts (mixed text/graphics) 0.719 Moderate confidence — near threshold
Aerial photographs with overlay text 0.420 Low confidence — correctly flagged for review

Clean text documents scored 0.90+; documents with known extraction problems scored below 0.70, confirming that the 0.70 threshold correctly separates reliable from unreliable extractions.

4.8 Signature Detection (EXP-008)

9 signatures detected across 18 pages with multi-attribute extraction: signer name, role, signature type (wet ink / digital), position. Expert review confirmed 6 correct name identifications, 2 unclear, 1 unknown. Limitation identified: VLM sometimes captures form field labels ("Signature of Process Server") while missing the adjacent handwritten signature.


5 Discussion

Diagnostic Classification Is Essential

The most significant finding is the impact of diagnostic classification on false positive rate: 23.3% without, 0.0% with. Prior multi-engine approaches that treat all disagreements as errors would produce unusable false positive rates when applied to VLM output, because VLMs routinely reorganize document layout without changing content. The classification taxonomy transforms an unusable system into a practical one.

To illustrate the practical magnitude: in a 2,000-page discovery set, a 23.3% false positive rate produces 460 flagged pages requiring manual review. At approximately 3 minutes per page, that is 23 hours of paralegal review — and the vast majority of those flags are layout disagreements, not real errors. Reviewer fatigue from false alarms increases the risk that genuine hallucinations are missed. With diagnostic classification, only actual content errors reach the reviewer. In our measurements, this reduces the review burden to as few as 1 flagged page — approximately 3 minutes instead of 23 hours.

Asymmetric Failure Validates Cross-Architecture Design

The 42% asymmetric failure rate is consistent with the view that different engine architectures fail in different ways. Text-layer extraction cannot process images; OCR does not hallucinate in the same autoregressive sense as a VLM; and VLMs can sometimes recover content that OCR misses. In our experiments, these complementary failure modes made cross-engine comparison more informative than relying on a single engine.

Invisible Text: A Novel Attack Vector

The invisible text layer detection method (Section 3.6) highlights document-tampering scenarios that deserve attention in document-processing pipelines. Failed redactions are a recurring and well-documented problem in legal document production — multiple reported cases involve inadvertent disclosure of classified information, protected medical records, and privileged attorney communications through redactions that covered text visually but left it intact in the PDF data layer. The method provides a receiving-side check that can identify these failures after document production — or, if applied by the producing party, can catch them before production.

The OCR layer tampering scenario is particularly significant: because searchable PDFs store the scanned image and the OCR text as independent layers, altering one without the other creates a detectable discrepancy. A malicious actor who changes the visible dollar amount on a scanned check from $5,000 to $50,000 may not realize that the original OCR layer still records the true figure — and cross-architecture comparison reveals the mismatch.

Embedded Code Verification Fills a Gap

The embedded code verification pipeline addresses a gap between two existing capabilities — QR/barcode decoding and VLM document understanding — that no prior system has bridged. The phishing and tampering scenarios described in Section 3.7 are realistic attack vectors that become increasingly relevant as QR codes proliferate in official documents.

The DQS provides a quantitative answer to the Daubert requirement for a known error rate. In legal proceedings subject to Daubert v. Merrell Dow Pharmaceuticals (1993) or Frye v. United States (1923) admissibility standards, counsel must demonstrate a known error rate and testable methodology for any scientific or technical evidence. Rather than asserting that "the AI is confident," counsel can present a specific composite score with documented methodology, per-component breakdowns, and validated thresholds. This does not guarantee admissibility — which remains a judicial determination — but provides a structured evidentiary foundation that pure single-engine confidence scores cannot.

Limitations

Our method requires at least one deterministic engine to produce meaningful output for hallucination detection to function. On pure images (photographs, scanned documents with no text layer and poor OCR quality), the system falls back to VLM-only extraction with reduced confidence, relying on human review. The effectively-scanned threshold (Section 3.4) addresses the most common case but does not resolve the fundamental limitation.

VLM processing time (60–120 seconds per page at 72B parameters, 4-bit quantization, local GPU) limits the method to batch processing rather than real-time applications.

The method detects fabricated content but does not detect fabricated structure — a VLM that correctly extracts all text but misassociates dates with events (observed in our timeline documents) produces a different class of error that cross-architecture comparison can flag but not definitively classify.

Our validation corpus consists of 130+ pages from real legal documents. While the results are encouraging — 100% sensitivity, 0% false positives — the corpus size limits the strength of statistical claims, and validation on larger and more diverse document sets is needed.


6 Conclusion

We have presented a method for detecting potential VLM hallucinations in document extraction by comparing architecturally distinct extraction engines and classifying the resulting disagreements. In our evaluation, the diagnostic taxonomy — distinguishing hallucination from layout disagreement, substitution error, and omission — reduced false positives from 23.3% to 0.0% on the tested subset while detecting all synthetic hallucination cases included in the study.

Beyond hallucination detection, the cross-architecture framework enables invisible text layer detection for identifying failed redactions, OCR layer tampering, and accidental metadata leakage; embedded code verification for detecting QR code and barcode fraud in legal documents; and a composite Document Quality Score that provides quantitative reliability metrics suitable for legal evidence admissibility under Daubert/Frye standards.

The results suggest that cross-architecture comparison is a useful safety check for document extraction workflows, particularly in legal-document settings. We publish this method as a contribution to the document-processing community and as a basis for further validation and comparison by other researchers and practitioners.


References

[1] I. Zavorin, E. Borovikov, A. Borovikov, and L. Hernandez, "A multi-evidence, multi-engine OCR system," in Proc. SPIE 6500, Document Recognition and Retrieval XIV, 2007.

[2] A. Abdulkader and M. R. Casey, "Efficient identification and correction of optical character recognition errors through learning in a multi-engine environment," US Patent 8,331,739, Dec. 2012.

[3] A. Goel, D. Schwartz, and Y. Qi, "Zero-knowledge LLM hallucination detection and mitigation through fine-grained cross-model consistency," in Proc. EMNLP 2025 Industry Track, 2025.

[4] W. B. Lund and E. K. Ringger, "Improving optical character recognition through efficient multiple system alignment," in Proc. JCDL, 2009.

[5] A. Gunjal, J. Yin, and E. Bas, "Detecting and preventing hallucinations in large vision language models," in Proc. AAAI, 2024.

[6] J. Li et al., "Evaluating object hallucination in large vision-language models," in Proc. EMNLP, 2023.