AI Document Processing Pipeline — Case Study

The challenge.

A design procurement team was receiving hundreds of documents per week — invoices, purchase orders, supplier spec sheets, fabric swatch confirmations, and shipping manifests — each in a different format. PDFs, scanned images, Excel exports, Word docs, and emailed HTML tables all arrived with no consistent schema.

The team manually read each document, extracted key fields (vendor name, PO number, line items, quantities, unit prices, lead times, shipping terms), and re-entered them into a procurement database. This consumed roughly 40 hours per week across three staff members, with error rates averaging 12% on line-item quantities and unit prices.

The specific problems

No single document format — PDF, Word, Excel, HTML, and scanned images all arrived with different structures
Handwritten annotations on printed PDFs and spec sheets had to be interpreted and entered manually
Extraction errors on line-item quantities and unit prices created downstream order discrepancies and rework
No audit trail existed — when an order quantity was wrong, the team couldn't trace back to the source document

What was built.

A multi-format document ingestion and processing pipeline with AI classification, field extraction, human review, and full audit trail.

Document Ingestion & Classification

Built a Laravel-based ingestion API that accepts uploads in 6 formats (PDF, PNG/JPG, DOCX, XLSX, HTML, plain text). Each document passes through an AI classifier that determines document type (invoice, PO, spec sheet, shipping manifest, or unknown) and routes to the appropriate extraction pipeline. Implemented SQS-backed async processing with configurable concurrency, per-document retry limits, and dead-letter queues for malformed files.

AI Field Extraction

Designed a multi-stage extraction engine: OCR pre-processing for scanned images and PDFs (Tesseract + layout analysis), then OpenAI structured-output prompts fine-tuned per document type. Each prompt targets 12–18 key fields depending on document category — vendor info, PO number, ship-to address, line items with quantities and unit prices, lead times, shipping terms, and totals. Added confidence scoring per field and fallback to a rules-based regex parser for high-certainty fields like PO numbers and dates.

Human Review Queue

Built a web-based review dashboard where procurement staff can inspect extracted fields side-by-side with the source document (rendered as PDF or image). Fields below a configurable confidence threshold are highlighted for mandatory review. Reviewers can approve, edit, or flag fields, and every action is logged with a timestamp and user ID. Implemented a bulk-approve workflow for batch uploads where extraction confidence exceeds 98% across all fields.

Audit Trail & Export

Designed a complete audit system that captures every stage of the pipeline — upload timestamp, AI classification result, extracted fields with confidence scores, each human review action, and final export. All audit events are stored in a dedicated PostgreSQL events table and exposed through a read-only API for downstream procurement systems. Added structured logging with document ID and pipeline stage for debugging and SLA monitoring.

What shipped.

97%

field extraction accuracy on primary fields (vendor, PO, line items, totals)

<10s

median end-to-end processing time per document, from upload to structured output

40→8

weekly person-hours reduced from 40 to 8 — an 80% drop in manual processing time

6

supported input formats: PDF, PNG/JPG, DOCX, XLSX, HTML, plain text

100%

audit trail capture on every document, from upload through review to export

<1%

downstream order discrepancy rate after pipeline implementation (was 12%)

Laravel 11 AWS SQS OpenAI Tesseract OCR PostgreSQL React 18 PHP TypeScript Docker

The developer.

Alexander Dudnik

AI & Full-Stack Engineer

10+ years of commercial experience in backend development, system architecture, and building scalable applications. Specialises in PHP, Node.js, React, PostgreSQL, and message queue architectures. Experienced in technical leadership: task decomposition, estimation, database design, and core system architecture across high-load environments.

Upload any document. Get structured data back in seconds.

The challenge.

What was built.

What shipped.

The developer.

Need a document processing pipeline for your team?