If you Googled "extract data from PDF" you are probably in one of two camps. Either you have one document you need in a spreadsheet right now, or you have a recurring pile of PDFs every week and a person retyping the numbers. The honest answer is different for each, and the wrong answer for any volume above a handful per week is "a person and Excel."
The phrase covers four different jobs, and the right tool depends entirely on which one you have. Mixing them up is how UK SMEs end up paying for enterprise software they will never use, or worse, paying a person to do work AI does in seconds.
One PDF, you want it as an Excel sheet, today. Open it in Adobe Acrobat, choose "Export to Excel", done. Free alternatives: Smallpdf, ILovePDF, the table-extraction feature in newer versions of Microsoft Excel itself. Accuracy is usually fine for clean PDFs with proper tables. Stop reading this guide, do the conversion, get on with your week.
Every week, the same three suppliers send invoices in the same layout. Or every month, your bank sends statements you need to reconcile. This is where the SaaS layer earns its £30 to £300 monthly fee. Tools like Docparser, Klippa, Mindee, Veryfi and Rossum learn a template once, then process every future document automatically.
Hundreds of invoices a month, fifty different supplier layouts, and you need the line items, not just the totals. This is where modern AI extraction earns its place. Rossum, AWS Textract, Google Document AI and Microsoft Azure Form Recognizer handle layout variation that defeats template-based tools. Costs scale with volume, typically £100 to £2,000 per month at SME scale.
Extraction is only half the value. The other half is the data landing in Xero, in your CRM, in your custom database, with the right tags, without a person touching it. This is where a built solution makes sense, usually £4,000 to £9,000 as a fixed-quote project that pays back in three to nine months on saved data-entry time.
In most UK SMEs, here is the workflow that PDFs typically feed into. A person, usually finance or operations, opens the PDF on one screen. They open Excel on the other. They retype the numbers. They check the totals. They paste into the accounting system at month end. They file the PDF in a SharePoint folder named after the year.
The spreadsheet has columns for supplier name, invoice number, date, net, VAT, gross, GL code, paid date, and notes. It looks tidy. It is the operational backbone of the function. And it breaks in four predictable ways.
Any of these failure modes is a signal that the spreadsheet has outlived its useful life. The replacement is not a bigger spreadsheet. It is an extraction system that lands clean structured data in the place it needs to go, with the PDF retained and searchable as evidence.
These are illustrative shapes drawn from UK SME workflow patterns. The fixed-quote bands are the real ranges we quote for builds of this shape; the workflows described are the patterns that fit each context.
Profile: an independent practice serving around 200 SME clients pulling in roughly 6,000 supplier invoices a month across the book. The bookkeeping team, often two to four people, spend the bulk of their week on data entry. The shape that fits: a targeted build that reads invoices and receipts straight into Xero, with the practice's own coding rules applied automatically. Typical band: £6,000 to £12,000 fixed quote. Expected outcome: typically saves 60% to 75% of data-entry time, leaving the judgement calls (which is what your bookkeepers should have been doing all along).
Profile: case bundles arriving as 200 to 2,000 page PDFs. Paralegals spend hours indexing them, building chronologies, extracting key dates and parties. The shape that fits: modern AI extraction that produces a first-pass chronology, named-entity index and key-document summary in minutes, then a paralegal reviews and corrects. Typical band: £8,000 to £18,000 fixed quote. Expected outcome: the economics flip from "junior staff are the bottleneck" to "junior staff are the quality gate."
Profile: policy schedules from Aviva, AXA, RSA, Allianz and twenty other insurers arriving in twenty different layouts. Account handlers retype the cover details into the broker's own system to produce client summaries. The shape that fits: a schedule-extraction build that reads the carrier PDF, maps the fields to the broker's data model, and produces the client-facing summary in seconds. Typical band: £7,000 to £15,000 fixed quote. Expected outcome: client summaries produced in seconds rather than the hour each currently consumes.
Profile: CVs arriving as PDFs and Word documents in every conceivable layout. CV parsing tools have existed for years (Daxtra, Sovren, Textkernel) but the better recent option is a custom build that pulls candidate data into your own ATS or Bullhorn instance with your agency's own taxonomy. The shape that fits: useful at agencies handling 200+ CVs per consultant per month. Typical band: £5,000 to £10,000 fixed quote. Expected outcome: candidate data lands in the ATS with the agency's taxonomy applied, ready for search and shortlisting.
Profile: supplier price lists arriving as PDFs every quarter. The buying team retypes prices into the merchant's stock system, with predictable errors and a two-week lag before new prices are reflected on the trade counter. The shape that fits: a price-list extraction build, with a human approval step on flagged changes. Typical band: £5,000 to £10,000 fixed quote. Expected outcome: prices live within hours of the PDF landing in inbox.
| Option | Best for | Real cost |
|---|---|---|
| Docparser, Klippa | 1 to 5 document templates, moderate volume, no integration complexity | £30 to £200 / month |
| Mindee, Veryfi | Invoices and receipts specifically, pre-trained models, decent APIs | £50 to £500 / month |
| Rossum | High volume, varied layouts, enterprise-grade approval workflow | £600 to £5,000+ / month |
| AWS Textract, Google Document AI, Azure Form Recognizer | You have engineering capacity; raw building blocks | Pay per page (£0.001 to £0.05) |
| Custom build with AI extraction layer | Multi-document, downstream system integration, owned outcome | £4,000 to £25,000 one-off |
The mistake to avoid: picking the enterprise tool (Rossum) because it looks the most impressive, when your volume would be served by Docparser for one tenth the cost. The opposite mistake: trying to stretch a template tool to handle every supplier when the layouts vary too much, ending up with a brittle system the team stops trusting.
The pattern that works: start with the SaaS layer for the templated 80%, custom build only the integration into your downstream system, keep a human approval gate on anything that creates a financial transaction. That combination is usually £4,000 to £9,000 one-time plus £50 to £300 per month ongoing, and it pays back inside a year on saved data-entry time.
Buy enterprise extraction software for SME volume. If you process under 200 documents a month, Rossum, ABBYY FlexiCapture and Hyland are over-built for you. The licence fee will not pay back. Start with the lighter SaaS layer and graduate later if volume justifies it.
Buy a tool with no plan for the integration. The extraction is the easy bit. Getting clean structured data into your accounting system, CRM or stock system is where time and money disappear. If you have not scoped the integration before buying the extraction tool, you have bought half a solution.
Skip the human approval gate. Even at 99% accuracy, that means one in every hundred fields is wrong. If those fields feed financial transactions or compliance records, you need a queue where a person reviews flagged or low-confidence extractions. The good tools build this in. The cheap tools do not, and the consequences land on your finance team six months later.
We start with the document. You send us samples of the PDFs you want extracted, with examples of how the data needs to land in your downstream system. We come back inside a week with a fixed quote and a delivery timeline, usually three to four weeks for a focused build, six to ten for a multi-document workflow.
The build itself sits on whichever AI extraction layer fits the volume and the layouts. Sometimes Mindee for receipts, sometimes Azure Form Recognizer for invoices, sometimes a direct Anthropic or OpenAI integration for unstructured documents. The choice is technical and we make it on your behalf. What you get is the working extraction-to-system pipeline, on your environment, with a human approval workflow for anything flagged as low-confidence.
The same shape underpins our AI implementation work and the ongoing tech partnership where we run, maintain and improve the system as your volume grows.
If you want to go deeper on the underlying patterns, the PDF to Excel guide covers the one-off conversion path and the email automation guide covers the closely related pattern of automated response over received documents.
For one PDF, Adobe Acrobat or a free converter like Smallpdf will do. For a recurring workflow (10+ similar PDFs per week), the SaaS layer (Docparser, Rossum, Mindee, Klippa) gets you 80% of the way for £30 to £300 a month. For a custom integration into your accounting or CRM system, a built solution runs £4,000 to £9,000 as a fixed-quote project. The wrong answer for any volume above a handful of documents per week is a person retyping the data into Excel.
For standard structured documents (invoices, receipts, forms with known layouts) modern AI extraction runs at 95% to 99% field-level accuracy. The remaining 1% to 5% is the reason every production system needs a human approval gate on anything that matters. For unstructured documents (free-text contracts, scanned handwritten forms, multi-page case bundles) accuracy drops sharply and you need both a more capable model and a workflow that surfaces low-confidence fields to a person.
Docparser and Klippa are good for moderate-volume, single-format work (one supplier's invoices, one form layout). Rossum scales to higher volumes and varied layouts but is enterprise-priced. AWS Textract and Google Document AI are the cloud building blocks if you have engineering capacity; powerful but raw. Custom builds make sense when the SaaS layer cannot reach into your downstream system, or when you process more than 500 documents per month and the monthly fees overtake a one-time build cost in under 18 months.
Yes. Modern extraction tools combine OCR (turning the image into text) and AI (understanding what the text means in context). Quality depends on the scan: clean printed text at 300 DPI runs at 98% plus accuracy; faxed receipts and faint thermal-printer dockets drop to 85% to 90% and need human review. Handwritten data is the hardest, around 70% to 85% on neat handwriting, and effectively requires a human-in-the-loop step.
A focused build, one document type, one downstream system, runs £4,000 to £9,000 as a fixed-quote project and ships in three to four weeks. A multi-document, multi-system build (invoices plus delivery notes plus statements, into Xero plus your stock system) runs £10,000 to £25,000 over six to ten weeks. Ongoing running costs are typically £40 to £400 per month depending on volume.
Yes, that is usually the point. We integrate with Xero, QuickBooks, Sage, HubSpot, Salesforce, custom databases and bespoke systems via their APIs or, where no API exists, via robotic process automation. The extraction is only half the value. The other half is that the data lands in the right place without a person retyping it.
Send us a handful of sample PDFs and tell us where the data needs to land. We will come back with a fixed quote and a delivery timeline, usually inside a week. No charge for the conversation.
Email oliver@digitalsignet.com