PDF to AI-Ready Text: How to Choose the Right Data Extraction Tool
PDF data extraction quality directly determines AI accuracy. When building BetterRegulation’s document processing system, we found that naive extraction wastes 40-60% of context windows on PDF artifacts. After evaluating ChatGPT API, traditional Python libraries, and Unstructured.io, we achieved 30% token reduction and significantly improved document categorization. Here’s what we learned.
In this article:
When we started building BetterRegulation’s document processing system, we faced an immediate reality: PDFs are everywhere. Contracts, reports, specifications, regulations, research papers—every regulatory document we needed to process came as a PDF.
But we immediately identified a problem: PDFs are terrible for AI processing.
PDFs were designed for printing and visual display, not for text extraction and machine readability. They encode positioning, fonts, colors, and layout—not semantic meaning.
When you naively extract text from a PDF, the results are messy and unusable. Headers and footers appear repeatedly on every page, page numbers get embedded in the middle of sentences, and watermarks mix with actual content. Multi-column text reads left-to-right instead of following each column properly, footnotes interrupt paragraphs at random points, and table data becomes unstructured gibberish. On top of that, formatting markers and PDF metadata clutter the output, and line breaks appear in completely random places, breaking sentences mid-word.
Example of naive PDF extraction:
Enterprise Risk Assessment Q3 2024 Page 1 of 45
Executive Sum- The following report Confidential
mary provides a compre-
hensive overview of
Risk Factors enterprise risk ex- Risk Category
Financial Risk posures identifi High
ed during Q3 2024
Operational Risk audit procedures. Medium
Compliance Risk High
Enterprise Risk Assessment Q3 2024 Page 2 of 45
[continues with more broken text...]This is what we were initially feeding our AI. No wonder it was getting confused and making categorization errors.
What we learned AI actually needs:
Executive Summary
The following report provides a comprehensive overview of enterprise
risk exposures identified during Q3 2024 audit procedures.
Risk Factors:
- Financial Risk: High
- Operational Risk: Medium
- Compliance Risk: HighClean, structured, readable text that accurately represents the document’s semantic content.
This article shares how we solved this challenge—and the lessons we learned along the way.
Read also: AI document processing in Drupal: Technical case study with 95% accuracy
Why are PDFs so difficult for AI?
Understanding why PDFs cause problems for AI helps explain why extraction tool choice matters so much. Here are the four main challenges we encountered.
1. Formatting markers and metadata
PDFs contain positioning information, font specifications, and layout instructions that aren’t part of the actual content:
/F1 12 Tf % Font size 12
(Executive Summary) Tj
72 650 Td % Position at coordinates
/F2 10 Tf % Font size 10
(The following report...) TjThese markers can consume 30-50% of your AI context window with non-content information.
2. Complex layouts
Multi-column layouts, text boxes, sidebars—PDFs encode these as separate text objects with coordinates, not as a logical reading order:
[Column 1 text] [Column 2 text]
[More Col 1] [More Col 2]Naive extraction reads left-to-right: > “Column 1 text Column 2 text More Col 1 More Col 2”
Correct reading order: > “Column 1 text More Col 1” then “Column 2 text More Col 2”
3. Embedded content
Images, charts, tables, headers, footers, page numbers—all embedded as separate objects. Naive extraction either includes everything (noise) or skips important content (data loss).
4. Variable structure
No two PDFs structure content the same way. What works for simple reports completely fails for legal documents with complex footnotes and citations, technical specifications with embedded tables and diagrams, scanned documents requiring OCR, or structured forms with specific field layouts. Each document type requires different extraction strategies.
How does poor extraction impact AI?
The consequences of poor PDF data extraction are severe. Token waste alone accounts for 40-60% of your context window being consumed by PDF artifacts rather than actual content. The AI becomes confused trying to interpret page numbers, headers, and formatting markers as if they’re meaningful information. This leads to errors like multi-column confusion, broken sentences, missing context, and incorrect document categorization. Ultimately, you’re paying to process PDF noise rather than the content that actually matters.
How do PDF data extraction approaches compare?
We evaluated three main approaches before choosing our solution.
1. Direct PDF to ChatGPT API
How it works: send PDF directly to ChatGPT Vision API, let OpenAI handle extraction.
This was our first approach—just one API call for a simple initial implementation. It’s fast to implement and requires no additional infrastructure, making it attractive for quick prototypes. However, the simplicity comes with significant trade-offs. You have no control over how OpenAI extracts the text, PDF artifacts often remain in the context, and debugging extraction issues becomes nearly impossible. It’s also more expensive than self-hosted alternatives and locks you into OpenAI’s models. This approach works best for simple PDFs, low-volume processing, or initial prototypes where speed matters more than cost or control.
2. Traditional PDF libraries (PyPDF2, pdfplumber, etc.)
How it works: Python libraries that parse PDF structure and extract text.
We tested these next, evaluating whether the open-source route would provide better control.
Example:
import PyPDF2
with open("document.pdf", "rb") as file:
reader = PyPDF2.PdfReader(file)
text = ""
for page in reader.pages:
text += page.extract_text()These libraries are free, open-source, and work completely offline, making them appealing for simple use cases. The implementation is straightforward for basic needs—just a few lines of Python code and you’re extracting text. However, the extraction quality is basic at best. These libraries have significant limitations with complex layouts, offer no automatic cleaning of PDF artifacts, and require manual handling of multi-column text, tables, and other structural elements. Significant post-processing is needed to get usable results. They work well for simple, single-column PDFs with minimal formatting, but anything more complex requires a better solution.
3. Unstructured.io (our final choice)
How it works: advanced PDF processing library with layout analysis, OCR, and intelligent text extraction.
After evaluating the limitations of the previous options, we moved to Unstructured.io — which met our requirements.
Example:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf("document.pdf")
clean_text = "\n\n".join([el.text for el in elements])Unstructured.io delivers excellent extraction quality and handles complex layouts that other tools can’t manage. It automatically cleans PDF artifacts and preserves document structure, includes OCR for scanned documents, and remains open-source with commercial support available. As a Python library, it can be installed via pip and used directly in your code. For production systems needing API-based processing at scale, you can optionally deploy it as a self-hosted service using Docker/Kubernetes. The main considerations are system dependencies for advanced features (like OCR) and learning the library’s configuration options. For complex PDFs, high-volume processing, and production systems, these considerations are well worth it. BetterRegulation chose Unstructured.io for precisely these reasons.
Read also: AI Automators in Drupal: How to orchestrate multi-step AI workflows
How does Unstructured.io work?
Since we chose Unstructured.io and it became the foundation of our extraction pipeline, let me share what we learned about how it works.
How it works
Unstructured.io combines multiple sophisticated techniques to extract clean, structured text from PDFs. It starts with layout analysis, identifying columns, headers, footers, and sidebars, then determines the logical reading order to separate main content from ancillary elements. During text extraction, it preserves document structure while maintaining paragraph and section boundaries, and correctly handles multi-column layouts that confuse simpler tools.
The library classifies each text element as Title, NarrativeText, ListItem, Table, or other types, enabling selective extraction where you can choose to process only main content while filtering out noise. Its cleaning pipeline removes headers and footers, filters page numbers, cleans up excessive whitespace, and normalizes line breaks to produce readable text. When processing scanned documents, Unstructured.io automatically detects when OCR is needed and applies it seamlessly, even handling mixed documents that contain both digital text and scanned images.
Self-hosting setup (optional API server)
For basic usage, simply install the Python library with pip install unstructured. The Docker/Kubernetes setup below is only needed if you want to run Unstructured.io as an API server for your application to call remotely.
Docker Compose (local development):
version: '3'
services:
unstructured-api:
image: downloads.unstructured.io/unstructured-io/unstructured-api:latest
ports:
- "8000:8000"Start with: docker-compose up
Kubernetes (production):
BetterRegulation runs Unstructured.io as a Kubernetes pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: unstructured-api
spec:
replicas: 2
selector:
matchLabels:
app: unstructured-api
template:
metadata:
labels:
app: unstructured-api
spec:
containers:
- name: unstructured-api
image: downloads.unstructured.io/unstructured-io/unstructured-api:latest
ports:
- containerPort: 8000
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"Infrastructure requirements:
- 2-4GB RAM per instance
- 1-2 CPU cores per instance
- Scale horizontally for volume
Configuration for legal documents
Here’s the configuration we settled on for our complex legal PDFs after considerable experimentation:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="document.pdf",
strategy="hi_res", # High-resolution analysis
include_page_breaks=False, # Don't include page break markers
infer_table_structure=True, # Detect and preserve tables
ocr_languages=["eng"], # OCR if needed
extract_images_in_pdf=False, # Skip images (not needed)
model_name="yolox", # Layout detection model
)
# Filter to main content only
main_content = [el for el in elements if el.category in [
"Title",
"NarrativeText",
"ListItem",
"Table"
]]
# Join with appropriate spacing
clean_text = "\n\n".join([el.text for el in main_content])Key parameters explained:
The strategy="hi_res" setting uses the highest quality analysis, which is slower but significantly more accurate for complex documents. Setting include_page_breaks=False removes page break markers that would clutter the output. With infer_table_structure=True, the library detects and preserves table formatting instead of outputting unstructured table data. Finally, extract_images_in_pdf=False skips image extraction when you only need text processing, improving performance.
Filtering and cleaning
Remove specific elements:
# Filter out headers, footers, page numbers
filtered = [el for el in elements if el.category not in [
"Header",
"Footer",
"PageNumber",
"PageBreak"
]]
# Remove short elements (likely noise)
filtered = [el for el in filtered if len(el.text) > 10]
# Remove elements that are just page numbers or dates
import re
filtered = [el for el in filtered if not re.match(r'^Page \d+$', el.text)]
filtered = [el for el in filtered if not re.match(r'^\d{1,2}/\d{1,2}/\d{4}$', el.text)]This level of control was game-changing for us—we could fine-tune exactly what content reached our AI models.
Integration with AI pipelines
Here’s how we integrated Unstructured.io into our processing workflow:
# Step 1: Extract clean text
def extract_pdf_text(pdf_file):
elements = partition_pdf(
filename=pdf_file,
strategy="hi_res",
include_page_breaks=False,
infer_table_structure=True,
)
# Filter to main content
main_content = [el for el in elements if el.category in [
"Title",
"NarrativeText",
"ListItem"
]]
return "\n\n".join([el.text for el in main_content])
# Step 2: Send to AI
def categorize_document(pdf_file):
clean_text = extract_pdf_text(pdf_file)
prompt = build_categorization_prompt(clean_text)
response = openai.ChatCompletion.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": prompt}
]
)
return parse_ai_response(response)This clean separation between extraction and AI processing made debugging much easier when we hit issues.
Read also: Prompt engineering for data extraction: How to achieve 95% accuracy in legal documents
Text output optimization
The token efficiency gains we achieved were substantial:
Before cleaning (with PDF artifacts):
- 75-page document: ~65,000 tokens
- Includes: headers, footers, page numbers, formatting markers
After cleaning (Unstructured.io):
- Same document: ~45,000 tokens
- 30% token reduction = 30% cost savings
Context window management:
GPT-4o-mini’s 128K token window appeared sufficient initially. However, processing a 350-page statute with naive extraction exceeded this limit. After implementing Unstructured.io’s cleaning, even our largest documents fit comfortably within the context window.
What are the performance and cost trade-offs?
Beyond extraction quality, choosing a PDF processing tool involves practical trade-offs around speed, cost, and infrastructure. Here’s what our production deployment revealed.
How fast is PDF extraction?
Here’s what we measured in production:
| Document size | Extraction time | Total processing |
|---|---|---|
| Tiny (2-3 pages) | ~2 seconds | ~10 seconds |
| Small (10-20 pages) | ~5 seconds | ~15-20 seconds |
| Medium (50-75 pages) | ~15 seconds | ~30-45 seconds |
| Large (100-150 pages) | ~30 seconds | ~1 minute |
| Very large (200-350 pages) | ~45-60 seconds | ~1.5-2 minutes |
We found that extraction accounts for roughly 30-40% of total processing time, with AI analysis taking the rest.
Read also: Intelligent routing for chatbot questions: How we cut AI API costs by 95%
How much does PDF extraction cost?
SaaS Unstructured.io:
- $0.10-0.20 per document
- No infrastructure costs
- Pay as you go
Self-hosted Unstructured.io:
- Infrastructure: ~$50-100/month (Kubernetes pod)
- Processing: no per-document fees
- Break-even: ~250-500 documents/month
For our volume (200+ docs/month): self-hosting broke even quickly and now saves us money.
For smaller volumes (<100 docs/month): SaaS would be more cost-effective.
Infrastructure costs
Here’s our self-hosted infrastructure:
- 2 Kubernetes pods (redundancy)
- 2GB RAM each
- 1 CPU core each
- Total cost: ~£50-70/month
Alternative (AWS Lambda):
- Serverless Unstructured.io processing
- Pay per invocation
- No idle costs
- Good for variable/intermittent volumes
What results did we achieve in production?
To validate our choice, we benchmarked all three approaches against real documents from our production workload. The differences were significant.
How does extraction quality compare?
We tested all three approaches against representative documents from our production corpus:
PyPDF2 (naive extraction):
- Multi-column layouts frequently read incorrectly
- Text broken mid-sentence
- Headers, footers, and page numbers mixed with content
- Required extensive manual post-processing
ChatGPT direct:
- Better than PyPDF2 but inconsistent
- PDF artifacts still present in extracted text
- No control over what gets included or filtered
Unstructured.io:
- Clean, logically ordered text
- Proper handling of complex layouts
- Headers and footers automatically filtered
- Minimal post-processing needed
How does extraction affect AI categorization?
The extraction quality directly impacted our AI’s categorization performance:
With poor extraction (PyPDF2):
- Frequent categorization errors due to broken or missing context
- Multi-column confusion led to wrong document types being assigned
- Required manual review and correction of most documents
With good extraction (Unstructured.io):
- Significantly improved categorization accuracy
- Most errors came from genuine document ambiguity rather than extraction problems
- Manual review needed only for edge cases
The lesson was clear: better extraction directly translates to better AI accuracy.
Read also: How we improved RAG chatbot accuracy by 40% with document grading
Alternative PDF data extraction tools
While we chose Unstructured.io, several other tools are worth considering depending on your specific requirements and constraints.
Adobe PDF Services API
Adobe’s commercial offering delivers high-quality extraction with full enterprise support, and it handles complex PDFs well. However, it’s expensive at $0.05-0.30 per page, can’t be self-hosted, and locks you into Adobe’s ecosystem. Consider this option when you have budget for premium services and need enterprise-level support contracts.
AWS Textract
Amazon’s document analysis service provides OCR and layout analysis, excelling at forms and tables with seamless AWS integration. However, it becomes expensive at scale, is optimized specifically for forms rather than general documents, and requires cloud infrastructure. It’s a good fit if you’re already operating on AWS and primarily processing forms or invoices.
Google Document AI
Google Cloud’s document processing leverages advanced ML models that handle varied document types well, with native GCP integration. The downsides are high costs, complex pricing structures that make budgeting difficult, and cloud-only deployment. Choose this if you’re already invested in GCP infrastructure and need Google’s advanced processing features.
Apache Tika
This open-source framework is free and handles many document formats beyond just PDFs, making it useful in Java ecosystems. However, extraction quality is basic, it requires Java infrastructure, and layout analysis capabilities are limited. Consider Tika when you need multi-format document support and are already working in a Java environment.
Read also: LangChain vs LangGraph vs Raw OpenAI: How to choose your RAG stack
When to use which PDF extraction tool
There’s no single best tool for every scenario. Here’s a practical decision framework based on document type and processing needs.
Simple PDFs → PyPDF2 or pdfplumber
For simple, single-column PDFs with standard layouts, minimal formatting, and text-only content—like basic reports, memos, or letters—PyPDF2 or pdfplumber are perfectly adequate. These free, simple libraries handle straightforward documents without requiring complex infrastructure.
Complex PDFs → Unstructured.io
When you’re dealing with multi-column layouts, tables and charts, headers and footers that need filtering, mixed formatting, or situations where high accuracy is critical, Unstructured.io is the clear choice. Legal documents, technical specifications, and research papers all fall into this category, and Unstructured.io delivers the best extraction quality for these complex cases.
Scanned documents → Unstructured.io with OCR
Image-based PDFs that require OCR processing and have variable quality—such as scanned contracts or historical documents—need Unstructured.io’s built-in OCR capabilities. The library automatically detects scanned content and applies OCR without manual intervention.
Forms and invoices → AWS Textract
Structured forms with key-value pairs and tabular data, like invoices, applications, or standardized forms, are where AWS Textract excels. The service is specifically optimized for form processing and delivers excellent results for this document type.
Prototypes → ChatGPT direct
For quick wins with low volume and relatively simple documents, sending PDFs directly to ChatGPT API is the fastest path forward. It’s the simplest implementation and perfect for prototyping before investing in more sophisticated extraction infrastructure.
Want to improve your PDF processing pipeline?
This case study is based on our real production implementation for BetterRegulation, where we built a complete PDF extraction pipeline using Unstructured.io to achieve 30% token reduction and significantly improved AI categorization accuracy. The system processes 200+ documents monthly in production, delivering consistent results.
Interested in building a similar solution for your platform? Our team specializes in creating production-grade AI document pipelines that balance extraction quality, cost efficiency, and scalability. We handle everything from Unstructured.io setup and Kubernetes deployment to custom extraction pipelines and AI integration. Visit our generative AI development services to discover how we can help you.