Screen displaying a CMS panel with a “Generate with AI” option for automatically filling in document information.

AI-Powered Document Categorization: How BetterRegulation Saved 50% of Editorial Time

We built an AI-driven document categorization engine for BetterRegulation that cuts editorial effort in half by automating legal document field extraction and classification. Powered by Drupal Automators and AI, the system transforms hours of manual tagging into a fast, reliable workflow that keeps complex legal content perfectly structured and searchable.

Key results

50% time saved
       Document processing time reduced by half.
 

1 FTE equivalent saved
       Freed editorial capacity for growth.
 

10s-2min processing time
       AI processing vs multiple minutes or hours manually.

2-350 pages handled
       Handles documents of all sizes.
 

Very high accuracy
       Minimal corrections needed, <5% of fields require adjustment.
 

Reduction in cognitive load
       AI removes the need for full-document reading.

About BetterRegulation

BetterRegulation.com is a comprehensive compliance intelligence platform delivering consolidated, up-to-date legislative, regulatory, and guidance materials for the UK and Ireland. Since 2004, it has supported legal professionals, compliance teams, and financial sector experts who rely on accurate, current, and fully traceable legal information.

With a London-based editorial team, BetterRegulation processes a high volume of legal documents monthly, each requiring careful categorization across multiple fields and taxonomies to maintain their platform's quality and usability.

 

The platform brings together:

  • Consolidated primary and secondary legislation with full amendment history.
  • EU law and international regulatory materials relevant to UK and Irish markets.
  • Accounting standards and industry guidance from regulatory bodies.
  • Commentary and analysis from leading law firms and accountants.
  • Historical amendment tracking with powerful comparison tools.


 

Challenge: the editorial bottleneck

A legal professional using the BetterRegulation platform to review legal documents.

Manual document processing at scale

Processing legal documents for a compliance platform requires meticulous attention to detail. Each statute, regulation, or guidance document needed to be read, categorized, and tagged across approximately 15 different fields – from document type and jurisdiction to year of enactment and organization.

For BetterRegulation's editorial team, this manual process created a significant operational bottleneck.

A legal professional sitting at a computer and reviewing content on the BetterRegulation platform.

The two-step manual workflow

Before AI implementation, every document followed a labor-intensive process:

Step 1: Document reading and categorization

The editor:

  • receives a new legal document (ranging from 2 to 350 pages),
  • carefully reads through the entire document to understand its content,
  • manually extracts key information: document type, title, year, jurisdiction, organization,
  • cross-references with existing taxonomy systems to ensure consistent categorization,
  • fills approximately 15 different fields in the content management system.

Time requirement: 15 minutes to multiple hours per document, depending on length and complexity.


Step 2: Quality verification

The second editor:

  • reviews all categorizations,
  • verifies accuracy of field assignments,
  • checks consistency with platform standards.

The pain points

Time-intensive process

Reading and categorizing legal documents required substantial editorial time. For longer documents exceeding 100 pages, a single document could take multiple hours to process completely. This time investment was necessary but didn't leverage the editorial team's higher-level skills and legal expertise.

Resource allocation challenge

Approximately one full-time equivalent (1 FTE) was dedicated primarily to document reading and initial categorization. This represented a significant resource investment in a repetitive task that, while essential, prevented the team from focusing on more strategic work.

Cognitive load and fatigue

Maintaining focus while reading lengthy legal texts with complex terminology led to editor fatigue. This cognitive burden not only slowed the process but also increased the risk of categorization errors, especially during high-volume periods or at the end of long reading sessions.

Scalability constraints

The manual process limited how many documents could be processed. Growth in document volume would require proportional increases in editorial staff—an expensive and slow-to-scale solution. This created a hard ceiling on the platform's ability to expand coverage.

Document complexity challenges

The documents themselves presented additional complexities:

  • Variable sizes: from brief 2-3 page guidance notes to comprehensive 350-page statutes.
  • Complex legal language: requiring careful reading and interpretation.
  • Multiple taxonomy mappings: each document needed accurate assignment to numerous interconnected taxonomies (organization, jurisdiction, document type, legislation area, etc.).
  • Varied formats: historical documents came in different structural formats, making text extraction challenging.
  • Frequent updates: legislative amendments required reprocessing of existing documents.

The business impact factors

This operational bottleneck had direct business implications:

  • Limited ability to scale content coverage.
  • High operational costs relative to output.
  • Delayed time-to-publish for new documents.
  • Editorial team unable to focus on higher-value activities.

BetterRegulation needed a solution that could maintain their high standards of accuracy while dramatically improving processing efficiency. The goal was to free their editorial team from the tedious work of document reading and initial categorization, allowing them to focus on verification, quality control, and strategic editorial decisions.

Solution: AI-augmented editorial workflow

A laptop screen showing the Drupal admin panel with an AI generation option for BetterRegulation editors.

The approach: augment, don't replace

Droptica approached this challenge by exploring how AI could augment – not replace – the editorial workflow. Through collaborative discovery sessions with BetterRegulation, we focused on a clear objective: eliminate the time-consuming manual reading and data entry tasks while keeping human editors in control of quality and decision-making.

The philosophy was straightforward: let AI handle the tedious, repetitive work, and let humans handle judgment, verification, and quality control.

Discovery and testing phase

Rather than jumping directly to implementation, the Droptica team conducted thorough testing of different approaches to ensure the solution would be reliable, accurate, and production-ready.

View of the AI Automator module settings, which form the technical foundation of the “Generate with AI” functionality.

PDF processing methods

Legal PDFs are notoriously complex. They often contain:

  • multiple columns and complex layouts,
  • headers, footers, and page numbers throughout,
  • embedded images and graphics,
  • tables and structured data,
  • various fonts and formatting styles.

We evaluated multiple methods for extracting clean, usable text:

  1. Direct PDF to ChatGPT API - revealed limitations with complex formatting and file size restrictions.
  2. Traditional PDF parsing libraries - struggled with inconsistent document structures and produced noisy output.
  3. Unstructured.io - emerged as the clear winner.

The choice of Unstructured.io proved crucial. The team found that they didn't have control over PDF construction – legal documents often contain numerous formatting markers and metadata that can clog up the context window and confuse the AI. With Unstructured.io, they could filter those out during the extraction phase. The team also saw significantly better accuracy and faster processing speeds compared to other methods.

Prompt configuration panel defining instructions for generating information with AI.

Language model selection

The team tested multiple large language models, evaluating them on three key criteria:

  1. Accuracy: could the model correctly identify and categorize document information?
  2. Speed: how quickly could it process documents ranging from 2 to 350 pages?
  3. Cost: what was the token cost per document at expected volumes?

After extensive testing with real BetterRegulation documents, we selected GPT-4o-mini as the model best suited to our current needs. At the same time, the development process remains open to experimenting with other models as they evolve. If future solutions offer a better balance of quality, performance, or capabilities, the underlying model may be adjusted accordingly. At this stage, GPT-4o-mini provides the right combination of speed, accuracy, and a sufficiently large context window (128K tokens) to handle even the longest documents.
 

Prompt engineering

Significant effort went into crafting prompts that would reliably extract and categorize information. This iterative process included:

  • Defining clear, unambiguous instructions for field extraction.
  • Providing complete taxonomy lists within the prompt context.
  • Specifying exact JSON output formats for consistent parsing.
  • Adding validation rules and edge case handling.
  • Testing with hundreds of real documents to refine accuracy.

How it works

Editor’s screen view where the AI content generation process can be triggered with a single click.

The auto-fill feature

The solution integrates seamlessly into BetterRegulation's existing Drupal 11 editorial workflow.

From an editor's perspective:

  1. Upload PDF: editor creates a new document entry and uploads the PDF to the "Original Document" field.
  2. Click "Generate with AI": single button click initiates the AI processing.
  3. Wait briefly: 10 seconds to 2 minutes depending on document size (no page refresh needed).
  4. Adjust if needed: editor can modify any field before saving.
  5. Save and publish: document is ready for the platform.

The experience transformation:

  • Before: 15 minutes to multiple hours of reading and manual data entry.
  • After: click a button, wait briefly, send a review.

Editors remain in control. They can modify any auto-populated field before saving. This preserves quality standards while eliminating the tedious work.

Editor panel showing document sections that are automatically populated by AI.

Fields automatically filled by AI

The system populates approximately 15 fields, including:

Text fields:

  • Title - extracted and cleaned document title.
  • Body/Summary - key content extraction from document.

Taxonomy references:

  • Document type - statute, regulation, guidance, code, etc.
  • Organization - issuing body or regulatory authority.
  • Document area - subject matter classification.
  • Document legislation - related legislative framework.

Entity references:

  • Jurisdiction - UK, Ireland, EU, etc. (can be multiple).

Date fields:

  • Year - when the document was enacted or published.

URL fields:

  • Source URL - official publication location.

And additional metadata fields specific to BetterRegulation's content model

Technical configuration settings for the BetterRegulation platform enabling proper AI support.

The technical innovation: intelligent taxonomy mapping

A key technical achievement is how the system handles Drupal's taxonomy references. The AI doesn't just extract text; it intelligently maps extracted information to existing taxonomy terms in the Drupal database.

Here's how it works:

  1. Context injection: the system includes the complete list of available taxonomy terms for each field in the prompt sent to the AI
  2. Semantic matching: the AI analyzes the document content and matches it against these terms based on meaning, not just keywords
  3. ID return: it returns not just the matched term names, but their specific Drupal entity IDs
  4. Entity reference creation: the Drupal Automators module then creates proper entity references using these IDs

This approach ensures:

  • Seamless integration with BetterRegulation's existing content architecture.
  • No "orphaned" terms or data inconsistencies.
  • Proper relationships between documents and taxonomies.
  • Maintainable data structure as taxonomies evolve.

Technical architecture

The solution is built on a robust, production-ready architecture designed for reliability and scalability.

Technology stack:

  • Drupal 11 - content management platform

  • Drupal Automators (contrib module) - orchestrates the AI workflows and manages processing logic

  • Unstructured.io (Extracture) - PDF text extraction and cleaning, self-hosted for control

  • GPT (OpenAI) - language model for text analysis and categorization

  • RabbitMQ - message queue for background processing (used for summary feature)

  • Watchdog - comprehensive logging and error monitoring

Processing flow:

Visual representation of the AI-driven content processing workflow on the BetterRegulation platform.

Key technical decisions

ChallengeSolutionRationale
Complex PDF formattingUnstructured.ioSuperior filtering of PDF artifacts, better handling of tables and multi-column layouts, higher extraction accuracy.
Model selectionGPTOptimal speed/accuracy/cost balance, large context window (128K tokens) handles longest documents.
Output formatStructured JSON with schemaEnsures consistent, parseable responses; validates against expected field types.
Taxonomy matchingInclude full taxonomy lists in promptAI can match semantically rather than by exact keywords; returns proper entity IDs.
User experienceSynchronous on-demand processingEditors see immediate results; can verify before saving; no waiting for background jobs.
Large documentsGraceful degradationDocuments exceeding token limits flagged for manual review with clear error messages.
ReliabilityComprehensive error loggingAll failures logged to Watchdog with context; admin dashboard shows processing status.

Handling edge cases

Large documents (>350 pages or exceeding token limits):

When documents approach or exceed context window limits:

  • The system attempts processing with the full document.
  • If token limits are exceeded, processing is gracefully terminated.
  • The document is flagged in an admin "manual review" queue.
  • Editors are notified with a clear error message.
  • Editors can use the "Admin Created PDF" field to upload a condensed version or key excerpts.
  • This alternative PDF can then be processed successfully.

Failed processing:

  • All errors are comprehensively logged to Drupal's Watchdog.
  • Admin dashboard shows processing status for all documents.
  • Failed documents can be manually reprocessed with a single click.
  • Detailed error messages help diagnose issues (API errors, malformed PDFs, etc.).
  • Retry logic handles transient failures automatically.

Quality control layers:

  1. AI processing - initial extraction and categorization.
  2. Editor review - human verification and adjustment of all fields.
  3. QA editor - second human review before final publication.
  4. Ongoing monitoring - track accuracy rates and common correction patterns.

The AI assists but doesn't replace human judgment. This multi-layered approach ensures that BetterRegulation's high standards are maintained while gaining significant efficiency benefits.

Results: transformative efficiency gains

50% time savings in document processing

The most significant and immediately measurable result is the dramatic reduction in time required to process documents.

Before AI implementation:

  • from 20 minutes to multiple hours per document for initial reading and categorization,
  • highly variable depending on document length and complexity,
  • full, sustained attention required from editor during entire process,
  • average processing capacity: 3-8 documents per day per editor for complex documents.

After AI Implementation:

  • 10 seconds to 2 minutes for AI processing (depending on document size),
  • predictable, consistent processing time regardless of document complexity,
  • additional 5 minutes for editor review and verification (focused, lower cognitive load),
  • average processing capacity: 6 documents per hour (up to 8x faster).

BetterRegulation achieves 50% overall time savings for the full document ingestion, categorisation, review and publication process.

1 FTE equivalent capacity freed

What used to be a full day's work for one editor is now completed in an hour. The AI handles the tedious part – reading and extracting information – while editors focus on verification and quality control.

This represents approximately one full-time equivalent (1 FTE) of editorial capacity that has been freed up for higher-value work.

Editorial staff benefits:

  • Reallocated to strategic tasks: document analysis, quality improvement initiatives, user feedback incorporation.
  • Focus shift: from manual data entry to quality verification and editorial judgment.
  • Increased job satisfaction: editors report significantly less fatigue and higher engagement.
  • Skill utilization: legal expertise now applied to verification and improvement, not just reading.
  • Career development: editors can take on more complex, challenging work.

Business benefits:

  • Increased capacity: can process approximately 2x the document volume without additional staff.
  • No additional hiring costs: equivalent to ~£30-50k annually (1 FTE) in cost avoidance.
  • Better prepared for growth: platform can scale document coverage without proportional headcount increases.
  • Faster response to changes: can quickly process and publish new regulatory changes.
  • More consistent output quality: less variation due to fatigue or workload pressure.

Scalability without headcount growth

Perhaps most importantly for BetterRegulation's business, the AI solution provides scalability that would have previously required proportional increases in staff.

Operational flexibility:

  • Can process 2x the document volume without additional editors.
  • Quick adaptation to regulatory changes that temporarily increase document flow.
  • Handles seasonal spikes (e.g., end-of-year legislative sessions) without overtime or temporary staff.
  • Maintains consistent quality regardless of volume.

Business growth enablement:

  • Ability to expand coverage to additional jurisdictions without proportional cost increases.
  • Can take on more comprehensive document types without workflow bottlenecks.
  • Platform evolution not constrained by editorial capacity.
  • Competitive advantage through more comprehensive, current content.

Cost efficiency:

  • Reduced training time for new editors (focus on verification rather than full reading).
  • Lower operational costs per document processed.
  • Better resource allocation across the business.
  • Improved ROI on editorial team investment.

Reliability metrics:

  • Success rate: >95% of documents processed without errors.
  • Accuracy rate: very high - <5% of fields require editor correction.
  • Availability: 99%+ uptime for processing service.
  • Error recovery: automatic retry handles transient failures.

Technical innovation: Drupal + AI success story

This project showcases the power of modern Drupal for sophisticated AI integration.

Why this architecture works

Seamless Drupal integration

Unlike bolt-on AI solutions, this implementation is deeply integrated into Drupal's core:

  • native Drupal forms with AI-powered features,
  • full integration with Drupal's entity and field system,
  • respects Drupal permissions and editorial workflows,
  • works with existing content types and taxonomies,
  • no separate interfaces or context switching for editors.

Drupal Automators module

The Drupal Automators contrib module proved instrumental:

  • provides clean abstraction for AI workflows,
  • handles orchestration of multi-step processing,
  • manages connection to external AI services,
  • offers admin UI for configuration and monitoring,
  • supports complex prompt engineering and response parsing.

Production-ready from day one

This isn't a prototype. It's a production system handling business-critical workflows:

  • comprehensive error handling and logging,
  • graceful degradation for edge cases,
  • full monitoring and visibility for administrators,
  • retry logic and fault tolerance,
  • security considerations (API key management, data privacy).

Extensible architecture

The technical architecture is designed to be extensible:

  • modular prompt design allows easy updates and improvements,
  • processing pipeline can be adapted for additional document types,
  • clean separation between AI processing and Drupal integration,
  • foundation for future AI features (we've already built on it with document summaries).

Like this project? Develop an AI document categorization with us!

Schedule a free meeting to discuss your AI document categorization goals and requirements.

We’ll reach out to explore how we can help make your content instantly accessible to users.