Why I Skipped LLMs to Extract Data From 100,000 Wills: A System Design Story
Opinion & AnalysisBreakthroughScore: 85

Why I Skipped LLMs to Extract Data From 100,000 Wills: A System Design Story

An engineer details a deterministic, high-accuracy document processing pipeline for legal wills using Azure's Content Understanding model, rejecting LLMs due to hallucination risk and cost. A masterclass in pragmatic AI system design.

2h ago·4 min read·1 views·via towards_ai
Share:

Why I Skipped LLMs to Extract Data From 100,000 Wills: A System Design Story

In an era dominated by LLM hype, a practical engineering case study emerges, demonstrating that the most advanced model isn't always the right tool for the job. This is the story of building a production-ready, automated pipeline to extract structured data from 100,000 messy legal wills, where accuracy is non-negotiable and "guessing" constitutes a legal failure.

The Problem: Precision Over Hype

The task was a Proof of Concept (PoC) to perform deep data analysis on 1,000 wills, scaling to 100,000. The dataset was a nightmare for standard automation:

  • Old scans and JPEGs: Blurry, tilted, and noisy.
  • Handwritten annotations: Notes and signatures overlapping typed text.
  • Inconsistent templates: Every law firm's "special" way of drafting a will.

The required schema was complex: extracting names, National Insurance numbers, specific gift clauses, funeral wishes, and applying business rules like flagging agency executors or wills aged 10-35 years.

Why LLMs Failed the Legal Stress Test

The author tested the standard generative AI route first. It failed for two critical reasons:

  1. The Hallucination Factor: On fuzzy scans, an LLM would "helpfully" guess a beneficiary's name. In legal document processing, an incorrect guess isn't a bug—it's a potential legal disaster.
  2. Prohibitive Cost: Feeding 100,000 multi-page documents through a high-end LLM API was financially untenable for a scalable, deterministic pipeline.

Standard OCR and document intelligence services (like Azure AI Document Intelligence) also hit a wall. They excel on fixed-format documents like invoices but cannot adapt to the wildly variable structure of legal wills.

The Breakthrough: Azure's Content Understanding Model

The solution was a pivot to Azure's Content Understanding Model. This was the pivotal "Aha!" moment. Unlike generic OCR or LLMs, this model allows for a custom, schema-first approach.

The process:

  1. Define the Schema: The engineer uploaded a custom JSON schema, explicitly telling the AI: "Here is exactly what I am looking for."
  2. Iterative Training: Starting manually, they uploaded sample wills, observed the model's low-confidence outputs, and used the built-in training feature to manually "teach" the AI where to look and how to interpret specific legal phrases (e.g., "I bequeath," "executor," "guardian").
  3. Achieve Deterministic Accuracy: Through this feedback loop, the model's confidence scores soared from "okay" to a consistent 95%—a level of reliability LLMs could not guarantee for this task.

The Production Architecture: A Three-Container Pipeline

With a trained model as the "brain," the author built an automated "body" using Azure cloud services. The goal: drop a folder of 10,000 wills and let the system work unattended.

The architecture centers on three Azure Blob Storage containers:

  1. raw-wills: The ingestion inbox for JPEGs and PDFs.
  2. extracted-json: Holds the raw, messy JSON output from the AI model.
  3. curated-json: A clean room where logic refines the raw data (fixing date formats, standardizing names) to match the final database schema perfectly.

The Digital Warehouse: Making Data Actionable

A critical, often-skipped step was moving from extracted JSON to a structured, queryable database. The author provisioned an Azure SQL Server and built a dedicated database with columns for every field in the schema (ClientName, WillDate, ExecutorName, etc.).

This transformation made the data "permanent" and "searchable." A business question like "How many wills mention a charity beneficiary?" could be answered with a simple SELECT query in seconds, rather than by manually reviewing thousands of PDFs.

Automation Glue: Azure Functions

The entire flow is orchestrated by Azure Functions, a serverless "trigger" that runs 24/7. The automated sequence:

  1. Trigger: A file lands in the raw-wills container.
  2. AI Call: The function sends the document to the trained Content Understanding model.
  3. Extraction: It saves the raw JSON result to extracted-json.
  4. Curation: It applies cleaning logic and moves the refined data to curated-json.
  5. SQL Push: It maps the curated JSON to the SQL database table and inserts the record.

The entire process runs in seconds, fully automated. The provided Python code snippet illustrates the core integration logic, handling authentication (via API key or Managed Identity), calling the Content Understanding REST API, polling for results, and managing the data flow.

The Takeaway: A Blueprint for Deterministic AI

This project is a masterclass in pragmatic AI system design. It underscores a vital principle for enterprise applications: The best solution is often a specialized tool integrated into a robust, automated pipeline, not the most hyped general-purpose model. For tasks demanding perfect accuracy, consistent structure, and predictable cost, a trained, deterministic model can outperform a powerful but stochastic LLM.

AI Analysis

For retail and luxury AI practitioners, this case study is a critical reminder to resist solution-first thinking. The immediate instinct for extracting data from documents—product descriptions, supplier contracts, customer service emails, or handwritten client notes—might be to "plug it into GPT-4." This narrative provides a powerful counterpoint. The core lesson is about **fit-for-purpose architecture**. In retail, numerous processes demand deterministic accuracy where hallucination is unacceptable: extracting line items from a legacy supplier invoice, parsing specific product specifications from a technical data sheet, or identifying precise clauses in a licensing agreement. An LLM might creatively infer missing data, but a custom-trained document understanding model, like the one described, will reliably extract only what it's been taught to find. The cost and latency arguments are equally compelling for high-volume tasks like processing daily shipment manifests or thousands of customer feedback forms. Implementing a similar pipeline in a retail context is highly feasible. The technological components (cloud storage, serverless functions, specialized AI services, SQL databases) are all commodity services. The real investment is in the **schema definition and training phase**. A team would need to manually label a representative corpus of documents—be they purchase orders, product certificates, or return forms—to teach the model the specific lexicon and layout of their business. This upfront effort trades the flexibility of an LLM for the precision and operational control required in mission-critical, scalable back-office automation. It's a trade-off every technical leader must evaluate consciously.
Original sourcepub.towardsai.net

Trending Now

More in Opinion & Analysis

Browse more AI articles