
How to Govern Unstructured Data
Your governance stack only covers structured data. Here is how to close the gap.

Every enterprise has a data governance stack. Collibra, Alation, Atlan, Informatica - pick your flavor. These platforms are excellent at what they do: catalog data, enforce policies, track lineage, manage quality. But they all share the same blind spot.
They assume the data is already structured.
At most large companies, 80 to 90 percent of data is unstructured. PDFs, scanned documents, handwritten forms, invoices in twelve different layouts. None of it fits into a catalog. And you cannot govern what you cannot structure.
This post breaks down how the major governance platforms handle unstructured data today, where they fall short, and what the architecture looks like when you close the gap with a document intelligence layer.
The governance landscape today
Data governance has matured rapidly. The leading platforms each bring something distinct to the table.
Collibra
Collibra is the enterprise governance leader. Their Data Intelligence Cloud provides a searchable catalog, automated lineage, policy management, and data quality scoring. In 2025, they acquired Deasy Labs and launched Collibra Unstructured AI, which uses LLMs to tag and classify unstructured documents at the chunk level with confidence scores.
This is the most aggressive unstructured play in the governance space - but it classifies documents. It does not extract structured content from them. Knowing a document is a financial contract containing PII is useful. Extracting the principal amount, interest rate, and maturity date into validated JSON fields is a different problem entirely.
Alation
Alation leads in data discovery and collaborative governance. Their catalog supports natural-language search across data assets with trust signals, and they have over 120 connectors for active metadata harvesting. Their new agentic platform automates classification and metadata gap detection.
Alation can catalog unstructured data sources - it knows they exist and can tag them. But extracting structured fields from a PDF is outside scope.
Atlan
Atlan is the modern data stack native. Built API-first with over 100 certified connectors, column-level lineage, and a custom metadata builder. They ship an MCP server so AI agents can query lineage graphs for full provenance chains. Fast deployment - 4 to 6 weeks vs. months for legacy platforms.
Like the others, their unstructured data support is at the metadata and tagging layer. They can govern the container. Not the content inside.
Informatica
Informatica is the heavyweight in enterprise data management. Their Cloud Data Governance and Catalog offers classification, quality scoring, AI-enhanced lineage, and policy pushdown to Databricks, Redshift, and Microsoft Fabric. In late 2025, they introduced unstructured data governance in private preview - again, classification-focused.
Dremio
Dremio sits in the data layer as a lakehouse platform rather than a governance tool. Their Open Catalog (Apache Polaris) governs Iceberg tables, and they have built-in AI functions that can do basic extraction from unstructured content. Enterprises often pair Dremio with one of the governance platforms above for full catalog and policy management.
What these platforms actually do with unstructured data
Every vendor above has a story around unstructured data. But when you dig into what they actually do, a pattern emerges.
They operate at the metadata layer. They can discover that unstructured files exist. They can classify them - "this is a financial contract containing PII." They can tag them, apply access policies, track who owns them.
What none of them do is extract structured content from documents. They can tell you a file is an insurance loss run. They cannot tell you the loss run contains a claim for $450,000 filed on March 15 with a confidence score of 97.2%.
The governance platforms govern the container. They do not govern the content inside.
Why this matters for compliance
Consider the compliance workflow at a large financial institution. The data governance team has spent years building lineage maps, quality rules, policy frameworks. They can trace any structured data point back to its source system.
Then someone asks: where did the data from that scanned trust document end up? What confidence level do we have on the extracted beneficiary name? Which model version processed it?
The governance stack goes silent. That data lives somewhere in an S3 bucket as a PDF. Maybe someone manually keyed it into a spreadsheet. There is no lineage. There is no confidence score. There is no audit trail.
This is 80 to 90 percent of enterprise data. Ungoverned. Unaudited. Unknown.
The missing layer: document intelligence
The fix is not better governance tooling. It is a structuring layer that makes unstructured data governable in the first place.
A document intelligence platform sits between raw documents and your governance stack. It does three things:
Schema-validated extraction. Take a 22-page scanned trust document and extract every field into typed, schema-validated JSON. Not a bag of text. Structured output that matches the exact shape your downstream systems expect.
Field-level confidence scoring. Every extracted field carries a confidence score. The beneficiary name was extracted at 98.3%. The trust date at 94.1%. The account number at 99.7%. This is per-field precision that maps directly to data quality frameworks in your governance platform.
Full provenance and audit trail. Every extraction is traced: which document, which page, which model version, which schema version, when it was processed, who triggered it, what corrections were applied. This is the lineage data your governance stack needs.
How document intelligence integrates with governance
The architecture is straightforward. The document intelligence layer produces four types of output that feed directly into your governance platform.
Metadata for your catalog
Every processed document generates metadata: document type, schema version, extraction timestamp, source location, field count, overall confidence. This gets pushed to your catalog via REST APIs as custom asset metadata.
Collibra's Catalog API accepts external asset types with custom attributes. Alation's Relational Integration API supports batch metadata uploads. Atlan's custom metadata builder lets you define extraction-specific fields that attach to any asset. Informatica's CDGC REST API supports both initial and incremental metadata syncs.
Lineage for compliance
The extraction creates a clear lineage chain: Source Document (PDF in S3) → Extraction Process (API call, model version, schema version) → Structured Output (JSON rows in your warehouse).
This maps to column-level lineage in every major platform. "The interest_rate field in the output table was extracted from page 3, section 2.1 of document X, by model version 2.4, at 98.7% confidence."
For regulated industries - financial services, healthcare, insurance - this is the difference between "we think this data is right" and "we can prove exactly where every field came from."
Quality scores for trust
Per-field confidence scores map directly to data quality dimensions in your governance platform. Collibra and Informatica both have native quality frameworks that accept external scores. Atlan's trust signals can incorporate extraction confidence.
Set a rule: flag any extraction where confidence drops below 0.85 for human review. Your governance platform enforces it automatically.
Webhooks for real-time governance
The extraction layer pushes events via webhooks as documents are processed. Your governance platform picks up new assets in real time, applies policies, triggers compliance workflows. No batch jobs. No overnight syncs. A document lands, gets extracted, gets governed - in seconds.
The architecture
Here is what the full stack looks like when you add a document intelligence layer to your governance infrastructure:
1Source Documents (S3 / SharePoint / Email / Google Drive)2 |3 v4 Document Intelligence Layer (bem)5 - Schema-validated extraction6 - Per-field confidence scoring7 - Bounding box localization8 - Human-in-the-loop corrections9 - Full execution trace and provenance10 |11 +--> Structured Data (Warehouse / Iceberg / Systems of record)12 | |13 | v14 | Query Layer (Dremio / Snowflake / Databricks)15 |16 +--> Metadata + Lineage + Quality Scores17 |18 v19 Governance Platform (Collibra / Alation / Atlan / Informatica)20 - Catalog entry per extraction21 - Lineage: document -> extraction -> output22 - Quality scores per field23 - Policy enforcement24 - Compliance reporting
This is not a rip-and-replace. Your governance investment stays intact. The document intelligence layer extends it to cover the 80 to 90 percent of data that was previously invisible.
How bem fits into this stack
We built bem to be that structuring layer. Our V3 API takes any document - PDF, image, email, spreadsheet, audio - and returns schema-validated JSON with field-level confidence scores and full execution traces.
Workflow DAGs with full tracing. Documents flow through configurable workflows: split multi-page documents, extract fields, route by document type, join outputs, reshape payloads. Every step in the DAG is traced - all queryable via API.
Webhooks with HMAC signing. Every extraction event can trigger a signed webhook to your governance platform. Payloads are signed with HMAC-SHA256 for authenticity verification. Four automatic retries with exponential backoff.
Human-in-the-loop corrections. When a field is corrected, the correction is tracked with full provenance. The system computes accuracy, precision, recall, and F1 against the corrected values. Confidence scores improve over time.
Subscriptions for downstream sync. Configure subscriptions to push extraction results to S3, Google Drive, or any webhook endpoint. Your governance platform gets notified the moment new structured data lands.
The API call is simple. Send a document, get structured JSON:
1curl -X POST https://api.bem.ai/v3/workflows/invoice-extraction/call \2 -H "x-api-key: YOUR_API_KEY" \3 -F "file=@invoice.pdf" \4 -F "wait=true"
The response includes extracted fields, confidence scores per field, bounding boxes showing where on the page each field was found, and a trace URL for the full execution DAG.
The bottom line
Data governance platforms are very good at governing structured data. The gap is not in the tooling. The gap is in the data.
Eighty to ninety percent of enterprise data is unstructured. Until you can reliably structure it - with confidence scores, provenance, and audit trails - your governance stack only covers a fraction of your data estate.
The companies solving this are not buying more governance. They are investing in the structuring layer that makes governance complete.
If your team is evaluating how to bring unstructured data into your governance framework, reach out at bem.ai.

Written by
Antonio Bustamante
Apr 14, 2026 · Perspectives


Ready to see it in action?
Talk to our team to walk through how Bem can work inside your stack.
Talk to the team