Isaac Sacolick
Contributing Writer

Addressing the challenges of unstructured data governance for AI

analysis
Apr 21, 202610 mins

As technology and regulations evolve, enterprises need to address data governance throughout pipelines, models, and AI agents. AI can help.

Black and white computer keyboard keys, mostly numeric with M and L (machine learning) keys in foreground. Concept of unstructured big data for data science and deep learning.
Credit: tookitook / Shutterstock

Large enterprises in regulated industries, especially in data-rich financial services and insurance, have invested significantly in data governance programs. Other businesses have been catching up as part of their efforts to become more data-driven organizations. Data governance often starts with defining policies, classifying data sources, establishing data catalogs, and communicating non-negotiables

But look a little closer at the implementations, and you’ll see much of the focus has been on governing data warehouses, relational data, and other structured data sources. AI has elevated the importance of implementing data governance and establishing guardrails on unstructured data sources used to train language models and provide context to AI agents.   

“Unstructured data now makes up the vast majority of enterprise information, and AI is redefining how organizations bring control, accessibility, and security to it,” says Ashish Mohindroo, general manager and senior vice president of Nutanix Database Service platform. “Leaders should ask themselves, ‘Who needs daily access to this data?’ and ‘How can we keep data safe from unauthorized access or accidental loss?’ ” Those are two key questions to address on all data sources, but unstructured ones have historically been more challenging to implement. I consulted with several experts on these complexities and on how AI can ease unstructured data governance challenges.

Context as important as content

Joanne Friedman, CEO of ReilAI, says that organizations must ensure safety through governed autonomy, which requires shifting from static access control to contract-based safety. “Routing messages is not the same as reasoning about them, connecting assets is not the same as understanding them, and reactive telemetry is not the same as choreographed intelligence,” says Friedman.

Structured data sources are a mix of transactional and relational data, supported by mature technologies to improve data quality and manage metadata. Document stores and other NoSQL databases provided better data management and search capabilities of unstructured data, but it wasn’t until vector databases and large language models (LLMs) emerged that we had tools to derive meaning from documents at scale. 

“When I look at unstructured documents, I focus on the risk that lives inside the content because sensitive details hide in places people never review,” says Amanda Levay, CEO of Redactable. “I expect controls that stop those documents from entering unsafe workflows because exposure often happens before anyone knows the risk exists. I also push for systems that flag when a file carries information that shouldn’t move forward, so teams catch the problem at the moment it matters most.”

It’s a lot easier to define controls for accessing rows of structured financial transactions and customer records than to define rules for unstructured documents, such as contracts and health records. Friedman points out that the rules for unstructured documents are more dynamic, while Levay notes the scale and real-time complexities in evaluating documents.

Governance across the life cycle

Where should one begin implementing governance policies? There are many considerations for data pipelines, source data sets, consuming applications, AI models, and AI agents. Stéphan Donzé, founder and CEO of AODocs, says organizations need strong plumbing. He recommends a governed system that can perform the following tasks:

  • Routes content to the right models
  • Enforces granular permissions
  • Maps relationships between extracted entities and other taxonomies
  • Tracks implicit versions
  • Calls in humans when the stakes are high

“Without these capabilities, AI becomes another black box. With them, you unlock an auditable, secure, explainable insight layer for data governance, risk, compliance, and mission-critical decisions at enterprise scale”, says Donzé.

Policies need to be implemented consistently across the full data lineage from source through consumption, including the creation of derivative data.

“One of the biggest security challenges with unstructured data is the lack of visibility and lineage as information moves across systems, clouds, and teams,” says Jack Berkowitz, chief data officer at Securiti. “When organizations cannot track where data originated, how it has changed—even what version is active or whether it is still relevant—they increase the risk of exposing sensitive or inaccurate data through genAI applications.”

Using AI to classify and categorize

Extracting knowledge from documents, categorizing them, and then classifying them for user entitlements is complex enough. Add the fact that documents are roll-ups of sections and subsections that need independent analysis and are then related to the full document’s context.

Consider building construction specifications, which often follow the CSI MasterFormat document standard. CSI MasterFormat has 50 divisions, such as general specifications, electrical, and plumbing. Now consider access controls for this document, given that security is covered in two separate divisions and may require different classifications than other sections, such as equipment. But even that’s not sufficient context, as a general contractor should have different policies for accessing the specifications for a nuclear power plant than for a small office building.   

Complex classification challenges are being addressed with AI and advanced algorithms. “Enterprises are shifting toward commodity-driven, API-driven governance accelerators, especially in areas like classification, taxonomy management, and domain-specific labeling,” says Nandakumar Sivaraman, senior vice president and chief architect of enterprise data at Bridgenext. “Instead of manually applying categories, rules, and policies across thousands of assets, companies are now using AI-driven classification APIs to auto-tag and categorize data. They use machine learning–based pattern detection to assign taxonomies, product hierarchies, or entity domains, and implement lightweight governance microservices for real-time classification in ingestion pipelines.”

Another approach uses vision language models (VLMs) to analyze the document’s visual structure for additional contextual clues. Harpreet Sahota, hacker-in-residence at Voxel51, says VLMs can classify documents without training data, but the bigger issue is that most organizations don’t have consistent taxonomies to begin with. “A first step is to treat documents as images rather than just extracting text, which preserves layout information that is important for understanding structure,” recommends Sahota.

Managing versions and duplicates

Documents can have hundreds of versions and derivatives scattered across SharePoint sites, cloud storage areas, SaaS platforms, and email attachments. One of the more significant unstructured data governance challenges is identifying the latest, accurate versions to include in AI models, retrieval-augmented generation (RAG) systems, and AI agents. 

“To improve document versioning, measure the semantic similarity between files and cluster documents that are likely versions of the same document,” says Reece Griffiths, field CTO for Collibra. “Once grouped, apply additional signals, such as last-modified date, metadata, or even title patterns to infer which document in each cluster is the most recent version.”

Determining document versions was once a rules-based system with controls for data owners and tools for handling exceptions. Modern systems now incorporate AI to automate or recommend the latest, most accurate documents and suggest which ones to archive.

“Agents excel at processing unstructured data, reading and analyzing the contents of presentations, videos, emails, and chat logs at scale,” says Dr. Michael Wu, chief AI strategist at PROS. “To manage versions, we must combine search and genAI to enhance the practice of ‘search first, search often’ with ‘read all before creating.’ This fosters continuous document evolution, where outdated or incorrect content is naturally updated or flagged for deprecation.”

Document retention policies

Even after duplication is addressed, a key data governance question remains: How to implement document retention policies? “Most organizations have well-defined retention rules for structured data, but applying those same rules to unstructured content has historically been very difficult,” says Griffiths of Collibra. “By performing AI-based tagging of every document according to a retention taxonomy, including record types and subtypes, companies can then query and manage unstructured data with the same precision they apply to structured data sets.”

Retention policies tend to follow legal guidelines with specific rules. A more difficult challenge is recognizing outdated information in documents that should no longer be used with AI models and agents.

“AI can age documents the way our minds naturally let older memories fade by noticing declining relevance signals, reduced connections to current work, and changing patterns of use,” says Jason Williamson, CEO of MythWorx. “Instead of a hard cutoff, it adapts continuously, helping organizations surface what’s still meaningful while gently retiring what no longer fits the present.”

Data security from start to finish

Three data disciplines are related: data governance protects the business, data privacy protects people, and data security protects the data. Implementing data security must first consider how people create and manage documents.  

“When you’re dealing with documents at scale, security and governance can’t be separate workflows with handoffs between teams; they become the same integrated workflow, with discovery, classification, and enforcement happening as one coordinated response,” says Rohan Sathe, cofounder and CEO at Nightfall. “Modern platforms need to quarantine inappropriately shared messages, emails, and files the moment they’re detected. They need to revoke over-permissioned access to sensitive documents, prevent unauthorized cloud sync operations, block risky CLI commands, and stop file uploads to unsanctioned destinations—all in real time.”

Since documents feed AI models and AI agents, a second data security consideration is which documents to include and how to protect the data embedded in AI. “The primary risk with AI isn’t just a traditional breach; it’s contextual leakage,” says Nico Dupont, founder and CEO of Cyborg. “Once you ground a model in your enterprise data, that model becomes a potential vector for surfacing sensitive information to unauthorized users, and you cannot rely on the model to be its own gatekeeper. True data security requires inference time governance and treating AI as a new tier of infrastructure where the security is built into the architecture and is as automated as the data cleaning itself.”

A third consideration is how data is protected as people interact with LLMs and AI agents. These must adhere to the user’s access policies and the usage context. “The primary security risk in AI document management is inference exposure, where an AI might correctly answer a question by accessing a sensitive document that the user technically shouldn’t see,” says James Urquhart, field CTO and developer evangelist at Kamiwaza AI. “To mitigate this risk, organizations must understand the relationships between different entities in their business ontologies and implement permission-aware indexing that ensures that AI and agentic systems respect the same access controls that a human would be subject to.”

One of the most challenging aspects of unstructured data governance is that regulations are evolving and AI capabilities are improving. Policies must evolve as businesses add more data sets, increase AI literacy across their employee base, and expand their AI use cases. Addressing the challenges of unstructured data governance will generate a growing backlog of work for the foreseeable future. 

Isaac Sacolick

Isaac Sacolick, President of StarCIO, a digital transformation learning company, guides leaders on adopting the practices needed to lead transformational change in their organizations. He is the author of Digital Trailblazer and the Amazon bestseller Driving Digital and speaks about agile planning, devops, data science, product management, and other digital transformation best practices. Sacolick is a recognized top social CIO, a digital transformation influencer, and has over 900 articles published at InfoWorld, CIO.com, his blog Social, Agile, and Transformation, and other sites.

Isaac's opinions are his own.

More from this author