
According to IDC, 80% of the world’s data is unstructured, but unlocking the value of that data has been one of the most difficult technological problems in big data. All of this has changed with the advent of Artificial Intelligence (AI), which makes unstructured data more usable than ever before. In fact, understanding the impact that AI is having on enterprises is largely down to the way that it has completely changed the landscape for unstructured data and its use.
Do you support unstructured data in your AI architecture
This raises a question: how do you add support for unstructured data to your existing architecture now that it’s available to enterprises for the first time?
In this article, we’ll explore why ingesting unstructured data has historically been challenging, and the data architecture you need to turn this raw data into an invaluable business asset.
Intrigued? Let’s jump in.
Accessing unstructured data vs structured data
Before AI, accessible data in the enterprise was largely synonymous with structured or semi-structured data. These data types adhere to predefined schemas. For example, structured data is organized into tables with rows and columns, while semi-structured data follows hierarchical or tag-based structures such as JSON or XML. In both cases, this kind of data is more predictable, easier to index, and more readily queried than unstructured data.
How to use unstructured data
Unstructured data, on the other hand, includes content like text documents, images, videos, and audio files. This type of data does not follow a fixed schema, which makes it harder to process with traditional tools. To make it worthwhile, unstructured data must often be interpreted and mapped into a more structured format. This transformation has historically been a barrier to using unstructured data in enterprise settings. It requires significant computation and sophisticated tooling.
To understand this more, let’s define the three different types of data clearly:
- Structured data: Data in a machine-readable table/column format (fixed schema) in a relational database, data warehouse, or even spreadsheets.
- Semi-structured data: Data in a human-consumable but semi-structured, machine-parsable format that an automated process can partially ingest. Think JSON, XML, HTML web pages, markup files, CSV files, log files, etc. These text files may contain machine-readable portions (e.g., dates and integers) and non-machine-readable blobs (text, image data, etc.).
- Unstructured data: Human-consumable data. Examples of unstructured data include text documents, PDF documents, social media posts, and multimedia files (video files, audio files, and images).
How AI makes using unstructured data easier
Artificial intelligence has transformed how organizations work with unstructured data. Large Language Models (LLMs) and other AI systems can now analyze raw data such as PDFs, emails, images, or recordings and extract meaningful structure from them. This makes it possible to use information that was previously difficult to access. By enabling systems to understand and work with human-generated content, AI bridges the gap between technical systems and the way people naturally communicate and store knowledge.
Why unstructured data is human data
It’s worth taking a moment to consider why we have unstructured data in the first place.
Unstructured data is the closest digital reflection of how humans communicate, think, and interact. Unlike structured data, which follows rigid formats, unstructured data captures the richness of language, tone, visuals, and expression.
Until recently, only humans could reliably interpret this kind of data. Our existing workloads have primarily focused on structured and semi-structured data, consumed through SQL-based analytics or classic machine learning models. But the true value of unstructured data—whether stand-alone or embedded within semi-structured sources—has remained largely untapped.
AI now offers the ability to unlock this value at scale. This comes with certain key benefits.
Unstructured data is non-programmatic
The difficulty in processing unstructured data stems from its unpredictability. In a way, this is also its value. Unstructured data is non-programmatic. This means that it does not have a predefined format and does not fit any predefined data model. There’s no predefined way to parse its data structure because there is no data structure. Even if you do manage to parse it once, there’s no guarantee the data won’t change shape tomorrow, breaking any attempts at data pipeline automation.
Structured and semi-structured data can break pipelines, too. However, these formats usually include a schema. This schema serves as a data contract. While field values may change, the underlying schema provides a level of consistency. With unstructured data, there are no such contracts and no built-in guarantees.
The attempt to structure unstructured data
Deriving value from unstructured data is difficult, but that hasn’t stopped data teams from attempting to unlock its value. Machine learning workloads, for example, required hiring entire teams to manually collect, annotate, and synthesize data, restructuring it, and enriching its metadata. In essence, they created a data model from unstructured data.
While this approach achieved results in some instances, it lacked scalability. For example, a Databricks survey found that, before the rise of Generative AI, most organizations considered these efforts unviable due to the complexity of data curation. Only large enterprises had the resources to support AI projects, and even then, only in a limited number of high-impact use cases.
How AI unlocked unstructured data for data analysis
The era of unstructured data being inaccessible has now changed forever. AI has significantly expanded the kinds of data that can be included in analytics workflows.
Generative AI has lowered that barrier. Today, companies can quickly analyze any kind of unstructured data using pre-trained AI models, making this data accessible for search, classification, and downstream analytics.
How GenAI systems ingest unstructured data
How does it work? GenAI systems ingest unstructured data by converting it into a numerical format that neural networks can process. For language models, this begins with tokenization, where text is broken into discrete tokens based on subword units. Each token is mapped to a dense vector using a learned embedding space that preserves semantic relationships.
These vectors are passed through multiple layers of the model, which identify patterns, context, and meaning across long sequences. The model’s probabilistic nature means it does not rely on fixed schemas but instead learns statistical correlations across vast and varied input data.
Because this architecture is indifferent to traditional structure, GenAI systems can effectively ingest messy, free-form inputs like raw text, complex documents, or transcripts, making this data accessible to analytical systems for the first time.
The AI revolution is really a revolution in unstructured data
In other words, GenAI excels at finding meaning in unstructured data. Even if your data format changes, e.g., a set of unstructured text documents or PDF documents suddenly uses a new structure, GenAI can detect this automatically and adapt.
This intelligence enables a number of new and emerging use cases, including:
- Ingesting image data and making it available to analytics for the first time.
- Digesting and summarizing sales call transcripts, then making “next action” recommendations based on that and other data about the client.
- Ingesting and analyzing the meaning of video data.
- Digesting PDF data and making the conclusions and numeric data available alongside structured and semi-structured data.
This isn’t a “new kind of analytics.” Rather, it’s making data that was previously inaccessible available for analysis. That’s why we say that AI unlocks the value of your data like never before.
Accessing all your data
Data access is at the heart of the unstructured data problem. Traditional data architectures have supported structured and semi-structured data for analytics and operational applications. However, enabling AI use cases requires broader access to large volumes of high-quality data, including the unstructured data scattered across your organization.
Bringing this unstructured data into AI workflows while maintaining strong governance, access control, and compliance creates a significant architectural challenge.
Unlocking unstructured data silos
Until recently, using unstructured data was unecnomical. This means that most of the unstructured data you own is currently lying dormant. This so-called “dark data” likely makes up the majority of data in your enterprise.
This data isn’t producing any money for your company. Worse, it’s costing money to process and store it.
Retrieving this data isn’t straightforward, as it’s always scattered across various data storage systems, including databases, data warehouses, data lakes, and data lakehouses. These repositories are distributed across cloud-based systems and on-premises hardware. There’s never any uniform way to access them; a single team may hold the keys to a treasure trove of previously untapped unstructured data.
Unlocking the dark data in these data silos is key to activating data for AI. The question is, what’s the best way to do that?
Centralization vs. decentralization
The traditional answer to the data access issue has been “centralize everything.” Companies undertake huge, often years-long projects aimed at bringing all of the company’s data into a single data lake or data lakehouse.
The problem is that indiscriminate centralization always fails in the end. Teams spend months moving data around before they can accomplish anything. That slows down the velocity of AI projects, which wastes both time and money.
In the AI world, the centralization approach amounts to “bringing your data to your AI.” Often, it doesn’t work. We have a better idea: bring AI to your data. Taking this approach means:
- Ensuring centralized access to all data sources via a common set of data connectors and query tools; and
- Enabling centralization of critical workloads that require further performance optimization
This architectural approach breaks down data silos, enabling immediate access to data wherever it lives in the organization. Developers can begin building and experimenting now, not six months from now. It also saves time and money by not forcing teams to centralize data unless it makes sense to do so. This puts choice back in your hands.
Starburst: Making data for AI easy
This is where Starburst comes in. Even before the advent of AI, we’ve dedicated ourselves to facilitating data access, whether that means data across your organization or locked away in data silos.
Unstructured data is the last major frontier of siloed data. AI has made it accessible for the first time, enabling you to analyze and convert it into a structured format that you can use alongside your other, conventional data. This brings all your data—structured, semi-structured, and unstructured—under one roof.
With the Starburst Icehouse architecture, you can easily unlock your siloed unstructured data and make it available for GenAI workloads. The Icehouse adds two pivotal components to your existing data architecture:
- Ingesting streaming and bulk unstructured data in real-time or via bulk file upload, which you can then convert to a structured format using an LLM. This typically involves extracting and organizing text from various formats, including images, videos, and PDFs, before transforming it into structured output using AI models such as LLMs.
- Storing data in Apache Iceberg, a modern table format that’s built for speed and security.
All of this puts choice and access at the heart of your AI journey. With over 50+ data connectors, Starburst connects to your data wherever it lives. That means both you and your AI agents can easily discover and leverage all of your data for AI.