Gradually reducing data warehouse costs using an Icehouse architecture
Daniel Abadi
Computer Science Professor
University of Maryland, College Park
Daniel Abadi
Computer Science Professor
University of Maryland, College Park


More deployment options
Every database or data warehouse vendor uses pricing models that increase with the amount of data managed by that product. The fancier the product, the larger the per-gigabyte cost. The more data you have, the larger the bill.
The largest “Big Tech” companies in the world (Google, Facebook/Meta, etc.) quickly realized that given the extraordinary amount of data they generate and need to keep, it makes no financial sense to buy data management software from a vendor and pay the astronomical bills that existing vendors would charge them to manage their extreme amounts of data. Instead, it made much better sense to build their own solutions in-house. Although building such solutions can be a complex and expensive endeavor, this expense was still cheaper for them than buying from a vendor.
The emergence of free and open source solutions
To help reduce the expense of building these complex software systems, many of these companies open-sourced their systems. This enabled external developers to contribute to the code and continuously improve it. It also served as a recruiting vehicle for these large companies — many top software developers are far more excited to build open source software than closed-source proprietary software. The result is a “win-win” for everyone involved: the community benefits by getting free access to complex software systems, and the tech companies from which they originate benefit by getting free software contributions, robust “in production” testing, and bug fixes from community members.
Examples of open source projects that originated in this way include Hadoop and HDFS (Yahoo), Presto and Trino (Facebook/Meta), Cassandra (Facebook/Meta), TensorFlow (Google), and many others.
The Icehouse architecture
Over time, entire software stacks emerged, consisting entirely of these types of free and open source software systems, that together perform extremely important data management functionalities. For example, instead of storing data tables in high-end database systems, many organizations store their data in Apache Parquet files within Apache Iceberg tables, stored in the HDFS distributed file system (which is part of Apache Hadoop).
These free and open source stacks provide enormous cost benefits relative to storing tables in high-end database systems or cloud data warehouses. The only cost is the hardware to physically store the data, or the rental of S3-type of cloud object storage. All the complex data management software is free, and takes care of the data replication, compression, optimized (columnar) layout on disk, metadata management, and data integrity. The data sits ready to be accessed by any data processing tool compatible with Iceberg, including Trino, Starburst, and Spark.
This architecture, in which tables are managed by Apache Iceberg, which consists of data files stored in open data formats (such as Parquet or ORC) sitting in cloud object storage or in HDFS, is often called the “Icehouse architecture”. When paired with a data processing engine, it completely replaces the full set of functionalities provided by modern data warehouses — both on-premises and in the cloud. Yet the cost is a small fraction of the price of traditional data warehouses. The appeal is obvious — not only for the largest tech companies, but also for any company that is looking to avoid the 5, 6, 7, or 8-digit price tags from data warehouse vendors.
How to experiment with Icehouses
Given the cost appeal of such architectures, many organizations operating existing data warehouses or other data management solutions from traditional vendors will want to experiment with Icehouses to evaluate whether they really can replace their existing solutions. If the experiment goes well, they may create a plan to switch over to the Icehouse architecture. However, in many cases, a complete switch is not possible in the short term, and perhaps not even possible in the medium-to-long term if there are mission critical applications running on the existing infrastructure that are too risky to move off of it.
Experimentation without abandoning entrenched solutions
Even if switching to a new solution is not possible in the short or medium term, it is still important to understand the costs of sticking with the existing solution vs. the benefits that the Icehouse architecture can bring. This is best done via hands-on experimentation. There are several practical questions that come up in this context:
(1) What is the best way to experiment with Icehouses to get a sense of their performance, capabilities, and ability to replace existing expensive solutions?
(2) Is it possible to partially switch to an Icehouse while keeping some workloads running on the existing data warehouse? In this context, what if data is spread out across both types of stores and a query comes along that needs access to data in both locations?
(3) When is the best long term plan for switching over to an Icehouse architecture?
Data architecture best practices
Many of these questions are not specific to Icehouse architectures. In truth, any time a new technology comes along that claims to be superior relative to existing solutions along any dimension — price, performance, features, etc. — it is a good best practice to not get caught up in the hype, but rather rigorously experiment with the new technology for a given use case and transition to it slowly and carefully. The only real difference that is specific to Icehouses is that since their core is based on free and open source technologies, it is super easy to get started with these experiments without the need to make contact with a vendor prior to embarking.
Data virtualization is an important tool to help with these experiments
In this context, data virtualization is a super helpful tool, helping to manage these types of experiments, transitions, and multi-solution settings. To understand why this is the case, let’s dive into a fairly common example of how many organizations get started with Icehouses.
Exploring the Icehouse: How data virtualization helps
An organization, let’s call it O, has an existing data warehouse consisting of hundreds of tables and dozens of applications that access data from the warehouse in various ways. Resources have been budgeted for its current petabyte size. Nothing is broken, and nothing needs to be fixed except that the existing solution consumes large amounts of money each year.
Trying out an Icehose with a single “starter” dataset
Over time, a large dataset emerges as being strategically important to O’s business operations. Perhaps it contains important raw data for new machine learning initiatives, or perhaps it contains useful information about customer behavior and sentiments that can add value when combined with existing data in the data warehouse. Either way, in an ideal world, it would be added to the data warehouse. The only problem is that due to its size, adding it to the warehouse will cause warehouse costs to rocket beyond existing budget constraints.
This new dataset is obviously a great opportunity to try out Icehouse-style access and find out what value it can bring to O’s operations. Instead of letting it cause the existing warehouse’s costs to skyrocket, it is stored as Parquet files in HDFS or cheap cloud storage such as S3, and managed by Iceberg. This allows the dataset to be introduced to data scientists, analysts, and analytical software immediately with minimal upfront costs.
The immediate issue that comes up, however, is that now the organization has data in more than one place. One of the major selling points of data warehouses is that all the data is stored in a single location, managed by a single piece of software, and carefully integrated such that it is easy to combine different datasets managed by the warehouse in a single analysis task. But now O has most of its data in the warehouse, and a new strategically important dataset stored elsewhere, potentially located on different physical hardware, and managed by different software. This isn’t so bad if all analytical jobs need access to data either in the warehouse or the new dataset, but not both at the same time. But if some jobs need to combine data across systems, what should be done?
Data virtualization as the solution
This is where data virtualization technologies are so valuable. Data virtualization provides a single interface in which data from multiple underlying systems can be accessed within the context of the same request or query. When they work properly, it is irrelevant whether data is stored in a data warehouse or an icehouse or any other type of system. Either way, they give the impression that all data is stored in a single location and can be accessed together. In our running example, O’s data scientists and analysts make their requests to the data virtualization system directly, and the data virtualization system handles the complexity of directing these requests to the appropriate underlying systems, and potentially combining, joining, or aggregating results across the various underlying systems that contain data relevant to any particular request.
Data virtualization is thus an important tool whenever experimenting with any new data system technology. Assuming the data virtualization software being used supports the new technology, data can be stored and managed by the new technology, and yet become immediately available for access through the same data virtualization interface used to access data stored in all other existing data systems used by that organization. If the access of data in the new system being experimented with yields comparable performance and capabilities as data accessed from existing systems, the experiment can be considered a success and the new dataset can be left there. Otherwise, other options can be experimented with, or the data can be migrated to the older systems.
Data virtualization can also help with long-term data migrations
Perhaps more important than the initial experiment, data virtualization also helps to implement a gradual migration process to the new system. Initially the new system may only have a seed or experimental dataset; however, over time the plan may be to move data from the old system to the new one. Usually it is impossible to do this in a single step, since data movement can be disruptive to ongoing critical business processes. Data virtualization allows data to be moved over to the new system once dataset or table at a time. The interface to the client or end-user application is the same regardless of where the data is stored. The data virtualization system is in charge of hiding details about which underlying systems are currently managing any particular dataset. It simply accepts requests from clients or end-user applications and manages the routing of these requests to the relevant set of underlying systems and the aggregating of the results of these requests.
Reviewing the example
In our example, O started with a new, strategically important dataset in a new system (an Icehouse) and used data virtualization to experiment with accessing data in the new dataset alongside data stored inside an existing data warehouse. After this initial successful experimental phase, O decided that it was significantly more cost-effective to move more data out of the data warehouse and into the much cheaper Icehouse. Unfortunately, many business-critical applications were set up to access data in the data warehouse directly without communicating with the data virtualization system. Redirecting them to the new system or to the data virtualization system would require code changes in these applications, which requires testing and a careful process, and is not something that can be done immediately.
The best option in this scenario would be for O to make these code changes one application at a time. There are two steps to this process:
Step 1: Redirect requests to the data virtualization system
This first step typically requires very few code changes since it involves simply redirecting requests to the data virtualization system instead of the old data warehouse. As long as they have a similar API (e.g., a SQL interface), this change can be done quickly. Initially, nothing else changes: the data virtualization system will either redirect the request back to the old data warehouse or will pull out the data necessary to fulfill that request on the fly from the old data warehouse. For 90% of data virtualization systems, performance will not be significantly impacted (at least not in a negative way) by adding the additional level of indirection.
Step 2: Copy data to the new system
Once the code has been changed to direct the request to the data virtualization system, the physical data used to fulfill those requests can be copied to the new underlying system (the Icehouse) as a background task. Furthermore, any systems that append or modify data in the dataset being migrated need to be modified to append or modify data in the new system additionally. This may also require code changes and is not something that happens quickly. While these changes are ongoing, the data virtualization system continues to direct requests over these datasets to the old system. Once these changes are complete, and the data virtualization system is notified that the new system has the correct version of the datasets (the exact mechanism for doing this varies across systems), the data virtualization system can now direct all requests over these datasets to the new system. Once this switch-over is successful, these datasets can be safely removed from the old system (as long as there are no other applications still using these datasets by connecting directly to the old system) and achieve the associated cost benefits from having a smaller data footprint in the old system.
The key benefit of using data virtualization during this migration
It is important to remember that several applications may use the same dataset and that an application may access multiple datasets. The above-described approach of moving one dataset at a time to the new system, and migrating one application at a time to direct their data requests to the data virtualization system, is very likely to have intermediate states where applications need to access data in more than one system. For example, a retail application may need to join together information about orders and customers (e.g., to get address information for the customers that made each order). If the orders table is moved to the new system before the customers table (or vice versa), there will be an intermediate state in which these requests will need to join together data across systems. This is why using a data virtualization system is so important — these systems are designed exactly for this purpose — to allow requests to span many different types of systems and bring data together on the fly in processing these requests.
What type of data virtualization should be used, and what are some important performance pitfalls?
In this post, we have spoken in general terms about how Icehouses can be used to save on data warehouse costs and how data virtualization is a key tool that enables experimentation with Icehouses, and long-term, safe migrations to them. However, not all data virtualization systems are designed for this type of use case, and there are some performance pitfalls that emerge if using the wrong type of data virtualization system for these kinds of workloads. In our next post, we will dive deeper into how data virtualization systems work, performance pitfalls that come up in this context, and which architectures are best suited for these types of experimentation and migration scenarios.
In the meantime, readers are welcome to download and read my O’Reilly book on this subject.