
Despite the investments and effort poured into next-generation data storage systems, data warehouses and data lakes have failed to provide data engineers, data analysts, and data leaders trustworthy and agile business insights to make intelligent business decisions. The answer is Data Mesh – a decentralized, distributed approach to enterprise data management.
Founder of Data Mesh Zhamak Dehghani defines Data Mesh as “a sociotechnical approach to share, access and manage analytical data in complex and large-scale environments – within or across organizations.” She’s authored O’Reilly book, Data Mesh: Delivering Data-Driven Value at Scale and Starburst, the ‘Analytics Engine for Data Mesh,’ is the sole sponsor. In addition to providing a complimentary copy of the book, we’re also sharing chapter summaries so we can read along and educate our readers about this (r)evolutionary paradigm.
Data as a Product
We’ve finally arrived at one of the most crucial principles of Data Mesh: data as a product, where organizations apply product thinking to domain-oriented data. Why is this significant? Put simply, when this principle is executed well, the business is poised to unlock an enormous amount of value from their data.
To recap how we got here, the first generation of data platforms were data warehouses and the ownership resided with the warehouse team, which limited access, usability and value to the analysts actually using data to create organizational value. Next came data lakes which moved ownership to the data users resulting in 45% of the data scientist’s time devoted to data cleansing and organization.
Now, Data Mesh shifts the responsibility to as close to the source of the data as possible. This approach eliminates friction of access and usability, and also improves the overall experience of the data users — data scientists, data analysts, data explorers and everyone in between. As a result of frictionless access to data and the agility to respond to external and internal organizational changes, there is a significant impact on the overall business bottomline with faster time-to-insight.
This approach isn’t unique to Data Mesh, but over the last decade we’ve seen an industry wide shift that addressing problems is cheaper and more effective when it’s done as close to the source of data. Zhamak reminds us, “Data is not what gets shared on a mesh, it is only a data product that can be worthy of sharing on the mesh.”
What Successful Data Products Should Embody
Successful products should have these three common characteristics: feasibility, value, and usability. This chapter primarily focuses on the usability and value of a data product.
For a data product to be usable, there are baseline data usability characteristics that every data product must exhibit. It must be: discoverable, understandable, addressable, secure, interoperable/composable, trustworthy, natively accessible, and valuable on its own. We highlighted a few standouts below and you can read the rest in the book.
Discoverable
Traditionally, discoverability with centralized data happens as a catalog listing, with available datasets, owners, location, sample data, etc. In contrast, Data Mesh embraces a source-oriented solution with data product discoverability, where information is intentionally provided by the data product itself. This information may include “source of origin, owners, run-time information such as timeliness, quality metrics, sample datasets, and most importantly, information contributed by their consumers such as the top use cases and applications enabled by their data.” With this information, data consumers or users can easily explore the available data products, search and find the needed datasets, and gain confidence in cultivating a data-driven mindset.
Understandable
After the data consumer discovers a data product, the next step is to understand it. Get to know the semantics of the data, as well as the syntax in which the datasets are presented to the data user. Data consumers also need to understand how exactly the data is presented to them (i.e. data serialization, what SQL queries to execute, etc) as well as the data schema (the underlying representation of the data). By understanding data schemas, we can support the understanding of the data product in a self-serve manner. And ultimately, understanding a data product and creating value no longer requires end user “hand holding” which is a baseline data usability quality.
Trustworthy and Truthful
While understandability and discoverability closes the gap between what the data consumer knows and doesn’t know about the data, it requires far more to trust the data. It’s crucial that the data represents the business accurately in terms of the events or transactions that have occurred, and the probability of truthfulness of the aggregations and projections that have been created by the business. To eliminate uncertainty surrounding the data, a service level agreement would certainly help. These agreements may include details around interval of change (how often changes in the data are reflected), timeliness (time between the business fact occurs and is served to data users), completeness (availability of necessary information), statistical shape of data (distribution, range, volume), lineage (data journey from source to now), precision and accuracy over time (degree of business truthfulness as time passes), and operational qualities (freshness, general availability, performance).
Natively accessible
The usability of a data product often hinges on how easily data consumers can access it with their native tools. As a nod to empathetic design — provide the same data to data analysts and data scientists, but in the way that aligns with their skill sets and tools. For instance, some data analysts are only comfortable with SQL to generate data visualizations or reports. Meanwhile, data scientists that curate and structure the data to train their models typically expect file based data, whilst analytical application developers might expect a real-time stream of events.
Valuable on its own
Data products must be inherently valuable for the data users. If the data product owner can’t summon any value out of the data product, it’s best not to create one, so a data product should carry a dataset that is independently valuable and meaningful.