Who created the Data Mesh Concept?
If you are new to this concept, Zhamak Dehghani is the person to follow and the creator of the data mesh concept.
What’s Data Mesh?
Data is an asset and in the last decades we have been extracting more value, more insights from it, and every year your data multiplies either by volume or more data sets, and when that happens, we face a lot of problems, such as delays in having the analytics right on time, and it relies on having a data team under extreme pressure and big backlogs to deliver towards business needs. This is, for me, the major problem, so that’s where Data Mesh comes in, aiming to solve this problem and others such as Data Ownership, Data Quality.
So what’s data mesh?
The data mesh architectural paradigm shift is all about moving analytical data away from a monolithic data warehouse or data lake into a distributed architecture, allowing data to be shared for analytical purposes in real-time, right at the point of origin.
At its core, it makes data highly available, easily discoverable, secure, and interoperable with the applications that need access to it. It is not centralized and monolithic.
Bringing it to the concept of breaking down data lakes and siloes into smaller, more decentralized portions.
“Much like the shift from monolithic applications toward microservices architectures in the world of software development, Data Mesh can be described as a data-centric version of microservices” by Bhavesh Furia.
The Four Principles:
The data mesh paradigm is founded on four principles:
1. Domain-oriented ownership / Data Ownership by Domain
2. Data as a product
3. Data available everywhere in a self-serve data infrastructure
4. Data standardization governance / Data is governed wherever it is
How does it look like from an architecture perspective?
John Mallinder from Microsoft did an amazing slide that summarizes it very well under a “harmonized mesh topology”:
If you want to know more about Microsoft Data Mesh approaches/topologies you can find all the three - Governed mesh topology, Harmonized mesh topology, Highly federated mesh topology - in this article (if you know more, please share them in the comments).
Is this for your company?
Depending on the complexity of your data infrastructure and how demanding it is, you may not need it, you may start implementing some data mesh practices and concepts to ease up on a later migration, or you can be in the sweet spot and should join the revolution.
No solution fits all, and data mesh at this point sounds great but for this approach to have success you need to rethink what’s your company goals under data, basically, a clear data strategy, and to think about adopting a modern data stack that will include a good virtualization tool that will be mandatory in data mesh-concept.
Data mesh and Data Fabric
Data fabric is technology-centric while data mesh is more focused on organizational change, data mesh is about people and process, data fabric is purely architectural that tackles the complexity of data and metadata in a smart way.
Both work well together!
Is data mesh good for all “data pipelines”?
Most people understand something wrong about data mesh, such as the end of data engineers as we know it, and the end of traditional data warehouses. In my opinion, of course not. You will need to incorporate some data warehouse into the process and do some more complex pipelines to solve some problems, such as:
• My operational system performance will be affected.
• Report speed, can my operational system handle this?
• If my operational system is down is my reporting down?
• If my operational system changes my reports breaks?
• Where data will be cleaned?
• My operational system doesn’t track historical data and now?
Is this story familiar? If you have been in the data space for some time, you can relate to these problems and the pain, but take into consideration that the paradigm has changed a lot if you already have a modern technology that can scale.
Modern Data Technology Stack
Let start with the amazing image of The 2021 Machine Learning, AI and Data (MAD) Landscape by mattturck.com and FirstMark:
Insane right? Insane hot!
Choosing the right tech stack is very important?
Yes and no! What’s important is that you have some conditions that allow you to move between stacks without problems, easily and with less effort.
If you are not happy with one, if you are paying too much, you just need to change it. For that, you will need to make your data platform the right way, and that’s why you see so many quadrants and why we started seeing a lot of companies focusing on solving one specific problem. It’s the end of one technology that solves it all.
What to look at when building a stack like this?
• Manage Compute and Storage separately.
• Everything as code any change is a merge request
∘ All code/everything must be versioned
∘ Allows us to implement good practices of engineering automation/collaboration/testing/code track).
• Good observability (logging + monitoring + lineage).
• Open ecosystem
• Data Lake for infinite ingestion and universal.
• Data Warehouse with storage columnar for fast analytics.
Dream Stack - Is there a perfect stack for analytics?
Each stack needs to be adopted looking into the business use cases,no solution fits all (sorry for repeating this) and the points to look out for are:
Cloud data lakes offer the least expensive way to store data today. There is no need to spend time or resources transforming data to store or analyze it.
Companies can easily scale their use of the technology, benefitting from the true separation of computing and storage.
Customers aren’t locked in with a single vendor who can set prices and terms. They can take advantage of the best-in-class or highest-value options for specific use cases. Most tools are open-source or SaaS and thus easy to connect and operate.
Customers can use their choice of processing engines (Spark, Dask, Starburst, etc.) and store data in any format they want. This is critical for enterprises that have dated, on-premises storage systems that are difficult and costly to move entirely to the cloud.
Anyone can access a company’s data through their preferred framework, without having to use a specific tool or format. That means data analysts, data scientists, application developers and others can efficiently make the most of the data.
Is this too much tech stack?
Indeed, this raises some concerns, such as how many people will need to manage all this? Do I get the stack skills necessary? But remember that the concept is data as a product and most of the tech is self-managed with awesome uptimes and support.