It’s widely acknowledged that in any data warehousing and analytics project, around 80% of the effort is spent on ETL (Extract, Transform, Load) and other data preparation tasks. This isn’t just a random statistic; it’s a reality many professionals face. I’ve experienced this first-hand while working with the SSIS team at Microsoft. We saw a golden opportunity to make data analytics more accessible, user-friendly, and influential in the business world by reducing the time spent on data preparation. We certainly shook up the ETL market, but we didn’t change the dynamics of data analysis for business analysts.
Even today, it seems as though data preparation and ETL still make up roughly 80% of the effort involved in analytics projects. But instead of asking why this work takes so long, we should be asking why this work is needed in the first place. Why does data need so much preparation and transformation?
One reason is that when we extract data from its source system, we strip it of its semantic context. In the source system, data is created within a specific business context and stored alongside various supporting data that gives meaning and value to it. An invoice is not just a number and some line items: it is an element in a complex business process extending through marketing, sales, CRM, support, manufacturing, supply chain, shipping and much more. But when we extract that data, we lose all the rich operational context the source system provided. As a consequence, we have to put in a ton of work to rebuild as much of the original context as we need by enriching the newly extracted data. Yet the data was already semantically rich in the source system.
I often say that extracting data for data warehouses and analytic systems is like uprooting a sapling from a forest to replant it in our garden. For it to thrive, we have to modify and enrich the soil, water and feed the young tree carefully, and meticulously tend to it. But in its native forest, the sapling didn’t need any of this assistance. It grew in a wonderfully complex, supportive ecosystem quite without our help.
So with business data. ETL and data transformation is too often a destructive process, which strips away meaning and context. And then we have to try to recreate that context all over again.
What is to be done? One straightforward solution is to create analytic systems that are virtual or hybrid, keeping the data where it’s semantically most comfortable – in its original application. With the advancements in modern virtualization technologies, this has become a much more viable option than in the past. We can now query, aggregate, and model data efficiently and effectively without having to construct new systems. By leveraging virtualization, we can access and analyze the data in its original context, preserving its semantic richness. This approach significantly reduces the need for ETL and data preparation, as the data is not being moved or transformed as extensively. The result is a more streamlined process that allows for quicker and more accurate data analysis.
This is one of the reasons why semantic layers have become so popular. A semantic layer is an intermediate level in a data analytics system that serves as a bridge between the raw data stored in databases and the end-users who interact with the data through reports, dashboards, or data visualization tools. The purpose of the semantic layer is to provide a consistent, user-friendly, and business-oriented representation of the data, making it easier for non-technical users to understand and work with the information.
The semantic layer helps with our problem of context by providing a consistent, business-centric view of the information which can be as rich as necessary, without moving data. This also simplifies data access: with a semantic layer in place, end-users often don’t need to know SQL or other query languages to access and analyze the data. They can use simple, intuitive interfaces to interact with the data, making it more accessible and reducing the need for extensive data preparation.
It’s also easier to ensure consistency with a semantic layer. I am bored of the over-simplified aim of a single version of the truth. Life and business are not so straightforward. But by establishing a single, unified view of the data, the semantic layer ensures that all users can have a consistent understanding of the information. This helps prevent discrepancies and misunderstandings that may arise from different interpretations.
A well-designed semantic layer works best in conjunction with other strategies, such as virtualization or hybrid analytics systems, to provide a comprehensive approach to managing and analyzing data with minimal preparation and transformation.
Of course there are challenges - security, accessibility, performance and synchronisation all require their own work. But virtual and hybrid analytic systems offer a promising solution to the age-old problem of extensive data preparation and ETL. By allowing data to remain in its original context, we can preserve its semantic richness and reduce the time and effort spent on data preparation.
Semantic layers are not an answer to all your problems of data modelling, accessibility and consistency: of course not. But you will not spend 80% of your time finding that out.
Love the analogy of extracted data is like a tree torn from its forest. Then... "The semantic layer helps with our problem of context by providing a consistent, business-centric view of the information which can be as rich as necessary, without moving data."
First, the most violence is caused by copying data, not just moving data. Now, you have the dilemma of determining where is the REAL data. Here is where the tree analogy falters.
Second, there is a direct analogy of providing "context" to that of prompt engineering for LLMs. Where should you focus your ATTENTION (a principle that is enabling current AI innovations)? Wonder whether the semantic layer vendors could adapt the attention principle from neural net transformers? And, wonder what the analogy of similarity latent/embedding spaces are for a semantic layer?
Beautifully stated "But instead of asking why this work takes so long, we should be asking why this work is needed in the first place. Why does data need so much preparation and transformation?"