Skip to content
Carmel Eve Carmel Eve

Talk

Ever wondered what exactly a Data Lakehouse is and why it's revolutionising how we interact with data?

After returning from a career break in 2021, Carmel Eve discovered the data landscape had transformed beyond recognition. In this video, Carmel will guide you through the evolution of data architecture and explain why Lakehouses might be the solution you've been looking for.

🎯 What You'll Learn:

  • The fundamental differences between Data Warehouses, Data Lakes, and Data Lakehouses
  • Why ACID compliance matters for your data architecture
  • How Microsoft Fabric leverages Delta Lake for optimised querying
  • The real-world trade-offs between storage costs and performance
  • Whether a Lakehouse architecture is right for your organisation

📚 In This Video:

  • 00:00 Introduction to Data Lakehouse
  • 01:06 Evolution of Data Storage: From Warehouses to Lakes
  • 03:10 Challenges with Data Warehouses
  • 04:11 Rise of Data Lakes
  • 05:43 Limitations of Data Lakes
  • 08:00 Introduction to Data Lakehouse Architecture
  • 09:11 Microsoft Fabric and Future Prospects
  • 10:58 Conclusion and Contact Information

đź’ˇ Key Takeaways:

Data Lakehouses combine the best of both worlds - the low-cost, flexible storage of data lakes with the management features and performance of data warehouses. They support ACID transactions, schema enforcement, and direct BI querying whilst handling structured, unstructured, and semi-structured data.

Transcript

Hello. Welcome to this video in which I'm going to try and answer the question: what exactly is a data lakehouse? A lot has changed in the data space in the last few years, and I am acutely aware of this, having left for a career break in mid-2021 and returning to a data landscape that is almost unrecognizable.

One of the main developments has been the wider adoption of data lakehouses. In this video, I'm going to give a brief introduction to what they are, their functionality, and what they might mean for you. So let's dive in.

The Evolution of Data Architecture

In this constantly changing world of data, it can be difficult to keep track of the latest developments and trends and how they differ from what came before. The data lakehouse architecture is built on the back of years of developments in the data space to create a new centralized approach that I think could revolutionize how we interact with data. I often find myself scrolling past the history sections in technical articles, but in this case, I believe that it is integral to understanding lakehouses and the huge amount of value that they can bring to a data environment.

Data Warehouses: The Foundation

Originally, when organizations first started exploring their data and the value they could extract from it, they opted for a data warehouse architecture. This meant that all of their operational data was moved directly from often on-premise live databases and stored in a usually SQL-based relational format and directly queried to create reports and extract insights.

The main advantages of this architecture were and are: some level of data consistency is ensured by tables with defined schemas. Reporting queries can be designed to be performant and optimized, leveraging the database's tabular nature. So it is worth noting that denormalizing in star schemas can require a lot of joins.

ACID Compliance in Data Warehouses

They also support ACID-compliant operations. ACID compliance means that the database has the following characteristics: Atomicity means that each transaction happens as a single unit. This means that if you have an action that includes multiple transformations, then either the entire action will succeed or the whole thing will fail, and the database will be completely unchanged.

This ensures that your database will never get into a state where, for example, in a data copy operation, the data is removed from the original location but not yet added to the new one. Consistency means that the tables can only be modified according to well-defined rules. For example, added data must follow the correct schema, or a certain value can't be negative at the end of a transaction.

Isolation means that even if users are concurrently modifying a database, their transactions won't interfere with one another. In practice, this often means that even though two things happen concurrently, the database will act as though one of them happened first. One way this is achieved is by temporary locks and queues.

Durability means that if a transaction succeeds, its effects will be saved, even in the case of unexpected full system failure.

Data Warehouse Limitations

However, there were also some major disadvantages to a data warehouse. The storage and compute are tightly coupled, which means that the architecture is really inflexible. Data storage is often expensive, which means that as data volumes increase, storage costs can skyrocket.

It is also often difficult to scale these things independently, so you're forced to scale up compute at the same time as storage and vice versa. You often have to provision for the peak load, meaning that you always have to pay for the maximum compute even when you're not using it. They're generally unsuitable for storing unstructured data like images or JSON.

And you also generally have to pay for the data warehouse software, which tends to be licensed per CPU, and this can end up very expensive.

The Rise of Data Lakes

As the world of data started to change and data volumes massively increased, unstructured data became much more the norm. People started to turn to a new storage option: The data lake. Data lakes are able to cope with storage of structured, unstructured, and semi-structured data all in the same place. The first widely used data lake was HDFS, or Hadoop Distributed File System. But there are now multiple offerings to choose from: Amazon S3, Azure Data Lake, and many more.

Data Lake Advantages

Data lakes are built to allow low-cost storage of large volumes of mixed data types. The storage itself is usually lower cost than a data warehouse, but additionally, the storage and compute are decoupled. This means that you only pay a standing charge for the storage consumed, and it means that if your storage needs increase, you don't also need to pay for increased compute power.

And likewise, if you need more processing power, you don't have to scale up storage capacity at the same time. They also have the advantage of intrinsic support for open file formats—formats that follow a standard specification and can be implemented by anyone, such as Parquet. These are often used in machine learning models and data science.

Enabling frameworks such as Spark, data lakes also support a familiar file-type API, which allows data to be queried like a local file system with folder structure and a corresponding hierarchy of files.

Data Lake Challenges

Because of the nature of a data lake and the support for unstructured data, architectures adopt a "schema on read" approach. This means that the structure of the data isn't defined until the data is read. The implication of this is that there is no built-in way to enforce data integrity or quality, which means that keeping the data consistent can, to be honest, be an absolute nightmare.

Another issue with data lakes is that though they are great for certain types of querying, like ML models and data science, they aren't optimized for standard business intelligence querying. Many don't intrinsically support SQL language querying, which is used for most business reporting, and those that do often can't offer ACID transactions or meet the performance of a data warehouse architecture.

The Two-Tier Architecture Approach

To combat some of these issues and support more optimized queries and concurrency, a lot of people adopt an architecture where they would initially ingest the operational data—usually from a data warehouse—into a data lake where large volumes of data could be stored. This would then be transformed, and the resulting projections would be put into a data warehouse, which was then used to power business analytics platforms. This was known as an ELT—Extract, Load, and Transform—approach, where you load the data into the system as-is and then transform it to your needs.

Two-Tier Architecture Trade-offs

This did have the advantage of introducing some level of consistency into the platform and achieved a key aim of being able to support business insights in a performant way. However, it also had its disadvantages. The two-tier data architecture is complex, and this meant there was significant costly engineering effort.

There's constantly involvement in keeping the system up-to-date and running smoothly, and a high chance of bug introduction. When nothing changed, it introduced further points of failure into the system which needed to be monitored and kept healthy. Keeping the data consistent between the two—and sometimes three—systems is difficult and expensive and leads to additional data staleness, with the analytics often being carried out over older and out-of-date data. And there is still limited support for data science workloads, with people and organizations needing to choose whether to run over the data warehouse, which is not optimized for these kinds of use cases, or over the data lake and not be able to use the data management features such as ACID transactions, versioning, and indexing.

And finally, most of your processing time and compute cost is spent dealing with the movement of data.

The Data Lakehouse Solution

In response to these challenges, a new architecture began to emerge: the data lakehouse. This can be thought of as a combination of a data warehouse and a data lake. Against a lakehouse, you get the low-cost support for mixed data types while still being able to access the data management features of a data warehouse.

Core Features of Data Lakehouses

At the core of lakehouses are the following: ACID transactions, schema enforcement and governance—including audit and discoverability—BI support directly over the data to reduce data staleness, scalability and concurrency, support for open data formats which can support machine learning and data science workloads, support for structured, unstructured, and mixed data types.

Built for a variety of workloads like BI, ML, and data science, and data streaming, which allows the processing of real-time data. And at the core of all of this was the need for these things to be supported without losing the performance capabilities of a classic data warehouse approach. And with the invention of Delta Lake and other open table formats such as Hudi and Iceberg, this dream is closer than ever.

Microsoft Fabric and Modern Implementation

Microsoft Fabric was announced in 2023 and provides intrinsic support for lakehouses, leveraging Delta Lake to support optimized querying of both BI and advanced analytics workloads. Since then, and its general availability in November 2023, features have been continually added to help users understand new and exciting possibilities for unlocking the power of their data.

Continuous performance improvements are being made to bring query speed over these mixed data types ever closer to that of a relational database, but we are not quite there yet. So if you do have a relatively small amount of purely structured data which is fuelling reporting, a data warehouse remains a good storage option.

Current Challenges and Considerations

And as with any new idea or architecture, there is also a big learning curve. When thinking about migration, it can be hard to work out if or how to leverage these new technologies like Fabric to best assist your organizational needs. An issue that we at Endjin have found with Microsoft Fabric specifically is that it recouples your storage and compute costs.

This is not so much an issue with the technology, but more with the commercial model that Microsoft has chosen. We hope to see support in the future for a serverless, pay-as-you-go compute model as these offerings progress. This would allow small- and medium-enterprise organizations to have a risk-free jumping-off point for exploration.

Also, Microsoft Fabric is relatively new, which means that users may find that some of the tooling they're used to is not quite ready for general use. A good example of this at the time of recording is the Git integration support. However, the Microsoft team is working hard and listening to user feedback to continually improve the functionality.

And even though there are some limitations, there is already more than enough features to begin exploring.

Conclusion

If you're interested in having a longer conversation about how these technologies, including Microsoft Fabric, can help you and your organization, why not drop us a line at hello@endjin.com

Thanks for listening.