How does Delta Lake work?
Talk
In this video, Carmel Eve explores the game-changing technology that's finally made Data Lakehouses a practical reality for organisations worldwide.
Following on from her introduction to Data Lakehouses, she dives deep into how OpenTable formats like Delta Lake, Apache Iceberg, and Apache Hudi have solved the performance challenges that previously limited adoption.
What You'll Learn
Carmel demonstrates how these innovative metadata layers bridge the gap between traditional data lakes and data warehouses, enabling:
- ACID transactions across multiple files - essential for data consistency
- Schema validation and enforcement - reject non-compliant data automatically
- Time travel and data versioning - create repeatable audit trails
- Unified batch and stream processing - support diverse workload patterns
- SQL-like querying performance - rival traditional databases whilst handling mixed data types
Key Technical Insights
Discover how OpenTable formats achieve massive performance improvements through:
- Intelligent indexing strategies that eliminate costly table scans
- Multi-tier caching mechanisms for frequently accessed data
- Statistical metadata collection for query optimization
- Z-ordering for multi-dimensional data clustering
- Predictive optimization capabilities in platforms like Databricks Unity Catalog
Why This Matters
For years, organisations have struggled to support both business analytics and data science workloads on the same platform. Carmel explains how this metadata revolution finally enables true convergence - allowing teams to work smarter, not harder, with their data infrastructure. Whether you're architecting a new data platform or optimising an existing one, understanding these OpenTable formats is crucial for modern data engineering success.
Get in Touch
Interested in implementing a Data Lakehouse architecture? Drop us a line at hello@endjin.com to discuss how these technologies can transform your data strategy.
Chapters
- 00:00 Introduction to Data Lake Houses
- 00:37 Challenges with Traditional Data Lakes
- 01:33 Open Table Formats: A Game Changer
- 03:14 Performance Enhancements in Delta Lake
- 04:57 Advanced Data Management Techniques
- 06:10 Conclusion and Future Outlook
Transcript
Hello and welcome. In my last video, I gave a bit of an introduction to Data Lakehouses and how they came to be. In this video, I'm gonna focus a bit more on the technology that has led to the wider adoption of this data design pattern. The idea of Data Lakehouses has been around for a while, but they were faced with huge performance challenges that largely limited the amount of people that were using them.
But with new open table formats such as Hudi, Delta Lake, and Iceberg, that has started to change. One of the main aims of data lakehouses is to be able to serve both business analytics reporting and data science and machine learning workloads in a performant way. One of the issues that data lakes have always faced is that the data that is needed to support analytics can often be spread across loads of different files in different formats, which can be hard to manage.
When faced with complex querying logic in a relational database, it is relatively straightforward to join information across tables. But this is less true when you are storing large volumes of mixed data, which is structured, semi-structured, and unstructured data. This means that data lakehouses, when supporting SQL-like querying, face huge performance challenges compared to a relational database.
ACID transactions are also a challenge, as if multiple files need to be accessed and modified as part of an operation, you need to set up complex logging systems. Ideally, we want to be able to treat this non-relational data as though it is relational, while still being able to support performant querying. To combat these challenges, open table formats like Delta Lake were developed to provide a metadata layer between the data itself and whatever query engine you choose to use. You can see here the metadata layer and APIs sit between the storage and the query APIs. This layer is used to define and optimize how data access is managed.
This allows you to implement ACID transactions, as you can easily tag and group files which need to be changed together. Apply schema validation and enforcement. You can reject data that doesn't follow a set of rules. This improves data consistency within the system. You can implement data versioning and create repeatable audit trails.
You can enable both stream and batch processing, and you can provide more performant querying over mixed data. Delta Lake collects this data automatically when you write to a table, and in this way, lakehouses can support well-managed and performant querying while still being able to use SQL-like business analytics APIs.
These open file formats also allow us to better support optimizations for the declarative APIs that are used for machine learning and data science workloads like Spark. These declarative APIs are quite different from classical imperative APIs where you just provide a list of commands that the system will execute.
Instead, you provide a state that you want the system to create, and you're not concerned about the exact steps involved in achieving that state. For example, I want to combine two datasets within this range by matching these values and removing any records for which a certain statement is true. In this case, the optimization layer is able to decide how best to execute this query in a performant way.
The performance enhancements that the metadata provides are achieved using multiple techniques. The first of these is indexing. Enabling indexing of data can have a massive impact on performance as it allows you to build a data structure which supports faster access. In Delta Lake, this means creating a reference table, which allows lookup of data based on specific columns.
This more targeted approach to data retrieval can have a massive effect on performance, especially compared to the table scan approach, which is often used in data lakes, especially when talking about large data volumes. Next, we have caching. In Delta Lake, for example, data is cached on disk by the metadata layer, which greatly speeds up read access.
Accessed files are automatically cached in the storage of the local processing node, and copies are made of any files that are retrieved from a remote location. It also automatically monitors for changes, meaning that file consistency can still be maintained across the copies.
As well as caching the data itself, additional or supplementary data is often used to speed up data access. For example, if you store statistics about a range of certain columns, when the data within a range needs to be accessed, like a date range or payments above a certain value, then the data can quickly be located.
It can also let you skip large amounts of data if values are known to not be present. For example, if you're performing a merge on values equal to one to 10, and you store metadata about the range in each file, you can discount entire files that fall outside the range without needing to do the processing.
The physical location of data within your storage can also have a huge impact on performance. In most cases, you will have regularly accessed or hot data, then that which is cold or less frequently accessed. Open table formats increase performance by grouping these together. And even beyond that, they also group data that is regularly accessed at the same time, including across multiple dimensions.
So a lot of this metadata is collected automatically when you write to the data lake, but it is configurable should you choose. A good example of this is the defining of specific Z-ordering for columns. This is useful because if you have data that is often grouped by more than one column, like age and gender, smarter grouping is needed to support performant querying.
Additionally, there are account-wide settings such as predictive optimization for Unity Catalog in Databricks, which means that instead of just collecting statistics for the first 32 columns in your schema, these statistics will be collected more intelligently. Overall, Delta Lake and other open table formats provide powerful metadata layers that sit in between your storage and query APIs.
This lets you use a huge amount of management functionality and performance improvements compared to a traditional data lake, even when you're just using out-of-the-box functionality. These latest performance improvements have allowed a greater uptake of lakehouses in general and brought this data design pattern to a point that is being adopted throughout many data-driven organizations.
This exciting step towards a unified data strategy is one that we've been waiting for a long time here at endjin. If you want to talk to us about how these new ideas and technologies can help you, why not drop us a line at hello@endjin.com. Thanks for listening.