DuckLake in Practice: Hands-On Tutorial and Core Features | endjin

By Barry Smart Director of Data & AI 30th June 2025

DuckLake in Practice: Hands-On Tutorial and Core Features

TL;DR:

This hands-on tutorial demonstrates DuckLake's practical capabilities through step-by-step examples. We'll install DuckLake, create schemas and tables, perform CRUD operations, explore file organization, and use advanced features like snapshots and time travel. The tutorial showcases how DuckLake's database-backed metadata approach simplifies lakehouse operations while providing enterprise-grade capabilities like ACID transactions and automatic conflict resolution.

From Theory to Practice

In Part 1, we explored DuckLake's architectural philosophy and its elegant solution to lakehouse complexity. Now it's time to get our hands dirty and see how these concepts translate into practice.

This tutorial will walk you through DuckLake's core functionality, from installation to advanced features like time travel. Whether you're evaluating DuckLake for your organization or simply curious about how it works, this hands-on exploration will give you the practical foundation you need.

Code Availability: All code examples in this tutorial are available in the example_3_ducklake_basics.ipynb Jupyter notebook in our endjin-duckdb-examples GitHub repository.

Prerequisites and Setup

Before we begin, you'll need:

Python
DuckDB Python package (version 1.3.0 or later)
Basic familiarity with SQL

Getting Started: Installation and Basic Setup

Let's start by importing the packages we need and determining where want to persist the metadata and data files.

import duckdb
from pathlib import Path

# Define our working directories
DUCKLAKE_FOLDER = Path("../ducklake_basic")
ducklake_metadata = DUCKLAKE_FOLDER / "metadata"
ducklake_files = DUCKLAKE_FOLDER / "data_files"

Installing the DuckLake Extension

The first step is installing the DuckLake extension. DuckDB's extension system makes this remarkably simple:

duckdb.sql("INSTALL ducklake")

That's it! The extension is now available for use. This single command gives you access to DuckLake's full functionality.

Understanding DuckLake's Architecture Choices

DuckLake requires you to make two fundamental technology choices:

Metadata Backend Options:

DuckDB (for local development and small-scale deployments)
SQLite (lightweight, single-file database)
PostgreSQL (enterprise-scale with full ACID compliance)
MySQL (widely deployed, mature ecosystem)

Storage Backend Options:

Local filesystem (development and small deployments)
Amazon S3 (cloud-scale object storage)
Azure Blob Storage (Microsoft cloud ecosystem)
Google Cloud Storage (Google cloud ecosystem)

For this tutorial, to keep things simple, we're using DuckDB as our metadata backend and the local filesystem for data storage — the simplest possible configuration that still demonstrates all of DuckLake's capabilities.

Creating Your First DuckLake Instance

Now let's create and connect to our DuckLake instance:

duckdb.sql(f"""
    ATTACH 'ducklake:{ducklake_metadata}' AS ducklake_basic_db (DATA_PATH '{ducklake_files}');
    USE ducklake_basic_db;
    """)

This command does several important things:

Creates a DuckLake instance with metadata stored in our specified directory
Sets up the data files location for Parquet storage
Attaches to DuckLake as a database called ducklake_basic_db
Switches our context to use this database

Building Your First Schema

DuckLake supports multiple schemas within a single database, enabling logical organization of related tables. Let's create a schema for a simple retail analytics use case:

duckdb.sql("""
    CREATE SCHEMA IF NOT EXISTS retail_sales;
    USE retail_sales;
    """)

This creates a schema called retail_sales and switches our context to work within it.

Creating Tables and Understanding Constraints

Let's create our first table. DuckLake currently has some limitations around constraints that are worth understanding:

Constraint Support: DuckLake currently supports only NOT NULL constraints. PRIMARY KEY, FOREIGN KEY, UNIQUE, and CHECK constraints are not yet implemented, though this may change as the format matures.

duckdb.sql("""
    CREATE TABLE IF NOT EXISTS customer (
        customer_id INTEGER NOT NULL,
        first_name VARCHAR NOT NULL,
        last_name VARCHAR NOT NULL,
        date_joined DATE NOT NULL
    );
    """)

Basic Data Operations

Now let's perform some basic CRUD (Create, Read, Update, Delete) operations to see how DuckLake handles data changes.

Inserting Data

duckdb.sql("""
    INSERT INTO customer (customer_id, first_name, last_name, date_joined) VALUES
    (1, 'Jane', 'Dunbar', '2023-01-11'),
    (2, 'Jimmy', 'Smith', '2024-08-26'),
    (3, 'Alice', 'Johnston', '2023-05-05');
""")

Reading Data

duckdb.sql("""
    SELECT * FROM customer;
    """)

┌─────────────┬────────────┬───────────┬─────────────┐
│ customer_id │ first_name │ last_name │ date_joined │
│    int32    │  varchar   │  varchar  │    date     │
├─────────────┼────────────┼───────────┼─────────────┤
│           1 │ Jane       │ Dunbar    │ 2023-01-11  │
│           2 │ Jimmy      │ Smith     │ 2024-08-26  │
│           3 │ Alice      │ Johnston  │ 2023-05-05  │
└─────────────┴────────────┴───────────┴─────────────┘

Exploring File Organization

One of DuckLake's distinctive features is its file organization approach. Let's examine how it stores data behind the scenes:

duckdb.sql(f"""
    FROM glob('{ducklake_files}/*');
    """)

┌────────────────────────────────────────────────────────────────────────────────────┐
│                                        file                                        │
│                                      varchar                                       │
├────────────────────────────────────────────────────────────────────────────────────┤
│ ../ducklake_basic/data_files/ducklake-01978495-e183-7763-bf40-d9a85988eed9.parquet │
└────────────────────────────────────────────────────────────────────────────────────┘

Notice the flat file structure with UUID-based naming. This is quite different from hierarchical folder structures used by other lakehouse formats. Let's examine the actual content:

# Extract the parquet file name
parquet_file = duckdb.sql(f"""
    SELECT file FROM glob('{ducklake_files}/*.parquet') LIMIT 1;
    """).fetchone()[0]

# Inspect the file contents
duckdb.sql(f"""
    SELECT * FROM read_parquet('{parquet_file}');
    """)

┌─────────────┬────────────┬───────────┬─────────────┐
│ customer_id │ first_name │ last_name │ date_joined │
│    int32    │  varchar   │  varchar  │    date     │
├─────────────┼────────────┼───────────┼─────────────┤
│           1 │ Jane       │ Dunbar    │ 2023-01-11  │
│           2 │ Jimmy      │ Smith     │ 2024-08-26  │
│           3 │ Alice      │ Johnston  │ 2023-05-05  │
└─────────────┴────────────┴───────────┴─────────────┘

The data is stored as standard Parquet files, making it fully compatible with other tools and systems that can read Parquet format.

Understanding DuckLake Snapshots

Every operation in DuckLake creates a snapshot — a point-in-time view of the database state. Let's explore how this works:

duckdb.sql("""
    SELECT * FROM ducklake_snapshots('ducklake_basic_db');
    """)

┌─────────────┬────────────────────────────┬────────────────┬─────────────────────────────────────────────────────────────────────────┐
│ snapshot_id │       snapshot_time        │ schema_version │                                 changes                                 │
│    int64    │  timestamp with time zone  │     int64      │                         map(varchar, varchar[])                         │
├─────────────┼────────────────────────────┼────────────────┼─────────────────────────────────────────────────────────────────────────┤
│           0 │ 2025-06-18 19:48:24.41+00  │              0 │ {schemas_created=[main]}                                                │
│           1 │ 2025-06-18 19:48:24.489+00 │              1 │ {schemas_created=[retail_sales]}                                        │
│           2 │ 2025-06-18 19:48:24.533+00 │              2 │ {tables_created=[retail_sales.customer]}                                │
│           3 │ 2025-06-18 19:48:24.554+00 │              2 │ {tables_inserted_into=[2]}                                              │
└─────────────┴────────────────────────────┴────────────────┴─────────────────────────────────────────────────────────────────────────┘

This shows the complete history of our DuckLake instance:

Snapshot 0: Default main schema created
Snapshot 1: Our retail_sales schema created
Snapshot 2: Customer table created
Snapshot 3: Data inserted into the table

Data Modification and Delete Operations

Let's explore how DuckLake handles updates and deletes:

Updating Data

duckdb.sql("""
    UPDATE customer SET first_name = 'Alice', last_name = 'Fraser' WHERE customer_id = 3;
    """)

Deleting Data

duckdb.sql("""
    DELETE FROM customer WHERE customer_id = 2;
    """)

duckdb.sql("""
    SELECT * FROM customer;
    """)

┌─────────────┬────────────┬───────────┬─────────────┐
│ customer_id │ first_name │ last_name │ date_joined │
│    int32    │  varchar   │  varchar  │    date     │
├─────────────┼────────────┼───────────┼─────────────┤
│           1 │ Jane       │ Dunbar    │ 2023-01-11  │
│           3 │ Alice      │ Fraser    │ 2023-05-05  │
└─────────────┴────────────┴───────────┴─────────────┘

How DuckLake Handles Deletes

DuckLake uses a sophisticated approach for delete operations. Let's examine how deletes are stored:

duckdb.sql(f"""
    FROM glob('{ducklake_files}/*-delete.parquet');
    """)

┌───────────────────────────────────────────────────────────────────────────────────────────┐
│                                           file                                            │
│                                          varchar                                          │
├───────────────────────────────────────────────────────────────────────────────────────────┤
│ ../ducklake_basic/data_files/ducklake-01978495-e274-7dde-9587-ea01a9697457-delete.parquet │
└───────────────────────────────────────────────────────────────────────────────────────────┘

DuckLake creates separate "delete files" that reference the original data files and specify which rows should be considered deleted:

# Extract the delete file name
delete_file = duckdb.sql(f"""
    SELECT file FROM glob('{ducklake_files}/*delete.parquet') LIMIT 1;
    """).fetchone()[0]

# Inspect the delete file contents
duckdb.sql(f"""
    SELECT * FROM read_parquet('{delete_file}');
    """)

┌────────────────────────────────────────────────────────────────────────────────────┬───────┐
│                                     file_path                                      │  pos  │
│                                      varchar                                       │ int64 │
├────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ ../ducklake_basic/data_files/ducklake-01978495-e183-7763-bf40-d9a85988eed9.parquet │     1 │
└────────────────────────────────────────────────────────────────────────────────────┴───────┘

This shows that row position 1 (Jimmy Smith) in the original Parquet file should be considered deleted. This approach allows for efficient deletes without rewriting large data files.

Time Travel: Querying Historical Data

One of DuckLake's most powerful features is time travel — the ability to query your data as it existed at any previous point in time.

Let's get our current snapshot ID and then query historical states:

# Get the current maximum snapshot ID
max_snapshot_id = duckdb.sql("""
    SELECT MAX(snapshot_id) FROM ducklake_snapshots('ducklake_basic_db');
    """).fetchone()[0]

print(f"Current snapshot ID: {max_snapshot_id}")

Current snapshot ID: 5

Now let's query the table as it existed before we deleted the customer:

duckdb.sql(f"""
    SELECT * FROM customer AT (VERSION => {max_snapshot_id - 1});
    """)

┌─────────────┬────────────┬───────────┬─────────────┐
│ customer_id │ first_name │ last_name │ date_joined │
│    int32    │  varchar   │  varchar  │    date     │
├─────────────┼────────────┼───────────┼─────────────┤
│           1 │ Jane       │ Dunbar    │ 2023-01-11  │
│           3 │ Alice      │ Fraser    │ 2023-05-05  │
└─────────────┴────────────┴───────────┴─────────────┘

We can go even further back to see the table before our update:

duckdb.sql(f"""
    SELECT * FROM customer AT (VERSION => {max_snapshot_id - 2});
    """)

┌─────────────┬────────────┬───────────┬─────────────┐
│ customer_id │ first_name │ last_name │ date_joined │
│    int32    │  varchar   │  varchar  │    date     │
├─────────────┼────────────┼───────────┼─────────────┤
│           1 │ Jane       │ Dunbar    │ 2023-01-11  │
│           2 │ Jimmy      │ Smith     │ 2024-08-26  │
│           3 │ Alice      │ Johnston  │ 2023-05-05  │
└─────────────┴────────────┴───────────┴─────────────┘

This time travel capability is invaluable for debugging, auditing, and understanding how your data has evolved over time.

Working with Multiple Tables

Let's create a second table to demonstrate multi-table operations:

duckdb.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        order_id INTEGER NOT NULL,
        customer_id INTEGER NOT NULL,
        order_date DATE NOT NULL,
        product_id INTEGER NOT NULL,
        product_name VARCHAR NOT NULL,
        amount DECIMAL(10, 2) NOT NULL
    );
    """)

duckdb.sql("""
    INSERT INTO orders (order_id, customer_id, product_id, product_name, order_date, amount) VALUES
    (1, 1, 101, 'Widget A', '2023-01-15', 19.50),
    (2, 1, 102, 'Widget B', '2023-01-20', 29.99),
    (3, 3, 103, 'Widget A', '2023-02-10', 19.50);
""")

Atomic Transactions Across Multiple Tables

One of DuckLake's key advantages over table-level formats is true multi-table transactions. Let's demonstrate this by adding a new customer and their orders in a single atomic operation:

duckdb.sql("""
    BEGIN TRANSACTION;
    INSERT INTO customer (customer_id, first_name, last_name, date_joined) VALUES
    (4, 'Bob', 'Brown', '2023-03-01');
    INSERT INTO orders (order_id, customer_id, product_id, product_name, order_date, amount) VALUES
    (4, 4, 104, 'Widget B', '2023-03-05', 29.99),
    (5, 4, 105, 'Widget C', '2023-02-15', 59.99),
    (6, 4, 106, 'Widget A', '2023-01-25', 19.50);
    COMMIT;
""")

This entire operation — adding a customer and three orders — appears as a single snapshot in our change history. If any part of the transaction had failed, the entire operation would have been rolled back, ensuring data consistency.

Change Tracking and Incremental Analysis

DuckLake provides powerful capabilities for tracking changes between snapshots. This is particularly valuable for incremental data processing and change data capture scenarios:

# Get changes to the customer table in the latest snapshot
duckdb.sql(f"""
    SELECT * FROM ducklake_table_changes('ducklake_basic_db', 'retail_sales', 'customer', {max_snapshot_id}, {max_snapshot_id})
    ORDER BY snapshot_id;
    """)

┌─────────────┬───────┬─────────────┬─────────────┬────────────┬───────────┬─────────────┬─────────┐
│ snapshot_id │ rowid │ change_type │ customer_id │ first_name │ last_name │ date_joined │  email  │
│    int64    │ int64 │   varchar   │    int32    │  varchar   │  varchar  │    date     │ varchar │
├─────────────┼───────┼─────────────┼─────────────┼────────────┼───────────┼─────────────┼─────────┤
│           9 │     6 │ insert      │           4 │ Bob        │ Brown     │ 2023-03-01  │ NULL    │
└─────────────┴───────┴─────────────┴─────────────┴────────────┴───────────┴─────────────┴─────────┘

This shows exactly what changed in our latest transaction — a new customer (Bob Brown) was inserted.

Flexibility in Deployment

While we've used a local setup for this tutorial, DuckLake's architecture allows seamless scaling. The same operations we've performed locally would work identically with:

PostgreSQL as the metadata backend for enterprise-scale deployments
S3 or Azure Blob Storage for cloud-native data storage
Multiple concurrent users across different geographic locations
Mix of personal devices, edge computing and cloud infrastructure

The transition between these deployment models requires only changing connection strings — your application code remains unchanged.

Key Takeaways from Our Hands-On Experience

Through this practical exploration, several key advantages of DuckLake become apparent:

Simplicity: Installation and setup require minimal configuration. The API is standard SQL with no format-specific extensions to learn.

Transparency: You can directly inspect both the metadata (standard database tables) and data files (standard Parquet), providing complete visibility into your lakehouse.

Consistency: Multi-table transactions ensure your data maintains integrity across related tables, something that's challenging or impossible with table-level formats.

Performance: Operations that require multiple round trips in other formats (like checking table existence or coordinating transactions) happen instantly with database-backed metadata.

Flexibility: The same code works across deployment scales, from laptop to enterprise, with only configuration changes.

Current Limitations to Consider

Our hands-on experience also reveals some current limitations:

Experimental Status: As we noted, DuckLake is still experimental. Some advanced SQL features may not yet be supported.

Constraint Support: The current limitation to NOT NULL constraints may require adjustments to how you approach schema enforcement, but we also expect this to be addressed in future versions of DuckLake.

Tooling Maturity: While the core functionality is solid, ecosystem tools (monitoring, management interfaces, dataops etc.) are still developing.

These limitations are important to consider for production deployments, though they don't prevent experimentation and proof-of-concept work.

Looking Ahead

This hands-on exploration demonstrates how DuckLake translates its architectural philosophy into practical benefits. The simplicity of setup, the transparency of operations, and the power of features like time travel and multi-table transactions show why DuckLake represents such a compelling approach to lakehouse architecture.

But we've only scratched the surface of DuckLake's capabilities. In Part 3, we'll dive deeper into DuckLake's advanced capabilities, including its built-in encryption system that enables zero-trust data lakes, its sophisticated conflict resolution mechanisms, and the new business models it enables. We'll also examine DuckLake's competitive positioning against established formats and explore the strategic implications for organizations considering their future data architecture.

Whether you're convinced by what you've seen so far or still have questions about DuckLake's readiness for enterprise deployment, Part 3 will provide the strategic perspective you need to make informed decisions about this emerging technology.

FAQs

How do I get started with DuckLake?

DuckLake is available as a DuckDB extension from version 1.3.0. Simply run 'INSTALL ducklake;' in DuckDB, then create your first DuckLake instance by attaching to a metadata backend and storage location.

What do I need to run DuckLake locally?

For local experimentation, you only need DuckDB and a local folder for data files. DuckDB can serve as both the query engine and metadata backend, with your local filesystem as storage.

How does DuckLake organize files differently from other formats?

DuckLake uses a flat file structure with UUID-based naming rather than hierarchical folders. This simplifies operations like table renaming and avoids binding files to structures that may need to change.

What can I do with DuckLake snapshots?

DuckLake automatically creates snapshots for every operation, enabling time travel queries, change tracking, and rollback capabilities. You can query any table as it existed at any previous snapshot.

How does DuckLake handle concurrent users?

DuckLake uses optimistic concurrency control with automatic conflict resolution. Multiple users can work simultaneously, with the system intelligently resolving conflicts that don't logically interfere with each other.

Who We Are

What We Do

Who We Help

What We Think

Contact Us