DuckDB in Practice: Enterprise Integration and Architectural Patterns

TL;DR:
DuckDB shines as a building block within larger enterprise data solutions, bridging SQL and DataFrame paradigms while significantly reducing total cost of ownership. Its MIT license and seamless integration make it ideal for embedding in ETL platforms, dashboards, and application backends. When integrated with Microsoft Fabric's Python notebooks, DuckDB provides a smooth path from local development to enterprise deployment. It enables new architectural patterns like modular data processing, embedded analytics, and fits naturally into medallion architecture for data transformations, representing a valuable addition to the modern data engineer's toolkit.
Series Recap
In the first two parts of this series, we've explored the revolutionary potential of in-process analytics and DuckDB:
- Part 1 introduced the concept of the "data singularity" and how DuckDB embodies a shift toward powerful single-node analytics
- Part 2 dove into the technical details that make DuckDB fast and user-friendly, including its columnar architecture, vectorized execution, and SQL enhancements
Now, in this final article, we'll focus on practical applications—how DuckDB fits into real-world enterprise environments, the architectural patterns it enables, and how it integrates with Microsoft Fabric.
Bridging SQL and DataFrame Worlds in a Polyglot Environment
For SQL experts, DuckDB provides all the power and expressiveness of SQL with thoughtful extensions that make it even more productive. For DataFrame users, the seamless integration with DataFrame libraries such as Polars allows them to move between tabular processing paradigms without friction. This versatility makes DuckDB an excellent choice for modern data teams that need the ability to switch between these paradigms as their work demands.
Here's a practical example of this integration, where a set of CSV files is loaded and queried SQL using DuckDB, then converted into Polars (through simple .pl()
operation) for visualization:
import duckdb
import polars as pl
import plotly.express as px
# Use SQL to load data and apply a complex aggregation
regional_summary = duckdb.query("""
SELECT
region,
SUM(sales) as total_sales,
COUNT(DISTINCT customer_id) as customer_count,
SUM(sales) / COUNT(DISTINCT customer_id) as sales_per_customer
FROM read_csv('sales_data*.csv')
WHERE sale_date >= '2024-01-01'
GROUP BY region
ORDER BY total_sales DESC
""").pl()
# Use summarised data in Polars for visualization
fig = px.bar(
regional_summary,
x="region",
y="sales_per_customer",
)
fig.show()
This hybrid approach allows you to use the right tool for each specific task: DuckDB's powerful CSV parsing and loading, SQL for complex aggregations and joins, Polars for visualization, further feature engineering and data science workflows. The transitions between paradigms are seamless, with negligible performance overhead thanks to DuckDB's zero-copy integration.
In fact, the rise of specialized in-process tools is creating an interesting ecosystem where practitioners can choose the interface that best matches the problem at hand.
A Valuable Addition to Your Data Engineering Toolkit
DuckDB should not be considered a complete end-to-end solution for all data challenges. It excels at specific tasks and workloads so effectively that it would be remiss not to consider it when those scenarios arise. As data engineers, our toolkit is becoming increasingly capable, with new tools like DuckDB offering more opportunities to create architectures that better fit both functional and non-functional requirements.
DuckDB should be considered as another powerful tool to add to your arsenal — one that will appeal to people from the SQL camp, to whom it should feel both familiar and liberating when applied to solve the right problems. For those comfortable with SQL but frustrated by the overhead of traditional database systems, DuckDB offers a refreshing alternative that marries SQL's expressiveness with modern performance and convenience.
Core to this proposition is Total Cost of Ownership (TCO), which encompasses all costs associated with our data work: from licenses and cloud consumption charges to the effort we invest building, testing, and maintaining data solutions. DuckDB presents a compelling opportunity to reduce TCO and shorten time to value by:
- Eliminating infrastructure setup and maintenance costs
- Reducing cloud compute expenses for appropriate workloads
- Shortening development cycles through faster iteration
- Simplifying testing and deployment processes
- Lowering the barrier to entry for new team members
We encourage you to look at DuckDB not as a replacement for your existing tools, but as a complementary addition that can significantly enhance your capabilities in specific scenarios. By gaining hands-on experience with it, you'll develop a better understanding of how it fits into your wider data platform strategy and architecture, potentially unlocking new efficiencies and approaches that weren't previously feasible.
The Corporate Component: DuckDB as a Building Block
While DuckDB is excellent for individual data analysis, its fastest-growing area of adoption is actually as an embedded "building block" within larger corporate solutions.
This "building block" approach is possible because of DuckDB's MIT license, which allows companies to integrate it freely into their products. For example, Fivetran (a popular ETL solution) embeds DuckDB to power data transformations. Similarly, data visualization tools, ETL platforms, and analytics services increasingly leverage DuckDB's processing capabilities without exposing it as a separate component.
Real-world examples of DuckDB as a building block include:
- Streamlit and other dashboard tools: Running DuckDB in WebAssembly to power interactive visualizations directly in the browser
- ETL platforms: Using DuckDB as an efficient transformation engine between data sources and destinations
- Application backends: Replacing thousands of lines of inefficient parsing and processing code with concise SQL operations
One application engineering team reported replacing 5,000 lines of Java code with just 1,600 lines of DuckDB-powered code, significantly improving performance while reducing the complexity of their codebase.
What makes DuckDB particularly valuable in these scenarios is its seamless integration. Because it runs inside your application environment (such as a Python process), there are no client-server barriers, no passwords, and no connection strings. DuckDB has direct access to all the context and objects in the Python session, enabling a frictionless experience that feels like a natural extension of the environment rather than a separate system.
Rethinking Modern Data Architectures
DuckDB challenges our fundamental concept of what a database is and how it should behave. Traditionally, databases were distinct systems that ran separately from application code, requiring specific setup and maintenance. DuckDB blurs these lines by bringing database capabilities directly into the application process.
This shift enables new architectural patterns:
Modular Data Processing: DuckDB can serve as an engine to solve specific "units" of data wrangling functionality within larger pipelines, handling the steps where its strengths in SQL processing and efficient joins are most valuable.
Embedded Analytics: applications can include sophisticated analytical capabilities without external dependencies.
New Data Packaging Models: by combining data, schema definitions, constraints, views (processing logic) in a single file, DuckDB offers new ways to package and distribute analytical assets.
Edge Computing: analytical processing can be pushed closer to data sources, reducing the need to transfer large volumes to central servers.
Medallion Architecture Integration: DuckDB fits naturally into medallion architecture patterns (Bronze/Silver/Gold or Raw/Enriched/Curated). It is well-suited for the L and T stages of ETL/ELT processes, where its efficient SQL processing and data transformation capabilities shine in data refining processes: for example to process raw data in Bronze to Silver, or to create projections from Silver into Gold.
There are clearly some parallels to draw between the DuckDB database persisted to a file system and a Power BI PBIX (or more recently PBIP) files. They each focus on different stages of the end to end analytics value chain, but there is some cross over to think about. MotherDuck has developed an open-source connector for Power Query allowing you to push transformation tasks you may normally perform in Power Query upstream to DuckDB. It will be interesting to see there is further developments in the connectivity between Power BI DuckDB in the future!
As we continue exploring modern data architectures, DuckDB will enable us to re-think our approach to integrating data processing into our applications and workflows.
Bridging Local Development and Enterprise Deployment
While DuckDB excels at enabling fast, friction-free development locally on your laptop, this introduces a new challenge: how do you transition from local development to enterprise-grade production environments?
Enterprise deployments require:
- Scheduled automation of processing pipelines
- Monitoring and alerting capabilities
- Secure handling of confidential data
- Integration with downstream systems (reports, dashboards, APIs)
- DTAP (Development, Testing, Acceptance, Production) environments
- Source control and version management
- Scalability beyond a single node when truly needed
Interestingly, DuckDB and Microsoft Fabric's SQL On-Demand endpoint share some similar capabilities in terms of "schema on read" working with files in data lakes. However, they use different SQL dialects: SQL On-Demand follows T-SQL standards, DuckDB has its own SQL dialect with specific extensions. This difference means direct porting of code between the two engines would require work.
Fortunately, Microsoft seem to have recognised the "data singularity" we introduced in Part 1, have listened to the community and have recently introduced the new Python Notebooks in Fabric which comes pre-installed with DuckDB. This provides a more direct path to production through by allowing you to develop locally using DuckDB and then deploy the same DuckDB-powered Python code to a Fabric Python Notebook for scheduled execution, monitoring, and integration with enterprise systems.
This was possible prior to the introduction of the new Python Notebooks by using Fabric's original Pyspark notebook experience, but running DuckDB on a Spark cluster only makes use of the compute on the head node: so spinning up a Spark cluster seems like an anathema to everything DuckDB is trying to achieve. The investment by Microsoft to provide the new Fabric Python Notebooks pre-installed with DuckDB reinforces that the growing realisation that Spark isn't a great platform for many "small data" tasks.
This is where platforms like Microsoft Fabric and OneLake aim to provide a bridge. Fabric allows you to leverage the benefits of tools like DuckDB during development and exploration while offering a path to proper industrialization of your solutions. It provides the infrastructure, security, and governance required for enterprise deployment without forcing you to completely rewrite your code when moving from experimentation to production.
Querying files in OneLake from a local DuckDB session
One powerful feature of DuckDB is the ability to develop locally but to query data that is hosted remotely on OneLake.
In this example, we are using a Python environment where the DuckDB package is installed. We tend to use Poetry for dependency management:
poetry add duckdb
Azure CLI is installed in the environment where you are running DuckDB as a Python package. It can be installed in Windows, macOS and Linux environments. It can also be run in a Docker container which is our preferred approach to developing Python apps.
Once Azure CLI is installed, you can use the az login
command to establish credentials in your environment that will enable your to connect to OneLake:
az login
In an interactive Python session such as a Jupyter notebook, you can now follow a few simple steps to set up and then query files on OneLake using DuckDB.
First establish the Azure Blob Storage Filesystem (ABFS) path for the file(s) you want to query. The following code will enable you to build this from known components:
import duckdb
ACCOUNT_NAME = "onelake"
WORKSPACE_NAME = "<Workspace Name>"
LAKEHOUSE_NAME = "<Lakehouse Name>"
FOLDER = "land_registry"
FILE = "pp-2023.csv"
abfss_path = f"abfss://{WORKSPACE_NAME}@{ACCOUNT_NAME}.dfs.fabric.microsoft.com/{LAKEHOUSE_NAME}.Lakehouse/Files/{FOLDER}/{SINGLE_FILE}"
Notes:
- Replace spaces in Workspace, Lakehouse, Folder and File names with
%20
- The
FILE
variable can be a glob pattern, for examplepp-*.csv
to query all years in the folder. - Be aware of the size of the data you are querying, depending on your Internet connection anything up to 100 MB tends to work well.
Next we set up an "in memory" DuckDB database. You could also choose to persist the data to disk by specifying a local file path in the database
parameter.
db = duckdb.connect(database=":memory:", read_only=False)
We set up a PERSISTENT
secret in the database which tells DuckDB to use the Azure CLI credentials we set up in the earlier step:
db.sql("CREATE OR REPLACE PERSISTENT SECRET onelake (TYPE azure, PROVIDER credential_chain, CHAIN 'cli', ACCOUNT_NAME 'onelake');")
For a Linux environment including Dev Containers, you need to set the following DuckDB environment variable:
db.sql("SET azure_transport_option_type = 'curl'")
Finally, assuming the files are in a OneLake location that you have permissions to access, you can now query the data:
db.sql(
f"""
SELECT
column00 AS 'id',
column01 AS 'price',
column02 AS 'date',
column03 AS 'postcode',
column04 AS 'property_type',
column05 AS 'old_new',
column06 AS 'duration',
column07 AS 'paon',
column08 AS 'saon',
column09 AS 'street',
column10 AS 'locale',
column11 AS 'town_city',
column12 AS 'district',
column13 AS 'county',
column14 AS 'ppd_category',
column15 AS 'record_type',
year(date) AS 'year_of_sale',
month(date) AS 'month_of_sale'
FROM read_csv(
'{abfss_path}',
header=False
);
"""
)
This code sample is available in the endjin-duckdb-examples GitHub repo.
Using DuckDB in a Python Notebook on Microsoft Fabric
The same query above could be run in a Microsoft Fabric Notebook. There is no need to take the steps above to install the DuckDB package or authenticate. All of this is taken care of in the background by Fabric. Simply run the code and it works!
This allows you to deploy your DuckDB based analytics onto Microsoft Fabric and orchestrate that Notebook as part of your end to end analytics workloads using Fabric pipelines.
The experience is made even simpler in Fabric thanks to the concept of "pinning" your Notebook to the relevant Lakehouse as your Default Lakehouse. This allows you to use relative paths to your data, for example:
import duckdb
duckdb.sql(
f"""
SELECT year(column02) AS 'year_of_sale', COUNT(*)
FROM read_csv(
'/lakehouse/default/Files/land_registry/pp-*.csv',
header=False
)
GROUP BY ALL
ORDER BY year_of_sale;
"""
)
The query above summarises ~30M rows over 25 CSV files (about 4 GB of data) in ~30 seconds on the current Python Notebook configuration which is sized to 16GB RAM and 2 cores. Hopefully, Microsoft will offer the option to use higher specifications machines in the future to meet your performance / price point sweet spot!
As well as raw data such as CSV files, you can also query tables in the Lakehouse:
duckdb.sql(
f"""
SELECT year_of_sale, COUNT(*) FROM delta_scan('/lakehouse/default/Tables/land_registry/house_sales') GROUP BY ALL
"""
)
This allows you to mix and match DuckDB with other technologies in your Fabric solutions. For example, you may have used a Pipeline connector to extract the data from the operational platform and load it into a set of Tables in a Fabric Lakehouse. You could then use DuckDB as the engine to explore and analyse this data, automating the end to end process in Pipeline.
Notes:
- ABFS paths are also supported.
- At the time of writing, writing back to ADLS2 / OneLake using the DuckDB
COPY TO
command is not implemented. We have raised a GitHub issue requesting this to implemented, if you agree please vote for this issue! - At time of wiring, the DuckDB Delta extension supports reading, but it does not support writing Delta.
Conclusion: The Evolution of Data Processing
As we wrap up this examination of DuckDB and the in-process analytics movement, it's worth reflecting on the broader trend we're witnessing: the democratization of data processing. Tools like DuckDB (for the SQL-minded) and Polars (for DataFrame enthusiasts) are removing barriers that previously required specialized infrastructure and expertise, enabling more teams to work efficiently with larger datasets.
The question for many organizations is no longer "How do we scale up to handle our data?" but rather "Do we actually need to scale up at all?" For a surprisingly large percentage of analytical workloads, the answer might be that your modern laptop — equipped with the right tools — is more than capable of handling the job. The real challenge becomes not processing power, but moving these solutions into robust, governed enterprise environments.
DuckDB represents a valuable addition to the modern data toolkit. Not as a replacement for all existing systems, but as a powerful complement that excels at specific tasks. By understanding its strengths, limitations, and integration patterns, you can leverage DuckDB to build more efficient data solutions that reduce complexity and cost while improving developer productivity.
What's your experience with analytical workloads? Have you hit limits with traditional tools that forced you to more complex infrastructure? Or have you found ways to simplify your architecture using tools like DuckDB? We'd love to hear your thoughts in the comments.