Browse our archives by topic…
Big Data
Scaling API Ingestion with the Queue-of-Work Pattern
The queue-of-work pattern enables massive parallelism for API ingestion by breaking large jobs into thousands of independent work items processed by concurrent workers. This approach reduced data ingestion time for our use case from 15 hours to under 2 hours while providing automatic retry handling and fault tolerance at a fraction of the cost of traditional orchestration tools.
Polars Workloads on Microsoft Fabric
Polars now ships inside Microsoft Fabric by default. Here's how to use it alongside Fabric's other analytics tools and what that means for your data workflows.
Practical Polars: Code Examples for Everyday Data Tasks
Unlock Python Polars with this hands-on guide featuring practical code examples for data loading, cleaning, transformation, aggregation, and advanced operations that you can apply to your own data analysis projects.
Under the Hood: What Makes Polars So Scalable and Fast?
Polars gets its speed from a strict type system, lazy evaluation, and automatic parallelism. Here's how each piece works under the hood.
Polars: Faster Pipelines, Simpler Infrastructure, Happier Engineers
We've migrated our own IP and several customers from Pandas and Spark to Polars. The benefits go beyond raw speed: faster test suites, lower platform costs, and an API developers actually enjoy using.
Top Features of Notebooks in Microsoft Fabric
Discover the top key features of notebooks in Microsoft Fabric.
DuckLake in Perspective: Advanced Features and Future Implications
Explore DuckLake's advanced capabilities including built-in encryption, sophisticated conflict resolution, and the strategic implications for future data architecture. Understand how DuckLake enables new business models and positions itself against established lakehouse formats.
DuckLake in Practice: Hands-On Tutorial and Core Features
Get hands-on with DuckLake through a comprehensive tutorial covering installation, basic operations, file organization, snapshots, and time travel functionality. Learn how DuckLake's database-backed metadata management works in practice.
Introducing DuckLake: Lakehouse Architecture Reimagined for the Modern Era
DuckDB Labs introduces DuckLake, a revolutionary approach to lakehouse architecture that solves fundamental problems with existing formats by bringing database principles back to data lake metadata management.
What is a Data Lakehouse?
What exactly is a Data Lakehouse? This blog gives a general introduction to their history, functionality, and what they might mean for you!
DuckDB in Practice: Enterprise Integration and Architectural Patterns
DuckDB comes pre-installed in Microsoft Fabric Python notebooks, so code developed locally deploys straight to production with enterprise monitoring, governance, and OneLake integration.
DuckDB in Depth: How It Works and What Makes It Fast
Dive deep into the technical details of DuckDB, exploring its columnar architecture, vectorized execution, SQL enhancements, and the performance optimizations that make it exceptionally fast on a single machine.
DuckDB: the Rise of In-Process Analytics and Data Singularity
Modern laptops can now handle datasets up to a billion rows, yet 94% of query spending goes on big-data compute that isn't needed. DuckDB brings analytical SQL directly into your process.
Creating Quality Gates in the Medallion Architecture with Pandera
This blog explores how to implement robust validation strategies within the medallion architecture using Pandera, helping you catch issues early and maintain clean, trustworthy data.
Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Next Steps)
Intelligently scheduling cloud data pipelines based on carbon impact can optimize both environmental sustainability and operational efficiency.
Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Pipeline Definition)
Intelligently scheduling cloud data pipelines based on carbon impact can optimize both environmental sustainability and operational efficiency.
Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Architecture Overview)
Intelligently scheduling cloud data pipelines based on carbon impact can optimize both environmental sustainability and operational efficiency.
Carbon Optimised Data Pipelines - minimise CO2 emissions through intelligent scheduling (Introduction)
Intelligently scheduling cloud data pipelines based on carbon impact can optimize both environmental sustainability and operational efficiency.