Browse our archives by topic…
Data Engineering
Fabric Performance Benchmarking - Spark versus Python Notebooks
Benchmarking Pandas, PySpark, Polars, and DuckDB on Microsoft Fabric: in-process Python engines run 4-5x cheaper and faster than Spark for common workloads.
Scaling API Ingestion with the Queue-of-Work Pattern
The queue-of-work pattern enables massive parallelism for API ingestion by breaking large jobs into thousands of independent work items processed by concurrent workers. This approach reduced data ingestion time for our use case from 15 hours to under 2 hours while providing automatic retry handling and fault tolerance at a fraction of the cost of traditional orchestration tools.
Polars Workloads on Microsoft Fabric
Polars now ships inside Microsoft Fabric by default. Here's how to use it alongside Fabric's other analytics tools and what that means for your data workflows.
Practical Polars: Code Examples for Everyday Data Tasks
Unlock Python Polars with this hands-on guide featuring practical code examples for data loading, cleaning, transformation, aggregation, and advanced operations that you can apply to your own data analysis projects.
Under the Hood: What Makes Polars So Scalable and Fast?
Polars gets its speed from a strict type system, lazy evaluation, and automatic parallelism. Here's how each piece works under the hood.
Polars: Faster Pipelines, Simpler Infrastructure, Happier Engineers
We've migrated our own IP and several customers from Pandas and Spark to Polars. The benefits go beyond raw speed: faster test suites, lower platform costs, and an API developers actually enjoy using.
Building data quality into Microsoft Fabric
Data quality issues are one of the biggest silent killers of analytics initiatives. This post explores how to build data quality into Microsoft Fabric from the ground up.
FabCon Vienna 2025: Day 3
FabCon is a conference dedicated to everything Microsoft Fabric. Day 3's sessions included migration, Databricks, Spark optimisation, and more.
FabCon Vienna 2025: Day 2
FabCon is a conference dedicated to everything Microsoft Fabric. Day 2 featured deep dives into OneLake, Maps in Fabric, and multi-agent AI systems.
Batch Processing Triggered Pipeline Runs in Azure Synapse
This post describes a pattern for batch processing triggered pipeline runs in Azure Synapse
Reliably refreshing a Semantic Model from Azure Data Factory or Synapse Pipelines
This post describes a pattern for reliably refreshing Power BI semantic models from Azure Data Factory or Azure Synapse Pipelines.
Reliably refreshing a Semantic Model from Microsoft Fabric Pipelines
This post describes a pattern for reliably refreshing Power BI semantic models from Microsoft Fabric Pipelines.
FabCon Vienna 2025: Day 1
FabCon is a conference dedicated to everything Microsoft Fabric. Day 1 was mostly focused around the hundreds of new feature announcements.
Supercharge Your Dev Containers on Windows
Running VS Code Dev Containers on Windows? Clone repos inside WSL filesystem to eliminate I/O bottlenecks and boost performance dramatically.
DuckLake in Perspective: Advanced Features and Future Implications
Explore DuckLake's advanced capabilities including built-in encryption, sophisticated conflict resolution, and the strategic implications for future data architecture. Understand how DuckLake enables new business models and positions itself against established lakehouse formats.
DuckLake in Practice: Hands-On Tutorial and Core Features
Get hands-on with DuckLake through a comprehensive tutorial covering installation, basic operations, file organization, snapshots, and time travel functionality. Learn how DuckLake's database-backed metadata management works in practice.
Introducing DuckLake: Lakehouse Architecture Reimagined for the Modern Era
DuckDB Labs introduces DuckLake, a revolutionary approach to lakehouse architecture that solves fundamental problems with existing formats by bringing database principles back to data lake metadata management.
What is a Data Lakehouse?
What exactly is a Data Lakehouse? This blog gives a general introduction to their history, functionality, and what they might mean for you!
DuckDB in Practice: Enterprise Integration and Architectural Patterns
DuckDB comes pre-installed in Microsoft Fabric Python notebooks, so code developed locally deploys straight to production with enterprise monitoring, governance, and OneLake integration.
DuckDB in Depth: How It Works and What Makes It Fast
Dive deep into the technical details of DuckDB, exploring its columnar architecture, vectorized execution, SQL enhancements, and the performance optimizations that make it exceptionally fast on a single machine.
DuckDB: the Rise of In-Process Analytics and Data Singularity
Modern laptops can now handle datasets up to a billion rows, yet 94% of query spending goes on big-data compute that isn't needed. DuckDB brings analytical SQL directly into your process.
Creating Quality Gates in the Medallion Architecture with Pandera
This blog explores how to implement robust validation strategies within the medallion architecture using Pandera, helping you catch issues early and maintain clean, trustworthy data.
Working locally with spark dev containers
Running Spark locally in a dev container can significantly improve development feedback loops. This first article explains why, and the rest of the series will show how.
Per-Property Rows from JSON in Spark on Microsoft Fabric
Spark doesn't always interpret JSON how we'd like. For example, if each key/value pair in a JSON object is conceptually one item, Spark won't give you a row per item by default. This article shows how to nudge Spark in the right direction.
Introduction to Python Logging in Synapse Notebooks
The first step on the road to implementing observability in your Python notebooks is basic logging. In this post, we look at how you can use Python's built in logging inside a Synapse notebook.
Star Schemas are fundamental to unleashing value from data in Microsoft Fabric
Ralph Kimball's 1996 Star Schema principles still underpin modern cloud-native analytics — why dimensional modelling unlocks value from data in Microsoft Fabric.
Adopt A Product Mindset To Maximise Value From Microsoft Fabric
Treating data as a product turns data teams from order takers into innovation engines. A product mindset gives you a framework to fail fast, build user empathy, and focus resources on high-value work.
Exploring Strategies Enabled By Microsoft Fabric
Explore building situational awareness and leveraging strategic opportunities with Microsoft Fabric in this concise overview.
Developing a Data Mesh Inspired Vision Using Microsoft Fabric
Explore Microsoft Fabric, inspired by Data Mesh, for a data-driven strategy. Learn to approach a Data Mesh vision using this powerful tool.
How Does Microsoft Fabric Measure Up To Data Mesh?
Explore Data Mesh's influence on Microsoft Fabric, addressing gaps in data product marketplace, standards, master data management, and governance.
Microsoft Fabric Is A Socio-Technical Endeavour
Creating a successful organisation-wide data and analytics platform isn't just about architecture, schemas and semantic models. It's also about culture, organisational design and people. This blog explores the socio-technical nature of data and analytics and how this should influence your approach to adoption of Microsoft Fabric.
Azure Synapse Analytics versus Microsoft Fabric: A Side by Side Comparison
In this Microsoft Fabric vs Synapse comparison we examine how features map from Azure Synapse to Fabric.
Data validation in Python: a look into Pandera and Great Expectations
Implement Python data validation with Pandera & Great Expectations in this comparison of their features and use cases.
How to setup Python, PyEnv & Poetry on Windows
Explore using Python virtual environments & Poetry on Windows for smoother workflows, with a script & guide to enhance your dependency management experience.
How To Implement Continuous Deployment of Python Packages with GitHub Actions
Discover using GitHub Actions for auto-updates to Python packages on PyPI, assessing its role in Continuous Deployment.
Customizing Lake Databases in Azure Synapse Analytics
Explore Custom Objects in Lake Databases for user-friendly column names, calculated columns, and pre-defined queries in Azure Synapse Analytics.
How to create a semantic model using Synapse Analytics Database Templates
Explore Azure Synapse Analytics Database Templates and learn to create semantic models in this 2nd blog of the series.
Continuous Integration with GitHub Actions
This post gives an overview of Continuous Integrations and shows how you can implement it with GitHub Actions, with an accompanying example Python project
How to apply behaviour driven development to data and analytics projects
In this blog we demonstrate how the Gherkin specification can be adapted to enable BDD to be applied to data engineering use cases.
What is the Shared Metadata Model in Azure Synapse Analytics, and why should I use it?
Explore Azure Synapse's 'Shared Metadata Model' feature. Learn how it syncs Spark tables with SQL Serverless, its benefits, and tradeoffs.
Extract insights from tag lists using Python Pandas and Power BI
Discover how to extract insights from spreadsheets and CSV files using Pandas and Power BI in this blog post.
Introduction to Containers and Docker
Explore containerisation & Docker for app development & deployment. Learn to create containerised applications with examples in this intro guide.
How to test Azure Synapse notebooks
Explore data with Azure Synapse's interactive Spark notebooks, integrated with Pipelines & monitoring tools. Learn how to add tests for business rule validation.
How Azure Synapse unifies your development experience
Modern analytics requires a multi-faceted approach, which can cause integration headaches. Azure Synapse's Swiss army knife approach can remove a lot of friction.
How to use SQL Notebooks to access Azure Synapse SQL Pools & SQL on demand
Wishing Azure Synapse Analytics had support for SQL notebooks? Fear not, it's easy to take advantage rich interactive notebooks for SQL Pools and SQL on Demand.
Azure Synapse for C# Developers: 5 things you need to know
Did you know that Azure Synapse has great support for .NET and #csharp? Learning new languages is often a barrier to digital transformation, being able to use existing people, skills, tools and engineering disciplines can be a massive advantage.
Import and export notebooks in Databricks
Learn to import/export notebooks in Databricks workspaces manually or programmatically, and transfer content between workspaces efficiently.
Azure Databricks CLI "Error: JSONDecodeError: Expecting property name enclosed in double quotes:..."
Explore solutions for JsonDecodeError in Databricks CLI & Clusters. Learn how pre-built CLIs/SDKs simplify requests & authentication in REST APIs.
Using Databricks Notebooks to run an ETL process
Explore data analysis & ETL with Databricks Notebooks on Azure. Utilize Spark's unified analytics engine for big data & ML, and integrate with ADF pipelines.
Using Python inside SQL Server
Learn to use SQL Server's Python integration for efficient data handling. Eliminate clunky transfers and easily operationalize Python models/scripts.