Microsoft Fabric ML Tutorial Part 2: Data Validation with Great Expectations
A deep dive into data validation using Microsoft Fabric and Great Expectations, for a Predictive Analytics use case with the Kaggle Titanic data set.
About this talk
Course
In part 2 of this course Barry Smart, Director of Data and AI, walks through a demo showing how you can use Microsoft Fabric to set up a "data contract" that establishes minimum data quality standards for data that is being processed by a data pipeline.
He deliberately passes bad data into the pipeline to show how the process can be set up to "fail elegantly" by dropping the bad rows and continuing with only the good rows. Finally, he uses the new Teams pipeline activity in Fabric to show how you can send a message to the data stewards who are responsible for the data set, informing them that validation has failed, itemising the specific rows that failed and the validation errors that were generated in the body of the message.
The demo uses the popular Titanic data set to show features in data engineering experience in Fabric, including Notebooks, Pipelines and the Lakehouse. It uses the popular Great Expectations Python package to establish the data contract and Microsoft's mssparkutils Python package to enable the exit value of the Notebook to be passed back to the Pipeline that has triggered it.
Barry begins the video by explaining the architecture that is being adopted in the demo including Medallion Architecture and DataOps practices. He explains how these patterns have been applied to create a data product that provides Diagnostic Analytics of the Titanic data set. This forms part of an end to end demo of Microsoft Fabric that we will be providing as a series of videos over the coming weeks.
Chapters:
- 00:12 Overview of the architecture
- 00:36 The focus for this video is processing data to Silver
- 00:55 The DataOps principles of data validation and alerting will be applied
- 02:19 Tour of the artefacts in the Microsoft Fabric workspace
- 02:56 Open the "Validation Location" notebook and viewing the contents
- 03:30 Inspect the reference data that is going to be validated by the notebook
- 05:14 Overview of the key stages in the notebook
- 05:39 Set up the notebook, using %run to establish utility functions
- 06:21 Set up a "data contract" using great expectations package
- 07:45 Load the data from the Bronze area of the lake
- 08:18 Validate the data by applying the "data contract" to it
- 08:36 Remove any bad records to create a clean data set
- 09:04 Write the clean data to the lakehouse in Delta format
- 09:52 Exit the notebook using
mssparkutilsto pass back validation results - 10:53 Lineage is used to discover the pipeline that triggers it
- 11:01 Exploring the "Process to Silver" pipeline
- 11:35 An "If Condition" is configured to process the notebook exit value
- 11:56 A Teams pipeline activity is set up to notify users
- 12:51 Title and body of Teams message are populated with dynamic information
- 13:08 Word of caution about exposing sensitive information
- 13:28 What's in the next episode?
From Descriptive to Predictive Analytics with Microsoft Fabric:
- Part 1 - Overview
- Part 2 - Data Validation with Great Expectations
- Part 3 - Testing Notebooks
- Part 4 - Task Flows
- Part 5 - Observability
Microsoft Fabric End to End Demo Series:
- Part 1 - Lakehouse & Medallion Architecture
- Part 2 - Plan and Architect a Data Project
- Part 3 - Ingest Data
- Part 4 - Creating a shortcut to ADLS Gen2 in Fabric
- Part 5 - Local OneLake Tools
- Part 6 - Role of the Silver Layer in the Medallion Architecture
- Part 7 - Processing Bronze to Silver using Fabric Notebooks
- Part 8 - Good Notebook Development Practices
Microsoft Fabric First Impressions:
Decision Maker's Guide to Microsoft Fabric
and find all the rest of our content here.