Introduction to the Medallion Architecture
Talk
In this video, Carmel Eve provides a comprehensive introduction to the Medallion Architecture, a widely-adopted data design pattern for modern data platforms.
What You'll Learn
This video covers the three-tier architecture pattern that helps organizations structure their data lakes and lakehouses effectively:
- Bronze Tier: Raw data ingestion and historical archiving
- Silver Tier: Data transformation, validation, and cleaning
- Gold Tier: Business-ready data optimized for specific use cases
Key Topics Covered
- The difference between ETL and ELT approaches
- How to transform various data formats (JSON, XML, CSV) into tabular structures
- Schema validation and data quality considerations
- When to combine data sources vs. keeping them separate
- Creating multiple data projections for different reporting needs
- Best practices for organizing data by domain and use case
Who This Video Is For
This video is ideal for:
- Data engineers building modern data platforms
- Data architects designing lakehouse architectures
- IT professionals working with Azure, Databricks, or similar platforms
- Anyone interested in understanding data design patterns
Chapters
- 00:00 Introduction to Medallion Architecture
- 00:50 Understanding the Bronze Tier
- 01:46 Transforming Data in the Silver Tier
- 04:51 Creating the Gold Layer
- 06:43 Conclusion and Next Steps
Transcript
Welcome to this introduction to the medallion architecture. Here I'll be continuing my exploration of the new data landscape for this important concept. I'm personally quite familiar with data design patterns and concepts due to my background in C#, and we can think of the medallion architecture in this way as a data design pattern.
It is a way to support good practices and introduce consistency and durability into the system. Medallion architecture consists of three tiers: the bronze, silver, and gold tier. Data moves through these tiers from bronze to silver to gold, and the data structure is refined and restructured until you get to a useful data product.
The idea of data products is to treat data assets as fully supported and maintained products. I have a link below to a blog on the subject if you want to know more.
The Bronze Tier
So what is the bronze tier? Generally, operational data is ingested directly into the bronze tier with little or no cleaning or validation. The preservation of data in this state is important as it provides a historical way to archive and replay all data.
The ingestion of data directly into the bronze tier follows the ELT (not ETL) process, where data is extracted, loaded, and then transformed, rather than being transformed before it's loaded. Usually the bronze tier will be organized by source and ingestion time, giving you a record of when data arrived and where it came from.
The advantage of this pattern is that it provides a historical audit trail of exactly what was ingested. You have each version of a file saved, allowing for complete playback and reprocessing data from first principles if you need it.
The Silver Layer
Once the raw data has been ingested to bronze, it is then processed into the silver layer. During this process, the data is transformed from whatever format it arrived in—it could be JSON, XML, CSV, Excel spreadsheets, or anything—into a tabular format, usually Parquet or Delta. For example, you could receive complex binary data you need to transform into a readable format, or nested JSON that needs to be flattened into a tabular structure.
Alongside this transformation, data is validated. This usually consists of applying a schema. This ensures that all of the data has the expected shape, so all of the expected columns are present and that the data itself follows certain rules. For example, you may have columns that shouldn't contain null values, and you can ensure that here. You might also want to do additional validation, like checking you have the expected number of rows, or that known values are what they're expected to be.
You might also do some transformation at this stage. You may want to unpivot data that has been entered in a pivoted state, which is often the case as it's more human-readable, to make it easier to create reports or perform analysis over it. However, the output should mimic the initial overall structure of the data, and separate files should only be combined in your output tables if they're logical continuations of the same table.
For example, if you had two input files—one that contains "people" (which is an XML file containing people's names, ages, and gender) and "contact info" (which is a CSV file containing those people's email addresses, cities, phone numbers, or whatever)—these should be two individual tables in your silver layer, even if you end up combining them further down the line.
However, if you had, for example, two files with "orders from October" and "orders from November," then you might combine these in the silver layer as they're logical continuations of the same data. The silver layer is also usually where you deal with upserts or modifications. Like if you receive a new update to the orders from November, you would process this in the silver layer in order to create a complete current view of the world.
The exception to this is if there's a requirement to track changes over time as part of your reporting or analytics use case. After the processing, you should be left with an unopinionated (so not shaped for a specific report or purpose), but cleaned and validated copy of your data. And this should be in a form that's easily consumable downstream, probably tabular.
This data will usually be organized by domain (so what it will be used for) and the entity that it originated from (so the original files).
Data Quality
It is often stated that data quality improves as it moves through the tiers in the medallion architecture; however, I think this is incorrect. Data in the silver layer will often be used for feeding ML or statistical models, for running simulations, or performing data science experiments and ad hoc analysis. This means that at this point, the data should be high quality and reliable—else you risk unpredictable and dangerously incorrect results.
The Gold Layer
Once you have cleaned and validated your data, it is time to create a gold layer. The gold layer serves data in the format that is most useful for specific use cases. Often this will be BI reporting, but it could also be for serving applications or driving analytics.
Transformations that you often see in this projection stage include:
- Pivoting your data so you can better serve Power BI reports
- Aggregating data if you only need monthly or yearly granularity reporting
- Creating star or snowflake schemas (so splitting the data into fact and dimension tables)
- Combining related data from different sources
You project your unopinionated tables into the format that best fits your needs. This might mean joining related tables, like all person-specific information. Like I mentioned earlier, you could encode your data (like one-hot encoding—again, if you want to know more about this, I've linked to a great article below), or creating entirely new reference data sets, like a date table that has additional information like day of the week.
You might have multiple projections of the same data from the silver layer in your gold layer, with each one forming the basis of a semantic model for a different report. For example, you could have one report that requires all of the personally identifiable information from the person table, and one for which that's not necessary. And in this case, you could create one projection that includes this information and one that doesn't, which lets you do much more partitioned and fine-grained security, with each report only accessing the information that it actually needs.
At this point, your data should be organized by domain and use case. Sometimes your gold layer might be exactly the same as your silver, but often there does need to be some adjustments that need to be made in order to better serve your reporting needs.
Watch out for my next video in which I'll talk about the connection between the gold layer and semantic layer, and at what point your data is production ready.