Skip to content
Carmel Eve Carmel Eve

Talk

In this episode, Carmel Eve delves deeper into the connection between the medallion architecture and the semantic layer, exploring when data becomes truly production ready.

Carmel recaps the three tiers of the medallion architecture: raw bronze, cleaned silver, and opinionated gold layers. The focus is placed on the semantic layer, which adds context and meaning to the gold layer data, making it useful for humans, machines, and AI. This semantic layer includes human or machine-friendly names, advanced data types, metadata for governance, and relationships between data objects.

Carmel also discusses the environments for software development—development, testing, and production—and how this multi-environment system applies to the medallion architecture. Finally, we highlight the advantages of using this pattern, such as supporting multiple use cases, enhancing data lineage, and ensuring data's flexibility and security.

Chapters

  • 00:00 Introduction and Recap
  • 00:06 Understanding the Semantic Layer
  • 01:21 Defining the Semantic Layer
  • 03:02 Data Production Readiness
  • 03:13 Development, Testing, and Production Environments
  • 04:43 Applying the Medallion Architecture
  • 09:30 Conclusion and Key Takeaways

Transcript

Welcome back. In our previous video, we covered an introduction to the medallion architecture. Now let's understand how it connects to the semantic layer, plus we'll discuss when data becomes truly production-ready.

As we saw last time, after passing through the three tiers of the medallion architecture, we arrived at an opinionated or projected view of the data in our gold layer. This gold layer forms the basis of a semantic layer, but the semantic layer adds additional context and meaning to the tables in the gold layer, which enables humans and, more so these days, machines and AI to understand and engage with data.

So this could include human or machine-friendly names, descriptions, and synonyms of objects like tables and columns. It could include assigning more advanced data types over the primitive types that are used in the gold layer – so saying like this number is a percentage, or a latitude, or a currency, or this string is a city. It could include adding metadata to help you enable effective governance of the gold layer, like data classification tags. And it could mean defining relationships between your objects.

Often the idea of the gold layer is conflated with the semantic layer, but without defining your objects' meaning and relationships, you can't use that data in any meaningful way. The semantic layer usually sits outside of the lakehouse, and its exact form will be dependent on your use case.

A good example of this is Power BI. So when you're creating a report, you import the data from the gold layer, and this forms a part of the model. But you then update column names and types to make the report more readable, and enrich the data, and use the modeling tab to define the relationships between your tables and objects. And only at this point is your semantic model fully defined.

Power BI is a useful example because the Semantic Model is a built-in concept, but your semantic layer could also be defined using other tools like Microsoft Purview or Databricks Unity Catalog. Whatever your use case – be it reporting, analytics, or application development – you're likely to define a semantic layer in order to describe your data and give it meaning.

Each output from the semantic layer is a data product – a valuable asset that should be maintained, versioned, and treated as a fully contained product. You may have multiple data products or versions of each, which are consumed by different use cases or used for different types of analytics.

So overall, we have: data is ingested from on-premise or cloud systems into the raw bronze layer, is then processed to the cleaned and validated silver layer, and projected to the opinionated gold layer. Data is then defined as it moves into the described semantic layer, where it then goes on to serve actionable insights. The bronze, silver, and gold layer usually sit inside of the lakehouse, but the semantic layer often sits outside.

So now on to our next question: when is data production-ready?

In software development, we often talk about different environments, and at endjin we often use a three-tiered approach with development, testing, and production.

So in general, the development environment is where engineers are currently working and is therefore the most volatile. New code will be deployed here, and bugs will often be found during this first stage. Things will be changing rapidly, and if the project is actively being worked on, extremely rapidly. Nothing production-focused or client-facing should ever depend on the development environment, as it's purely a place for developers to make changes, trial solutions, and update things.

Once developers are happy with the changes made, and hopefully those changes have passed some kind of quality gate, they'll be deployed into the testing environment. Here, the code undergoes more rigorous testing in a more controlled environment. There might be additional tests including integration tests or non-functional tests, and anything else that doesn't generally fit into a quick feedback loop that isn't necessary during development. Hopefully at this point, any bugs that have slipped through during the development stage will be found.

Finally, once all the tests have passed and all the validation has been carried out, the code will be deployed into production. This is the live code that your wider solution depends on, and if, for example, you're hosting a client-facing web application, this code will drive your public app, and you would therefore hope that it is reliable and bug-free.

This is a generalized pattern that's applicable in loads of scenarios, but the number of environments and their purposes can vary. For example, you might require a pre-production environment for additional validation, or a specialized QA environment in order to meet regulatory requirements.

So how does this apply to our data design pattern? It's slightly confusing that we now have three tiers in both of these separate but related dimensions. In the medallion architecture, data moves from raw to clean to projected. But alongside this, we still want an environment in which engineers can experiment and change things about data – including the silver and gold tier – without this impacting anything public-facing (our development environment). We also want an environment in which changes can be validated (testing), and an environment in which we can be as sure as possible that data is reliable and can be depended upon (production).

So as such, we can design a system as follows: in each of our environments, we have a bronze, silver, and gold tier. It's clear that we don't want any production systems relying on our bronze raw data, as any new data that arrives needs to be cleaned and validated before it can be used. But as is the case for software development, anything that is end-user facing should be relying on the production environment.

The development environment is where engineers are currently working and is therefore volatile. These engineers will need to update things relating to the silver and gold tier and will need to do so without worrying about affecting these production systems. There could be necessary schema changes, column renames, bugs introduced accidentally in calculated columns, tables accidentally deleted, and much more.

It is worth noting that though the environment is functionally volatile, the data itself may be more tightly controlled. Data needs to be consistent in order to allow for development and testing, and test data might be created specifically to hit edge cases, or fake data used to restrict access to production data. Any changes that are made in the development environment are then validated in the testing environment, and once those quality gates have been passed, they're deployed in production.

And within the production environment, nothing should be depending on the data in the bronze tier, as this data is unvalidated and raw. But once we've cleaned and validated the data, both silver and gold tier should be production-ready. At this point, the data's cleaned, validated, and not subject to unpredictable changes as it is in the development and, to some extent, testing environment.

As I mentioned in my last video, it's often stated that data quality improves as you go through the medallion architecture. However, this statement is flawed. The data in the silver tier is no less production-ready than the gold tier. It is just an unopinionated representation of your data. It might be used to feed machine learning models and data science experiments, and the results of those may well be client-facing. Therefore, data in your silver tier in the production environment is very much production data and should be viewed as such.

So to answer our earlier question: when is data production-ready? When you're in your production environment, data is production-ready when it is in the silver or gold tier.

Overall, the opinionation of our data increases as it moves through the medallion architecture. We start off with data in its completely unaltered raw form, and we end with data that is first cleaned and then structured for a specific use case.

Using medallion architecture, we can:

  • Land and then work with data in any format, whether it is structured, semi-structured, or unstructured
  • Impose gates that limit data quality issues
  • Support multiple use cases
  • Support different workload types, including reporting, machine learning, and data science
  • Support recreation of tables at any time
  • Support auditing and data versioning
  • Allow for historical playback, as you have each version of the raw data saved
  • Allow for greater agility where, for example, customer change requests can be dealt with by just updating the gold projection
  • Define data lineage and how data moves from source through processing into consumption
  • Allow for flexibility and security, with certain groups being given access to different projections with different levels of sensitivity

An important point is that there's no hard and fast rules for implementing this pattern. Though the medallion architecture provides us with useful guidance on how to structure our data solutions, there's a lot of nuanced decisions that need to be made. For example: where exactly to draw the lines between the different layers, how much validation is done at each stage, and how much processing is done in the gold storage layer versus how much in Power BI. Each of these require a balancing of performance, flexibility, data copying and storage costs, data volumes, historical data support, security, regulatory requirements, and loads more.

Overall, the combination of this data design pattern and a multi-environment system in which data reliability increases as it's promoted through the environments provides a reliable and flexible architecture that can support many different scenarios.

That wraps up our look at the medallion architecture. We've covered the three tiers, the semantic layer, multi-environment deployment, and key advantages of this approach.

Thanks for watching. I hope you found this valuable.