(Original article on Medium)
The Medallion Architecture is a framework from Databricks that enables managing expectations for data teams.

Close your eyes and imagine, for a moment, that you are the Head of Data of a biotech company. Your position is one of the most important of the company, especially in the data science and AI space. If data is garbage, the models will be garbage. If data is good, the models will be good, and the company will be successful. Your team can make a difference between the former and latter case.
However, managing a data team is not an easy task. If things go well, you’ll be overwhelmed by users requesting new datasets, but you will not have the resources to deliver them all. If things do not go well, users will start downloading things on their laptops, and ignoring the data pipelines you are building, bypassing all your efforts.
How to deal with this situation, as a Data leader?
In my opinion, one of the first steps is to understand that two atavic forces govern the needs of data users. The first force is the need for making the data available now, without making users wait for months while your team develops a data pipeline. If you spend too much time developing the perfect pipeline, your users will start going around you, downloading what they need on their laptops and forgetting about the data team. The second force is the need for having perfectly clean data and harmonized metadata. This is easier said than done, and it requires time and careful planning.
The medallion architecture from Databricks is a way to manage these two atavic forces. Essentially, data is distributed across three catalogs, called Bronze, Silver and Gold. The Bronze catalog is the messy place where, by definition, data is made available quickly, but in a raw form, without cleaning and curation. The Silver catalog is a mid-point between bronze and gold, where data is available in a reasonable form, after a reasonable investment of time from your team. The Gold catalog, instead, is the perfect and tidy place where clean and harmonized data products live, defined by user stories and use cases, and developed after months of investment.

Let’s make an example.
A new paper is published in Nature, presenting a new sequencing technique that is different from what has been seen before. Your data scientists are very excited by this and want to try the data as soon as possible. However, as the data team leader, you know it will take months to develop a perfect pipeline for storing this data into your data lake. What do you do?
The first step is to download the data from the paper, and put it in a new folder in Bronze. You may add some minimal metadata to track where the data was downloaded from, but not much more than that. This way, the data scientists will be able to access the data immediately, without having to wait for anything else. The data will not be curated, but your data scientists will not care at that point; they just want to try it out and use it for experimentation. If you are using Databricks, the Unity Catalog will track any notebook and analysis done using the data, so any result will still be reproducible. The important thing here, is that the data scientists will use the data from the Catalog, instead of downloading it in a temporary position, keeping your data team in the loop.
After a few months, more papers are published using the same technique. So far, you’ve been downloading new data to Bronze, without curation. However, as more datasets accumulate, things become too messy. This is the right time to create some aggregate tables in Silver. For example, you could concatenate the gene counts (or any other fact data) from the raw Bronze datasets into a big table. You’ll likely want to add some metadata table, for example a projects one containing the list of datasets, and other metadata tables as relevant. You don’t spend too many resources on this; just make something sensible that makes your users happy.
After more time, a big initiative in your company focuses on using this new type of data for a specific goal. For example, the company wants to create a perfectly clean dataset for a drug discovery programme, integrating data from several modalities, including this one. A brave new Data Product Owner is assigned to the task, and they begin collecting user stories, designing data flows, outsourcing metadata curation services, and everything that is needed for a perfect product. This product will live in Gold; it will take some time to develop it, but once it is ready, it will be perfect for its purpose.
This is what I like about the Medallion Architecture. It is a framework for managing expectations, for data teams. Users will know that the Bronze data is not curated, but they will be fine with that, because that is the definition of Bronze. On the other hand, users will know that, if they want perfect clean and harmonized data, they need to ask leaders to put resources into it, so that proper data products can be developed in Gold. As these assumptions are intrinsic to the way the data is stored, data team leaders will face less pressure from the requests they receive.