MLFlow for Bioinformatics – the missing piece for reproducibility of results?

(Story originally published on Medium)

MLFlow is a tool used to manage the development of Machine Learning models, from their development to release in production. It is also a great tool for managing bioinformatics analyses, making it easier to test parameters and options and reproduce results. However, in my opinion, not many Bioinformaticians are aware of this tool and its advantages. In this article, I will show an example of using ML FLow to track a differential expression analysis and demonstrate how it is the missing piece to ensure reproducibility.

Example of using ML Flow for tracking a Differential Expression analysis. The two last columns shows two parameters I tested: lfcShrink (whether to apply LFC Shrinkage or not), and refit_cooks (whether to filter out potential outliers, based on Cooks distance). The “Metrics” columns track some metrics I computed for every analysis, such as the number of significant genes, the averge LogFC, and so on. The first columns provide info on the input dataset, the notebook, and when it was executed.

What is ML Flow?

MLFlow is a platform for managing the life cycle of a Machine Learning model. It keeps track of all the experiments done while training a Model, as data scientists try different parameters and algorithms to improve their predictions.

Suppose you are a Machine Learning scientist, tasked with developing a model to predict a specific variable. There are so many tools you can try?—?random forest, XGboost, etc..?—?and for each of these, there are so many parameters to be tested. How to keep track of all these options and choices?

The answer is ML flow. This tools allows to record information on parameters, inputs, and metrics every time we run an experiment.

The following screenshot is taken from the ML Flow documentation page. We can see a list of Tensorflow models, computed using different parameters lr and momentum (in the last two columns of the table). For each run, the metrics test_rmse is computed?—?this allows to quickly determine which run has the most optimal results.

A screenshot of ML Flow from its documentation page. Each row represents an experiment using tensorflow. The last two columns show the values of the parameters (lr and momentum) used for every run. The third-to-last column shows the test RMSE for each run, allowing to determine which run has the best results.

MLflow is much more than a registry of parameters and metrics. It allows to manage the whole life cycle of a model?—?from developing it, tagging specific versions, and pushing it to production. These are all valid applications in Bioinformatics as well?—?but for this article, I’ll just explain the basic usage of the Registry.

How can ML Flow be used for Bioinformatics?

Despite being developed to track Machine Learning experiments, ML Flow can be used to track any computational analysis. We don’t need to use scikit-learn or train Machine Learning models to use it. We can track anything that has parameters, and produces results. Here, I’m going to show how to use it for a Differential Expression analysis made with the PyDESeq2 Python Package.

Initializing a ML Flow run

The first step is to import the library and initialize an experiment. The code below may seem verbose, but it can be copied&pasted in all notebooks. In short, we define the Experiment name, and connect to it. If the experiment does not exists, we create it.

In the second cell, we start the ML Flow run. This will create a new row in the ML Flow registry for this experiment. The row will be empty, but we will add parameters and metrics later in the code.

Tracking Parameters

For this example, I am following the PyDESeq2 tutorial, almost verbatim.

I’ve added two parameters, to show how to keep track of them using mlflow. The first parameter is called “refit_cooks”, and determines whether we want to remove outliers, based on Cooks distance. This is a common step in a Differential Expression analysis, to remove samples that may have low quality for technical reasons. The second parameter is lfcShrink, to determine whether to apply LFC Shrinkage to the data?—?this is a technique used to clean up genes that have low fold change in the results.

The mlflow code to track parameters is quite simple. We run mlflow.log_params, and provide a dictionary of parameters we want to track. Assuming we have executed mlflow.start_run() earlier in the code, these parameters will be added to the current run.

Tracking Parameters in ML Flow

Tracking Input Data

Another useful thing to track when running a Differential Expression analysis is the input data.

If the dataset is small, we can track it using the mlflow.log_input() function as shown below. Otherwise, if the data is bigger, we can store it in a table, and provide the location. If the data is complex to represent as a table, we can also store it as an Artifact?—?a file that can is stored in the run, and can be accessed later. It all depends on how big the input data is, and how much space do you want to reserve for the runs data.

Tracking Input datasets in MLflow

Tracking Results and Metrics

I’m not going to paste the code for this Differential Expression analysis, because it is taken verbatim from the PyDESeq2 tutorial . Let’s assume we have completed our analysis, and obtained a dataframe of Genes that are differentially expressed in our comparison.

The results of our Differential Expression analysis, computed using the PyDESeq2 tutorial. The padj column tells us which genes are differentially expressed in cases vs control, and the log2FoldChange shows the magnitude of the change, and its sign.

Now that we have computed this table, we may want to keep track of some metrics, to determine whether we results are valid or not.

For a Machine Learning experiment, we could track accuracy, recall, MAE, RMSE, and other metrics.

In the case of a Differential Expression analysis, we don’t really have the equivalent of these metrics, but we can create others as we may see fit. For example, here I am computing the ones in the screenshot below:

Some of these metrics are self explanatory: for example, n_significant_genes shows the number of genes that are significant in the results dataset

Other metrics are based on the Biology. Here I’ve computed a metrics called “is_gene3_significant”, to determine whether Gene3 is differentially expressed between cases and controls. This may be a hypothesis that I make on the data?—?for example, this is an experiment where Gene 3 has been Knocked Out using gene editing or other reasons. As another example, we may know from literature that Gene3 is always expressed in the group of patients we are testing, and we may want to check whether that is the case.

There is an infinite amount of possibilities when tracking Metrics from an experiment. ML flow provides an efficient way to log these and navigate them across different runs.

The results

This screenshot shows what got recorded in ML FLow when running this analysis.

The last two columns show the parameters used for every analysis. These are my original parameters lfcShrink and refit_cooks, which I defined earlier in my code.

The other columns on the right show the metrics from every run. For example, you can see that when lfcShrink is set to True, the average Log2FC is lower?—?this is expected, as LFC Shrinkage is designed exactly for that purpose.

The columns on the right show information on the runs?—?like the input data, the time when the experiment was run, and so on.

What’s next?

This article described how ML Flow can be used to keep track of Bioinformatics tasks, such as Differential Expression.

In the past, I’ve tried many approaches to ensure reproducibility of analysis. I’ve used jupyter notebooks stored in git; and all sorts of approaches. However, I think ML Flow is the ultimate answer. Especially if integrated with environments like DataBricks, as in these examples.

There are infinite ways to design parameters and metrics for tracking the results of a differential expression, and other analysis in Bioinformatics. Hopefully this will give you a good primer, and help you ensuring your tasks are more reproducible and efficient.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">