4-minute read

Quick summary: Machine Learning Operations (MLOps) ensures quality of machine learning systems over time and reduces lead time for moving a model into production.

Your data science team has collected thorough requirements for a pressing business problem, collected and cleaned all the required data, and fit an effective, validated model that is ready to provide real value to real business users.

Mission accomplished, right?

Not so fast. First off, you need to figure out how to get the model in the hands of the business users—and provide quality assurance for model performance long term. Your team needs to consider the following:

  • “How will my model perform on new, unseen data?”
  • “Is my model training process reproducible?”
  • “Can my model scale to the needs of the business?”
  • “If changes are required, how long will it take to make those changes in production?”

Turns out, this is a difficult problem, and a problem that is not improving over time. Rexter Analytics, who regularly conduct surveys on the data science industry, in a 2017 survey cites that only 13 percent of data scientists say their models always get deployed, and this value has not improved since 2009 when the question first appeared on the survey.

A contributor to this issue is that machine learning production systems have a variety of moving parts outside of pure modeling code. This includes data collection and processing code, environment configuration, process management code, and monitoring code, to name a few. Complexity creates opportunity for technical debt and can increase lead time for both changes and deployment. Algorithmia cites that for companies that deploy models, half of respondents said they spend between 8 and 90 days to deploy a model, and 18 percent responded that it takes longer than 90 days. ML practitioners also cite that the scaling up and versioning/ reproducibility of their models were the two largest challenges their organizations face.

How can this be improved? By borrowing aspects of DevOps. There has been a recent rise in Machine Learning Operations (MLOps), a set of guiding principles focused on automation, collaboration, reproducibility, monitoring, and effective model scaling. The overall goal is ensuring quality of machine learning systems over time and reducing lead time for moving a model into production.

Automation (CI/CD)

Training and deploying a model is a multi-step process. This can often fall into the trap of being treated as a one-off task. Training and deployment processes should be self-contained in an automated pipeline process that can be triggered, not only after code changes, but also to train on new data, either periodically or when performance on recent data exceeds the established drift metric (more on drift metrics below).

By introducing CI/CD and operationalizing model training and monitoring, MLOps provides repeatable, consistent mechanisms for moving models to the target environment, and it opens additional opportunities to incorporate automated integration testing and parity across multiple environments.

Reproducibility—Data and Model Versioning

It’s generally an expectation in software development environments to use a version control system such as Git or SVC for tracking code and configuration artifacts. In data science workflows, there’s an added complexity of being able to create reproducible results (it is data science, after all). This includes being able to take a previous version of the target training/test datasets to reproduce model results.

Data versioning

With the onset of cheap, persistent storage (such as AWS S3), it becomes trivial to version data with modern tools such as Data Version Control, Metaflow, and SageMaker for small to medium datasets. In some workflows, such as SageMaker, this is done automatically alongside model training. For big data problems, repeatable splitting using numeric hashes is a common design pattern versus random value generation.

The cultural impact of making data assets used for model generation shareable and discoverable for a team is invaluable. It encourages emphasis on reproducibility, but also allows reusability by the team to leverage existing work rather than needing to reinvent the wheel when developing new models.

Model versioning

Models trained through the ML pipeline should be saved into a model registry where experiments can be tracked over time. This metadata store should contain pipeline parameter configuration and high-level metrics on experimentation results. By utilizing a common workflow on storing model artifacts and metadata, it can facilitate the decision of which models should be moved into production environments.

Data drift

The cross-validation workflow in machine learning is effective for getting an idea of how a model will generalize on unseen data. The complication with hosting a model in production is that it is impossible to know future and emerging trends in the data that can affect how the model behaves.

Data drift metrics such as population stability index can give insight into how recent periods of data inflow compare to historical data used to train the original models. In addition to data drift, model drift is a useful metric to watch. By using model drift metrics, such as permutation importance, training on new data and comparing feature importance can help the team understand how the model reacts to new data, whether the key drivers in the original model are still relevant, or if there is some indication of a systematic change in the data. Drift metrics like these can be used to trigger a model retraining workflow to produce candidates for new production models.


Pipeline design and deployment architecture need to include proper scaling considerations. Scaling issues can occur in data collection, preprocess, model training, or deployment. This varies across IT organizations and cloud platforms in terms of the tool sets available. Modern cloud providers have solutions that enable distributed training of compute intensive deep neural nets, and a scalable deployment mechanism—as either an API call or batch process. There is a tradeoff between opting for a managed solution versus building one internally for hosting, and there is a wide variety of considerations that is out of scope of this article to consider when designing the architecture that works best for your solution.


Adoption of machine learning comes with its challenges, and these challenges can make it difficult to implement ML solutions in a production environment. Fitting effective models is important, but by investing in an MLOps strategy, data scientists can be empowered to experiment and move models into production faster, knowing that the models they create will be able to scale to meet the demand required by the business.

Like what you see?

Paul Lee

Alex Johnson is a Senior Developer in Logic20/20’s Advanced Analytics practice.