Machine Learning REPA:
Reproducibility, Experiments and Pipelines Automation

Our goal is

to help participants to apply good engineering practices
in Machine Learning projects

Networking ML REPA # Day 1
September 19, 2020
Machine Learning REPA in Action /RUS
Mikhail Rozhkov, Creator ML REPA
Lessons learned trying implement ML REPA ideas in day-to-day work
Talk video
17:00 – 17:40 (Moscow time)
Software engineering life hacks for Data Science /RUS
Julia Antokhina, DS at Mobile TeleSystems
Cool stories about the engineering part of DS projects and simple howto's to improve your engineering skills. Agenda: Clean Code (Both Notebooks and Py), Linters 101, Refactoring and Patterns, Code Review - Practical Issues.
Target audience: junior/middle DS
Talk video
17:40 – 18:20 (Moscow time)
Reproducibility is not a pain anymore!! /RUS
Mikhail Maryufich, ML engineer at OK.ru
There's more to productionizing a model than packing it into a runnable container. Saving the entire training process is also a must. By keeping the model creation process reproducible we guarantee that experimental results and knowledge won't be lost. In OK.ru we developed a formal procedure whereby every model is trained in its own controlled and reproducible environment and then manually reviewed before getting put into production
Talk video
18:20 – 19:00 (Moscow time)
DataOps & ML automation with DVC /ENG
Dmitry Petrov, Creator of DVC - Data Version Control - Git for machine learning. Now co-founder & CEO of Iterative.ai. Ex-Data Scientist at Microsoft. PhD in Computer Science.
Recently, automation in software development has reached an unprecedented level of adoption. Infrastructure as code (IaC) principles, supported by tools like Terraform, Puppet, and Ansible, have become a mainstream approach to automating software building, testing, and deployment. In contrast, data science and machine learning projects frequently involve many manual steps, including data transfer and processing, model training and evaluation, and provisioning resources like cloud compute and storage. Each manual step lowers the overall reproducibility of a project and creates another hurdle to productionizing a project.
Talk video
19:00 – 19:40 (Moscow time)
Software engineering life hacks for Data Science /ENG
Julia Antokhina, DS at Mobile TeleSystems
Cool stories about the engineering part of DS projects and simple howto's to improve your engineering skills. Agenda: Clean Code (Both Notebooks and Py), Linters 101, Refactoring and Patterns, Code Review - Practical Issues.
Target audience: junior/middle DS
Talk video
19:40 – 20:20 (Moscow time)
Networking ML REPA # Day 2
September 20, 2020
Comparison of frameworks for pipelines automation: Airflow vs. Prefect
Roman Tezikov, Head of AI at Optia.ai & Co-creator ML REPA
In the world of winning pipelines, it is not surprising to describe DAG computations as code or config. Such a pipeline allows you to version it, work on it as a team. But which one to choose? Historically Airflow has dominated, but is it really that good? What are the benefits of the more modern Prefect? We will analyze all this in the report
Talk video
17:20 – 18:00 (Moscow time)
Testing for Data Science Hands-on Guide /RUS
Julia Antokhina, DS at Mobile TeleSystems
We will go thought motivation for testing in DS and common ways DS test their code. Then will be a gentle introduction to PyTest Framework. While PyTest is enough for many cases, Hypothesis for property-based testing will also be mentioned
Talk video
18:00 – 18:40 (Moscow time)
Model Lifecycle Management System /RUS
Ksenia Melnikova, Project Manager of AI Projects at Samsung Research Russia.
With the development of AI technology in all industries, the problem of reproducibility became highly important: it sharply affects the timing and cost of the development. Reproducibility also impacts on quality and performance of the final AI model, which is supposed to be implemented into production. This problem appears because of a lack of development process organization, unstructured workflow, and sometimes even low information sharing. To find a solution, we will discuss the common Model Lifecycle by overviewing the AI development process step by step, define what is Model Lifecycle Management System and how it can help a data scientist to control the results and make a model reproducible.
Talk video
18:40 – 19:20 (Moscow time)
DataOps/MLOps with DVC /ENG
Marcel Ribeiro-Dantas, Researcher at Institut Curie and a Ph.D. Student at Sorbonne Université, EDITE
DVC is an open source Version Control System for Machine Learning Projects. Along with git, DVC allows you to version your source code, your datasets, and even your pipelines. The goal of this talk is to go through DVC showing how this tool and the workflow it brings can make your life as a Data Scientist or Machine Learning Engineer much easier
Talk video
19:20 – 20:00 (Moscow time)
Adapting continuous integration to machine learning /ENG
Elle O'Brien, Data Scientist at Iterative, Inc. (the team behind DVC)
Continuous integration is a key practice from DevOps that has been demonstrated to speed up development cycles, but hasn't yet found widespread adoption in data science and machine learning projects. Yet continuous integration has many potential applications to machine learning, particularly in model training and testing in production-like environments
Talk video
20:00 – 20:40 (Moscow time)
Networking ML REPA # Day 3
September 21, 2020
How I Stop Worrying and Love the Standartization / RUS
Sviridenko Nadezda, Senior Data Scientist, Raiffeisenbank
One day or another every Data Science team faces a great bunch of routine tasks. Typically DS work includes creating a scope of similar models exploiting the very same client data. Designing your first model is very fun, third model is just fun, twentieth model makes you think about existential problems.

There is no doubt that several ready-made solutions to automate monotonous fit-predict already exist, however none of them meets all our demands completely. So we had no choice but to design and implement yet another framework (called Siegfried as far as Raiffeisen is Vienna bank) with Bamboo, Spark and Airflow. The framework allows to design, compare, test, deploy ML models at short term of two weeks and also increases the bus factor from 1 to 1.42
Talk video
18:40 – 19:20 (Moscow time)
The day after deployment: how to set up your model monitoring /ENG
Emeli Dral, Co-founder and Chief Technology Officer at Evidently AI, a startup developing tools to analyse and monitor the performance of machine learning models
The experience of running production machine learning systems is still limited. Best practices are yet to be shaped, and many operational aspects become an afterthought. One of the crucial but often overlooked components is model monitoring
Talk video
19:20 – 20:00 (Moscow time)
Testing for Data Science Hands-on Guide /ENG
Julia Antokhina, DS at Mobile TeleSystems
We will go thought motivation for testing in DS and common ways DS test their code. Then will be a gentle introduction to PyTest Framework. While PyTest is enough for many cases, Hypothesis for property-based testing will also be mentioned
Talk video
20:00 – 20:40 (Moscow time)
Model Lifecycle Management System / ENG
Ksenia Melnikova, Project Manager of AI Projects at Samsung Research Russia.
With the development of AI technology in all industries, the problem of reproducibility became highly important: it sharply affects the timing and cost of the development. Reproducibility also impacts on quality and performance of the final AI model, which is supposed to be implemented into production. This problem appears because of a lack of development process organization, unstructured workflow, and sometimes even low information sharing. To find a solution, we will discuss the common Model Lifecycle by overviewing the AI development process step by step, define what is Model Lifecycle Management System and how it can help a data scientist to control the results and make a model reproducible.
Talk video
20:40 – 21:20 (Moscow time)
Registration
Detailed program, schedule and registration are available on
the Data Fest 2020 web site
Register to participate
Our partners
email: info@ml-repa.ru
telegram: t.me/mlrepa
See you on the ML REPA Fest!
Made on
Tilda