Real world Software Engineering is an iterative process and one of its main objectives is to get changes all of types - including new features, configuration changes, bug fixes and experiments into production and into the hands of the users, safely, quickly and in a sustainable way. Continuous Delivery (CD), a software engineering discipline, with its principled approach allows you to solve this exact problem. The core idea of CD is to create a repeatable, reliable and incrementally improving process for taking software from concept to the end user. Like software development, building real world machine learning (ML) algorithms is an also an iterative process with a similar objective - How do I get my ML algorithms into production and in the hands of the users in a safe, quick and sustainable way. The current process of building models, testing and deploying them into production is at best an ad-hoc process in most companies. At Indix, while building the Google of Products, we have had some good success in combining the best practices of continuous delivery in building our machine learning pipelines using open source tools and frameworks. The talk will not focus on the theory of ML or about choosing the right ML algorithm but specifically on the last mile problem of taking models to production and the lessons learned while applying the concept of CD to ML.. Here are some of the key questions that the talk with try to answer. 1. ML Models Repository as analogous to Software Artifacts Repository - Similar to a software repository, what are the features of a Models Repository to aid traceability and reproducibility? Specifically, how do you manage models end to end - managing model metadata, visualization and lineage etc? 2. ML Pipelines to orchestrate and visualize the end to end flow - A typical ML workflow has multiple stages. How do you model your entire workflow as a pipeline (similar to Build Pipeline in CD) to automate the entire process and help visualize the entire end to end flow? 3. Model Quality Assurance - What quality gates and evaluation metrics, either manual and automated, should be used before exporting (promoting) models for serving in production? What happens when several different models are in play? How do you measure the models individually and then also in combination 4. Serving Models in Production - How do you serve and scale these models in production? What happens when these models are heterogenous (built using different languages - Scala, Python etc.)?
This talk is about why and how we built our internal data pipeline platform. At Indix we have data in different formats - html pages, thrift records, avro records and the usual culprits - CSVs and other plain text formats. We have data in TBs and in a few KBs and data consisting of billions of records and data consisting of a few hundred rows. And all this data - in one form or another - is consumed by the engineers, the product managers, the customer success team and even our CEO. Our biggest challenge was in knowing which data exists and where, and how to access it efficiently while balancing costs and productivity of the people involved. We had to make do with adhoc Scalding jobs. There was no single place where people can discover the different 'datasets' that we had, what format they were in, where they were stored and how frequently a new version was published. Running jobs was also not straightforward since things like finding a cluster to use were not trivial. In order to democratize the access to data and make it easy for anyone within the organization to work and play with the data we had, we went about building a data pipeline platform for our internal users. Leveraging the power of Spark, the platform allows the users to define datasets (along with their schema) and create pipelines to work with the datasets. The pipelines can be configured via a wizard based UI or a JSON config and all the jobs run on dedicated and auto scaled Spark clusters. Predefined transformations to filter, project, sample and even type in sql queries have made it powerful but simple to use for any type of user. Support for S3, Sftp and even Google sheets made it usable for different internal and customer use cases. The platform also enables us to load the same data and perform similar operations on them via notebooks with just couple of lines of client code. Today we run over 300 pipelines across over 100 datasets and thousands of versions of the datasets using this platform. The data pipeline platform has truly changed the way we ingest, manipulate, analyze and egress data across the organization, and is on course to be converted into a self-serve platform for our (external) customers too.
At Indix we collect and process lots of data. Most of our processing initially were done as MapReduce jobs but as our data grew in size we moved towards stream processing. We monitor the behaviour of our systems through collection of business metrics. It was relatively easy to write Stats jobs on our MapReduce output but things got tricky when we moved to Stream based processing.
Our key learnings over the years have been
We put all our learnings and built a system called Abel which solved this for us. It aggregates a million events in ~15 minutes on a single box.
Networking & Dinner
Geek Night is a monthly event to promote sharing of technical knowledge and increase collaboration between geeks in Chennai. It is organized by a passionate group of programmers and sponsored by ThoughtWorks.
It happens on the Last Thursday of every month, unless that's a public holiday or any other unavoidable cause, like an Alien invasion.
We love feedback! If you have any suggestions or cribs, feel free to fill out our feedback form. Don't worry, its completely anonymous.Geek Night Volunteers