Azure Machine Learning is a cloud platform for running machine learning workloads. What is it? Is it the same as Azure ML Studio (Classic)? What can this tool offer me?
Azure ML Studio (Classic) is the old tool, obviously. It had some similar features to the graphical model training UI, but I only used it very briefly - and maybe only once. Azure Machine Learning is a full-featured, machine learning platform. It seriously does just about everything we can possibly fathom, and going through the development process, I was very surprised at the *nice* features they had in there. I've been working with Azure Machine Learning for the last 6 months and have documented the features I care about most.
Infrastructure We Don’t Want to Write
First and foremost, as a data scientist, I don't want to write infrastructure code... and I don't want my infrastructure team or data engineering team writing a bunch of hacked together code either. When you run an AzureML pipeline, you’re running it in a pristine environment built just for you. It's like magic.
- Pipelines: Run this code, then that code. Automatically manages dependencies and you can set up manual dependencies. Set up input and output parameters (variables, datasets, etc...). Mix AzureML built-in stuff with Python code...
- Docker container: creates a reusable image your code can run in. This sticks around as long as your Compute doesn’t get recycled, so you don’t have to recreated it for subsequent runs. This is not a given feature using... Spinning up the docker container every time can add 2-5 minutes to each run of your code.
- Package management: Option to use pip, conda, or both for your packages. These get installed into your docker container.
- Different Frameworks: Python, PySpark, Tensorflow, PyTorch
- Note – makes it super easy to use Spark/PySpark for multiple-cluster jobs.
- Different Languages – Python or R
- Easily create new or switch your target compute in your pipeline – need 384GB of RAM for a couple hours to see what kind of gains you get from not sampling your data? Go ahead and use it (it’s only like $4.50 to use it for an hour).
- Automatically spins it up and down as needed.
- Automatically scales up your nodes as needed.
- Different compute for different workloads (ie: adjust RAM, storage capacity, etc... as needed)
- Logging: Everything about our experiment is logged and you can add custom logs. The logs are immediately available in the portal – you don’t have to log into a VM and hunt for the log file (because no one documented where it is or you don’t have access in this hypothetical situation where someone wrote is running Python Scripts on a Windows VM on a Windows Scheduler).
- Using Jupyter, you can use the given notebooks extension for the AzureML SDK to monitor a job from your notebook, rather than having to go into the portal.
Data & Model Versioning
In the most simple model training pipeline, we’ll at least register our training data and model to AzureML.
- Since training data in Azure SQL or a blob is truncated and loaded (or maybe upserted), you’ll never know what data looked like when you trained a model without some versioning mechanism.
- Registering these items ties them to a specific run(s) of your pipeline and to each other.
- At any time, you can go in and look at a specific version of a piece of data or download it to your machine to see what it looked like when it was used.
- Similarly, if you know a certain version of data is bad for some reason (like, maybe a table used for multiple models), you can see which models are associated to it and trigger them to retrain once you fix the data.
- Just like the data, you can even download a specific version of a model and test how it performs on some new data. They give you a the pkl file, and you just manually download it or consume it via the SDK.
Model Training Logic
Nice to have things.
- The visual editor – amazing for creating proof of concepts if you don’t want to write code.
- A python SDK - do everything in Python if you want. This is what we're doing so that we can be semi-'platform agnostic', but it was a pretty big undertaking.
- Schedule a pipeline to run on any schedule (ie: daily/weekly), without any external tools to orchestrate that.
- AutoML – a module for the SDK and visual designer for trying out a bunch of hyperparameters, models, and ensembles to try to eek out the best possible performance.
- Can reuse steps. If a step didn’t change (ie: the inputs/outputs or the script code), it won’t re-run that step. Saves a bit on compute time and your dev time if you’re debugging. It also keeps us from having tons of different “versions” of things that are actually all the same because it doesn't try to register models that are the same or data that's the same.
- With a few lines of code, output explanations in your pipeline. The explanations are saved for future use in the version history of the model, so you can analyze how certain features interact with each other over different versions of your model. These work for just about any model.
When a model is trained, there are a couple ways we might interact with it.
- Add a batch prediction step in your pipeline and save the output to SQL or a blob and consume the data from PowerBI, an analysis, or an app.
- Deploy your model as an API endpoint and securely get predictions from an app directly (ie: a website or a desktop app).
- PowerBI Integration – if you’re just consuming your model output in PowerBI, you don’t even need to batch predict. You can connect directly to the model and have PowerBI refresh your dataset and calculate the predicted value(s) automatically when it refreshes every day.
Integrates with DevOps & Data Factory
- Trigger model training runs on a schedule using data factory.
- Trigger pipeline updates when code is checked in.
- Trigger something to happen when a new model is registered (like running a batch job or a PowerBI refresh).
- Register or Deploy a new model based on some other logic (like a model review task being completed).
I have 1500+ runs of my pipelines (from testing and actual development) in my instance, and it has never costed me more than $15/mo except when I left the Notebook VM running. There is a huge risk here - since the product is in preview, they could completely jack up the price and screw us later. Do I think it's more valuable than $15/mo? Yeah. Could I make a case to use it if it costed $1000/mo? Probably. Would I want to? No.
A Few Limitations
Everything isn't sunshine and rainbows. I had some major annoyances for developing this whole thing.
- Pipelines can't run locally. So, every time you submit your job, it can take 10-20 minutes to run...
- I was using Visual Studio Code to edit the pipeline, and the linter didn't work at all. I went through all the settings to try to make it work, and it just didn't. This made debugging extremely painful because I'd often submit a job, wait 20 mins... and come back to it failing on a nonsense, simple code typo the linter should have noticed.
- Doesn't have a built-in connector to on-premesis data. We use the "On Premesis Data Gateway" for connecting Microsoft services to our data - like PowerBI, Microsoft Flow, etc. This service didn't have that feature and the suggestion was to use Azure Data Factory to move data and orchestrate the pipeline, which is just *another thing*. Also, I don't have access to create ADF resources in our Azure, so there's a headache of getting infrastructure to add it - both just getting their time in the first place and all the questions: "why do you need this?" and "whats the disaster recovery plan" and "what about georeplication?" blah, blah, blah...
- Moving data was a pain. I first tried to use Microsoft Automate (aka "Flow"). My training set for the project I was working on was only around 200,000 lines, and Flow couldn't move bulk data - it took about an hour to move 7000 rows. Pretty insane. We haven't set up ADF yet, but hopefully it does a better job. In the mean time, I had to write a Python script to run and move the data manually.