This is the second episode of our exploration of “no code” machine learning. In our first articlewe’ve laid out our problem set and discussed the data we’ll use to test whether a highly automated ML tool designed for business analysts can deliver cost-effective results near the quality of More code-intensive methods It involves a bit of human-driven data science.
If you haven’t read this article, at least come back skim it. If you’re all set, let’s go over what we’re going to do with our heart attack data under “normal” (i.e. the most code-intensive) machine learning conditions and then ignore it all and hit the “easy” button.
As previously discussed, we work with a set of heart health data drawn from a study at the Cleveland Clinic and the Hungarian Institute of Cardiology in Budapest (plus other places whose data we’ve discarded for quality reasons). All this data is available in storehouse We created it on GitHub, but its original form is part of data warehouse It was maintained for machine learning projects by the University of California-Irvine. We use two versions of the data set: a smaller, more complete version consisting of 303 patient records from the Cleveland Clinic and a larger database (597 patients) that includes HGI data but is missing two types of data from the smaller set.
It looks like the two missing fields of Hungarian data might be important, but the Cleveland Clinic data itself might be too small for some ML applications, so we’ll try to cover our bases.
the plan
With multiple data sets on hand for training and testing, it’s time to start grinding. If we were doing it the way data scientists usually do (and the way we tried last year), we would:
- Divide the data into a training set and a test set
- Use training data with an existing algorithm type to build the model
- Check the model with the test set to verify its accuracy
We can do this all by coding it in the Jupyter notebook and modifying the model until we achieve acceptable accuracy (as we did last year, in a perpetual cycle). But instead, we’ll first try two different methods:
- A “no-code” approach with AWS’s Sagemaker Canvas: Canvas takes data as a whole, automatically splits it into training and testing, and generates a predictive algorithm
- Another “no-/low-code” approach using Sagemaker Studio Jumpstart and AutoML: AutoML is much of what’s behind Canvas; It evaluates the data and tries a number of different types of algorithms to determine the best one
Having done that, we’ll swing by using one of the many battle-tested machine learning methods that data scientists have already experimented with with this data set, some of which have claimed over 90 percent accuracy.
The end product of these methods should be an algorithm that we can use to run a predictive query based on data points. But the real output will be a look at the trade-offs for each approach in terms of time to completion, accuracy, and computing time cost. (In our last test, AutoML itself practically blew the entire AWS account credit balance.)
“Unapologetic communicator. Wannabe web lover. Friendly travel scholar. Problem solver. Amateur social mediaholic.”