What do you get excited about? If you have a life, the answer probably isn’t machine learning. It’s arcane, sciency and downright neckbeard-y. Worst of all, it can be hard to see why you should care. There is no 20 minute tech talk you can watch, drop into your favorite editor and be on your way to winning the Netflix Prize. The value proposition can be a tough sell.
As it turns out, it’s pretty bloody exciting to me. If you’re still here, maybe I’ve got a shot at convincing you. Read on, brave adventurer!
Greybeards and Robot Brains
When a wizened greybeard speaketh, you listen (if you know what’s good for you). Why? Because experience is wisdom’s currency – there’s no substitute. What does the experience acquisition process look like?
Now, imagine that for a (very specific) problem, you can build a brain out of code that runs through this loop much, much faster than any human can. With enough novel situations, you can build a pretty formidable greybeard. In fact, this is usually the limiting factor on how good your model can get (almost always, the easiest way to improve the performance of your ML-based predictive models is to throw more data at it).
In practice, Machine Learning (from now on ML) is about building software than can make an educated guess about novel situations based on situations it has seen before. ML-based predictive models break down into two buckets, depending on your use case: classification and regression. The former is for when you need to segment cases based on a particular feature, e.g. chemically analysing wine to predict which grape variety it’s made from. The latter is for assigning a continuous value to a particular feature, e.g. predicting the age of abalone from physical measurements.
In this sense, feature loosely translates as primary or derived attribute of the system being modelled that is potentially useful in predicting the outcome we care about. For example, if you’re trying to predict the outcome of US Open tennis matches, player’s first serve ace % might make a good feature. Most ML prediction algorithms provide ways to score your features on how much they contribute to the final prediction, so the first step is usually to gather as many as possible and prune them later.
As a piece of your problem-solving swiss army knife, ML orients your thinking towards finding a good enough approximation of the solution. This is the hidden reward at the center of the machine learning maze – you’ll have the tools to hunt for the good enough when the perfect is impractical/impossible.
The forest for the trees
Phew – enough of the preamble, let’s get down to business. A very simple kind of ML model is the Decision Tree. Suppose that every Monday since college, my friend James and I go out to eat. In the last few years, we’ve taken to ordering wine with dinner. I’m very indecisive about my wine choice, so before I make a selection, I ask James if he thinks I’ll like it. To help him out, I give him a list of wines I’ve liked in the past, including their vintage, grape varieties and countries of origin (i.e. a training set).
Taking these, James plays a 20-questions-style game with each potential selection, asking questions like ‘is it older than 5 years?’, ‘is it Italian?’, ‘does it use more than 2 grape varieties?’ etc. and comparing the answers for the candidate wine with the training set’s. He then tells me for each wine we’re considering on the wine list, how many wines I’ve liked in the past with the same answers to these questions, with the closest match coming out on top. James has built a decision tree for my taste in wine.
The problem with this approach is that it doesn’t generalize to new, potentially tastier wines that I just haven’t had the pleasure of sampling (i.e. it overfits). To counter this, I have recruited 4 other friends, and run the whole process again with each of them. I will select a wine if a majority of my 5 friends agree that it is the best choice. I don’t want them all to give me the same answer, so I give them each slightly different lists of previously enjoyed wines (including some and excluding others randomly). The fancy ML term for this is bootstrap aggregating, or bagging. My model is getting better!
One more problem, though: who says that I know in advance which characteristics (i.e. features) of a wine contribute to me liking it? After all, if I had perfect knowledge of my own likes and dislikes, I wouldn’t need an army of friends to experiment on. It’s very possible that some features have nothing to do with whether or not I will like the wine, but just happen to be correlated to other features that do. Mathematically determining which features are dependent or wholly derivative of others is non-trivial (as well as being a real pain), so we cheat.
I recruit a lot more friends to this task, and assign each of them a random subset of the full list of questions which they may ask about the wine. The theory here is that, with enough decision trees independently making a call based on a limited subset of the data, the error built into each of them will cancel out and we’ll be left with something resembling the underlying wine-to-deliciousness function I’m trying to find.
Voilà! I now have an ensemble classifier built from my improbably large circle of friends. In our case, I’ve got a Random Decision Forest (RDF). In the next post in this series, we’ll look at building and using RDFs in R and Ruby – stay tuned!