What is Machine Learning?

Video • 8:17

In this video, Christopher Brooks, Associate Professor of Information, discusses the fundamentals of machine learning, including supervised learning (classification, regression), unsupervised learning (clustering), semisupervised learning, and reinforcement learning.

Excerpt From

Introduction to Machine Learning in Sports Analytics

Course

Transcript

So what is machine learning? You can think of machine learning as a paradigm of building computational models based on historical data without you having to explicitly create rules about that data. We build these models in an iterative fashion. So we collect some data about a phenomenon. Maybe we watch a bunch of sports matches or athletic events. We then use statistical methods and we apply it to the data and the statistical methods organized, they find patterns inside of the data. And that creates a model which we can then use for prediction on new data. And you've already seen machine learning at work in this specialization through regression. So I want to break down the main branches of machine learning and they differ depending on the task. So supervised learning. Now this class is actually going to focus really on supervised learning. So the task with supervised learning is to learn the relationship between this historical data and some labels which exist now. The labels are usually provided by humans, and the goal of the models that we create is to predict labels for new data, which we don't have the label for. And there's really two broad categories of supervised approaches. In regression approaches, the label is a target value. So this might be the draft pick position of a player, which you might predict from their previous performance. In classification, the label is actually a class value, so this is categorical in nature. So this is like predicting the kind of activity that you might get out of sensor data or predicting who's going to win given match. Unsupervised learning approaches on the other hand, don't require this label. They take the historical data and they use this to identify the features from the data, which could be used to help us understand new data. Really the most common task is clustering of this data, and unsupervised approaches really tell us a relationship between the different observations inside of the data. And this is sometimes done just statistically, so we're interested in understanding what kind of clusters exist and what a centroid looks like. So does a good forward player look like or what does a good goalie look like? But sometimes we do this for disambiguation and we do this with visualization in mind. And so we actually want to visually look at people and their relationship and their activities inside of sports events to see if there's different patterns. So, in sports analytics, there's actually numerous great examples which players are similar based on their stats, which physical activity share similar sensor data, which teams have similar playing patterns. So despite not having a label, human decision making is still an important part of the process and this is really in determining what features we're going to feed into the model and that we're going to cluster on. So for instance, if you're clustering NHL players, if you fed in features such as goal scoring and location on ice, you're going to differentiate the players based on position. So forwards tend to score goals, and they tend to be all over the ice. A goalie tends not to score goals, although they have and tends to stay on one place on the ice. But if instead you start feeding in features such as time on ice, salary and so forth, the models coming back are going to tell you more about perceived value of players. So there might be a well paid forward, a well paid goalie, and a well paid defenseman all coming up in one cluster. Semi supervised learning. So the title here sort of gives it away. Semi supervised learning involves a mixture of supervised and unsupervised learning approaches where the human labelling of the data is expensive or incomplete. For instance, imagine that you've scraped the web and you've got a collection of thousands of pictures of athletes and you want to analyze what pair of shoes they're wearing using computer vision and machine learning. So individual shoe identification is actually a pretty difficult problem and it's quite error prone. And so what we might want to do instead is see the system with a few classifications, show an expert a few of the pictures and say. Yeah, those are this brand or they have this kind of logo on them or their this kind of trainer. And then the machine learning algorithms going to identify features of the images, color or logo features, vector features of what the logo looks like, or lacing or something like this. Then clusters a large set of the images into different clusters and it'll have some goodness of fit for each of those clusters. And so it'll be able to identify those pieces that seem to be outliers and then you can raise those up to humans and the humans can provide labels for those. Humans can also go to the clusters where you think you have it well thought out and spot check and provide new training data, thus adding the labels and then you repeat the process. Reinforcement learning. So reinforcement learning is actually really exciting to me. I really think that this is a very exciting technique. The goal in reinforcement learning is you're going to train using a supervised method but the human doesn't actually provide the labels. Instead the machine itself in this case, I mean the machine learning algorithm can sense the environment and can actually see the labels in the environment. So most commonly this is done by providing some reward function. That function rewards a machine for correctly classifying data in real time without human intervention. So an example of this in sports analytics might be in amateur athlete training where some broad objective is known and could be measured. For instance, let's say there's a rehabilitation program and you want compliance with a training program because an athlete has been injured and you can get compliance from wearable data. Did they go out and walk? Did they go out and run? Are they doing the activities that were prescribed to them? Let's say, by a physiotherapist or a trainer. Then the machine is able to take some interventions and try and make this happen. Try and change what the user activity is in the environment. So, for instance, sending an email or phone app nudges a little notification or a little reward, and so forth. And we experience these all the time right now. And some of these are based on machine learning. So the relationship then between when and how often to send these emails or other interventions, we start to learn by watching the effect they have on the compliance with the training program. So for instance, if emails don't seem to be working, maybe the machine has some other opportunities. Such as sending the nudges or pinging a physiotherapist and getting them involved to bring the player back into compliance. So this is very similar to a B or randomized controlled trial tests with the goal of trying to do some experimentation. But this is done with machine learning instead. And so we don't use equal proportions of subjects and we just start to learn which approaches are best, and then we start to heavily favor those. So in this brief lecture we talked about the machine learning space and there's really four main approaches to applying machine learning. Supervised approaches, like classification and regression are probably the most common and well explored in many, many domains, including sports analytics, and that's where we're going to focus our discussions in this class. Unsupervised approaches, though like clustering are also heavily used, especially when you're trying to understand some relationships between players based on play style. Semi supervised approaches are really an interesting blend of these two. And reinforcement learning is an excellent way to start experimenting and understanding and changing behaviors when your machine or your machine learning method can actually start to sense the environment.