The Machine Learning Workflow

Video • 16:09

In this video, Christopher Brooks, Associate Professor of Information, outlines the machine learning workflow, including processing data (defining the machine learning problem, acquiring data, labeling data), creating models (choosing a model, partitioning your data, evaluating your models), and deploying models.

Excerpt From

Introduction to Machine Learning in Sports Analytics

Course

Transcript

So now that we know a little bit more about what machine learning is, let's talk about the machine learning workflow or what you're going to do as a sports analyst or a data scientist to apply machine learning. The first step is processing data. So this is really determining the features that are likely to be of significance to the task that you've got in mind. Then going out and acquiring and cleaning data to create these features. And then in the case of supervised models, labeling the data. Then we get to creating models. Some people call this the fun part. This is the statistics part, if you're into that in particular. So you identify your model choice in your evaluation strategy. Then we think about how we're going to separate data into training, validation, and testing data sets. We train and we tune the models using this training validation data and then we finally evaluate the model performance on our testing data or holdout data. Then we move on to deploying the model. So this is where we actually want to make predictions on unseen data and evaluated in the while. And this really is an iterative task that goes back down to processing data again, creating models again, and continuing to tune this workflow for a given task.

So let's start with processing of data and defining your machine learning problem. So it really starts with thinking about the problem. What is it that you want to model and what is it that you want to predict? Do you want game score? Do you want match outcomes? Do you want the player salary or the movement result? Now I was actually showing this to a colleague of mine and they said, I think games scoring match outcomes are redundant, but are they? I mean game score, I think of as seven verses three, a match outcome is team A won. Game score then is a regression task. A match outcome is a classification task. And we might want to start collecting different information to differentiate these. And the details matter in the prediction. Do you just care about the accuracy of the model? Or do you want to be able to look inside of the model and interpret what features inside of it are leading to higher, low accuracy? How generalizable do you expect your model to be? Where are you actually going to use it? The generalizability of the model is really important when you want to get away from a toy problem and into a problem you can use in the real world. And these start to inform your data collection, modeling, and evaluation strategies.

So what are the ideal features, sometimes we call these attributes, that you think would be useful to a particular problem? Now break this down into a list of those that you might have high confidence in and those that you are less sure of. And lean towards breath when doing this and be explicit as possible and be aware of potential scope creep throughout. So here's a fun time to maybe pause for a moment and think about this. Think about a problem you would be interested in applying machine learning to and tell me what features that you think would be useful in making this prediction. So next we go on to acquiring data, and this frankly is a common challenge. And there's sort of three broad categories or ways in which you can acquire data. So you can purchase the data from a third party. And there's numerous data vendors set up specifically to provide sports outcomes data, largely with an eye towards gambling and risk management markets. And the pricing depends on a few aspects of the data. Do you want a lot of historical data? Does it have to be super accurate? What features do you want? Sometimes they come from multiple sources and you have to make a deal with all of those or deal with an aggregator. And how frequent do you want the data? Do you want data coming in milliseconds as if it were a stock trading? Or do you want data only after matches are done?

Another common approach, especially in the amateur space is web scraping. So this is frankly a lot of fun. I think it's a complex and wonderful space. And it's this wonderful mixture of technical, ethical, and legal considerations around the accessing of data that's available on the web. Web scraping though can be a very fragile way to obtain data. Are you building just a proof of concept? If so, web scrapings maybe not so bad. Or are you building a longer lived service? Is something that you want to be able to make regular predictions, in which case web scraping becomes problematic. And then of course there's first party data collection. And this is especially common right now with wearable technologies and data scientists who are embedded in sports teams, who are working with those teams and the wearables that those athletes have on them and in the field. And this can be really integral in some tasks, in highly valuable and competitive tasks. When it's a competitive task, the information that you have that other people can't get, really gives you a leg up in the prediction.

Now we go onto labeling that data and this is cordial question, what is it that you're trying to predict? There's several different approaches. Sometimes there's a ground truth, which can just be objectively observed. Such as the score of a match or the outcome of a tournament. We know who won. It might already exist in the data or it's a function of what's team A score bigger than team B score. Sometimes though, the label's gotta be added by a human expert as it isn't found in the data. For instance the MVP or the stars of the game might be announced on TVm but you can't find this in the web scraped data. And so you've got to have somebody, either yourself or somebody else, add those labels to the data. And sometimes you want to engage a group of experts to help label your data. So this is commonly called crowd work, with the general idea being that you can speed up the labeling of data, collect a verse or achieve consensus and difficult tasks. And there's a lot of important considerations on accuracy when labeling data this way. For instance, if you have three trainers, watch somebody do a series of reps, let's say barbell reps or bench press. Did they do it with a clean form or with a compromised form? Even if there's just these two labels, those three physiotherapists might differ on their opinion and you have to think about the ways that you're going to integrate those opinions. Another thing to think about is, when classifying data, is your data balanced among classes? And if not, are you able to collect more data for minority classes? So the minority classes are the ones that you have less data about. And sometimes this is actually a huge problem and we'll see some examples of that throughout this course.

Okay, let's move on to the creating of models. So choosing the right modeling technique, I mean, it could be a course in itself or probably a degree in itself. Some techniques result in a model which is more interpretable than others. So I'm thinking here of decision trees, which we'll talk about in this course. It allows you to actually explore the choices that the machine learning method is going to use when predicting outcomes. Some techniques require large amounts of data to work well. So neural networks, for instance, require a lot of observations. If you have five or six Games with ten features in each of those games, neural networks probably not going to be very useful. And then this leads into some require significant computational resources, deep learning as an example. I think a good example of a place in the industry that this is a challenge right now, are with wearable sensors that you might have seen in Peep Audrey's class. And those sensors can create a lot of data points, 800 or 1000 points per second on various metrics. And we want to make predictive model out of them, especially if we want to use it real time. That requires potentially a lot of computation and it's an embedded system. So doing that on the device that's strapped to you, like your watch or halter or whatever it is, strapped to you can be a bit of a challenge. But of course in all of this, sometimes the method that you use will just work better for a particular problem based on the nature of that data. And so it's actually pretty common to try a couple of different approaches to get a sense as to how they might be segmenting the data and how you might want to tune your data more. So my advice is, start simple instead of going for the latest and greatest. In this course, we're going to demonstrate some specific fundamental models which have good results and can work well with moderate data science. Decision trees, support vector machines or SVMs, and regression trees. But I'm going to go a little bit further and we're going to talk about how we can bring multiple different models together to improve accuracy through a process called ensembles. You need to think about partitioning your data. When you train your model, you want to ensure that its generalizable to new data so that the predictive power is high. And I'd like to differentiate between an explanatory model, where the goal is post talk to explain what happened. And a predictive model, which is really what we'll be talking about in this course where the goal is to take what happened in those observations and build a model that will be able to predict on future data. So to do this, we tend to partition our data into three sets. We have training data, and this is the data that the learning algorithms sees to learn from and create a model. We then have validation data. This is the data that you as the data scientists or sports analyst use to evaluate the quality of the model as you are training with it. And then we have test data or what we'll call holdout data. And this is the data your client uses to understand how well your model actually performs. Now conceptually you and your client might be the same person. The key is, the more your learning algorithm can observe evidence from your validation and your test sets, the more likely it is to what we call over train to that data. So when partitioning your data, it's common to use an 8020 rule, but this is sometimes inappropriate, and we're going to go through some examples together in this class. And we'll dive in a little bit more into the importance of these data sets and techniques that exist to deal with these issues. Evaluating your model. So this is the place where your goal really matters. Do you want to predict who isn't going to win a tournament with high accuracy? That's actually really easy because only one team is going to win the tournament or one player, if it's a one on one game. So just predict nobody is going to win and you'll have a 99 or higher percent accuracy. There's different metrics, and each of these metrics inform us about how the model performs within the context of a question. So let's say we're going to predict who's going to win the NCAA March Madness tournament. There's 68 teams who make it out of 350 teams. So we could just naively get an accuracy rate of 80% just by predicting that no one's going to make it. The key is the word accuracy. We use this a lot in our English language as we engage with one another. But inside of data science, inside of machine learning and evaluation of models, it has a very specific meaning. And it's not always an inappropriate measure, but there's often better measures depending on your goal and a few that we'll talk about are kappas. So this is a chance corrected accuracy precision. This is where we have true positives and we divide them by the true positives and the false positives together to get a measure. Recall, this is true positives divided by the true positives and false negatives and then the F1 score. And this is pretty commonly used as a combination of the precision and the recall to get a sense of the sensitivity of your model. You don't have to worry about memorizing these. I just want to give you a little bit of the language before I show you these, an actual code.

Finally, we move on to deploying the model. Once you've built and evaluated it, you're ready to deploy it. And this is really where a lot of engineering work comes in, software engineering often. So pipelines of data and modeling resulting in continually improving systems is often what we're trying to build. The timeliness of the model and the predictions, especially if you're predicting in game live settings, I think is really important. And the feasibility of your predictions. Do you need to apply models on these embedded devices? That requires a very different engineering skill set and very different trade offs in your machine learning. One challenge is including hard to measure information in the process. When does a significant advance, let's say COVID, for instance, invalidate expectations of generalizability? COVID provides a great example actually of this in most of sports right now, which have been totally displaced by the pandemic and thus a lot of rules and norms have changed. And that really begs the question, are models that we would have trained in 2019, 2018, are there going to be as useful in 2022, 2023? But also when do new data sources provide an opportunity to improve the model? We're seeing this actually in a lot of sports analytics too. In fact just this year, the NHL was playing with pucks that actually had new data being gathered inside of the puck. Now they've paused that a little bit because they didn't like the way the puck was playing on the surface, some of the differences in the puck for the championships. But with the park, with that new data all of a sudden you can start to integrate a lot more into your models that you weren't able to before, and that's an opportunity. And then, how can we integrate judgment from humans in probabilistic and meaningful way? And there's a whole space of Beijing modeling that is actually really powerful and interesting for this. And this is really useful as a data scientist to be thinking about, especially in light of the previous two points. So deployment is very specific to the goals of solving your problem and making that a sustainable solution. We're not going to talk much more about deployment in this class. We're going to focus on the first two portions of the class instead. So the general process of the machine learning workflow. We process the data, we create models, and then we deploy those models. Now, even though I said we're going to focus on those first two, we can't really go through all of these steps in depth. So I'm going to focus on a couple of sports contexts, consumer wearables, pro sports events, season outcomes, and then a few novel research with wearables. And we'll actually replicate some work that's been done in some academic papers. And I want to highlight some of the challenges involved in the process and the decisions you need to make in order to carry out an analysis. And hopefully by the end of this class, you'll feel empowered to start playing with real data and making real predictions of your own. One last thing I want to share is there's a great reference for this course. It's the hands on Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurelien. And it's an excellent up to date introduction to machine learning and Python. It's affordable, it's very well written. And the models and the methods that we're going to be using can be found in this book. So I would encourage you to get a copy, if you're interested in this topic beyond what you'll see in the course.