Your browser is ancient!
Upgrade to a different browser to experience this site.

Skip to main content

Artificial Intelligence

Key Concepts in Machine Learning

In this video, Kevyn Collins-Thompson, Associate Professor of Information and Electrical Engineering and Computer Science, talks about two main types of machine learning–supervised learning and unsupervised learning–as well as the steps to solving a problem with machine learning.


"Tech support oversees ai neural network" by DC Studio is available via Freepik.

Excerpt From

Transcript

Machine learning tasks, including the ones I've described as examples such as credit card fraud detection, movie recommendations, speech recognition, and so on, can be categorized into two main types. The first type is known as supervised learning, and here our goal is to predict some output variable that's associated with each input item. If the output is a category—a finite number of possibilities such as a fraudulent or not fraudulent prediction for a credit card transaction, or maybe it's the English word associated with an audio signal for speech recognition—we call this a classification problem within supervised learning, and the function that we learn is called the classifier. If the output variable we want to predict is not a category but a real-valued number, like the amount of time in seconds it would likely take a car to accelerate from 0 to 100 kilometers per hour, we call that a regression problem, and we're learning something called a regression function. More formally, we typically denote a table of data items using the capital letter X, and there's one data item per row. The labels that we associate with each item are stored in the variable y. Our goal is to learn some function that maps a data item in X to a label in Y. To do this, the system's given a set of labeled training examples of inputs X sub i and outputs Y sub i, and this set of labeled training examples is what's used to identify the function that best maps the input to the desired output. For example, if the supervised learning problem is image recognition, this involves building a classifier where the input X sub i could be a set of pixels that describe a single image and the desired label Y sub i might be the label of the object in the image. Now there are many algorithms that scientists have developed that can do supervised learning that could be used to estimate this function F from the training data, and we'll cover a number of those algorithms in the course. Supervised learning needs to have this training set with labeled objects in order to make its predictions, but if the whole point is to predict these labels, where does this initial set of labeled items come from? The answer is that the training labels are typically provided by human judges. Obtaining labels for some problems can be easy or difficult, depending on how much labeled data is needed, the level of human expertise or expert knowledge that is needed to provide an accurate label, and the complexity of the labeling task, among other factors. The use of crowdsourcing platforms, like Amazon's Mechanical Turk, or Crowd Flower or others, have been a significant source of explicitly provided labels from human workers where customers with machine learning tasks that need labeling can connect with groups of workers that can provide labels using human intelligence. So those are more explicitly obtained labels. We can also obtain implicit labels, such as if a search engine detects a user clicking on a result link and then sees no more activity for another minute or two before the user comes back to the search engine. The system might use that activity as a kind of an implicit label for that page, suggesting that if the user took some time to visit the page, it was more likely that that page was relevant to their query.

The second major class of machine learning algorithms is called unsupervised learning. In many cases, we only have input data; we don't have any labels to go with the data. In those cases, the problems we can solve involve taking the input data and trying to find some kind of useful structure in it. By structure, we typically mean finding interesting clusters or groups within the data. Once we can discover this structure in the form of clusters, groups, or other interesting subsets, the structure can be used for tasks like producing a useful summary of the input data or visualizing the structure. For example, if you run an e-commerce site that sells products to customers, and you've got many thousands or even millions of customers, you might want to know if you can categorize or group the customers into different types. There might be power users who use more advanced features of the site, quick browser-type users who only look for a cheap discount and stay for a very little time on the side itself, or careful researcher type users who spend a lot of time comparing different items. If you could take your data of how people interact with the site and use unsupervised learning to discover these different groups, you can imagine then tailoring your site's offerings to each group to improve the chance that a user from that group would purchase a product or have a better experience using your site. You don't know how many groups there are out there ahead of time, or even what they look like, and you don't have any labeled examples. So this is a classic example of an unsupervised learning problem. Another type of unsupervised learning problem that is very important is flagging abnormal access to a web server. For security reasons, you might want to be notified if a website user is making requests that could be a cyberattack, or is somehow very different from typical user behavior on your site. Since there can be many different types of hacking or intrusion attempts to break into a server or exploit in some way, we don't have reliable training labels that we could use to train a classifier using supervised learning. Instead, we need an unsupervised approach that allows us to perform something called outlier detection that doesn't assume future attacks will be of the same form as previous attacks, but that does assume features of attacks on the site will look different somehow than the average user's behavior.

Okay, so suppose you have a situation where you think machine learning might be applicable, either using a supervised or an unsupervised approach, how would you apply machine learning to solve your problem? There are three basic steps and I'm going to use classification as my typical machine learning scenario. I'll often just use the term "classifiers" as an example of a machine learning task, but what I’m about to describe applies to other forms of supervised learning, like regression that we'll cover later, or unsupervised learning like clustering as well. The first step in solving a problem with machine learning is you have to figure out how to represent the learning problem in terms of something the computer can understand. You need to be able to take your data or even formulate a description of the object that you're interested in recognizing, for example, in a way that you can use input to an algorithm. You also need to decide what type of learning algorithm to apply to this data. For example, there're many different ways you could represent an image. Typically, it's represented as an array of colored pixels. There could also be metadata associated with the image. You could think about how you might represent a credit card transaction if you want to do fraud detection. So that might be represented by the time, the place, and amount of a transaction. You need some way, some representation of the data you have, and the choice of what kind of algorithm you want to apply to the data. The second thing you need to do is decide on an evaluation method that provides some type of quality or accuracy score for the predictions or the output that is coming out of the machine learning algorithm, typically I say classifier. So if you have any evaluation method, this allows you to assess and compare the effectiveness of different classifiers. So you can tell what classifiers are doing well, and which are the good ones, and which are the bad ones for your particular problem. For example, a good classifier will have high accuracy and will make a prediction that matches the correct true label a high percentage of the time. The third thing you need to do in applying machine learning to solve a problem is once we've decided on how to represent the input data, the type of classifier, and the evaluation method, we need to then search for the optimal classifier that gives the best evaluation outcome for that problem. And we'll go into all three of these areas in this course, and then, we'll use Python in this course as well to solve some specific examples. And we're about to see a concrete example of how we use machine learning libraries in Python to solve a classification problem. So let's go through these three steps in a little more detail now. Let's first talk about what it means to convert the problem into a representation that a computer can deal with. This involves two things: You need to convert each input object, which we often call a sample, into a set of features that describe the object. Second, we need to pick a learning model, typically the type of classifier that you want the system to learn. So let's look more closely at what we mean by a feature representation for an object. Each data point in your dataset represents something, some object or situation or event. An entity that's being represented by a list of properties. For example, an email might be represented by a list of words that are in the message, a picture might be represented by a matrix of color values for the pixels that make up an image, a piece of fruit, like an apple, could be represented by its color, its shape, its texture, and so forth. These attribute values for an object are called features. You can think of the input data containing this feature representation as the input to your function. You can visualize it easily as a table where each row of the table represents a single data instance and where the columns of the table represent the features of the object. There could be many different choices for how to represent an object using features. So, for example, if we take a lemon, it has a shape, a width, a height, and it also has mass, taste, and if you buy lemons in the store, they also come with a handy barcode that identifies the type of fruit that it is. So there are lots of different types of information we could gather about a thing in a form that a computer could understand. This problem of trying to figure out how to represent an object for a machine learning algorithm is a challenge and is known as feature engineering or feature extraction, and we’ll cover that in week four of the course. The other key part of representing a machine learning problem is choosing the type of classifier that's appropriate for the problem, and we'll cover many different types of classifiers in this course. They all have different trade-offs in terms of their accuracy, their interpretability, and their speed, and so forth, and we'll cover those in the second week of this course. Often, the process of addressing a machine learning task is a cycle, an iterative process, as shown here, where we make an initial guess about what some good features are for the problem and the classifier that might be appropriate. We then train the system using our training data, produce an evaluation, see how well the classifier works, and then based on what worked and what didn't work—which examples get classified correctly or incorrectly—we can do a failure analysis to see where the system is still making mistakes. And then with the results of that failure analysis, we typically will always refine the set of features. We may discover that an important feature is missing that would help fix some of the mistakes, for example. So, in my experience, this iterative process is very common in fact, and it's very typical of solving problems with machine learning. Typically, you might want to go through this cycle several times, to continually refine the features and assess their effect on accuracy, or try different types of classifiers, based on the evaluation method that you've chosen in order to determine if you have the right approach for your problem.