Deep Neural Networks
In this video, VG Vinod Vydiswaran, Associate Professor of Learning Health Sciences and Associate Professor of Information, discusses deep neural networks, activation functions, multi-layer neural networks, neural network architectures (multilayer perceptrons, convolution neural network, recurrent neural network (RNN), Long Short Term Memory (LSTM) RNNs, and Bi-directional LSTM models).
"Machine Learning & Artificial Intelligence" by mikemacmarketing is licensed under CC BY 2.0.
Excerpt From
Transcript
In this video, we are going to look into deep neural networks. We looked at what a perceptron is in the previous video and then how do you build from perceptron to a neural network. Perceptron, if you'll recall, is a threshold logic unit. You have your input X_1-X_n, that is weighted by this weight vector W_1 to W_n. If that is above a particular threshold, then the output would be one, if not, the output would be zero. We had written down that in this particular mathematical form, where instead of W_1-W_n we would actually have W_0 to W_n. We'd where W_0 specific, meaning that's a bias term. Then so you had n plus 1 dimensional weight vector as well. If that combination, the dot-product of x and w, if it is greater than 0, then y is 1. If it is less than zero, the output is 0. Another way to think about that is that y is a sign of this dot-product between X and W. If the sign is positive, they will label y is positive. If the sign is negative, it is negative. In general, you can think that it is actually a function on the dot-product. You can think of that function as a function f. Then if you had any other function, and so not necessarily the sine function, but if you have something else, then this is a more general way of saying that, that is called an activation function.
Activation functions f are ones that help introduce non-linearity in the model. It is working on the dot-product, but instead of just having a threshold of it, the value is greater than something, then call it one, otherwise call it zero. We can do many more interesting things and arbitrary complex functions can be used instead of just a threshold function. This idea of using activation function helps better approximate arbitrary complex functions that can be used to activate noodle node. This is going to be important because in neural networks, it is not the threshold function directly, but there are more complex functions that are used. Typically, activation functions would work something like this. This is something we have seen already, so you have X_1-X_n, that is your input. You have the bias term 1 that is weighted by W_0, and then you have other weights, W_1-W_n. In general, this activation function f will be running on the dot product of X and W and then give other the label y. What are the different kinds of activation functions? It could use a sigmoid function where you are having the sigmoid function one by one plus e raised to minus X as your activation. You can have the unit that is called ReLU or rectified linear unit, it is zero when X is negative, and then it linearly increases after, once X is positive. It is a different function. It can have Leaky ReLU, or it can have a max dot function or other functions that can be used as activation functions. So this is critical to learn from perceptron, which is a single node, to go into a multilayer neural network.
In a multilayer neural network, you have an input layer, which is again the same, X_1-X_n input and then one as your bias term. The output need not be one node. We have seen in the examples with one node, but it could be multiple nodes in an output as well so you can fund this particular case we are showing two output nodes, y_1 and y_2. For example, if you have to make two decisions at the same time. Then you have the nodes in-between. These are called the hidden layers. Every layer would be one hidden layer on top the input and then you have the second layer that is built on the output from that first layer and so on until the final layer that then feeds into the output layer. At any given node, the z node, the hidden node z is going to have the same m inputs and the bias term as one so that you can consider each of those as very similar node to what you had in perceptron. You have an m plus 1 dimensional input, and then you have weight vectors again the same way having m plus 1 weights and then a dot product of these two of the z from the previous layer and the w vector will give you the output. Then you can have an activation function on that to see whether the input is sufficiently large that it activates that node. This would be then written exactly the same way as we have written the perceptron function, where your Z_k,i, that is for the kth layer, ith node is going to be a weight W, that is a bias term, so W_0 is the bias one and i is for the ith input. Then you have plus, you have this dot-product of a function of the previous layer input multiplied by the weight and that would then give you your new function. How are these multilayer neural networks train? You have an activation function, so you have to see whether there are any parameters corresponding to this activation function. But then you have the approach that they use in Hidden Markov models. For example, you have the forward propagation where it is a feedforward neural network. The next layer depends on the outputs of the previous layer so you create one output and then you use that as an input to the next layer and so on and then you have a backpropagation model where you have a supervised machine learning technique that is used to optimize the parameters based on a specific loss function so there what you see is this is your final output that creates a loss and then in order to minimize that loss, how do you update your Model 1 layer at a time going from the last layer to the first, so that will be the backup replication. This forward, backward technique is something we have seen earlier. If you recall hidden Markov models, they are also trained in this forward backward approach. This is a similar one, is going afford us use in a multi-layer neural networks to.
In general, there are two big umbrellas of neural network architectures. You have this multilayer perceptron, which is a fully connected neural network, and then you have non-linear activation functions and that is something that we have seen in more detail now. Then there are additional kind of neural network models that are called Convolutional Neural Networks models. These convolution neural networks models use convolution, which is this dot-product or component-wise product instead of a matrix multiplication operation to define each layer. This is a neural network that uses a convolution function instead of a matrix multiplication. Then, as I was referring a few slides back, the rectified linear unit is commonly used as the activation function, so you call it like ReLU or rectified linear unit, as you are commonly used, activation function. What are the applications of convolution neural network? It is used extensively in image and video processing, in medical image analysis, and now more so in natural language processing for semantic parsing and query retrieval in text classification. After convolutional neural network, another architecture is a recurrent neural network. A recurrent neural network is where you have the same two layers, input layer, and output layer, and then you have multiple hidden layers. Then there is a feedback coming from a layer that is further out towards the output and you feed back that information to an earlier layer, so that would be a recurrent neural network. Neural networks generally are called deep neural networks if there are more than one hidden layers in between. That gives you an option to do this feedback because you can go from subsequent hidden layer to a previous hidden layer. It happens really only if it is a deep neural network model.
The next neural network architecture is a long short-term memory, recurrent neural network. Long short-term memory is called LSTM model, and so it's basically short-term memory, but then you have multiple layers of that. You call it long short-term memory models. These utilize memory of events that have happened in previous iterations, so instead of only the previous layer giving you input, you can have a few layers back feeding into subsequent layers. Through that way then you can have, just like in Hidden Markov models we had instead of one layer memory, you can have multiple layers the same way in long, short-term memory, we can use the input from one layer to influence not only the next layer but a few layers out. This model is used extensively in speech recognition, in text-to-speech synthesis, in handwriting recognition, and lot of image processing tasks. Finally, we have the more advanced neural network model and produce something that is used extensively in Named Entity Recognition and that is bidirectional LSTM models. In addition to something that is influencing a subsequent layer, you also have it two ways. It's a bidirectional layer, and that would be something where you have a bidirectional cell that influences as the next bi-directional cell and so on. But you also have a, have the subsequent cell influence the previous one. Then all of these can feed into a conditional random field as a supervised approach to then train in a sequence classification task and a sequence labeling task, you can use bidirectional LSTM models. This is fundamentally what is used in a named entity recognition task. The key takeaways here from deep learning is that deep learning enables learning models without explicit feature engineering. In a traditional machine learning model, we have spent a lot of time and effort in engineering features, and in fact, it's more of an art than really a scientific way to do it. In deep learning, that ambiguity is taken away because you have models that can be learned without explicit feature engineering, Deep Learning provides state-of-the-art results in multiple domains. For example, the deep learning approaches that use multiple hidden layers, that means a deep layer in convolutional neural networks, or long short-term memory is used for classification tasks in image processing, in texts processing, and in videos, and so on as well. The combination of deep neural network models with conditional random fields are useful for sequential labeling tasks like the named entity recognition task that we have. Then the biggest advantage is that are quite a few open-source tools available in the assignment. This week, we are going to introduce you to certain tools. This way, you'll be asked to configure deep learning models in multiple ways and learn from that exercise. But there is code available to train deep neural network models and there is computational power available to do that, and that's why it has become really the state of the art models now that these tool kits can be used for building blocks in creating even newer models to kind of see whether some features can be incorporated into deep learning architectures and so on, and that is the active area of research. A lot of medical natural language processing or even traditional natural language processing researchers. The challenge of deep learning model that it creates non-interpretable models still limits this wide applicability of these models in medical informatics, and there's lot of work that's happening now in explainability or interpretability of deep learning models. You have heard of buzzwords like explainable AI or exAI, where we are trying to learn, how do you build deep learning models that does well, but also can be explained in normal language, in, or explained to clinicians so that we can have higher input of this AI approaches in medical practice. This area of integrity is an active area of research and I look forward to talking to more of you about that as you get interested in this domain.