Artificial intelligence, machine learning, neural networks: All buzzwords you hear on a daily basis if you’re reading this article. Well, there’s a reason for that.
Artificial intelligence has emerged as one of the hottest topics in computer science today, and the advancement of the field is nowhere near slowing down. Heck, even if you’re not a computer scientist, you’ve at least once in your lifetime thought of how AI systems can become smarter than humans, eventually taking over society as a whole.
An even hotter topic nowadays is deep learning (DL), a direct descendent of artificial intelligence and what is supposedly going to bring us the closest to full computer intelligence.
A lot to unpack in this definition, and that’s what we’ll aim to do in this new series of articles.
In this first piece, we’ll start off by describing some limitations of traditional machine learning algorithms, and how neural networks (NN) are used as an answer to such limits. From there, we’ll lay the foundation for future DL articles by explaining the basic structure of an NN, and how this basic NN can be further developed to create highly sophisticated DL algorithms.
I write this article under the assumption that you have an understanding of basic machine learning concepts, such as:
If not, I’ll refer you to the series I wrote on basic machine learning algorithms and concepts:
Let’s get right into it.
On the one hand, traditional machine learning algorithms are great when you’re dealing with simple datasets. If you’ve studied algorithms of the likes of linear regression, logistic regression, or KNN, then you’ve had firsthand experience with how well these algorithms can be taught to solve complex problems, such as predict a house’s price or recommend items to users.
On the other hand, when working with more complex datasets, ones where the relationship between the features and the output is not so simple to understand, these algorithms can become highly inefficient. Let’s look at an example.
Say you were to use some classification algorithm to determine whether a tumor is cancerous or malignant. To do so, you decide to use two of the tumor’s characteristics: its size () and stiffness (). Here’s an example of some of the training data plotted, where represents a tumor labeled as malignant and represents a tumor labeled as cancerous:
What kind of hypothesis can we use to fit this training data? Here are a few options, along with their approximate decision boundaries:
None of these really succeed in properly classifying our training set, forcing us to come up with a more complex hypothesis, one of a higher order, and possibly more terms. This, in turn, will result in a more complicated decision boundary, which may be what we need to distinguish a malignant from a cancerous tumor. For example:
Better, but not so convincing. Can we come up with an even more detailed hypothesis? One of an even higher order, and more terms? Probably. Should we? Well, how far are we willing to go? If for a problem with only two features, already we end up with a complicated hypothesis, with many terms. Imagine, now, we were to use 100 features to solve our problem: The tumor’s size, stiffness, the gender of the patient, their age, etc. Worst-case scenario, we end up having to take all possible second-order terms to form our hypothesis, leaving us with a hypothesis of 10000 terms. In fact, for any hypothesis in which all the second-order terms are taken into account, we get terms, where is the number of features:
If we were to take all third-order terms, we’d end up with terms. That’s a lot of terms.
Is this far-fetched? Are we ever going to need that many features or that complicated of a hypothesis? Let’s take one last example to really drive the point home.
One very common use of neural networks is object recognition. For example, given an arbitrary image, we wish for our algorithm to identify whether or not the image contains a soccer ball.
Working with images implies that our algorithm will take as inputs the pixel color values. Assuming our image is in grayscale, and an image size of 800 x 801 (which is the actual size of the image above), that’s 640800 features in our feature space…That’s more than 400 billion terms, in the worst case, if we were to take all the second-order terms. Imagine how this would perform, for example, in an autonomous driving vehicle.
So what’s the solution. Well, you’ve probably guessed it by now: Neural networks.
Although popularized in the last decade or so, neural networks have been around for a while. The subject was opened by Warren McCulloch and Walter Pitts in the 1940s, with many others contributing to the field after that [2]. In 1958 Frank Rosenblatt created the Perceptron, a supervised learning classification algorithm [3]. Although perceptrons are no longer as widely used as they used to be, it’s a good segway to the more modern networks used today, and so we’ll start off by explaining its workings.
A perceptron is a binary classification algorithm that takes several binary values as input and outputs a binary value:
Figure 5 is an example of the most basic perceptron. Let’s explain its different parts.
are the binary inputs and they make up the input layer. are what Rosenblatt referred to as weights. These weights are used to describe the impact of an input on the overall output. An input with a higher weight will have more influence on the overall result than an input with a lower weight. All circles that have arrows coming in and out of them represent the idea of running the inputs through some function , which in turn will produce some output. These are called neurons. Notice that the circles for have no arrows coming in, only arrows coming out, and so these don’t run any functions, they simply represent inputs. In perceptrons, the function used in the neurons is the weighted sum between the inputs and the weights:
The output is based on the following rule:
Thethresholdis a pre-determined value that we choose. A larger threshold means we’re less strict on when our perceptron outputs a one. A smaller threshold means we’re more strict.
Our algebraic description of the perceptron can get a little messy. Let’s try to simplify it. One thing we can do is create vectors of our inputs and weights, and use the dot product between the two, leaving us with the same exact result as when we used the summation notation in equation 1. We can also move the threshold to the other side of the inequality, and assign it the variable such that :
This is the description of perceptrons you’ll normally find, and it’s what we’ll use when we move on to more modern neural networks.
A perceptron doesn’t have to contain only one neuron. We can create a network of perceptrons, hence the name neural network:
We call the layer with the output neuron the output layer. Any layer in between the input and output layer is called a hidden layer.
Perceptrons are great for linearly separable data i.e. data that can be separated using a line, a plane, or a hyperplane [5]. Meaning it doesn’t solve our earlier problem, where we needed a way to create highly complicated decision boundaries, without an immensely large hypothesis.
The output being limited to only zero and one also raises some limitations. Imagine we want to use the network in figure 6 to determine whether or not the number written on an image is a zero. We want an output of 1 if the image contains a zero, and an output of 0 otherwise. For some unknown reason, our network continuously labels images of the number one as being the number zero. What can we do? We can try slightly changing some weights. But a slight change in weights can have a much more significant change on the output node. It can have as large an impact as changing the result at the output node from one to zero, or vice versa. So, even if we start correctly identifying that a one is not a zero, the rest of the network may become erroneous when working with numbers other than ones and zeroes, say, nine.
The solution to all these limitations of perceptrons are sigmoid neurons, the more modern and widely used type of NNs today.
The structure of a neural network using sigmoid neurons is exactly the same as one using perceptrons: Inputs and weights are fed to neurons in a hidden layer, which in turn produce an output fed to the output nodes in the output layer, which finally spit out an output.
However, sigmoid neurons still differ from perceptrons in a few ways. One, the inputs and outputs are not limited to values of one or zero. They can be anything in between. This is a direct effect of the second difference, which is that the function used is now the sigmoid function (σ), as opposed to what we had in equation 1:
In our case, will always be equation 1. You can think of the sigmoid function as one that scales down any continuous value into a value between zero and one. Since the output of equation 1 is indeed a continuous value, this idea stands.
Sigmoid neurons can also have more than one output, and are not limited to binary classification problems:
The best way of understanding how sigmoid NNs work is through the sigmoid curve:
Notice that there are asymptotes at and , satisfying our requirement of getting values between zero and one. But does this fix the problem we mentioned, where a small change in weights causes a disproportionally large change in the output? To answer that question, we can compare figure 7 to the curve of a step function:
With perceptrons, we were dealing with the step function, where a slight change in (weights) could cause a complete change in (output). The sigmoid curve is nothing but a smoothed-out version of the step function. This new, smoother, curve allows for the change in the output to be better reflected by the amount of change in the weights.
By now, you should have a better sense of the aforementioned definition for deep learning:
You know what a neural network is, and how they work. But why “deep”, and why “learning”?
“Learning” comes from the fact that the weight selections and updates are learned by our neural network autonomously, using what we call a learning algorithm. We’ll talk about learning algorithms in the next article.
“Deep” describes the particular structure of our neural networks. The NNs we saw today were very simple and novel, with no more than two hidden layers, and a few outputs. In reality, deep learning algorithms are used to solve extremely complex problems by decomposing them into smaller subproblems. This decomposition involves building large neural networks by first constructing smaller neural networks and laying them on top of each other, together working to solve the larger, more complex question.
This article served as the foundation for everything we’ll be discussing in the future.
We started off by studying the limitations of traditional machine learning algorithms, in an attempt to understand why we need neural networks. From there, we looked at one of the earliest NNs developed, and studied how it can be used to solve linearly separable classification problems. Although useful, the perceptron lacks in its ability to distinguish between classes that aren’t linearly separable, hence the need for a more sophisticated approach.
By simply changing the function used at every neuron, we were able to come up with a neural network that can deal with more complicated datasets. Although the structure of both perceptrons and sigmoid neurons are similar, the difference in the impact of a change in weights on the output separates them in ability significantly.
In the next article, we’re going to discuss how neural networks learn on their own using a learning algorithm called gradient descent. Until then, I leave you with the following points to ponder on: