Tag: machine learning

  • Neural networks – ML meets biology?

    Neural networks – ML meets biology?

    *This article was written by a former Software Developer at Lab08 – Sophia Peneva*

    Machine Learning is more than a buzzword to us at Lab08 as we have the great opportunity to work on products, relying on AI and ML. This inevitably piques the interest of our employees to deep dive into specifics and learn more about the mechanics of ML. Sophie is no exception and her curiosity has led her into the depths of the topic. After sharing her general thoughts on the Hows and Whys of Machine Learning, we now present you with vol. 2 dedicated to neural networks. Enjoy!

    I have never been all that great at biology nor have I, to be completely sincere, taken any particular interest in anything biology-related. I will however make an attempt to describe something I am genuinely infatuated with by a brain analogy. I brought up the topic of ‘Machine learning’ in my previous article but in it, I just scratched the surface. This time the aim is to dig a little deeper, so I’ll get my scalpel and let’s get started.

    First things first, how are machine learning and biology even remotely related? Well to quote Wikipedia, “The understanding of the biological basis of learning, memory, behaviour, perception, and consciousness has been described by Eric Kandel as the “ultimate challenge” of the biological sciences.”, emphasis on learning. This is to some extent, also Machine learning’s challenge. When we, as human beings, are learning a new skill our brains are forming neurons and connections between neurons inside of them. These connections can be both strengthened and weakened. Practising a particular skill set, for example, can result in a strong connection between two neurons, meaning that firing one of them would most likely trigger the other one. This process has been the inspiration behind neural networks. 

    Neural networks

    Last time, I described the process of classifying images using linear image classification. As a quick recap, in order to determine whether our image was a depiction of a horse, a cat, a car or etc. we turn our image into an input vector containing the pixel data of our image, we multiply it by a matrix of weights and add a vector of biases to the result. This gives us a vector of possibilities, where each row represents the possibility for our image to be of a certain class (a horse, a cat, a dog, etc). Long story short, Y =Wx + b. The weights and biases are what we train in order to get the correct output from the classifier. 

    So far so good, but what if we don’t have linearly separable data? What if we want better results? That’s where neural networks come in. Let’s take a look at neural networks’ architecture:

    Fig.1

    Our neural network consists of an input vector, an output vector and a hidden layer. This illustration is of a two-layer Neural network (notice that the input vector is not accounted for when counting the number of layers). The layers are ‘fully connected’ which means that between every two adjacent layers all neurons are pairwise connected (NB neurons from the same layer share no connection). 

    There are a couple of things to consider when it comes to deciding what the architecture of your neural network will be – how the layers will be connected and the size of the network. Let’s first take a look at how we connect the layers.

    Activation functions

    When we were creating a simple linear model, we get the output vector by multiplying the input vector with a weights matrix and adding a vector of biases:

    Y =Wx + b

    So what happens when we put an additional layer in between the input and the output vector. How does that affect our equation? The easiest solution is to consider the following: if we get the hidden layer’s vector as HL = Wx + b we can multiply it by another set of weights and add another set of biases to get the output layer:

    Y = W1(Wx + b) + b1

    Unfortunately, this is not correct. The whole point of neural networks is to represent more complex functions. Will we achieve that with this equation though? Since the product of two matrices is a matrix, if we open the brackets we will get:

    Y = W1(Wx + b) + b1

    Y = W2x + b +b1

    Y = W2x  +b2

    Which is not much different from our linear model. The only difference being the values of the weights which are supposed to be gathered in the process of training anyway. Essentially, we still have a linear model. What we have done here is sort of like having a normal function y=x + 1, multiply it by 2 and add 1 and expect to have a nonlinear function. Obviously, that is not the case.

    So how do we model nonlinear functions? That’s where activation functions come in.

    Activation functions (nonlinear functions) take a number, perform mathematical operations on it and return a number. If we use such a function after every layer (except for the last one), we will achieve nonlinearity. So the way we generate our output should look something like this:

    Y = W1(f(Wx + b)) + b1 

    Where f is our activation function. Naturally the question of what we should use as an activation function arises. There are a few that are widely used in Machine Learning for neural networks.

    Fig.2

    All of the above have pros and cons however in recent years ReLU has been the most popular choice. There is no correct answer to the question ‘Which one should I use for my model?’. In order to find out which one works best for your data, you could try experimenting with different activation functions.

    Neural network sizes

    In order to measure the size of a neural network, two things are taken into consideration: the number of neurons and the number of learnable parameters. In Fig. 1 we have 5+2=7 neurons (we are disregarding the number of neurons from the input vector), 3×5 + 5×2 = 25 weights and 5+2=7 biases, for a total of 25+7=32 learnable parameters. For comparison, modern Convolutional Networks contain orders of 100 million parameters and around 10-20 layers (hence deep learning). This naturally brings up the question ‘So how big should my neural network be?’. Let’s first see what a change in the number of neurons means for our model. Here I have prepared 3 examples for ML models with a different number of neurons.

    At first glance, the last model which has the most neurons has the best predictions. Every red point is correctly classified as ‘red’, the same goes for every green point. Overall, the models seem to be getting better at predicting, the more neurons they have. Is that really the case, though? 

    Fig.3

    Despite the fact that the third model seems to represent the data perfectly, it has actually grossly overfitted it. What this means is that our model has become extremely good at classifying the points that it was trained on but it probably won’t perform well out of sample – a.k.a on new data points. The reason this happens is that the more neurons we have, the more complex functions our model can build to separate the classes in our data. The problem occurs because our initial data will most likely have some sort of ‘noise’ (pieces of information that bring nothing to the table – outliers that should be ignored). If we treat this noise as important contributors to our data instead of as outliers, we run the risk of misclassifying a lot of other data points. In the example, in Fig.4 our model has created a pretty complex function in order to correctly classify two additional green points in our training dataset. What happens when we run the tests on the real data after that? We’ll potentially misclassify a lot of points.

    Fig.4

    Okay, so does that mean that we should aim for neural networks with fewer neurons? Well, no. Thankfully there are other ways that we can deal with overfittings such as regularization and dropout. You might be wondering why we should go through this additional effort of adding regularization into the mix if we can simply make our model with fewer neurons. The big problem of smaller neural networks, however, is that they are harder to train. Their loss functions have relatively few local minima. Most of these local minima are bad but easy to converge to. This means that we are more likely to end up at a local minimum that has a high error rate. On the other hand, bigger neural networks have a lot more local minima and most of them have relatively low errors. To sum up, by choosing a bigger neural network we stop depending so much on a random chance for the initial weight initialization in order to get a good final result. So in the end, to decide how deep we want to go with our neural network, we should answer how deep we are willing to reach into our pockets as larger neural networks need a lot of computational power.

    I mentioned regularization as a way of reducing overfitting. What is regularization though and how does it work?

    At its core, regularization is penalizing for complexity. As mentioned earlier, if our model is overfitting it works well on the training data but due to being too complex, it cannot generalize well and therefore performs poorly on new data. So how do we make our model less complex? When we are training our neural networks, we heavily rely on the loss function. You might remember from the previous article that after computing the loss, i.e. the error, we make changes to our weights. Let’s take a look at the Sigmoid activation function once again.

    Fig.5

    You might notice that for Zs close to 0, the function is pretty linear (Fig. 6).

    Fig.6

    So if we maintain our weights to have lower values, we will keep a relatively linear result function even after the activation. This means that with the help of regularization we can achieve a relatively simple function that will not overfit our data. So what we should aim to do is penalize the loss function in such a way that our weights become smaller.

    If we take the L2 regularization, for example, the cost there is computed as follows:

    Fig.7

    You can see here that it is the sum of the loss function and the so-called ‘regularization term’. If λ is equal to 0 then there is no difference in how we compute the cost. The idea behind the cost function is that we minimize it with every iteration during the training process. If we simply minimize the loss ( i.e. the error) we will eventually overfit our data. However, if we minimize both the loss and the complexity we will have a more balanced model which will neither overfit nor underfit. Obviously, the value that we choose for λ is very important and playing around with it can have a dramatic effect on the resulting model.

    In order to illustrate the effect, here are 3 models that all have the same number of neurons but with different regularization strengths (going from smallest to largest). As you can see, the third model which has the largest regularization strength did not overfit the data despite having a lot of neurons. Now compare it to the first model which did the exact opposite.

    Fig.8

    You can play around with different neural networks’ sizes and regularization strengths: ConvNetJs demo.

    Everything we discussed so far was but a glimpse into what the topic of neural networks has to offer. There are so many more things to cover but it would be overkill to try and do it all in this article (trying to cover everything would lead to overfitting anyway). What I hopefully achieved was to give you a general understanding of what neural networks are and perhaps piqued your interest in them.

    Sources:

    [1] Overfitting and underfitting https://satishgunjal.com/underfitting_overfitting/

    [2] Regularization https://developers.google.com/machine-learning/crash-course/regularization-for-simplicity/l2-regularization

    [3] https://towardsdatascience.com/regularization-in-machine-learning-76441ddcf99a

    Sophia Peneva used to work as a Software Developer at Lab08, contributing to the growth of the Usertribe (now GetWhy) platform. Previously, she has worked for SAP and is currently a part of Google’s team, starting as an intern, working on a Youtube algorithm project, while she was a part of Lab08. She has experience with C++, PHP, Python, Java, MySQL, MongoDB and others. When it comes to deep learning, Sophie now has Tensorflow experience in her pocket. Sophie has been interested in ML for quite some time, showing great passion and engagement in getting her hands dirty. We hope you’ve enjoyed the product of her enthusiasm!

    Be sure to follow us on social media to receive updates about other similar content!

  • The Hows and Whys of Machine Learning

    The Hows and Whys of Machine Learning

    *This article was written by a former Software Developer at Lab08 – Sophia Peneva*

    Machine learning (ML) and Artificial intelligence (AI) are two terms that, due to their futuristic undertone, have earned themselves huge popularity. Both of them have the attractive quality of making people who use them sound smart and are thus widely used despite not being widely understood. What I hope to achieve with this article is to shed some light on ML while simultaneously trying to avoid the usage of greek letters to the best of my abilities.

    So what is ML actually used for?

    Think about your YouTube recommendation section. The more videos you watch the more it will give you suggestions that will fit your taste. In other words, the more you interact with the platform the more it LEARNS. This faces us with the key ingredient that makes machine learning possible – data. In addition to data, we also need a pattern and the lack of a mathematical formula that would solve our problem but the two latter elements we can do without whereas without structured and relevant information, learning, unfortunately, is not possible. Why is that?

    Well let me put it this way: if there is no pattern, you can try to learn as much as you want but the only knowledge you will acquire is that there is no pattern. On the other hand, if there is a magical mathematical formula that gives an answer to your question, machine learning will do the trick but you would have been better off just using the formula. So for both of the cases, ML can help but in the case of the latter, it would quite frankly just be a waste of time. And lastly, what would happen if we have no data? Well, that’s a dead end. How do we learn without having anything to learn from? See? It’s simply not possible.

    Okay, enough beating around the bush. Let’s assume we have data. What do we do with it? 

    Let’s go back to the YouTube example. You, as a user, can be represented as a vector of characteristics such as [0.4, 0.67, 0.25, … ] where each value in the vector represents how much you like something in percent (ex. You only like 40% of commentary videos, 67% of funny videos, 25% of science-related videos and so on). Similarly, videos can be represented as a vector. For example, the vector [0.7, 0.9, 0.16, … ] would tell us that this video is 70% of the video content is made up of commentary and so on. The match factor between the two vectors will give the user rating for the particular video. Sounds great, but that’s not learning. In this case, we already have knowledge about both the user’s likes and dislikes and the video’s content. But how did we get to that stage? We fill the vectors with completely random values and then the machine will try to extract the real values using the ratings we already have.

    Let’s go throught how this ‘learning’ process happens with a model that classifies images.

    To illustrate the basics of linear image classification we will use the CIFAR-10 dataset as an example. The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

    Sample image from the CIFAR-10 dataset

    What we are aiming for is to train a model that upon receiving an image that is a representation of one of these 10 classes would be able to correctly classify it. We can build the image f: X → Y where X is the vector representation of the image and Y ∈ (0;9) (where the value of y corresponds to one of the 10 classes in the dataset).

    First things first, how do we get the vector from an image?

    An image is just a grid of numbers between 0 and 255 e.g 32x32x3 for the 3 RGB channels. This matrix can then be reshaped into becoming a 3072-dimensional vector. Since we are describing a linear classifier, as the name suggests, we are going to use some linear function to classify our images. So the image f: X → Y would look like this: Y = g(WX + b). The values in the W matrix are called weights, the ‘b’ in the vector stands for bias and g is some sort of function.

    Multiplying this 1×3072 matrix with a 3072×10 matrix and adding a 1×10 would result in a 10-dimensional ‘scores’ vector. The goal is to generate this 3072×10 matrix in such a way that the biggest value in the 1×10 vector would be located at the index corresponding to the image’s class. So the ‘g’ function would get the index of the maximum value from the scores vector.  

    But how do we produce the matrix and the bias that apparently hold the key to our linear classifier?

    As mentioned above, initially our matrices are created by generating random values which then have to be optimized. This, however, brings up another question.

    How do we grade how well our model performed in order for any optimization to take place? What we are looking for here is a  loss function. A loss function maps decisions to their associated costs and it can also be thought of as an error function. 


    Let’s define the function as:

    where i is the index of the current image, j is the current class, s is the scores vector, and yi is the index of the correct class for the current image.

    Let’s look at an example where a vector representing a ship multiplied with the weights resulted in a [3.2, 5.1, -1.7] scores vector. This incorrectly classified our image as a horse. Now let’s calculate the loss for this image.

    Loss = max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1) = 2.9

    We do this for all images we want to classify.

    What we want to do is to minimize the loss function. In order to understand how to do that, one can imagine being in the mountains. The peaks and valleys represent the local maximums and minimums of our loss function and in order to minimize our costs, we have to go down the slope to our local minimum. This is accomplished with the gradient descent algorithm. The gradient of a weights vector is a vector of partial derivatives with respect to the weights. This vector tells us the direction of the steepest slope. Therefore taking steps in the direction of this negative gradient will minimize our loss function. 

    It is important to note that depending on the starting position, and the step size, the global minimum of the loss function may never be reached. This is owed to the fact that our weights are always moving down the biggest slope when the deepest valley can be located just behind the adjacent peak.

    Nowadays, image classifiers are widely used with a state of the art accuracy of around 95%. Of course, a lot more goes into training a model in order to achieve such results. In our example, we based our output only on one set of weights but in reality, convolutional neural networks( CNNs) have a lot of layers, each one searching for some specific feature in the image. For example, one layer could check if there are any ears in the image, another one can look for a head with ears and so on. 

    A great example of an image classification use case could be found in UserTribe’s emotion recognition. UserTribe is a platform that helps businesses gather customers intel about their projects. Lab08 has been working on UserTribe’s platform for almost 2 years now – developing new features and making sure this product is placed ahead of its competitors. Through video recorder product tests, companies can understand how their users feel about them and what they create. In order to create an automated report on someone’s opinion based on a video, UserTribe uses ML to extract sentiment from what is said but also takes it a step further and analyses the customer’s facial expressions while they interact with the product by adding social anthropology experts into the mix.

    Similarly, to our ten class example from the CIFER-10 dataset, UserTribe recognizes 7 different emotions. Their model has been trained on 30000 images and has achieved the 95% SOTA accuracy when tested on the CK+ dataset. Every video is converted into 1 fps (frames per second) as detecting facial emotion at each frame would both be redundant and very inefficient. After that, every frame is classified by the model which consists of 19 convolutional networks with a softmax activation function. Activation functions in CNN are what transforms our model into a non-linear classifier and essentially allowing it to perform more complex tasks. But this is a step too deep for the purposes of this article. So deep in fact, that it enters realms, not of Machine learning but those of Deep Learning which is a whole new conversation.

    Sophia Peneva used to work as a Software Developer at Lab08, contributing to the growth of the Usertribe (now GetWhy) platform. Previously, she has worked for SAP and is currently a part of Google’s team, starting as an intern, working on a Youtube algorithm project, while she was a part of Lab08. She has experience with C++, PHP, Python, Java, MySQL, MongoDB and others. When it comes to deep learning, Sophie now has Tensorflow experience in her pocket. Sophie has been interested in ML for quite some time, showing great passion and engagement in getting her hands dirty. We hope you’ve enjoyed the product of her enthusiasm!

    Be sure to follow us on social media to receive updates about other similar content!