Back to blog posts
Back to current openings

Have questions? I’m here to help.

Diliana Beeva

Talent Community Manager


The Hows and Whys of Machine Learning

Apply for this position

The Hows and Whys of Machine Learning

Machine learning (ML) and Artificial intelligence (AI) are two terms that, due to their futuristic undertone, have earned themselves huge popularity. Both of them have the attractive quality of making people who use them sound smart and are thus widely used despite not being widely understood. What I hope to achieve with this article is to shed some light on ML while simultaneously trying to avoid the usage of greek letters to the best of my abilities.

So what is ML actually used for?

Think about your YouTube recommendation section. The more videos you watch the more it will give you suggestions that will fit your taste. In other words, the more you interact with the platform the more it LEARNS. This faces us with the key ingredient that makes machine learning possible – data. In addition to data, we also need a pattern and the lack of a mathematical formula that would solve our problem but the two latter elements we can do without whereas without structured and relevant information, learning, unfortunately, is not possible. Why is that?

Well let me put it this way: if there is no pattern, you can try to learn as much as you want but the only knowledge you will acquire is that there is no pattern. On the other hand, if there is a magical mathematical formula that gives an answer to your question, machine learning will do the trick but you would have been better off just using the formula. So for both of the cases, ML can help but in the case of the latter, it would quite frankly just be a waste of time. And lastly, what would happen if we have no data? Well, that’s a dead end. How do we learn without having anything to learn from? See? It’s simply not possible.

Okay, enough beating around the bush. Let’s assume we have data. What do we do with it? 

Let’s go back to the YouTube example. You, as a user, can be represented as a vector of characteristics such as [0.4, 0.67, 0.25, … ] where each value in the vector represents how much you like something in percent (ex. You only like 40% of commentary videos, 67% of funny videos, 25% of science-related videos and so on). Similarly, videos can be represented as a vector. For example, the vector [0.7, 0.9, 0.16, … ] would tell us that this video is 70% of the video content is made up of commentary and so on. The match factor between the two vectors will give the user rating for the particular video. Sounds great, but that’s not learning. In this case, we already have knowledge about both the user’s likes and dislikes and the video’s content. But how did we get to that stage? We fill the vectors with completely random values and then the machine will try to extract the real values using the ratings we already have.

Let’s go throught how this ‘learning’ process happens with a model that classifies images.

To illustrate the basics of linear image classification we will use the CIFAR-10 dataset as an example. The CIFAR-10 dataset consists of 60000 32×32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Sample image from the CIFAR-10 dataset

What we are aiming for is to train a model that upon receiving an image that is a representation of one of these 10 classes would be able to correctly classify it. We can build the image f: X → Y where X is the vector representation of the image and Y ∈ (0;9) (where the value of y corresponds to one of the 10 classes in the dataset).

First things first, how do we get the vector from an image?

An image is just a grid of numbers between 0 and 255 e.g 32x32x3 for the 3 RGB channels. This matrix can then be reshaped into becoming a 3072-dimensional vector. Since we are describing a linear classifier, as the name suggests, we are going to use some linear function to classify our images. So the image f: X → Y would look like this: Y = g(WX + b). The values in the W matrix are called weights, the ‘b’ in the vector stands for bias and g is some sort of function.

Multiplying this 1×3072 matrix with a 3072×10 matrix and adding a 1×10 would result in a 10-dimensional ‘scores’ vector. The goal is to generate this 3072×10 matrix in such a way that the biggest value in the 1×10 vector would be located at the index corresponding to the image’s class. So the ‘g’ function would get the index of the maximum value from the scores vector.  

But how do we produce the matrix and the bias that apparently hold the key to our linear classifier?

As mentioned above, initially our matrices are created by generating random values which then have to be optimized. This, however, brings up another question.

How do we grade how well our model performed in order for any optimization to take place? What we are looking for here is a  loss function. A loss function maps decisions to their associated costs and it can also be thought of as an error function. 

Let’s define the function as:

where i is the index of the current image, j is the current class, s is the scores vector, and yi is the index of the correct class for the current image.

Let’s look at an example where a vector representing a ship multiplied with the weights resulted in a [3.2, 5.1, -1.7] scores vector. This incorrectly classified our image as a horse. Now let’s calculate the loss for this image.

Loss = max(0, 5.1 – 3.2 + 1) + max(0, -1.7 – 3.2 + 1) = 2.9

We do this for all images we want to classify.

What we want to do is to minimize the loss function. In order to understand how to do that, one can imagine being in the mountains. The peaks and valleys represent the local maximums and minimums of our loss function and in order to minimize our costs, we have to go down the slope to our local minimum. This is accomplished with the gradient descent algorithm. The gradient of a weights vector is a vector of partial derivatives with respect to the weights. This vector tells us the direction of the steepest slope. Therefore taking steps in the direction of this negative gradient will minimize our loss function. 

It is important to note that depending on the starting position, and the step size, the global minimum of the loss function may never be reached. This is owed to the fact that our weights are always moving down the biggest slope when the deepest valley can be located just behind the adjacent peak.

Nowadays, image classifiers are widely used with a state of the art accuracy of around 95%. Of course, a lot more goes into training a model in order to achieve such results. In our example, we based our output only on one set of weights but in reality, convolutional neural networks( CNNs) have a lot of layers, each one searching for some specific feature in the image. For example, one layer could check if there are any ears in the image, another one can look for a head with ears and so on. 

A great example of an image classification use case could be found in UserTribe’s emotion recognition. UserTribe is a platform that helps businesses gather customers intel about their projects. Lab08 has been working on UserTribe’s platform for almost 2 years now – developing new features and making sure this product is placed ahead of its competitors. Through video recorder product tests, companies can understand how their users feel about them and what they create. In order to create an automated report on someone’s opinion based on a video, UserTribe uses ML to extract sentiment from what is said but also takes it a step further and analyses the customer’s facial expressions while they interact with the product by adding social anthropology experts into the mix.

Similarly, to our ten class example from the CIFER-10 dataset, UserTribe recognizes 7 different emotions. Their model has been trained on 30000 images and has achieved the 95% SOTA accuracy when tested on the CK+ dataset. Every video is converted into 1 fps (frames per second) as detecting facial emotion at each frame would both be redundant and very inefficient. After that, every frame is classified by the model which consists of 19 convolutional networks with a softmax activation function. Activation functions in CNN are what transforms our model into a non-linear classifier and essentially allowing it to perform more complex tasks. But this is a step too deep for the purposes of this article. So deep in fact, that it enters realms, not of Machine learning but those of Deep Learning which is a whole new conversation.

Sophia Peneva is a Software Developer at Lab08, working on the UserTribe platform. Previously, she has worked for SAP and this summer, she became one of Google’s interns, working on a Youtube algorithm project. She has experience with C++, PHP, Python, Java, MySQL, MongoDB and other. When it comes to deep learning, Sophie now has Tensorflow experience in her pocket. Sophie will be one of the main contributors to our future blog posts on Machine Learning. Stay tuned!

Be sure to follow us on social media to receive updates about other similar content!

Back to current openings Back to current openings

Have questions? I’m here to help.

Diliana Beeva

Talent Community Manager