Universal function approximators a.k.a Deep Neural Networks (part 2)

Published in

DataDrivenInvestor

6 min readOct 23, 2018

Boats on the Beach at Pourville, 1882 by Claude Monet

This is the part2 of Universal function approximators series and part1 is also available for reading.

I think you are familier with the euphoric feeling of understanding how things work. So, let’s explore the inner workings of an artificial neural network.

Artificial neuron

These are the building blocks of neural networks. Bunch of this neurons can solve complex problems yet the basic concept is so simple.

x1, x2, … , xn are inputs and each and every input value is multiplied by a weight (w1,w2, …,wn) then the sum(∑) of the weighted inputs are calculated and passed through an activation function(f).

Above sentence can be mathematically annotated like this.

Weights

Weights are the parameters that determine how important part do the corresponding inputs should play in affecting the output.

In neural networks these weights are initialized at the beginning of the training and optimized throughout the training phase.

Weighted sum

Sum of all the inputs multiplied by corresponding weights.

Activation function

Given a weighted sum, the activation function determines what value should it output, in other words, whether the neuron should fire or not.

In neural networks there are so many to select from.

Each function has its strengths. We will use Identity and ReLU activation functions in future examples.

This should conclude the most simplest form of an artificial neuron.

Neuron siting by itself won’t do much in an application, it has to learn from the given data.

How does an artificial neuron learn ?

Let’s say we need our neuron to learn the relationship between above number pairs. Since we got only one input, our neuron looks like this.

For this guy, learning means finding a value for w that represents the relationship between the number pairs most accurately. Let’s run our number series through the neuron.

When the weight is 1 we are off by 2 from the expected output. We could refer this value as the cost of this operation.

For this setting of weights, the cost of our network is 30. After all, this is a neural network with a single neuron :) I know, it’s debatable whether it’s a “network”. Anyway we can annotate this in the following cost function. “yi” being the target value and “yi hat” being the networks output.

Cost function

Cost function is the difference between the target value and the networks output. Different cost functions imply different notions about the networks cost. Essentially what cost function does is determining how far is the networks output from the target value. Keep in mind that the above cost function is defined for the sake of simplicity. An actual cost function would look like this. For me the annotations and the indexing are the most confusing.

Now it’s all about minimizing the cost of the network. Lower cost means that the network has a good approximation for the relationship(function) between number pairs.

What affects the cost? Seemingly the weight.

Lets get back to our former formula.

We need to figure out what nudges to make to wi to minimize the cost. Should we nudge up a bit or nudge down?

Let’s try to visualize with little bit of python and see how it goes.

Executing above code in a jupyter notebook will draw the following graph.

We can clearly see when the weight is 3, the cost becomes zero and that should be our perfect weight but our graph doesn’t represent it that well. Although the cost seems to be deceasing in value as weight increases, that doesn’t mean 100in cost is any better than -100 . So we need a better cost function to represent how off we are from the expected value.

Let’s change the cost function of the above code to mean squared error.

As we can see now the cost function(MSE) is a more clear representation of how off we are from a expected value.

In the above example our perfect weight happened to be in the selected range of weights but what if it is not, like represented in the below graph.

Most of the times, going through a series of weight values won’t get us near the minima of the cost function(where our perfect weight lays). We need a better mechanism to navigate through weights until we find the right one.

Let’s recall differentiation from the calculus class.

Differentiation is the action of computing a derivative. The derivative of a function y = f(x) of a variable x is a measure of the rate at which the value y of the function changes with respect to the change of the variable x. It is called the derivative of f with respect to x.

We can find the derivative of the cost function with respect to weight and it will tell us the slope of the cost function at a given weight (Read this great article for better understanding). If the slope is negative we should increase the weight and if it is positive we should decrease the weight. How much we should increase or decrease is depending on size of the negative or positive slope. Because as we reach the minima, the slope gets smaller, so we should take smaller steps or otherwise we could miss the minima. This method of navigating through weights is called gradient descent.

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

So far we have discussed how artificial neuron is made and optimizes itself with the help of gradient descent but all the things we discussed above are overly simplified. There are so many things we are missing out like backpropagation, but hopefully!, now we get the concept.

Imagine how overwhelmingly complex would it be when we consider multi layer neural networks (deep neural networks). Remember?, we only considered a single neuron, not even a single layer.

There are libraries to handle all these complexities for us. Following are some open source deep learning libraries with most github stars and contributors.

Remember?, the function we wanted to approximate only by looking at its inputs and outputs in the Part1. We will use Keras with a TensorFlow backend to solve that in our next example.

So, let’s jump into python code that makes a deep neural net.

I’m ending this post here and will continue in the next post.

Until next time, keep hacking ;)