History

We can date the birth of artificial neural networks in 1958, with the introduction of Perceptron 1 by Frank Rosenblatt. It was the first algorithm created to reproduce the biological neuron. Conceptually, the easier perceptron that you might think of is made of a single neuron: when it’s exposed to a stimulus, it provides a binary response, just as would a biological neuron.

This model differs greatly from the neural network involving billions of neurons in a biological brain. Shortly after his birth, the researchers showed the world the problems of Perceptron: in fact, it was quickly proved that perceptrons could not be trained to recognize many classes of input patterns. To get a more powerful network, it was necessary to take advantage of multiple level of units and create a multilayers perceptron, with more intermediates neurons used to solve linearly separable2 subproblems, whose outputs were combined together by the final level to provide a concrete response to original input problem. Even though the Perceptron was just a simple but severely limited binary classifier, it introduced a great innovation: the idea to simulate the basic computational unit of a complex biological system that exists in nature.

Theory

Fundamentally, a neural network is nothing more than a really good function approximator — I mean, you give a trained network as an input vector, it performs a series of operations, and it produces an output vector. To train an ann to estimate an unknown function, the process is really simple: you have to get a training set - a collection of data points - that the network will learn from and generalize on to make future inferences. In a multilayer perceptron data points are forwarded through the network layer-by-layer until they reach the final layer. The final layer’s activations are the predictions that the network actually makes. In this article, I describe how I built with Golang my own perceptron - and then a multilayer perceptron. Let first talk about the representation of the input: all the example codes are from my go-perceptron-go repository.

Base structures - code

To create a neural network, the first thing you have to do is dealing with the definition of data structures. I create a neural package to collect all files related to architecture structure and elements.

Pattern - code

The Pattern struct represent a single input to the Perceptron struct. Look at the code:

// Pattern struct represents one pattern with dimensions and desired value
type Pattern struct {
	Features []float64
	SingleRawExpectation string
	SingleExpectation float64
	MultipleExpectation []float64
}

It satisfies our needs with only four fields:

  • Features is a slice of 64 bit float and this is perfect to represent input dimension,
  • SingleRawExpectation is a string and is filled by parser with input classification (in terms of belonging class),
  • SingleExpectation is a 64 bit float representation of the class which the pattern belongs,
  • MultipleExpectation is a slice of 64 bit float and it is used for multiple class classification problems;
Neuron - code

The NeuronUnit struct represent a single computation unit. Look at the code:

// NeuronUnit struct represents a simple NeuronUnit network with a slice of n weights.
type NeuronUnit struct {
	Weights []float64
	Bias float64
	Lrate float64
	Value float64
	Delta float64
}

A neuron corresponds to the simple binary perceptron originally proposed by Rosenblat. It is made of:

  • Weights, a slice of 64 bit float to represent the way each dimensions of the pattern is modulated,
  • Bias, a 64 bit float that represents NeuronUnit natural propensity to spread signal,
  • Lrate, a 64 bit float that represents learning rate of neuron,
  • MultipleExpectation, a 64 bit float that represents the desired value when I load the input pattner into network in Multi NeuralLayer Perceptron,
  • Delta, a 64 bit float that mantains error during execution of training algorithm (later);
Perceptron

As I said, the single perceptron schema is implemented by a single neuron. The easiest way to implement this simple classifier is to establish a threshold function, insert it into the neuron, combine the values (eventually using different weights for each of them) that describe the stimulus in a single value, provide this value to the neuron and see what it returns in output. The schema show how it works:

Metric

Why weights? What does it mean the expression dimension modulation of the the input? Well, training conceptually is “the process of learning the skills you need to do a particular job or activity”. But how do you know if you’re getting better, or if you are learning the skills you need? Of course, you need a metric of how good or bad you’re doing. Also in ANN there’s a metric generally called cost function. Suppose we want to change a certain wi weight of the network. More or less, the cost function looks at the function the network has inferred and uses it to estimate values for the data points in the training set. The difference between the outputs of the network and the training set data points are the main values for the cost function. When training your network, the goal is to get the value of this cost function as low as possible. The most basic of the training algorithms is the gradient descent. Suppose we can calculate the error E according to the variation of the weight value wi: we are therefore able to draw the graph in a graph like the one in the figure.

Therefore, if we calculate the derivative of this function, we can understand how the variation of the weight makes a positive or negative contribution to the error. In practice, whatever the derived value, we can use a single weight correction function that decrease the involved weight of derived quantity (modulated by learning rate). Despite the fact that it’s quite impossible, for any network or cost function, to be truly convex, the gradient descent follows the derivatives computed for each neuron unit to essentially “roll” down the slope until it finds its way to the center - as close as possible to the global minimum. Before continuing, let’s take a step back.

Why multilayer? The linearly separable problems

The problem with the binary perceptron made with a single neuron is the inability to handle non-linearly separable problems: these kind of problems are the ones in which, in other words, it’s impossible to define an hyperplane able to separate, in the vector space of the inputs, those that require a positive output from those requiring a negative output. An example of three non-collinear points belonging to two different classes (’+’ and ‘-’) are always linearly separable in two dimensions. This is illustrated by the first three examples in the following figure:

However, not all sets of four points, no three collinear, are linearly separable in two dimensions. The fourth image would need two straight lines and thus is not linearly separable. This is the main reason scientist start working with multilayers at the very beginning. Let’s move one step forward, introducing the NeuralLayer struct.

Neural Layer - code

The NeuralLayer struct represents a network layer with a slice of n NeuronUnits.

type NeuralLayer struct {
	NeuronUnits []NeuronUnit
	Length int
}

where:

  • NeuronUnits represents NeuronUnits in layer,
  • Length represents number of NeuronUnit in layer;

Now that we are able to build layers of neurons, we can define the MultiLayerNetwork struct.

Multilayer Perceptron - code
type MultiLayerNetwork struct {
	L_rate float64
	NeuralLayers []NeuralLayer
	T_func transferFunction
	T_func_d transferFunction
}

where:

  • NeuralLayers represents layer of neurons,
  • Length represents learning rate of neuron,
  • T_func and T_func_d represents the transferFunction and its derivative;

Inside the MultiLayerNetwork struct there’s an algorithm to create multilayer perceptron: if you pass a struct with NeuralLayers [4, 3, 3], you can define a network struct with 3 layer: input, hidden, output, with respectively 4, 3 and 3 neurons, as shown in the figure below.

The piece of code that handle network creation is the following:

// ... the following is in the neuralLayer.go

// PrepareLayer create a NeuralLayer with n NeuronUnits inside
// [n:int] is an int that specifies the number of neurons in the NeuralLayer
// [p:int] is an int that specifies the number of neurons in the previous NeuralLayer
// It returns a NeuralLayer object
func PrepareLayer(n int, p int) (l NeuralLayer) {

	l = NeuralLayer{NeuronUnits: make([]NeuronUnit, n), Length: n}

	for i := 0; i < n; i++ {
		RandomNeuronInit(&l.NeuronUnits[i], p)
	}

	log.WithFields(log.Fields{
		"level":   "info",
		"msg":     "multilayer perceptron init completed",
		"neurons": len(l.NeuronUnits),
		"lengthPreviousLayer": l.Length,
	}).Info("Complete NeuralLayer init.")

	return

}

// ... the following is in the multiLayerNetwork.go

// PrepareMLPNet create a multi layer Perceptron neural network.
// [l:[]int] is an int array with layers neurons number [input, ..., output]
// [lr:int] is the learning rate of neural network
// [tr:transferFunction] is a transfer function
// [tr:transferFunction] the respective transfer function derivative
func PrepareMLPNet(l []int, lr float64, tf transferFunction, trd transferFunction) (mlp MultiLayerNetwork) {

	// setup learning rate and transfer function
	mlp.L_rate = lr
	mlp.T_func = tf
	mlp.T_func_d = trd

	// setup layers
	mlp.NeuralLayers = make([]NeuralLayer, len(l))

	// for each layers specified
	for il, ql := range l {

		// if it is not the first
		if il != 0 {

			// prepare the GENERIC layer with specific dimension and correct number of links for each NeuronUnits
			mlp.NeuralLayers[il] = PrepareLayer(ql, l[il-1])

		} else {

			// prepare the INPUT layer with specific dimension and No links to previous.
			mlp.NeuralLayers[il] = PrepareLayer(ql, 0)

		}

	}

	log.WithFields(log.Fields{
		"level":     "info",
		"msg":       "multilayer perceptron init completed",
		"layers":  len(mlp.NeuralLayers),
		"learningRate: ": mlp.L_rate,
	}).Info("Complete Multilayer Perceptron init.")

	return

}

For classification problems the input layers has to be define with a number of neurons that match features of pattern shown to network. Of course, the output layer should have a number of unit equals to the number of class in training set.

NOTE: from the architectural point of view an interesting theorem guarantee that given a sufficient number of hidden units, everything that can be solved by a multilayer network at n levels can also be solved by a two-level network. Therefore in examples we will limit ourselves to using only two levels.

BackPropagation Algorithm - code

The learning algorithm can be divided into two phases: propagation and weight update.

Propagation - 1 of 2

Each propagation involves the following steps:

  • the propagation forward through the network to generate the output value(s) is done by Execute function,
  • the calculation of the cost (error term) is done here at the very beginning of the BackPropagate function,
  • the propagation of the output activations back through the network, using the training pattern target in order to generate the deltas (the difference between the targeted and actual output values) of all output and hidden neurons, done of coure BackPropagate function;

First, let’s have a look to the Execute function.

// Execute a multi layer Perceptron neural network.
// [mlp:MultiLayerNetwork] multilayer perceptron network pointer, [s:Pattern] input value
// It returns output values by network
func Execute(mlp *MultiLayerNetwork, s *Pattern, options ...int) (r []float64) {

	// new value
	nv := 0.0

	// result of execution for each OUTPUT NeuronUnit in OUTPUT NeuralLayer
	r = make([]float64, mlp.NeuralLayers[len(mlp.NeuralLayers)-1].Length)

	// show pattern to network =>
	for i := 0; i < len(s.Features); i++ {

		// setup value of each neurons in first layers to respective features of pattern
		mlp.NeuralLayers[0].NeuronUnits[i].Value = s.Features[i]

	}

	// execute - hiddens + output
	// for each layers from first hidden to output
	for k := 1; k < len(mlp.NeuralLayers); k++ {

		// for each neurons in focused level
		for i := 0; i < mlp.NeuralLayers[k].Length; i++ {

			// init new value
			nv = 0.0

			// for each neurons in previous level (for k = 1, INPUT)
			for j := 0; j < mlp.NeuralLayers[k - 1].Length; j++ {

				// sum output value of previous neurons multiplied by weight between previous and focused neuron
				nv += mlp.NeuralLayers[k].NeuronUnits[i].Weights[j] * mlp.NeuralLayers[k - 1].NeuronUnits[j].Value

				log.WithFields(log.Fields{
					"level":     "debug",
					"msg":       "multilayer perceptron execution",
					"len(mlp.NeuralLayers)":  len(mlp.NeuralLayers),
					"layer:  ": k,
					"neuron: ": i,
					"previous neuron: ": j,
				}).Debug("Compute output propagation.")

			}

			// add neuron bias
			nv += mlp.NeuralLayers[k].NeuronUnits[i].Bias

			// compute activation function to new output value
			mlp.NeuralLayers[k].NeuronUnits[i].Value = mlp.T_func(nv)

			log.WithFields(log.Fields{
				"level":     "debug",
				"msg":       "setup new neuron output value after transfer function application",
				"len(mlp.NeuralLayers)":  len(mlp.NeuralLayers),
				"layer:  ": k,
				"neuron: ": i,
				"outputvalue" : mlp.NeuralLayers[k].NeuronUnits[i].Value,
			}).Debug("Setup new neuron output value after transfer function application.")

		}

	}


	// get ouput values
	for i := 0; i < mlp.NeuralLayers[len(mlp.NeuralLayers)-1].Length; i++ {

		// simply accumulate values of all neurons in last level
		r[i] = mlp.NeuralLayers[len(mlp.NeuralLayers)-1].NeuronUnits[i].Value

	}

	return r

}

Basically, what Execute function does is computing the result of execution for each output NeuronUnit in output NeuralLayer. In order, it first inserts input to input NeuralLayer of the network, assigning the values of the dimensions (Features field) of each pattern to values (Value field) of each NeuronUnit in input layer (mlp.NeuralLayers[0]); after that, for each layers from first hidden to output, and for each neurons in the previous level and the current, execution algorithm computes the sum of multiplication between the weight that links two involved neurons and the (output) computed in the step before of the previous neuron - this is the meaning of most internal for. Then, the bias - natural propension to activation - of the neuron is added to the quantity nv, and output value of the current neuron in the current neural layer is updated with the activation function computed passing this quantity nv as parameter. The last for simply accumulate values of all neurons in last level and return the result. To summarize, this algorithm makes the input flow through the network, using weights to modulate the various dimensions that describe it and the activation functions to calculate the response of each neuron. In the end, the values accumulated in the neurons of the last level are returned.

Back to the BackPropagate, we already said it starts executing the network. The idea is to get the value accumulated in the neurons of the last level, to compute the error accumulated retracing the various steps backwards. With the (uncorrect) assumption of a convex function, we can imagine that solving the weight update task backwards, by calculating the derivative of the activation function, is a good way to go down towards the global optimum. In reality, there is no guarantee of not being stuck in a false minimum, and this depends on the characteristics of the function and (most likely) also on the architecture chosen for our ann.

Weights update:

Weight update - 2 of 2

For each weight in the network, the following steps must be followed:

  • the weight’s output delta and input activation are multiplied to find the gradient of the weight,
  • a percentage (modulated by learning rate) of the weight’s gradient is subtracted from the weight;

The learning rate influences the speed and quality of learning. The greater it is, the faster the neuron trains, but the lower it is, the more accurate the training is. The sign of the gradient of a weight indicates whether the error varies directly with, or inversely to, the weight. Therefore, the weight must be updated in the opposite direction - this is the reason of the name gradient descent.

// BackPropagation algorithm.
// [mlp:MultiLayerNetwork] input value		[s:Pattern] input value (scaled between 0 and 1)
// [o:[]float64] expected output value (scaled between 0 and 1)
// return [r:float64] delta error between generated output and expected output
func BackPropagate(mlp *MultiLayerNetwork, s *Pattern, o []float64, options ...int) (r float64) {

	var no []float64;
	// execute network with pattern passed over each level to output
	if len(options) == 1 {
		no = Execute(mlp, s, options[0])
	} else {
		no = Execute(mlp, s)
	}

	// init error
	e := 0.0

	// compute output error and delta in output layer
	for i := 0; i < mlp.NeuralLayers[len(mlp.NeuralLayers)-1].Length; i++ {

		// compute error in output: output for given pattern - output computed by network
		e = o[i] - no[i]

		// compute delta for each neuron in output layer as:
		// error in output * derivative of transfer function of network output
		mlp.NeuralLayers[len(mlp.NeuralLayers)-1].NeuronUnits[i].Delta = e * mlp.T_func_d(no[i])

	}

	// backpropagate error to previous layers
	// for each layers starting from the last hidden (len(mlp.NeuralLayers)-2)
	for k := len(mlp.NeuralLayers)-2; k >= 0; k-- {

		// compute actual layer errors and re-compute delta
		for i := 0; i < mlp.NeuralLayers[k].Length; i++ {

			// reset error accumulator
			e = 0.0

			// for each link to next layer
			for j := 0; j < mlp.NeuralLayers[k + 1].Length; j++ {

				// sum delta value of next neurons multiplied by weight between focused neuron and all neurons in next level
				e += mlp.NeuralLayers[k + 1].NeuronUnits[j].Delta * mlp.NeuralLayers[k + 1].NeuronUnits[j].Weights[i]

			}

			// compute delta for each neuron in focused layer as error * derivative of transfer function
			mlp.NeuralLayers[k].NeuronUnits[i].Delta = e * mlp.T_func_d(mlp.NeuralLayers[k].NeuronUnits[i].Value)

		}

		// compute weights in the next layer
		// for each link to next layer
		for i := 0; i < mlp.NeuralLayers[k + 1].Length; i++ {

			// for each neurons in actual level (for k = 0, INPUT)
			for j := 0; j < mlp.NeuralLayers[k].Length; j++ {

				// sum learning rate * next level next neuron Delta * actual level actual neuron output value
				mlp.NeuralLayers[k + 1].NeuronUnits[i].Weights[j] +=
					mlp.L_rate * mlp.NeuralLayers[k + 1].NeuronUnits[i].Delta * mlp.NeuralLayers[k].NeuronUnits[j].Value

			}

			// learning rate * next level next neuron Delta * actual level actual neuron output value
			mlp.NeuralLayers[k + 1].NeuronUnits[i].Bias += mlp.L_rate * mlp.NeuralLayers[k + 1].NeuronUnits[i].Delta

		}

	}

	// compute global errors as sum of abs difference between output execution for each neuron in output layer
	// and desired value in each neuron in output layer
	for i := 0; i < len(o); i++ {

		r += math.Abs(no[i] - o[i])

	}

	// average error
	r = r / float64(len(o))

	return

}

After execution step, BackPropagate function starts computing output error and delta for the output level. The delta for a given neuron can be calculated as follows:

delta = (expected - output) * transfer\_derivative(output)

where expected is the expected output value (o[i]) for the neuron and output is the output value for the neuron (no[i]) computed by the Execution step (the first operation is done in the code by e = o[i] - no[i] operation). Then, the transfer_derivative() calculates the slope of the neuron’s output value and the algorithm save this value to the delta fields of each of the neurons (not only in the oupput layers): this is done because the layers of the network are iterated in reverse order - or backwards, as it is shown by the k-for (k–) - starting at the output and working backwards. This ensures that the neurons in the output layer have errors values calculated first that neurons in the hidden layer can use in the subsequent iteration.

In the hidden layer, things are a little more complicated. The error signal for a neuron in the hidden layer is computed as the weighted error of each neuron in the output layer. Think of the error traveling back along the weights of the output layer to the neurons in the hidden layer: the back-propagated error signal is accumulated and then used to determine the error for the neuron in the hidden layer, as follows:

delta\_i = accumulated(weight\_i * delta\_j) * transfer\_derivative(output)

where delta_j is the error signal from the j_th neuron in the output layer, weight_i is the weight that connects the i_th neuron of the output layer to the current neuron, and output is the output of the current neuron3. After that there is the network layers weights update, that follow this rules

weight\_i = weight\_i + (learning_rate * delta\_j * input)

Finally, the errors (as the abs difference between expcted minus computed) accumulated in the neurons of the last level are returned. Wait a minute: where is the training algorithm?

Training Algorithm

Look at the code below! Basically, what it does is running for a fixed amount of epochs the BackPropagate function.

// MLPTrain train a mlp MultiLayerNetwork with BackPropagation algorithm for assisted learning.
func MLPTrain(mlp *MultiLayerNetwork, patterns []Pattern, mapped []string, epochs int) {

	epoch := 0
	output := make([]float64, len(mapped))

	// for fixed number of epochs
	for {

		// for each pattern in training set
		for _, pattern := range patterns {

			// setup desired output for each unit
			for io, _ := range output {
				output[io] = 0.0
			}
			// setup desired output for specific class of pattern focused
			output[int(pattern.SingleExpectation)] = 1.0
			// back propagation
			BackPropagate(mlp, &pattern, output)

		}

		log.WithFields(log.Fields{
			"level":             "info",
			"place":             "validation",
			"method":            "MLPTrain",
			"epoch":        	 epoch,
		}).Debug("Training epoch completed.")

		// if max number of epochs is reached
		if epoch > epochs {
			// exit
			break
		}
		// increase number of epoch
		epoch++

	}

}

Thank you everybody for reading!


  1. F. Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, pages 65–386, 1958. (cit. a p. 5). ↩︎

  2. This condition describes the situation in which there exists a hyperplane able to separate, in the vector space of the inputs, those that require a positive output from those requiring a negative output. ↩︎

  3. I do not want to bore you with maths, if you want to read more maths details here↩︎