Hi.Finished course on neural networks Good course, but not muchpractice. So in thispost we’ll review, write, and test A restricted Boltzmann machine. – stochastic , generative model neural network Let’s train it using the algorithm Contrastive Divergence (CD-k) Developed by Prof. Jeffrey Hinton who, by the way, teaches that course. We will be testing on a set of typed Englishletters. The next post will deal withone of the drawbacks of the error back propagation algorithm and a way to initialize the scales using a Boltzmann machine. Those who are not afraid of formulas and sheets of text, please go under the hood.

#### A very brief history

I won’t focus much on the origin story of the restricted Boltzmann machine (RBM), I’ll just mention that it all started with the Recurrent neural networks which are feedback networks that are extremely difficult to train. As a consequence of this slight difficulty in learning, people have begun to invent more * limited * recurrence models for which simpler learning algorithms could be applied. One such model was Hopfield neural network. , and he also introduced the notion of network energy, comparing neural network dynamics to thermodynamics :

*w*– weight between neurons*b*– neuron shift*s*– neuron state

The next step on the road to RBM was the usual Boltzmann machines , they differ from the Hopfield network in that they have stochastic in nature , and neurons are divided into two groups describing observable and latent states (analogous to hidden Markovmodels. (by the way, Hinton admits in his lectures that he borrowed the word "hidden" from there). Energy in Boltzmann machines is expressed as follows :

The Boltzmann machine is a fully connected undirected graph, and thus any two vertices from the same group depend on each other.

But if you remove the links within the group, so that you get bipartite graph , we get the structure of the RBM model.

The peculiarity of this model is that for a given state of neurons of one group, the states of neurons of another group will be independent of each other. Now we can move on to the theory where this property plays the key role.

#### Interpretation and Purpose

RBMs are interpreted similarly to hidden Markovmodels. We have a number of states that we can observe (visible neurons that provide an interface to communicate with the external environment) and a number of states that are hidden, and we cannot directly see their state (hidden neurons). But we can draw probabilistic conclusions about hidden states, based on the states we can observe. By training such a model, we are also able to draw conclusions about visible states, knowing the hidden ones ( Bayes theorem nobody cancelled =), and thus generate data from the probability distribution on which the model is trained.

Thus, we can formulate the goal of model training:it is necessary to adjust model parameters so that the reconstructed vector from the initial state is the closest to the original. By reconstructed we mean a vector obtained by probabilistic inference from hidden states, which in turn are obtained by probabilistic inference from observable states, i.e., from the original vector.

#### Theory

Let us introduce the following notations :

*w_ij*– weight between the i-th neuron*a_i*– displacement of a visible neuron*b_j*– latent neuron displacement*v_i*– the state of the visible neuron*h_j*– latent neuron state

We will consider a training set consisting of binary vectors, but this can easily be generalized to vectors of real numbers. Suppose we have n visible neurons and m hidden neurons. Introduce the notion of energy for RBM:

A neural network will calculate the joint probability of all possible pairs * v * and * h * as follows :

where * Z * – is the statistical sum or normalizer ( partition function ) of the following form (suppose we have only N images v, and M images h):

Obviously, the total probability of the vector * v * will be calculated by summing over all * h * :

Consider the probability that at a given * v * one of the hidden states * h_k= 1 * To do this, let us imagine a single neuron, then the energy of the system at 1 will be E1, and at 0 it will be E0, and the probability of 1 will be equal to the sigmoid function of the linear combination of the current neuron with the vector of states of the observed layer together with the shift :

And since at this * v * all * h_k * are independent of each other, then :

A similar conclusion is drawn for the probability of * v * at a given * h *

As we recall, the goal of training is to make the reconstructed vector as close to the original as possible, or, in other words, to maximize the probability * p(v) * To do this, we need to compute the partial derivatives of the probabilities over the model parameters. We begin by differentiating the energy from the model parameters :

We will also need the derivatives of the exponents of negative energy :

Differentiate the statsum :

We are almost there -) Consider the partial derivative of the probability * v * on weights :

Obviously, maximizing the probability is the same as maximizing the logarithm of the probability, this trick will help arrive at a nice solution. Consider the derivation of the natural logarithm by weight :

Combining the last two formulas we get the following expression :

By multiplying the numerator and denominator of each summand by * Z * and, simplifying to probabilities, we obtain the following expression :

And by definition mathematical expectation we get the final expression (M before the square bracket is the sign of the expectation):

The derivations of the neuronal shifts are also derived in a similar way :

And you end up with three rules for updating model parameters, where * this * – is the learning rate :

* PS might have made a mistake in the indexes, there is no built-in editor =) *

#### Contrastive Divergence Learning Algorithm

This algorithm was invented by Professor Hinton in 2002, and it is notable for its simplicity.The main idea is that mathematical expectations are replaced by well-defined values.The notion of a sampling process is introduced (Gibbs sampling, to see the association, you can look up here ).The CD-k process looks like this :

- the state of visible neurons is equated with the input image
- the probabilities of the states of the hidden layer are output
- each neuron of the hidden layer is assigned a state "1" with a probability equal to its current state
- the probabilities of the visible layer are deduced on the basis of the hidden one
- if the current iteration is less than k, then return to step 2
- the probabilities of states of the hidden layer are output

In Hinton’s lectures it looks like this :

I.e., the longer we sample, the more accurate our gradient will be.At the same time, the professor says that even for CD-1 (just one iteration of sampling) we already get quite good results. The first summand is called ** positive phase ** and the second with a minus sign is called ** negative phase **

#### Talk is cheap. Show me the code.

As in implementations of the error back propagation algorithm , I use almost the same interfaces. So let’s consider at once the learning function of the neural network. We will implement training using moment of inertia.

` `

`public void Train(IMultilayerNeuralNetwork network, IList<DataItem> data) `

I'm using a class with a learning algorithm configuration, I'll list the parameters. I tried to uncomment everything in the code to put less text in the text of the post.

LearningAlgorithmConfig config = new LearningAlgorithmConfig(){BatchSize = 10, //batch sizeMaxEpoches = 1000, //maximum number of epochs before the algorithm stopsGibbsSamplingChainLength = 30, //just k parameter from CD-kLearningRate = 0.01, //learning speedCostFunctionRecalculationStep = 1, //how often to calculate the learning errorMomentum = 0.9, //momentum parameterMinError = 0, // minimal error for the algorithm stoppingMinErrorChange = 0, //minimum change of the error for stopping the algorithmUseBiases = true //use biases in the learning process};

Various parameters are initialized before the main teaching cycle. ** Initializing **

`ILayer visibleLayer = network.Layers[0]; //the RBM network consists of two layers, one is visible statesILayer hiddenLayer = network.Layers[1]; //and one hidden stateif (!_config.UseBiases) //case shifts are involved in learning{for (int i = 0; i < visibleLayer.Neurons.Length; i++){visibleLayer.Neurons[i].Bias = 0d;}for (int i = 0; i < hiddenLayer.Neurons.Length; i++){hiddenLayer.Neurons[i].Bias = 0d;}}//init momentum//I will give the formula for the momentum later on when updating the weightsdouble[, ] momentumSpeedWeights = new double[visibleLayer.Neurons.Length, hiddenLayer.Neurons.Length];double[] momentumSpeedVisibleBiases = new double[visibleLayer.Neurons.Length];double[] momentumSpeedHiddenBiases = new double[hiddenLayer.Neurons.Length];//init stop factorsbool stopFlag = false;double lastError = Double.MaxValue; //many variables have speaking names, so I'll skip describing themdouble lastErrorChange = double.MaxValue;double learningRate = _config.LearningRate;int currentEpoche = 0;BatchEnumerator<DataItem<double> > batchEnumerator = new BatchEnumerator<DataItem<double> > (data, _config.BatchSize, true);`

BatchEnumerator text you can see here , and about batches it is described at in the previous post

Next begins the main cycle by era : ** Basic cycle **

`do{DateTime dtStart = DateTime.Now;//start batch processingforeach (IList<DataItem<double> > batch in batchEnumerator){//batch gradient//as in the backpropagation algorithm from the previous article, the batch stores the changes of weights/displacements per one batchdouble[, ] nablaWeights = new double[visibleLayer.Neurons.Length, hiddenLayer.Neurons.Length];double[] nablaHiddenBiases = new double[hiddenLayer.Neurons.Length];double[] nablaVisibleBiases = new double[visibleLayer.Neurons.Length];#region iterate through batch//...#endregion#region compute mean of wights nabla, and update them//...#endregion}#region Logging and error calculation//...#endregion//conditions of the algorithm stoppingcurrentEpoche++;if (currentEpoche > = _config.MaxEpoches){stopFlag = true;Logger.Instance.Log("Stop: currentEpoche:" + currentEpoche + " > = _config.MaxEpoches:" + _config.MaxEpoches);}else if (_config.MinError > = lastError){stopFlag = true;Logger.Instance.Log("Stop: _config.MinError:" + _config.MinError + " > = lastError:" + lastError);}else if (_config.MinErrorChange > = lastErrorChange){stopFlag = true;Logger.Instance.Log("Stop: _config.MinErrorChange:" + _config.MinErrorChange + " > = lastErrorChange:" + lastErrorChange);}} while (!stopFlag);`

Let’s look at the main part of the algorithm, which is just Gibbs sampling, calculating and accumulating the gradient by parameters. ** #region iterate through batch **

`//walk through each training example from the same battforeach (DataItem<double> dataItem in batch){//init visible layer states//assign values from the one/current example from the batch to the visible layer statesfor (int i = 0; i < dataItem.Input.Length; i++){visibleLayer.Neurons[i].LastState = dataItem.Input[i];}#region Gibbs samplingfor (int k = 0; k <= _config.GibbsSamplingChainLength; k++){//calculate hidden states probabilities//the hidden layer updates its states using the states of the observable layer and the model parameters; so far only the probabilities are calculatedhiddenLayer.Compute();#region accumulate negative phase//if the current sampling iteration is the last iteration, it is necessary to accumulate the negative part of the gradientif (k == _config.GibbsSamplingChainLength){for (int i = 0; i < visibleLayer.Neurons.Length; i++){for (int j = 0; j < hiddenLayer.Neurons.Length; j++){//multiplication of the visible states by the hidden onesnablaWeights[i, j] -= visibleLayer.Neurons[i].LastState *hiddenLayer.Neurons[j].LastState;}}if (_config.UseBiases){for (int i = 0; i < hiddenLayer.Neurons.Length; i++){nablaHiddenBiases[i] -= hiddenLayer.Neurons[i].LastState;}for (int i = 0; i < visibleLayer.Neurons.Length; i++){nablaVisibleBiases[i] -= visibleLayer.Neurons[i].LastState;}}break;}#endregion//sample hidden states//after the probabilities have been calculated, the hidden layer must be sampled, but only if it is not the last iteration, otherwise we would have already exited the Gibbs sampling loopfor (int i = 0; i < hiddenLayer.Neurons.Length; i++){hiddenLayer.Neurons[i].LastState = _r.NextDouble() <= hiddenLayer.Neurons[i].LastState ? 1d : 0d;}#region accumulate positive phase//if this is the first iteration, then collect the positive part of the gradientif (k == 0){for (int i = 0; i < visibleLayer.Neurons.Length; i++){for (int j = 0; j < hiddenLayer.Neurons.Length; j++){nablaWeights[i, j] += visibleLayer.Neurons[i].LastState*hiddenLayer.Neurons[j].LastState;}}if (_config.UseBiases){for (int i = 0; i < hiddenLayer.Neurons.Length; i++){nablaHiddenBiases[i] += hiddenLayer.Neurons[i].LastState;}for (int i = 0; i < visibleLayer.Neurons.Length; i++){nablaVisibleBiases[i] += visibleLayer.Neurons[i].LastState;}}}#endregion//calculate visible probs//calculate probabilities of the visible layervisibleLayer.Compute();//In Hinton's article it is written that it doesn't have to be sampled; it only increases dispersion, but the expectation remains the same; my experiments have shown that it actually works better without this block//todo: may not do sampling, like in 3.2 of http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf//sample visible//for (int i = 0; i < visibleLayer.Neurons.Length; i++)//{// visibleLayer.Neurons[i].LastState = _r.NextDouble() <= visibleLayer.Neurons[i].LastState ? 1d : 0d;//}}#endregion}`

We are out of the loop on the training examples from the batch, and we need to update the weights. Momentum is used as follows: the past change in the model parameter is multiplied by the constant of momentum ( * mu ) and added to the current gradient. The sum is then multiplied by the learning rate. *

**#region compute mean of wights nabla, and update them**

`for (int i = 0; i < visibleLayer.Neurons.Length; i++){for (int j = 0; j < hiddenLayer.Neurons.Length; j++){//record the current weight change, taking into account the past one, to be used in the next iterationmomentumSpeedWeights[i, j] = _config.Momentum*momentumSpeedWeights[i, j] +nablaWeights[i, j]/batch.Count;//using the current change of weight and speed of learningvisibleLayer.Neurons[i].Weights[j] += learningRate * momentumSpeedWeights[i, j];hiddenLayer.Neurons[j].Weights[i] = visibleLayer.Neurons[i].Weights[j];}}if (_config.UseBiases){for (int i = 0; i < hiddenLayer.Neurons.Length; i++){momentumSpeedHiddenBiases[i] = _config.Momentum*momentumSpeedHiddenBiases[i] +nablaHiddenBiases[i]/batch.Count;hiddenLayer.Neurons[i].Bias += learningRate * momentumSpeedHiddenBiases[i];}for (int i = 0; i < visibleLayer.Neurons.Length; i++){momentumSpeedVisibleBiases[i] = _config.Momentum*momentumSpeedVisibleBiases[i] +nablaVisibleBiases[i]/batch.Count;visibleLayer.Neurons[i].Bias += learningRate * momentumSpeedVisibleBiases[i];}}`

It remains to deduce some related information. The model is based on energy, but it is worth remembering that * Minimizing/maximizing energy does not mean improving recoverability * we are maximizing the probability that the restored vector will be the same as the original vector. There is another way: calculate the Euclidean distance between the reconstructed vector and the original vector. Obviously, this value will not fall monotonically, but often oscillate, but in general the decrease of this value will be noticeable. In the next section I will give an example of the log. ** #region Logging and error calculation **

`string msg = "Epoche #" + currentEpoche;#region calculate errorif (currentEpoche % _config.CostFunctionRecalculationStep == 0){#region calculating squared error with reconstructionIMetrics<double> sed = MetricsCreator.SquareEuclideanDistance();double d = 0;foreach (DataItem<double> dataItem in data){d += sed.Calculate(dataItem.Input, network.ComputeOutput(dataItem.Input));}msg += "; SqDist is " + d;lastErrorChange = Math.Abs(lastError - d);lastError = d;#endregion}#endregionmsg += "; Time: " + (DateTime.Now - dtStart).Duration().ToString();Logger.Instance.Log(msg);`

Source you can see here

#### RBM results

I did my experiments on text recognition. For this post I chose a simple, small-sized set of large English letters to speed up the learning process, with some noise present (a total of 260 pictures 29 by 29 pixels): You can download to view here

Training was performed with the following parameters :

LearningAlgorithmConfig:

LearningRate = 0.01

BatchSize = 10

RegularizationFactor = 0

MaxEpoches = 1000

MinError = 0

MinErrorChange = 0

CostFunctionRecalculationStep = 1

ErrorFunction =

Momentum = 0.9

NeuronLocalGainLimit:not setted

GibbsSamplingChainLength = 30

UseBiases = True

As you can see, the only criterion for stopping was 1000 epochs.Since the batch equals 10, there were 26 scale updates per epoch, for a total of 26, 000 updates. Log can be downloaded here

From the log, you can see that after the first epoch (26 parameter updates) the RMS Euclidean distance is 13128, and the last epoch is already 76.

Let’s take a look at some restored pictures (original on the left, restored on the right):

Now let’s look at one of the larger images :

As you can see from the picture, even the noises are recovered correctly (we are not talking about the concept of overtraining for RBM now).

Now we wonder if there is any way to visualize the patterns that have been found by the network. It turns out to be simple. For each hidden state, there are 841 (29*29) weights and one offset. If we ignore the offset, then the 841 weights can make up a picture of 29 by 29 pixels, and that’s exactly the patterns we need, automatically found by the network. And the linear combination of a pattern and an image plus the offset, and a sigmoid function taken from this number, will be the probability of whether a neuron is active or not. Here is the resulting map :

And this is what the same map looks like at tough guys at MNIST’s multitude of handwritten characters : http://deeplearning.net/tutorial/_images/filters_at_epoch_14.png So I still have to pick and choose my training parameters, but I’m kind of on the right track =)

In the next post, we will look at using RBM to initialize the weights of the forward propagation network, which will be trained on the backward error propagation algorithm.

#### Links

- https://class.coursera.org/neuralnets-2012-001/
- http://www.cs.toronto.edu/~hinton/absps/guideTR.pdf
- http://deeplearning.net/tutorial/rbm.html#contrastive-divergence-cd-k
- http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DBNEquations
- http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf
- and, of course, pedivilia.

* PS respect to the user ambikontur For regular help in checking formulas and text! UPD : Here are the patterns after more fine-tuning and on a set of multiple fonts And if you zoom in a little bit *