How does deep learning work and how is it different from normal neural networks and/or SVM?
Mo Data stashed this in Machine Learning
I'm going to try to use simple language ... basically I'm just going to try to summarize these two papers:http://www.cs.toronto.edu/~hinto...http://machinelearning.wustl.edu...---First, why is deep learning different from the other methods you mentioned? "Normal" neural networks usually have one to two hidden layers and are used for SUPERVISED prediction or classification. SVMs are typically used for binary classification, but occasionally for other SUPERVISED learning tasks.Deep learning neural network architectures differ from "normal" neural networks because they have more hidden layers. Deep learning networks differ from "normal" neural networks and SVMs because they can be trained in an UNSUPERVISED or SUPERVISED manner for both UNSUPERVISED and SUPERVISED learning tasks.Moreover, people often talk about training a deep network in an unsupervised manner, before training the network in a supervised manner. ---How do you train an unsupervised neural network? Usually, with a supervised neural network you try to predict a target vector y, from a matrix of inputs, x. But when you train an unsupervised neural network, you try to predict the matrix x using the very same matrix x as the inputs. In doing this, the network can learn something intrinsic about the data without the help of a target or label vector that is often created by humans. The learned information is stored as the weights of the network. Another consequence of unsupervised training is that the network will have the same number of input units as target units, because there are the same number of columns in the input x matrix as in the target x matrix. This leads to the hourglass shape that is common when training unsupervised, deep neural networks. In the diagram below, there are the same number of input units as target units, and each of these units represents a pixel in a small picture of a digit.
You might think it sounds easy to predict x from x. Sometimes it is too easy, and the network becomes over trained on the x matrix, so people typically add some noise, or random numbers, into x to prevent over training. One of the fancy names for this kind of architecture is: "stacked denoising autoencoder". You might also hear "restricted Boltzmann machine". ---Why so many layers? Deep learning works because of the architecture of the network AND the optimization routine applied to that architecture. The network is a directed graph, meaning that each hidden unit is connected to many other hidden units below it. So each hidden layer going further into the network is a NON-LINEAR combination of the layers below it, because of all the combining and recombining of the outputs from all the previous units in combination with their activation functions.When the OPTIMIZATION routine is applied to the network, each hidden layer then becomes an OPTIMALLY WEIGHTED, NON-LINEAR combination of the layer below it. When each sequential hidden layer has less units than the one below it, each hidden layer becomes a LOWER DIMENSIONAL PROJECTION of the layer below it as well. So the information from the layer below is nicely summarized by a NON-LINEAR, OPTIMALLY WEIGHTED, LOWER DIMENSIONAL PROJECTION in each subsequent layer of the deep network. In the picture above, the outputs from the small middle hidden layer are a two dimensional, optimal, non-linear projection of the input columns (i.e .pixels) in the input matrix (i.e. set of pictures). Figs. 3a and 3b in the Hinton paper above actually plot similar outputs. Notice that the network has basically clustered the digits 0 through 9 without a label vector. So, the unsupervised training process has resulted in unsupervised learning. ---How do you make predictions? That's the easy part. One approach is to break the hourglass network in half, and swap x as the target matrix with y, where y is some more typical target vector or label vector. In the picture above, you could throw away all the layers above the middle layer, and put a single target unit for y right above the middle hidden layer. What you are really using from the bottom half of the hourglass network is the weights from the unsupervised training phase. Remember, the weights represent what was learned during unsupervised training. They will now become the initial starting points for the supervised training optimization routine using the target vector y. (In the case of the digit pictures, y contains the label 0-9 of the digit.) So, the supervised training phase basically just refines the weights from the unsupervised training phase to best predict y. Since we have changed the architecture of the network to a more "normal" supervised network, the actual mechanism of prediction is similar to a "normal" neural network.