Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.
Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.
I will try to explain the situation through learning shoes.
If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm
there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.
the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept
first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
Normally the idea is to have multiple convolution layers one after another to learn
You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)
I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf). Even just looking at the images in the paper will give you some idea.
In computer science, a convolutional deep belief network (CDBN) is a type of deep artificial neural network composed of multiple layers of convolutional restricted Boltzmann machines stacked together.[1] Alternatively, it is a hierarchical generative model for deep learning, which is highly effective in image processing and object recognition, though it has been used in other domains too.[2] The salient features of the model include the fact that it scales well to high-dimensional images and is translation-invariant.[3]