QBoard » Artificial Intelligence & ML » AI and ML - Conceptual » Input data for convolutional neural network

Input data for convolutional neural network

  • I am trying learn deep learning and specifically using convolutional neural networks. I'd like to apply a simple network on some audio data. Now, as far as I understand CNNs are often used for image and object recognition, and therefore when using audio people often use the spectrogram (specifically mel-spectrogram) instead of the signal in the time-domain. My question is, is it better to use an image (i.e. RGB or greyscale values) of the spectrogram as the input to the network, or should I use the 2d magnitude values of the spectrogram directly? Does it even make a difference?
    Thank you.
      November 8, 2021 5:43 PM IST
    0
  • Input layer in CNN should contain image data. Image data is represented by three dimensional matrix as we saw earlier. You need to reshape it into a single column.

    A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning algorithm which can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other.
      November 9, 2021 2:12 PM IST
    0
  • The spectrogram is a lovely representation, especially for describing the process. Functionally, it's merely a simplification of the input data that adds no information, and loses a smidgen of accuracy -- which probably doesn't matter. The preprocessing doesn't buy you anything, so just use the 2d data and let the CNN take things from there.

     
      November 11, 2021 2:30 PM IST
    0
  • Input layer in CNN should contain image data. Image data is represented by three dimensional matrix as we saw earlier. You need to reshape it into a single column. Suppose you have image of dimension 28 x 28 =784, you need to convert it into 784 x 1 before feeding into input. If you have “m” training examples then dimension of input will be (784, m).

      November 12, 2021 1:51 PM IST
    0
  • Convolutional neural networks are distinguished from other neural networks by their superior performance with image, speech, or audio signal inputs. They have three main types of layers, which are:

    • Convolutional layer
    • Pooling layer
    • Fully-connected (FC) layer

    The convolutional layer is the first layer of a convolutional network. While convolutional layers can be followed by additional convolutional layers or pooling layers, the fully-connected layer is the final layer. With each layer, the CNN increases in its complexity, identifying greater portions of the image. Earlier layers focus on simple features, such as colors and edges. As the image data progresses through the layers of the CNN, it starts to recognize larger elements or shapes of the object until it finally identifies the intended object.

    Convolutional Layer

    The convolutional layer is the core building block of a CNN, and it is where the majority of computation occurs. It requires a few components, which are input data, a filter, and a feature map. Let’s assume that the input will be a color image, which is made up of a matrix of pixels in 3D. This means that the input will have three dimensions—a height, width, and depth—which correspond to RGB in an image. We also have a feature detector, also known as a kernel or a filter, which will move across the receptive fields of the image, checking if the feature is present. This process is known as a convolution.

    The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3x3 matrix; this also determines the size of the receptive field. The filter is then applied to an area of the image, and a dot product is calculated between the input pixels and the filter. This dot product is then fed into an output array. Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept across the entire image. The final output from the series of dot products from the input and the filter is known as a feature map, activation map, or a convolved feature.

    After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation to the feature map, introducing nonlinearity to the model.

    As we mentioned earlier, another convolution layer can follow the initial convolution layer. When this happens, the structure of the CNN can become hierarchical as the later layers can see the pixels within the receptive fields of prior layers.  As an example, let’s assume that we’re trying to determine if an image contains a bicycle. You can think of the bicycle as a sum of parts. It is comprised of a frame, handlebars, wheels, pedals, et cetera. Each individual part of the bicycle makes up a lower-level pattern in the neural net, and the combination of its parts represents a higher-level pattern, creating a feature hierarchy within the CNN.

      December 17, 2021 11:51 AM IST
    0