Reviews

  • 1
  • 2
  • 3
  • 4
  • 5
Editor Rating
  • 1
  • 2
  • 3
  • 4
  • 5
User Ratings
Based on 0 reviews

Major Concepts

Articles Home » Predictive and Prescriptive Modeling » Machine Learning » Introduction to Artificial Neural Networks

Introduction to Artificial Neural Networks

An Artificial neural network is usually a computational network inspired by biological neural networks that construct the structure of the human brain. Similar to a human brain has neurons interconnected to each other, artificial neural networks also have neurons(called perceptrons) that are linked to each other in various layers of the networks.

source:https://www.superdatascience.com/blogs/artificial-neural-networks-the-neuron


The above figure shows the structure of a Biological Neural Network.


The Structure of an Artificial Neural Network looks something like:


Source: https://www.javatpoint.com/artificial-neural-network


Comparing ‘Biological Neural Network’ to ‘Artificial Neural Network’, the relationship is as follows:


Dendrites: Inputs


Axon: Outputs


Neuron: Neuron/unit/node


Synapse: Weights


The architecture of an artificial neural network:
Artificial Neural Network primarily consists of three layers:


Input Layer:


The input layer is responsible for accepting the inputs in different formats.


Hidden Layer:


The hidden layer is located between the input and output layers. It performs all the calculations to find hidden features and patterns.


Output Layer:


The input goes through a series of transformations using the hidden layer, which finally results in output that is conveyed using this layer.


Now let's look at the working of a neural network with a single layer consisting of one neuron.


Source: https://hackerwins.github.io/2019-06-16/cs229a-week4


In the above figure, for one single observation, x0, x1, x2, x3...x(n) represents various inputs(independent variables) to the network. Each of these inputs is multiplied by a connection weight or synapse(synaptic weight). 


The weights are represented as w0, w1, w2, w3….w(n) . Weight shows the strength of a particular node.


b is a bias value. A bias value allows you to shift the activation function up or down.


In the simplest case, these products are summed, fed to a transfer function (activation function) to generate a result, and this result is sent as output.


Mathematically, x1.w1 + x2.w2 + x3.w3 ...... xn.wn = ∑ xi.w


Now activation function is applied



Taking about Activation function,it is the function that decides whether a neuron should be activated or not by calculating the weighted sum and further adding bias to it. The motive is to introduce non-linearity into the output of a neuron.


There are different type of activation functions, they are:


Binary step function:


A Binary step function is a threshold-based activation function. If the input value is above or below a certain threshold, the neuron is activated and sends exactly the same signal to the next layer.



“activated” if Y > threshold, else not.


The problem with this function is for creating a binary classifier ( 1 or 0), but if you want multiple such neurons to be connected to bring in more classes, Class1, Class2, Class3, etc. In this case, all neurons will give 1, so we cannot decide.


Sigmoid or Logistic function


A Sigmoid function is a mathematical function having a characteristic “S”-shaped curve or sigmoid curve which ranges between 0 and 1, Inputs that are much larger than 1 are transformed to the value 1, similarly, values much smaller than 0 are snapped to 0., therefore, it is used for models where we need to predict the probability as an output.



Hyperbolic Tangent Function — (tanh):


It is similar to Sigmoid but better in performance. It is nonlinear in nature, so great we can stack layers. The function ranges between (-1,1).



The main advantage of this function is that strong negative inputs will be mapped to negative output and only zero-valued inputs are mapped to near-zero outputs.,So less likely to get stuck during training.


 


Rectified Linear Units — (ReLu)


ReLu is the most used activation function in CNN(Convolutional Neural Networks) and ANN which ranges from zero to infinity.[0,∞)




It gives an output ‘x’ if x is positive and 0 otherwise. It looks like having the same problem of linear function as it is linear in the positive axis. Relu is non-linear in nature and a combination of ReLu is also non-linear. In fact, it is a good approximator and any function can be approximated with a combination of Relu.


 


Now let's look at  how a neural network learns,


Let's take an example of classifying a person who is either diabetic or non-diabetic.


 


Looking at the image and relating it to our example, Now x1,x2,x3… would become the features we have e.g glucose_level,skin_thickness, BMI, etc… Each observation is passed to the network which is multiplied by weights and added bias, then the output is provided based upon the activation function.


Learning in a neural network is closely related to how we learn in our regular lives and activities — we perform an action and are either accepted or corrected by a trainer or coach to understand how to get better at a certain task. Similarly, neural networks require a trainer in order to describe what should have been produced as a response to the input. Based on the difference between the actual value and the predicted value, an error value also called Cost Function is computed and sent back through the system.


For each layer of the network, the cost function is analyzed and used to adjust the threshold and weights for the next input. Our aim is to minimize the cost function. The lower the cost function, the closer the actual value to the predicted value. In this way, the error keeps becoming marginally lesser in each run(called as epoch) as the network learns how to analyze values.


We feed the resulting data back through the entire neural network. The weighted synapses connecting input variables to the neuron are the only thing we have control over.


As long as there exists a disparity between the actual value and the predicted value, we need to adjust those wights. Once we tweak them a little and run the neural network again, A new Cost function will be produced, hopefully, smaller than the last.


We need to repeat this process until we scrub the cost function down to as small as possible.


This procedure is known as Back-propagation and is applied continuously through a network until the error value is kept at a minimum.

source: https://www.guru99.com/backpropogation-neural-network.html


One of the most widely used algorithms for the above operation is “Gradient Descent”


Gradient descent is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost). 


Think of a large bowl, This bowl is a plot of the cost function (f).



Source: https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781786466587/3/ch03lvl1sec21/minimizing-the-cost-function


A random position on the surface of the bowl is the cost of the current values of the coefficients (cost).


The bottom of the bowl is the cost of the best set of coefficients, the minimum of the function.


The goal is to continue to try different values for the coefficients, evaluate their cost and select new coefficients that have a slightly better (lower) cost.


Repeating this process enough times will lead to the bottom of the bowl and you will know the values of the coefficients that result in the minimum cost.




source:https://www.researchgate.net/figure/A-graph-of-a-cost-function-modified-from_fig1_329920042


The procedure starts off with initial values for the coefficients for the function. These could be a small random value.The cost of the coefficients is evaluated by plugging them into the function and calculating the cost.The derivative of the cost is calculated. The derivative is a concept from calculus and refers to the slope of the function at a given point. We need to know the slope so that we know the direction (sign) to move the coefficient values in order to get a lower cost on the next iteration.


Now that we know from the derivative which direction is downhill, we can now update the coefficient values. A learning rate parameter (alpha) must be specified that controls how much the coefficients can change on each update.


This process is repeated until the cost of the coefficients (cost) is 0 or close enough to zero to be good enough.


Batch Gradient Descent:


The evaluation of how close a fit a machine learning model estimates the target function can be calculated a number of different ways, often specific to the machine learning algorithm. The cost function involves evaluating the coefficients in the machine learning model by calculating a prediction for the model for each training instance in the dataset and comparing the predictions to the actual output values and calculating a sum or average error (such as the Sum of Squared Residuals or SSR in the case of linear regression).


From the cost function a derivative can be calculated for each coefficient so that it can be updated using exactly the update equation described above.


The cost is calculated for a machine learning algorithm over the entire training dataset for each iteration of the gradient descent algorithm. One iteration of the algorithm is called one batch and this form of gradient descent is referred to as batch gradient descent.


Batch gradient descent is the most common form of gradient descent described in machine learning.


Stochastic Gradient Descent:


Gradient descent can be slow to run on very large datasets.


Because one iteration of the gradient descent algorithm requires a prediction for each instance in the training dataset, it can take a long time when you have many millions of instances.


In situations when you have large amounts of data, you can use a variation of gradient descent called stochastic gradient descent.


In this variation, the gradient descent procedure described above is run but the update to the coefficients is performed for each training instance, rather than at the end of the batch of instances.






 


 


 


 


User Reviews