QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » In Tensorflow, why add an activation function to a model only when preparing to export it?

In Tensorflow, why add an activation function to a model only when preparing to export it?

  • In the Tensorflow ML Basics with Keras tutorial for making a basic text classification, when preparing the trained model for export, the tutorial suggests including the TextVectorization layer into the Model so it can "process raw strings". I understand why to do this.

    But then the code snippet is:

    export_model = tf.keras.Sequential([
      vectorize_layer,
      model,
      layers.Activation('sigmoid')
    ])


    Why when preparing the model for export, does the tutorial also include a new activation layer layers.Activation('sigmoid')? Why not incorporate this layer into the original model?

      October 23, 2021 2:23 PM IST
    0
  • Sometimes you want to know model's answer before sigmoid as it may contain useful information, for example, about distribution shape and its evolution. In such scenario it's convenient to have final scaling as a separate entity. Otherwise one would have to remove/add sigmoid layer - more lines of code, more possible erros. So it may be a good practice to apply sigmoid in the very end - just before saving/exporting. Or just an agreement.

     
      October 27, 2021 1:54 PM IST
    0
  • An activation function is a function which is applied to the output of a neural network layer, which is then passed as the input to the next layer. Activation functions are an essential part of neural networks as they provide non-linearity, without which the neural network reduces to a mere logistic regression model. The most widely used activation function is the Rectified Linear Unit (ReLU). ReLU is defined as f(x) = max(0, x). ReLU has become a popular choice in recent times due to the following reasons:


    • Computationally faster: The ReLU is a highly simplified function which is easily computed.
    • Fewer vanishing gradients: In machine learning, the update to a parameter is proportional to the partial derivative of the error function with respect to that parameters. If the gradient becomes extremely small, the updates will not be effective and the network might stop training at all. The ReLU does not saturate in the positive direction, whereas other activation functions like sigmoid and hyperbolic tangent saturate in both directions. Therefore, it has fewer vanishing gradients resulting in better training.

    The function nn.relu() provides support for the ReLU in Tensorflow.

      December 21, 2021 1:41 PM IST
    0
  • Before the TextVectorization layer was introduced, you had to manually edit your raw strings. This usually meant removing punctuation, lower case, tokenization and so forth:


    #Raw String
    "Furthermore, he asked himself why it happened to Billy?"
    
    #Remove punctuation
    "Furthermore he asked himself why it happened to Billy"
    
    #Lower-case
    "furthermore he asked himself why it happened to billy"
    
    #Tokenize
    ['furthermore', 'he', 'asked', 'himself', 'why', 'it', 'happened', 'to', 'billy']​


    If you include the TextVectorization layer in your model when you export, you can essentially feed raw strings into your model for prediction without having to clean them up first.

    Regarding your second question: I also find it rather odd that the sigmoid activation function was not used. I imagine that the last layer has a "linear activation function" due to the dataset and its samples. The samples can be split into two classes, solving a linearly separable problem.

    The problem with a linear activation function during inference is that it can output negative values:

    # With linear activation function
    
    examples = [
      "The movie was great!",
      "The movie was okay.",
      "The movie was terrible..."
    ]
    
    export_model.predict(examples)
    
    '''
    array([[ 0.4543204 ],
           [-0.26730654],
           [-0.61234593]], dtype=float32)
    '''​


    For example, the value -0.26730654 could indicate that the review "The movie was okay." is negative, but this is not necessarily the case. What one actually wants to predict is the probability that a particular sample belongs to a particular class. Therefore, a sigmoid function is used in the inference to squeeze the output values between 0 and 1. The output can then be interpreted as the probability that sample x belongs to class n:

    # With sigmoid activation function
    
    examples = [
      "The movie was great!",
      "The movie was okay.",
      "The movie was terrible..."
    ]
    
    export_model.predict(examples)
    
    '''
    array([[0.6116659 ],
           [0.43356845],
           [0.35152423]], dtype=float32)
    '''​
      October 29, 2021 3:04 PM IST
    0