QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

What is the difference between a sigmoid followed by the cross entropy and sigmoid_cross_entropy_with_logits in TensorFlow?

  • When trying to get cross-entropy with sigmoid activation function, there is a difference between

    loss1 = -tf.reduce_sum(p*tf.log(q), 1)
    
    loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1)
    

    But they are the same when with softmax activation function.

    Following is the sample code:

    import tensorflow as tf
    
    sess2 = tf.InteractiveSession()
    p = tf.placeholder(tf.float32, shape=[None, 5])
    logit_q = tf.placeholder(tf.float32, shape=[None, 5])
    q = tf.nn.sigmoid(logit_q)
    sess.run(tf.global_variables_initializer())
    
    feed_dict = {p: [[0, 0, 0, 1, 0], [1,0,0,0,0]], logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2], [0.3, 0.3, 0.2, 0.1, 0.1]]}
    loss1 = -tf.reduce_sum(p*tf.log(q),1).eval(feed_dict)
    loss2 = tf.reduce_sum(tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q),1).eval(feed_dict)
    
    print(p.eval(feed_dict), "\n", q.eval(feed_dict))
    print("\n",loss1, "\n", loss2)
      August 28, 2021 11:37 PM IST
    0
  • you can understand differences between softmax and sigmoid cross entropy in following way:

    1. for softmax cross entropy, it actually has one probability distribution
    2. for sigmoid cross entropy, it actually has multi independently binary probability distributions, each binary probability distribution can treated as two class probability distribution

    so anyway the cross entropy is:

       p * -tf.log(q)
    

     

    for softmax cross entropy it looks exactly as above formula,

    but for sigmoid, it looks a little different for it has multi binary probability distribution for each binary probability distribution, it is

    p * -tf.log(q)+(1-p) * -tf.log(1-q)
    

     

    p and (1-p) you can treat as two class probability within each binary probability distribution

     
      August 30, 2021 1:23 PM IST
    0
  • You're confusing the cross-entropy for binary and multi-class problems.

    Multi-class cross-entropy
    The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits:

    -tf.reduce_sum(p * tf.log(q), axis=1)
    ​


    p and q are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:

    p = tf.placeholder(tf.float32, shape=[None, 2])
    logit_q = tf.placeholder(tf.float32, shape=[None, 2])
    q = tf.nn.softmax(logit_q)
    
    feed_dict = {
      p: [[0, 1],
          [1, 0],
          [1, 0]],
      logit_q: [[0.2, 0.8],
                [0.7, 0.3],
                [0.5, 0.5]]
    }
    
    prob1 = -tf.reduce_sum(p * tf.log(q), axis=1)
    prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q)
    print(prob1.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]
    print(prob2.eval(feed_dict))  # [ 0.43748799  0.51301527  0.69314718]​

    Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

    Binary cross-entropy
    This time the correct formula is

    p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
    


    You're confusing the cross-entropy for binary and multi-class problems.

    Multi-class cross-entropy
    The formula that you use is correct and it directly corresponds to tf.nn.softmax_cross_entropy_with_logits:

    -tf.reduce_sum(p * tf.log(q), axis=1)
    p and q are expected to be probability distributions over N classes. In particular, N can be 2, as in the following example:

    p = tf.placeholder(tf.float32, shape=[None, 2])
    logit_q = tf.placeholder(tf.float32, shape=[None, 2])
    q = tf.nn.softmax(logit_q)

    feed_dict = {
    p: [[0, 1],
    [1, 0],
    [1, 0]],
    logit_q: [[0.2, 0.8],
    [0.7, 0.3],
    [0.5, 0.5]]
    }

    prob1 = -tf.reduce_sum(p * tf.log(q), axis=1)
    prob2 = tf.nn.softmax_cross_entropy_with_logits(labels=p, logits=logit_q)
    print(prob1.eval(feed_dict)) # [ 0.43748799 0.51301527 0.69314718]
    print(prob2.eval(feed_dict)) # [ 0.43748799 0.51301527 0.69314718]
    Note that q is computing tf.nn.softmax, i.e. outputs a probability distribution. So it's still multi-class cross-entropy formula, only for N = 2.

    Binary cross-entropy
    This time the correct formula is

    p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
    Though mathematically it's a partial case of the multi-class case, the meaning of p and q is different. In the simplest case, each p and q is a number, corresponding to a probability of the class A.

    Important: Don't get confused by the common p * -tf.log(q) part and the sum. Previous p was a one-hot vector, now it's a number, zero or one. Same for q - it was a probability distribution, now's it's a number (probability).

    If p is a vector, each individual component is considered an independent binary classification. See this answer that outlines the difference between softmax and sigmoid functions in tensorflow. So the definition p = [0, 0, 0, 1, 0] doesn't mean a one-hot vector, but 5 different features, 4 of which are off and 1 is on. The definition q = [0.2, 0.2, 0.2, 0.2, 0.2] means that each of 5 features is on with 20% probability.

    This explains the use of sigmoid function before the cross-entropy: its goal is to squash the logit to [0, 1] interval.

    The formula above still holds for multiple independent features, and that's exactly what tf.nn.sigmoid_cross_entropy_with_logits computes:

    p = tf.placeholder(tf.float32, shape=[None, 5])
    logit_q = tf.placeholder(tf.float32, shape=[None, 5])
    q = tf.nn.sigmoid(logit_q)
    
    feed_dict = {
      p: [[0, 0, 0, 1, 0],
          [1, 0, 0, 0, 0]],
      logit_q: [[0.2, 0.2, 0.2, 0.2, 0.2],
                [0.3, 0.3, 0.2, 0.1, 0.1]]
    }
    
    prob1 = -p * tf.log(q)
    prob2 = p * -tf.log(q) + (1 - p) * -tf.log(1 - q)
    prob3 = p * -tf.log(tf.sigmoid(logit_q)) + (1-p) * -tf.log(1-tf.sigmoid(logit_q))
    prob4 = tf.nn.sigmoid_cross_entropy_with_logits(labels=p, logits=logit_q)
    print(prob1.eval(feed_dict))
    print(prob2.eval(feed_dict))
    print(prob3.eval(feed_dict))
    print(prob4.eval(feed_dict))

    You should see that the last three tensors are equal, while the prob1 is only a part of cross-entropy, so it contains correct value only when p is 1:

    [[ 0.          0.          0.          0.59813893  0.        ]
     [ 0.55435514  0.          0.          0.          0.        ]]
    [[ 0.79813886  0.79813886  0.79813886  0.59813887  0.79813886]
     [ 0.5543552   0.85435522  0.79813886  0.74439669  0.74439669]]
    [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
     [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]
    [[ 0.7981388   0.7981388   0.7981388   0.59813893  0.7981388 ]
     [ 0.55435514  0.85435534  0.7981388   0.74439663  0.74439663]]


    Now it should be clear that taking a sum of -p * tf.log(q) along axis=1 doesn't make sense in this setting, though it'd be a valid formula in multi-class case.
      September 17, 2021 1:26 PM IST
    0
  • sigmoid_cross_entropy_with_logits solves N binary classifications at once. ... sigmoid_cross_entropy allows to set the in-batch weights, i.e. make some examples more important than others. tf. nn.
      August 31, 2021 3:48 PM IST
    0