QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » What are the differences between all these cross-entropy losses in Keras and TensorFlow?

What are the differences between all these cross-entropy losses in Keras and TensorFlow?

  • What are the differences between all these cross-entropy losses?
    Keras is talking about
    • Binary cross-entropy
    • Categorical cross-entropy
    • Sparse categorical cross-entropy
    While TensorFlow has
    • Softmax cross-entropy with logits
    • Sparse softmax cross-entropy with logits
    • Sigmoid cross-entropy with logits
    What are the differences and relationships between them? What are the typical applications for them? What's the mathematical background? Are there other cross-entropy types that one should know? Are there any cross-entropy types without logits?
      September 24, 2021 12:42 PM IST
    0
  • There is just one cross (Shannon) entropy defined as:

    H(P||Q) = - SUM_i P(X=i) log Q(X=i)
    


    In machine learning usage, P is the actual (ground truth) distribution, and Q is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent P and Q.

    There are basically 3 main things to consider:

    there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then Q(X=1) = 1 - Q(X=0) so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each Q(X=...))

    one either produces proper probabilities (meaning that Q(X=i)>=0 and SUM_i Q(X=i) =1 or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.

    there is j such that P(X=j)=1 (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

    Depending on these three aspects, different helper function should be used:

                                      outcomes     what is in Q    targets in P   
    -------------------------------------------------------------------------------
    binary CE                                2      probability         any
    categorical CE                          >2      probability         soft
    sparse categorical CE                   >2      probability         hard
    sigmoid CE with logits                   2      score               any
    softmax CE with logits                  >2      score               soft
    sparse softmax CE with logits           >2      score               hard
    

     

    In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

    It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

    As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

      October 13, 2021 1:58 PM IST
    0
  • Logits
    For this purpose, "logits" can be seen as the non-activated outputs of the model.

    While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
    Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
    Losses "with logits" will apply the activation internally. Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.

    Sparse
    Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
    Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]
    Binary crossentropy = Sigmoid crossentropy
    Problem type:
    single class (false/true); or
    non-exclusive multiclass (many classes may be correct)
    Model output shape: (batch, ..., >=1)
    Activation: "sigmoid"
    Categorical crossentropy = Softmax crossentropy
    Problem type: exclusive classes (only one class may be correct)
    Model output shape: (batch, ..., >=2)
    Activation: "softmax"
      September 24, 2021 1:44 PM IST
    0
  • Logits
    For this purpose, "logits" can be seen as the non-activated outputs of the model.
    • While Keras losses always take an "activated" output (you must apply "sigmoid" or "softmax" before the loss)
    • Tensorflow takes them with "logits" or "non-activated" (you should not apply "sigmoid" or "softmax" before the loss)
    Losses "with logits" will apply the activation internally. Some functions allow you to choose logits=True or logits=False, which will tell the function whether to "apply" or "not apply" the activations.

    Sparse
    • Sparse functions use the target data (ground truth) as "integer labels": 0, 1, 2, 3, 4.....
    • Non-sparse functions use the target data as "one-hot labels": [1,0,0], [0,1,0], [0,0,1]

    Binary crossentropy = Sigmoid crossentropy
    • Problem type:
      • single class (false/true); or
      • non-exclusive multiclass (many classes may be correct)
    • Model output shape: (batch, ..., >=1)
    • Activation: "sigmoid"
    Categorical crossentropy = Softmax crossentropy
    • Problem type: exclusive classes (only one class may be correct)
    • Model output shape: (batch, ..., >=2)
    • Activation: "softmax"
      October 16, 2021 2:51 PM IST
    0
  • I assume that you are looking for intuitive definitions, since the technical definitions require quite some time to understand. First of all, let's remember a preliminary needed concept to understand those definitions.

    • Decision problem: A problem with a yes or no answer.

    Now, let us define those complexity classes.

    P

    P is a complexity class that represents the set of all decision problems that can be solved in polynomial time.

    That is, given an instance of the problem, the answer yes or no can be decided in polynomial time.

    Example

    Given a connected graph G, can its vertices be coloured using two colours so that no edge is monochromatic?

    Algorithm: start with an arbitrary vertex, color it red and all of its neighbours blue and continue. Stop when you run out of vertices or you are forced to make an edge have both of its endpoints be the same color.


    NP

    NP is a complexity class that represents the set of all decision problems for which the instances where the answer is "yes" have proofs that can be verified in polynomial time.

    This means that if someone gives us an instance of the problem and a certificate (sometimes called a witness) to the answer being yes, we can check that it is correct in polynomial time.

    Example

    Integer factorisation is in NP. This is the problem that given integers n and m, is there an integer f with 1 < f < m, such that f divides n (f is a small factor of n)?

    This is a decision problem because the answers are yes or no. If someone hands us an instance of the problem (so they hand us integers n and m) and an integer f with 1 < f < m, and claim that f is a factor of n (the certificate), we can check the answer in polynomial time by performing the division n / f.


    NP-Complete

    NP-Complete is a complexity class which represents the set of all problems X in NP for which it is possible to reduce any other NP problem Y to X in polynomial time.

    Intuitively this means that we can solve Y quickly if we know how to solve X quickly. Precisely, Y is reducible to X, if there is a polynomial time algorithm f to transform instances y of Y to instances x = f(y) of X in polynomial time, with the property that the answer to y is yes, if and only if the answer to f(y) is yes.

    Example

    3-SAT. This is the problem wherein we are given a conjunction (ANDs) of 3-clause disjunctions (ORs), statements of the form

    (x_v11 OR x_v21 OR x_v31) AND 
    (x_v12 OR x_v22 OR x_v32) AND 
    ...                       AND 
    (x_v1n OR x_v2n OR x_v3n)
    

    where each x_vij is a boolean variable or the negation of a variable from a finite predefined list (x_1, x_2, ... x_n).

    It can be shown that every NP problem can be reduced to 3-SAT. The proof of this is technical and requires use of the technical definition of NP (based on non-deterministic Turing machines). This is known as Cook's theorem.

    What makes NP-complete problems important is that if a deterministic polynomial time algorithm can be found to solve one of them, every NP problem is solvable in polynomial time (one problem to rule them all).


    NP-hard

    Intuitively, these are the problems that are at least as hard as the NP-complete problems. Note that NP-hard problems do not have to be in NP, and they do not have to be decision problems.

    The precise definition here is that a problem X is NP-hard, if there is an NP-complete problem Y, such that Y is reducible to X in polynomial time.

    But since any NP-complete problem can be reduced to any other NP-complete problem in polynomial time, all NP-complete problems can be reduced to any NP-hard problem in polynomial time. Then, if there is a solution to one NP-hard problem in polynomial time, there is a solution to all NP problems in polynomial time.

    Example

    The halting problem is an NP-hard problem. This is the problem that given a program P and input I, will it halt? This is a decision problem but it is not in NP. It is clear that any NP-complete problem can be reduced to this one. As another example, any NP-complete problem is NP-hard.

    My favorite NP-complete problem is the Minesweeper problem.


    P = NP

    This one is the most famous problem in computer science, and one of the most important outstanding questions in the mathematical sciences. In fact, the Clay Institute is offering one million dollars for a solution to the problem (Stephen Cook's writeup on the Clay website is quite good).

    It's clear that P is a subset of NP. The open question is whether or not NP problems have deterministic polynomial time solutions. It is largely believed that they do not. Here is an outstanding recent article on the latest (and the importance) of the P = NP problem: The Status of the P versus NP problem.

      January 4, 2022 1:14 PM IST
    0