And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.
ERROR:tensorflow:Model diverged with loss = NaN.
Traceback...
tensorflow.contrib.learn.python.learn.monitors.NanLossDuringTrainingError: NaN loss during training.
tf.contrib.learn.DNNClassifier(feature_columns=feature_columns, hidden_units=[300, 300, 300],
#optimizer=tf.train.ProximalAdagradOptimizer(learning_rate=0.001, l1_regularization_strength=0.00001), n_classes=11, model_dir="/tmp/iris_model")
Although most of the points are already discussed. But I would like to highlight again one more reason for NaN which is missing.
tf.estimator.DNNClassifier(
hidden_units, feature_columns, model_dir=None, n_classes=2, weight_column=None,
label_vocabulary=None, optimizer='Adagrad', activation_fn=tf.nn.relu,
dropout=None, config=None, warm_start_from=None,
loss_reduction=losses_utils.ReductionV2.SUM_OVER_BATCH_SIZE, batch_norm=False
)
By default activation function is "Relu". It could be possible that intermediate layer's generating a negative value and "Relu" convert it into the 0. Which gradually stops training.
I observed the "LeakyRelu" able to solve such problems.
I'd like to plug in some (shallow) reasons I have experienced as follows:
Hope that helps.
If you're training for cross entropy, you want to add a small number like 1e-8 to your output probability.
Because log(0) is negative infinity, when your model trained enough the output distribution will be very skewed, for instance say I'm doing a 4 class output, in the beginning my probability looks like
0.25 0.25 0.25 0.25
but toward the end the probability will probably look like
1.0 0 0 0
And you take a cross entropy of this distribution everything will explode. The fix is to artifitially add a small number to all the terms to prevent this.