QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » NaN loss when training regression network

NaN loss when training regression network

  • I have a data matrix in "one-hot encoding" (all ones and zeros) with 260,000 rows and 35 columns. I am using Keras to train a simple neural network to predict a continuous variable. The code to make the network is the following:

    model = Sequential()
    model.add(Dense(1024, input_shape=(n_train,)))
    model.add(Activation('relu'))
    model.add(Dropout(0.1))
    
    model.add(Dense(512))
    model.add(Activation('relu'))
    model.add(Dropout(0.1))
    
    model.add(Dense(256))
    model.add(Activation('relu'))
    model.add(Dropout(0.1))
    model.add(Dense(1))
    
    sgd = SGD(lr=0.01, nesterov=True);
    #rms = RMSprop()
    #model.compile(loss='categorical_crossentropy', optimizer=rms, metrics=['accuracy'])
    model.compile(loss='mean_absolute_error', optimizer=sgd)
    model.fit(X_train, Y_train, batch_size=32, nb_epoch=3, verbose=1, validation_data=(X_test,Y_test), callbacks=[EarlyStopping(monitor='val_loss', patience=4)] )​

    However, during the training process, I see the loss decrease nicely, but during the middle of the second epoch, it goes to nan:

    Train on 260000 samples, validate on 64905 samples
    Epoch 1/3
    260000/260000 [==============================] - 254s - loss: 16.2775 - val_loss:
     13.4925
    Epoch 2/3
     88448/260000 [=========>....................] - ETA: 161s - loss: nan​

    I tried using RMSProp instead of SGD, I tried tanh instead of relu, I tried with and without dropout, all to no avail. I tried with a smaller model, i.e. with only one hidden layer, and same issue (it becomes nan at a different point). However, it does work with less features, i.e. if there are only 5 columns, and gives quite good predictions. It seems to be there is some kind of overflow, but I can't imagine why--the loss is not unreasonably large at all.

    Python version 2.7.11, running on a linux machine, CPU only. I tested it with the latest version of Theano, and I also get Nans, so I tried going to Theano 0.8.2 and have the same problem. With the latest version of Keras has the same problem, and also with the 0.3.2 version. This post was edited by Viaan Prakash at September 7, 2020 3:08 PM IST
      September 7, 2020 2:56 PM IST
    1
    • Rakesh Racharla
      Rakesh Racharla Try loss='mean_squared_error', optimizer='adam' with a single hidden layer - still nans?
      September 7, 2020
    • Viaan Prakash
      Viaan Prakash @Rakesh Racharla,When using the above model with Adam optimizer, I get nans. With just one layer, it does not give nans during the three epochs of training
      September 7, 2020
  • Regression with neural networks is hard to get working because the output is unbounded, so you are especially prone to the exploding gradients problem (the likely cause of the nans).

    Historically, one key solution to exploding gradients was to reduce the learning rate, but with the advent of per-parameter adaptive learning rate algorithms like Adam, you no longer need to set a learning rate to get good performance. There is very little reason to use SGD with momentum anymore unless you're a neural network fiend and know how to tune the learning schedule.

    Here are some things you could potentially try:

    1. Normalize your outputs by quantile normalizing or z scoring. To be rigorous, compute this transformation on the training data, not on the entire dataset. For example, with quantile normalization, if an example is in the 60th percentile of the training set, it gets a value of 0.6. (You can also shift the quantile normalized values down by 0.5 so that the 0th percentile is -0.5 and the 100th percentile is +0.5).

    2. Add regularization, either by increasing the dropout rate or adding L1 and L2 penalties to the weights. L1 regularization is analogous to feature selection, and since you said that reducing the number of features to 5 gives good performance, L1 may also.

    3. If these still don't help, reduce the size of your network. This is not always the best idea since it can harm performance, but in your case you have a large number of first-layer neurons (1024) relative to input features (35) so it may help.

    4. Increase the batch size from 32 to 128. 128 is fairly standard and could potentially increase the stability of the optimization.

      September 7, 2020 3:00 PM IST
    0
  • The answer by 1" is quite good. However, all of the fixes seems to fix the issue indirectly rather than directly. I would recommend using gradient clipping, which will clip any gradients that are above a certain value.

    In Keras you can use clipnorm=1 (see https://keras.io/optimizers/) to simply clip all gradients with a norm above 1
      September 7, 2020 3:01 PM IST
    0
  • I faced a very similar problem, and this is how I got it to run.

    The first thing you can try is changing your activation to LeakyReLU instead of using Relu or Tanh. The reason is that often, many of the nodes within your layers have an activation of zero, and backpropogation doesn't update the weights for these nodes because their gradient is also zero. This is also called the 'dying ReLU' problem (you can read more about it here: https://datascience.stackexchange.com/questions/5706/what-is-the-dying-relu-problem-in-neural-networks).

    To do this, you can import the LeakyReLU activation using:

    from keras.layers.advanced_activations import LeakyReLU

    and incorporate it within your layers like this:

    model.add(Dense(800,input_shape=(num_inputs,)))
    model.add(LeakyReLU(alpha=0.1))

    Additionally, it is possible that the output feature (the continuous variable you are trying to predict) is an imbalanced data set and has too many 0s. One way to fix this issue is to use smoothing. You can do this by adding 1 to the numerator of all your values in this column and dividing each of the values in this column by 1/(average of all the values in this column)

    This essentially shifts all the values from 0 to a value greater than 0 (which may still be very small). This prevents the curve from predicting 0s and minimizing the loss (eventually making it NaN). Smaller values are more greatly impacted than larger values, but on the whole, the average of the data set remains the same.

     
      September 7, 2020 3:03 PM IST
    0
  • I faced the same problem with using LSTM, the problem is my data has some nan value after standardization, therefore, we should check the input model data after the standarization if you see you will have nan value:

    print(np.any(np.isnan(X_test)))
    print(np.any(np.isnan(y_test)))​

    you can solve this by adding a small value(0.000001) to Std like this,
    def standardize(train, test):
    
    
        mean = np.mean(train, axis=0)
        std = np.std(train, axis=0)+0.000001
    
        X_train = (train - mean) / std
        X_test = (test - mean) /std
        return X_train, X_test​
    This post was edited by Maryam Bains at September 7, 2020 3:07 PM IST
      September 7, 2020 3:05 PM IST
    0