QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » How to set adaptive learning rate for GradientDescentOptimizer?

How to set adaptive learning rate for GradientDescentOptimizer?

  • am using TensorFlow to train a neural network. This is how I am initializing the  GradientDescentOptimizer:

    init = tf.initialize_all_variables() 
    sess = tf.Session() 
    sess.run(init) 
    mse = tf.reduce_mean(tf.square(out - out_))
    train_step = tf.train.GradientDescentOptimizer(0.3).minimize(mse)​​

    The thing here is that I don't know how to set an update rule for the learning rate or a decay value for that.

    How can I use an adaptive learning rate here?

    This post was edited by Shivakumar Kota at August 24, 2020 2:01 PM IST
      August 24, 2020 1:56 PM IST
    0
    • Pranav B
      Pranav B Its a good habit to initialize all variables after you specify your optimizer because some optimizers like AdamOptimizer uses its own variables that also need to be initialized. Otherwise you may get an error that looks like this: FailedPreconditionError...  more
      September 16, 2020
  • First of all, tf.train.GradientDescentOptimizer is designed to use a constant learning rate for all variables in all steps. TensorFlow also provides out-of-the-box adaptive optimizers including the tf.train.AdagradOptimizer and the tf.train.AdamOptimizer, and these can be used as drop-in replacements.

    However, if you want to control the learning rate with otherwise-vanilla gradient descent, you can take advantage of the fact that the learning_rate argument to the tf.train.GradientDescentOptimizer constructor can be a Tensor object. This allows you to compute a different value for the learning rate in each step, for example:

    learning_rate = tf.placeholder(tf.float32, shape=[])
    # ...
    train_step = tf.train.GradientDescentOptimizer(
        learning_rate=learning_rate).minimize(mse)
    
    sess = tf.Session()
    
    # Feed different values for learning rate to each training step.
    sess.run(train_step, feed_dict={learning_rate: 0.1})
    sess.run(train_step, feed_dict={learning_rate: 0.1})
    sess.run(train_step, feed_dict={learning_rate: 0.01})
    sess.run(train_step, feed_dict={learning_rate: 0.01})​

    Alternatively, you could create a scalar tf.Variable that holds the learning rate, and assign it each time you want to change the learning rate.
      August 24, 2020 2:05 PM IST
    0
  • From tensorflow official docs

    global_step = tf.Variable(0, trainable=False)
    starter_learning_rate = 0.1
    learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,
                                           100000, 0.96, staircase=True)
    
    # Passing global_step to minimize() will increment it at each step.
    learning_step = (
    tf.train.GradientDescentOptimizer(learning_rate)
    .minimize(...my loss..., global_step=global_step))
      September 16, 2020 1:11 PM IST
    0
  • Tensorflow provides an op to automatically apply an exponential decay to a learning rate tensor: tf.train.exponential_decay. For an example of it in use, see this line in the MNIST convolutional model example. Then use @Samar Patil suggestion above to supply this variable as the learning_rate parameter to your optimizer of choice.
    The key excerpt to look at is:
     
    # Optimizer: set up a variable that's incremented once per batch and
    # controls the learning rate decay.
    batch = tf.Variable(0)
    
    learning_rate = tf.train.exponential_decay(
      0.01,                # Base learning rate.
      batch * BATCH_SIZE,  # Current index into the dataset.
      train_size,          # Decay step.
      0.95,                # Decay rate.
      staircase=True)
    # Use simple momentum for the optimization.
    optimizer = tf.train.MomentumOptimizer(learning_rate,
                                         0.9).minimize(loss,
                                                       global_step=batch)​
    Note the global_step=batch parameter to minimize. That tells the optimizer to helpfully increment the 'batch' parameter for you every time it trains.
      September 16, 2020 1:15 PM IST
    0
    • Viaan Prakash
      Viaan Prakash Usually, the variable you call batch is called global_step and there are several convenience functions, one for creating it tf.train.create_global_step() (which simply creates a integer tf.Variable and adds it to the tf.GraphKeys.GLOBAL_STEP collection)...  more
      September 16, 2020
  • Gradient descent algorithm uses the constant learning rate which you can provide in during the initialization. You can pass various learning rates in a way showed by Mrry.

    But instead of it you can also use more advanced optimizers which have faster convergence rate and adapts to the situation.

    Here is a brief explanation based on my understanding:

    • momentum helps SGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct dirrection and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makse sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentom. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
    • nesterov accelerated gradient overcomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
    • AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically illiminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning
    • AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly decrease. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSprop is very similar to AdaDelta
    • Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately

      few visualizationsenter image description here enter image description here

      October 19, 2021 2:42 PM IST
    0