QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » Hyperparameter optimization for Deep Learning Structures using Bayesian Optimization

Hyperparameter optimization for Deep Learning Structures using Bayesian Optimization

  • I have constructed a CLDNN (Convolutional, LSTM, Deep Neural Network) structure for raw signal classification task.
    Each training epoch runs for about 90 seconds and the hyperparameters seems to be very difficult to optimize.
    I have been research various ways to optimize the hyperparameters (e.g. random or grid search) and found out about Bayesian Optimization.
    Although I am still not fully understanding the optimization algorithm, I feed like it will help me greatly.
    I would like to ask few questions regarding the optimization task.
    1. How do I set up the Bayesian Optimization with regards to a deep network?(What is the cost function we are trying to optimize?)
    2. What is the function I am trying to optimize? Is it the cost of the validation set after N epochs?
    3. Is spearmint a good starting point for this task? Any other suggestions for this task?
    I would greatly appreciate any insights into this problem.
      August 28, 2021 11:39 PM IST
    0
  • Hyperparameters are important for machine learning algorithms since they directly control the behaviors of training algorithms and have a significant effect on the performance of machine learning models. Several techniques have been developed and successfully applied for certain application domains. However, this work demands professional knowledge and expert experience.

    And sometimes it has to resort to the brute-force search. Therefore, if an efficient hyperparameter optimization algorithm can be developed to optimize any given machine learning method, it will greatly improve the efficiency of machine learning. In this paper, we consider building the relationship between the performance of the machine learning models and their hyperparameters by Gaussian processes. In this way, the hyperparameter tuning problem can be abstracted as an optimization problem and Bayesian optimization is used to solve the problem. Bayesian optimization is based on the Bayesian theorem.

    It sets a prior over the optimization function and gathers the information from the previous sample to update the posterior of the optimization function. A utility function selects the next sample point to maximize the optimization function. Several experiments were conducted on standard test datasets. Experiment results show that the proposed method can find the best hyperparameters for the widely used machine learning models, such as the random forest algorithm and the neural networks, even multi-grained cascade forest under the consideration of time cost.

      September 9, 2021 1:00 PM IST
    0
  • It is the process of searching for a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter in machine learning algorithms, whose value is used to control the learning process. Our task in deep learning is to find the best value for tuning of hyperparameter. 

     

    In your problem, you want to use Bayesian Optimization for hyperparameter tuning. The Bayesian Optimization technique aims to deal with the exploration-exploitation trade-off in the multi-armed bandit problem. In this particular problem, there is an unknown function, which we can evaluate at any point, but each evaluation costs a direct penalty or opportunity cost, and our goal is to find the best hyperparameter in minimum iterations.

     

    Bayesian Optimization is used to build a model of the target function using a Gaussian Process and at each step, it chooses the most "optimal" point based on their GP model.

     

    image

     

    There is a true function in Bayesian optimization that is f(x) = x * sin(x) on [-10, 10] interval. Red dots represent one epoch, the red curve is the GP mean, the blue curve is the mean plus or minus one standard deviation. In this function, the GP model doesn't match with the true function everywhere, but the optimizer fairly quickly identified the "hot" area around -8 and started to exploit it.

     

    Hope this answer helps.

     
      September 14, 2021 4:25 PM IST
    0
  • First up, let me briefly explain this part. Bayesian Optimization methods aim to deal with exploration-exploitation trade off in the multi-armed bandit problem. In this problem, there is an unknown function, which we can evaluate in any point, but each evaluation costs (direct penalty or opportunity cost), and the goal is to find its maximum using as few trials as possible. Basically, the trade off is this: you know the function in a finite set of points (of which some are good and some are bad), so you can try an area around the current local maximum, hoping to improve it (exploitation), or you can try a completely new area of space, that can potentially be much better or much worse (exploration), or somewhere in between.

    Bayesian Optimization methods (e.g. PI, EI, UCB), build a model of the target function using a Gaussian Process (GP) and at each step choose the most "promising" point based on their GP model (note that "promising" can be defined differently by different particular methods).

    Here's an example:

    sin(x)*x

    The true function is f(x) = x * sin(x) (black curve) on [-10, 10] interval. Red dots represent each trial, red curve is the GP mean, blue curve is the mean plus or minus one standard deviation. As you can see, the GP model doesn't match the true function everywhere, but the optimizer fairly quickly identified the "hot" area around -8 and started to exploit it.

    How do I set up the Bayesian Optimization with regards to a deep network?

    In this case, the space is defined by (possibly transformed) hyperparameters, usually a multidimensional unit hypercube.

    For example, suppose you have three hyperparameters: a learning rate α in [0.001, 0.01], the regularizer λ in [0.1, 1] (both continuous) and the hidden layer size N in [50..100] (integer). The space for optimization is a 3-dimensional cube [0, 1]*[0, 1]*[0, 1]. Each point (p0, p1, p2) in this cube corresponds to a trinity (α, λ, N) by the following transformation:

    p0 -> α = 10**(p0-3)
    p1 -> λ = 10**(p1-1)
    p2 -> N = int(p2*50 + 50)​

    What is the function I am trying to optimize? Is it the cost of the validation set after N epochs?

    Correct, the target function is neural network validation accuracy. Clearly, each evaluation is expensive, because it requires at least several epochs for training.

    Also note that the target function is stochastic, i.e. two evaluations on the same point may slightly differ, but it's not a blocker for Bayesian Optimization, though it obviously increases the uncertainty.

    Is spearmint a good starting point for this task? Any other suggestions for this task?

    spearmint is a good library, you can definitely work with that. I can also recommend hyperopt.

    In my own research, I ended up writing my own tiny library, basically for two reasons: I wanted to code exact Bayesian method to use (in particular, I found a portfolio strategy of UCB and PI converged faster than anything else, in my case); plus there is another technique that can save up to 50% of training time called learning curve prediction (the idea is to skip full learning cycle when the optimizer is confident the model doesn't learn as fast as in other areas). I'm not aware of any library that implements this, so I coded it myself, and in the end it paid off. If you're interested, the code is on GitHub.
      August 30, 2021 1:21 PM IST
    0