QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » Training distributed on AI engine is slowed down because the evaluation is not distributed

Training distributed on AI engine is slowed down because the evaluation is not distributed

  • We have been training a neural net on the AI engine with a data-set consisting of 96 000 000 data points. The neural net was trained in a distributed manner, and as customary we used 20 % of the data-set as evaluation data. In order to train distributed we used TensorFlow estimators and the method tf.estimator.train_and_evaluate. Since our data-set is very large, our evaluation set is also quite large. Looking into the cpu usage of the master vs the workers nodes, and testing with an evaluation data-set consisting of only 100 samples, it appears as though the evaluation is not distributed and happens only on the master node. This makes the amount of ML units consumed increase by a factor of approximately 5 between having the standard size evaluation data (20 % of the total data) and only having 100 data points for evaluation, while the amount of training data is the same.

    We see two possible solutions to this problem:

    Doing also the evaluation distributed, but is that technically possible on the AI platform?
    Finding a representative smaller evaluation data-set. Is there a best practice approach to build this smaller data-set?
    Below is what I think is the relevant part of the code. The function input_fn returns a tf.data.Dataset that has been batched.

    run_config = tf.estimator.RunConfig(
            save_checkpoints_steps=1000, keep_checkpoint_max=10, tf_random_seed=random_seed
        )
    
        myestimator = _get_estimator(
            hidden_neurons, run_config, learning_rate, output_dir, my_rmse
        )
    
        # input_fn for tf.estimator Spec must be a callable function without args.
        # So we pack our input_fn in a lambda function
        callable_train_input_fn = lambda: input_fn(
            filenames=train_paths,
            num_epochs=num_epochs,
            batch_size=train_batch_size,
            num_parallel_reads=num_parallel_reads,
            random_seed=random_seed,
            input_format=input_format,
        )
        callable_eval_input_fn = lambda: input_fn(
            filenames=eval_paths,
            num_epochs=num_epochs,
            batch_size=eval_batch_size,
            shuffle=False,
            num_parallel_reads=num_parallel_reads,
            random_seed=random_seed,
            input_format=input_format,
        )
    
        train_spec = tf.estimator.TrainSpec(
            input_fn=callable_train_input_fn, max_steps=max_steps_train
        )
    
        eval_spec = tf.estimator.EvalSpec(
            input_fn=callable_eval_input_fn,
            steps=max_steps_eval,
            throttle_secs=throttle_secs,
            exporters=[exporter],
            name="taxifare-eval",
        )
    
        tf.estimator.train_and_evaluate(myestimator, train_spec, eval_spec)
    ​
      August 24, 2021 2:30 PM IST
    0
  • After seeing the comments and investigating a little more it looks like the evaluation is not slowing down the process, but the evaluation happens twice (once during the training and always at the end of the training). Therefore the training time is longer simply because one has to wait for the evaluation to finish. Thanks for all the comments

     
      August 26, 2021 2:06 PM IST
    0
  • TF isn't that comfy for distributed learning. Check out mxnet. There's nice intro here.

     
    This post was edited by Samar Patil at August 27, 2021 1:04 PM IST
      August 27, 2021 1:04 PM IST
    0