Training distributed on AI engine is slowed down because the evaluation is not distributed

QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » Training distributed on AI engine is slowed down because the evaluation is not distributed

User Dashboard

Training distributed on AI engine is slowed down because the evaluation is not distributed

Back To Topics

Tags : tensorflow python distributed-computing google-cloud-ml

Samar Patil

346 3
We have been training a neural net on the AI engine with a data-set consisting of 96 000 000 data points. The neural net was trained in a distributed manner, and as customary we used 20 % of the data-set as evaluation data. In order to train distributed we used TensorFlow estimators and the method tf.estimator.train_and_evaluate. Since our data-set is very large, our evaluation set is also quite large. Looking into the cpu usage of the master vs the workers nodes, and testing with an evaluation data-set consisting of only 100 samples, it appears as though the evaluation is not distributed and happens only on the master node. This makes the amount of ML units consumed increase by a factor of approximately 5 between having the standard size evaluation data (20 % of the total data) and only having 100 data points for evaluation, while the amount of training data is the same.

We see two possible solutions to this problem:

Doing also the evaluation distributed, but is that technically possible on the AI platform?
Finding a representative smaller evaluation data-set. Is there a best practice approach to build this smaller data-set?
Below is what I think is the relevant part of the code. The function input_fn returns a tf.data.Dataset that has been batched.
```
run_config = tf.estimator.RunConfig(
        save_checkpoints_steps=1000, keep_checkpoint_max=10, tf_random_seed=random_seed
    )

    myestimator = _get_estimator(
        hidden_neurons, run_config, learning_rate, output_dir, my_rmse
    )

    # input_fn for tf.estimator Spec must be a callable function without args.
    # So we pack our input_fn in a lambda function
    callable_train_input_fn = lambda: input_fn(
        filenames=train_paths,
        num_epochs=num_epochs,
        batch_size=train_batch_size,
        num_parallel_reads=num_parallel_reads,
        random_seed=random_seed,
        input_format=input_format,
    )
    callable_eval_input_fn = lambda: input_fn(
        filenames=eval_paths,
        num_epochs=num_epochs,
        batch_size=eval_batch_size,
        shuffle=False,
        num_parallel_reads=num_parallel_reads,
        random_seed=random_seed,
        input_format=input_format,
    )

    train_spec = tf.estimator.TrainSpec(
        input_fn=callable_train_input_fn, max_steps=max_steps_train
    )

    eval_spec = tf.estimator.EvalSpec(
        input_fn=callable_eval_input_fn,
        steps=max_steps_eval,
        throttle_secs=throttle_secs,
        exporters=[exporter],
        name="taxifare-eval",
    )

    tf.estimator.train_and_evaluate(myestimator, train_spec, eval_spec)
```
August 24, 2021 2:30 PM IST

0
Viaan Prakash

461

After seeing the comments and investigating a little more it looks like the evaluation is not slowing down the process, but the evaluation happens twice (once during the training and always at the end of the training). Therefore the training time is longer simply because one has to wait for the evaluation to finish. Thanks for all the comments

August 26, 2021 2:06 PM IST

0
Samar Patil

346 3

TF isn't that comfy for distributed learning. Check out mxnet. There's nice intro here.

This post was edited by Samar Patil at August 27, 2021 1:04 PM IST

August 27, 2021 1:04 PM IST

0

Cluzters.ai

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

Training distributed on AI engine is slowed down because the evaluation is not distributed

Connect With Us