QBoard » Artificial Intelligence & ML » AI and ML - Tensorflow » Does Tensorflow have something similar to scikit learn's One Hot Encoder?

Does Tensorflow have something similar to scikit learn's One Hot Encoder?

  • Does tensorflow have something similar to scikit learn's one hot encoder for processing categorical data? Would using a placeholder of tf.string behave as categorical data?

    I realize I can manually pre-process the data before sending it to tensorflow, but having it built in is very convenient.

      December 22, 2020 5:52 PM IST
    0
  • A simple and short way to one-hot encode any integer or list of integers:
    a = 5 
    b = [1, 2, 3]
    # one hot an integer
    one_hot_a = tf.nn.embedding_lookup(np.identity(10), a)
    # one hot a list of integers
    one_hot_b = tf.nn.embedding_lookup(np.identity(max(b)+1), b)
    This post was edited by Tarun Reddy at December 22, 2020 6:38 PM IST
      December 22, 2020 6:38 PM IST
    0
  • As of TensorFlow 0.8, there is now a native one-hot op, tf.one_hot that can convert a set of sparse labels to a dense one-hot representation. This is in addition to tf.nn.sparse_softmax_cross_entropy_with_logits, which can in some cases let you compute the cross entropy directly on the sparse labels instead of converting them to one-hot.

    Previous answer, in case you want to do it the old way: @Salvador's answer is correct - there (used to be) no native op to do it. Instead of doing it in numpy, though, you can do it natively in tensorflow using the sparse-to-dense operators:

    num_labels = 10
    
    # label_batch is a tensor of numeric labels to process
    # 0 <= label < num_labels
    
    sparse_labels = tf.reshape(label_batch, [-1, 1])
    derived_size = tf.shape(label_batch)[0]
    indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
    concated = tf.concat(1, [indices, sparse_labels])
    outshape = tf.pack([derived_size, num_labels])
    labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)

    The output, labels, is a one-hot matrix of batch_size x num_labels.

    Note also that as of 2016-02-12 (which I assume will eventually be part of a 0.7 release), TensorFlow also has the tf.nn.sparse_softmax_cross_entropy_with_logits op, which in some cases can let you do training without needing to convert to a one-hot encoding.

    Edited to add: At the end, you may need to explicitly set the shape of labels. The shape inference doesn't recognize the size of the num_labels component. If you don't need a dynamic batch size with derived_size, this can be simplified.

      December 22, 2020 6:44 PM IST
    0
  • Tensorflow 2.0 Compatible Answer: You can do it efficiently using Tensorflow Transform.

    Code for performing One-Hot Encoding using Tensorflow Transform is shown below:

    def get_feature_columns(tf_transform_output):
      """Returns the FeatureColumns for the model.
    
      Args:
        tf_transform_output: A `TFTransformOutput` object.
    
      Returns:
        A list of FeatureColumns.
      """
      # Wrap scalars as real valued columns.
      real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
                             for key in NUMERIC_FEATURE_KEYS]
    
      # Wrap categorical columns.
      one_hot_columns = [
          tf.feature_column.categorical_column_with_vocabulary_file(
              key=key,
              vocabulary_file=tf_transform_output.vocabulary_file_by_name(
                  vocab_filename=key))
          for key in CATEGORICAL_FEATURE_KEYS]
    
      return real_valued_columns + one_hot_columns
      December 22, 2020 10:34 PM IST
    0