Does tensorflow have something similar to scikit learn's one hot encoder for processing categorical data? Would using a placeholder of tf.string behave as categorical data?
I realize I can manually pre-process the data before sending it to tensorflow, but having it built in is very convenient.
a = 5
b = [1, 2, 3]
# one hot an integer
one_hot_a = tf.nn.embedding_lookup(np.identity(10), a)
# one hot a list of integers
one_hot_b = tf.nn.embedding_lookup(np.identity(max(b)+1), b)
This post was edited by Tarun Reddy at December 22, 2020 6:38 PM IST
num_labels = 10
# label_batch is a tensor of numeric labels to process
# 0 <= label < num_labels
sparse_labels = tf.reshape(label_batch, [-1, 1])
derived_size = tf.shape(label_batch)[0]
indices = tf.reshape(tf.range(0, derived_size, 1), [-1, 1])
concated = tf.concat(1, [indices, sparse_labels])
outshape = tf.pack([derived_size, num_labels])
labels = tf.sparse_to_dense(concated, outshape, 1.0, 0.0)
The output, labels, is a one-hot matrix of batch_size x num_labels.
Note also that as of 2016-02-12 (which I assume will eventually be part of a 0.7 release), TensorFlow also has the tf.nn.sparse_softmax_cross_entropy_with_logits op, which in some cases can let you do training without needing to convert to a one-hot encoding.
Edited to add: At the end, you may need to explicitly set the shape of labels. The shape inference doesn't recognize the size of the num_labels component. If you don't need a dynamic batch size with derived_size, this can be simplified.
Tensorflow 2.0 Compatible Answer: You can do it efficiently using Tensorflow Transform
.
Code for performing One-Hot Encoding using Tensorflow Transform
is shown below:
def get_feature_columns(tf_transform_output):
"""Returns the FeatureColumns for the model.
Args:
tf_transform_output: A `TFTransformOutput` object.
Returns:
A list of FeatureColumns.
"""
# Wrap scalars as real valued columns.
real_valued_columns = [tf.feature_column.numeric_column(key, shape=())
for key in NUMERIC_FEATURE_KEYS]
# Wrap categorical columns.
one_hot_columns = [
tf.feature_column.categorical_column_with_vocabulary_file(
key=key,
vocabulary_file=tf_transform_output.vocabulary_file_by_name(
vocab_filename=key))
for key in CATEGORICAL_FEATURE_KEYS]
return real_valued_columns + one_hot_columns