QBoard » Artificial Intelligence & ML » AI and ML - Python » How can I one hot encode in Python?

How can I one hot encode in Python?

  • I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

    I am trying to do the following for feature selection:

    1. I read the train file:

      num_rows_to_read = 10000
      train_small = pd.read_csv("../../dataset/train.csv",   nrows=num_rows_to_read)
    2. I change the type of the categorical features to 'category':

      non_categorial_features = ['orig_destination_distance',
                                'srch_adults_cnt',
                                'srch_children_cnt',
                                'srch_rm_cnt',
                                'cnt']
      
      for categorical_feature in list(train_small.columns):
          if categorical_feature not in non_categorial_features:
              train_small[categorical_feature] = train_small[categorical_feature].astype('category')
      
       
    3. I use one hot encoding
      train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

    The problem is that the 3'rd part often get stuck, although I am using a strong machine.

    Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

    What do you recommend?

    This post was edited by Jasmine Chacko at September 14, 2020 3:35 PM IST
      September 14, 2020 3:26 PM IST
    1
  • Firstly, easiest way to one hot encode: use Sklearn.

    http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

    Secondly
    , I don't think using pandas to one hot encode is that simple (unconfirmed though)

    Creating dummy variables in pandas for python

    Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

    Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.

    Here's the code for my custom encoding function if you want.

    from sklearn.preprocessing import LabelEncoder
    
    #Auto encodes any dataframe column of type category or object.
    def dummyEncode(df):
            columnsToEncode = list(df.select_dtypes(include=['category','object']))
            le = LabelEncoder()
            for feature in columnsToEncode:
                try:
                    df[feature] = le.fit_transform(df[feature])
                except:
                    print('Error encoding '+feature)
            return df​


    One-hot encoding: convert n levels to n-1 columns.

    Index  Animal         Index  cat  mouse
      1     dog             1     0     0
      2     cat       -->   2     1     0
      3    mouse            3     0     1

    You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

    Dummy Coding:

    Index  Animal         Index  Animal
      1     dog             1      0   
      2     cat       -->   2      1 
      3    mouse            3      2

    Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.

      September 14, 2020 3:35 PM IST
    0
    • Jasmine Chacko
      Jasmine Chacko @Shivakumar Kota,1. I have a data set which has 80% categorical variables. To my understanding i must use one hot encoding if i want to use a classifier for this data, else in the case of not doing the one hot encoding the classifier won't treat the...  more
      September 14, 2020
    • Shivakumar Kota
      Shivakumar Kota @Jasmine Chacko, as I said, there are two options. 1) One hot encode --> convert every level in categorical features to a new column. 2)Dummy coding --> convert every column to numeric representations.
      September 14, 2020
  • One hot encoding with pandas is very easy:

    def one_hot(df, cols):
        """
        @param df pandas DataFrame
        @param cols a list of columns to encode 
        @return a DataFrame with one-hot encoding
        """
        for each in cols:
            dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
            df = pd.concat([df, dummies], axis=1)
        return df​


    Another way to one_hot using sklearn's LabelBinarizer :

    from sklearn.preprocessing import LabelBinarizer 
    label_binarizer = LabelBinarizer()
    label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later
    
    def one_hot_encode(x):
        """
        One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
        : x: List of sample Labels
        : return: Numpy array of one-hot encoded labels
        """
        return label_binarizer.transform(x)​
      September 14, 2020 3:41 PM IST
    0
  • You can do it with numpy.eye and a using the array element selection mechanism:

    import numpy as np
    nb_classes = 6
    data = [[2, 3, 4, 0]]
    
    def indices_to_one_hot(data, nb_classes):
        """Convert an iterable of indices to one-hot encoded labels."""
        targets = np.array(data).reshape(-1)
        return np.eye(nb_classes)[targets]
     

    The the return value of indices_to_one_hot(nb_classes, data) is now

    array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
            [ 0.,  0.,  0.,  1.,  0.,  0.],
            [ 0.,  0.,  0.,  0.,  1.,  0.],
            [ 1.,  0.,  0.,  0.,  0.,  0.]]])

    The .reshape(-1)  is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).

      September 14, 2020 4:12 PM IST
    0
  • pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.

    one line code for one-hot-encoding:

    df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
     
      September 14, 2020 4:13 PM IST
    0
  • The simplest way to hot encode a dataframe in an automated way is to use this function:

    def hot_encode(df):
        obj_df = df.select_dtypes(include=['object'])
        return pd.get_dummies(df, columns=obj_df.columns).values
      September 14, 2020 4:15 PM IST
    0
  • I know I'm late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:

    def hot_encode(df):
        obj_df = df.select_dtypes(include=['object'])
        return pd.get_dummies(df, columns=obj_df.columns).values
      October 23, 2021 1:53 PM IST
    0