How can I one hot encode in Python?

QBoard » Artificial Intelligence & ML » AI and ML - Python » How can I one hot encode in Python?

User Dashboard

How can I one hot encode in Python?

Back To Topics

Tags : python pandas machine-learning one-hot-encoding

Jasmine Chacko

63 1
I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding?

I am trying to do the following for feature selection:
1. I read the train file:
  
  num_rows_to_read = 10000 train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
2. I change the type of the categorical features to 'category':
  
  non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt'] for categorical_feature in list(train_small.columns): if categorical_feature not in non_categorial_features: train_small[categorical_feature] = train_small[categorical_feature].astype('category')
3. I use one hot encoding
  train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
The problem is that the 3'rd part often get stuck, although I am using a strong machine.

Thus, without the one hot encoding I can't do any feature selection, for determining the importance of the features.

What do you recommend?
This post was edited by Jasmine Chacko at September 14, 2020 3:35 PM IST
September 14, 2020 3:26 PM IST

1
Shivakumar Kota

102 9
Firstly, easiest way to one hot encode: use Sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Secondly, I don't think using pandas to one hot encode is that simple (unconfirmed though)

Creating dummy variables in pandas for python

Lastly, is it necessary for you to one hot encode? One hot encoding exponentially increases the number of features, drastically increasing the run time of any classifier or anything else you are going to run. Especially when each categorical feature has many levels. Instead you can do dummy coding.

Using dummy encoding usually works well, for much less run time and complexity. A wise prof once told me, 'Less is More'.

Here's the code for my custom encoding function if you want.
```
from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df
```
One-hot encoding: convert n levels to n-1 columns.
```
Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1
```
You can see how this will explode your memory if you have many different types (or levels) in your categorical feature. Keep in mind, this is just ONE column.

Dummy Coding:
```
Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2
```
Convert to numerical representations instead. Greatly saves feature space, at the cost of a bit of accuracy.
September 14, 2020 3:35 PM IST

0
- Jasmine Chacko @Shivakumar Kota,1. I have a data set which has 80% categorical variables. To my understanding i must use one hot encoding if i want to use a classifier for this data, else in the case of not doing the one hot encoding the classifier won't treat the... more
  
  or cancel
  
  September 14, 2020
- Shivakumar Kota @Jasmine Chacko, as I said, there are two options. 1) One hot encode --> convert every level in categorical features to a new column. 2)Dummy coding --> convert every column to numeric representations.
  
  or cancel
  
  September 14, 2020

Nitara Bobal

One hot encoding with pandas is very easy:

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

Another way to one_hot using sklearn's LabelBinarizer :

from sklearn.preprocessing import LabelBinarizer 
label_binarizer = LabelBinarizer()
label_binarizer.fit(all_your_labels_list) # need to be global or remembered to use it later

def one_hot_encode(x):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
    """
    return label_binarizer.transform(x)

September 14, 2020 3:41 PM IST

Laksh Nath

126
You can do it with numpy.eye and a using the array element selection mechanism:
```
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]

def indices_to_one_hot(data, nb_classes):
    """Convert an iterable of indices to one-hot encoded labels."""
    targets = np.array(data).reshape(-1)
    return np.eye(nb_classes)[targets]
```
```
 
```
The the return value of indices_to_one_hot(nb_classes, data) is now
```
array([[[ 0.,  0.,  1.,  0.,  0.,  0.],
        [ 0.,  0.,  0.,  1.,  0.,  0.],
        [ 0.,  0.,  0.,  0.,  1.,  0.],
        [ 1.,  0.,  0.,  0.,  0.,  0.]]])
```
The .reshape(-1) is there to make sure you have the right labels format (you might also have [[2], [3], [4], [0]]).
September 14, 2020 4:12 PM IST

0
Rakesh Racharla

129 8
pandas as has inbuilt function "get_dummies" to get one hot encoding of that particular column/s.

one line code for one-hot-encoding:
```
df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)
```
```
 
```
September 14, 2020 4:13 PM IST

0
Vinaya Chahal

29 1
The simplest way to hot encode a dataframe in an automated way is to use this function:
```
def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values
```
September 14, 2020 4:15 PM IST

0
Maryam Bains

317
I know I'm late to this party, but the simplest way to hot encode a dataframe in an automated way is to use this function:
```
def hot_encode(df):
    obj_df = df.select_dtypes(include=['object'])
    return pd.get_dummies(df, columns=obj_df.columns).values
```
October 23, 2021 1:53 PM IST

0

Cluzters.ai

Cluzters.ai is the first step towards uniting various Industry participants in the field of Applied Data Innovations. It is a gamified community geared towards creating a level playing turf for Data science professionals.

Member Sign In

Member Sign In

Create Account

How can I one hot encode in Python?

Connect With Us