QBoard » Artificial Intelligence & ML » AI and ML - Python » Label encoding across multiple columns in scikit-learn

Label encoding across multiple columns in scikit-learn

  • I'm trying to use scikit-learn's LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to avoid creating a LabelEncoder object for each column; I'd rather just have one big LabelEncoder objects that works across all my columns of data.

    Throwing the entire DataFrame into LabelEncoder creates the below error. Please bear in mind that I'm using dummy data here; in actuality I'm dealing with about 50 columns of string labeled data, so need a solution that doesn't reference any columns by name.
    import pandas
    from sklearn import preprocessing 
    
    df = pandas.DataFrame({
        'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
        'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
        'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                     'New_York']
    })
    
    le = preprocessing.LabelEncoder()
    
    le.fit(df)
    


    Traceback (most recent call last): File "", line 1, in File
    "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/preprocessing/label.py", line
    103, in fit y = column_or_1d(y, warn=True) File "/Users/bbalin/anaconda/lib/python2.7/site-
    packages/sklearn/utils/validation.py", line 306, in column_or_1d raise ValueError("bad
    input shape {0}".format(shape)) ValueError: bad input shape (6, 3)

    Any thoughts on how to get around this problem?

      December 11, 2020 2:00 PM IST
    0
  • Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer and sklearn.preprocessing.OneHotEncoder:

    If you only have categorical variables, OneHotEncoder directly:

    from sklearn.preprocessing import OneHotEncoder
    
    OneHotEncoder(handle_unknown='ignore').fit_transform(df)​

    If you have heterogeneously typed features:

    from sklearn.compose import make_column_transformer
    from sklearn.preprocessing import RobustScaler
    from sklearn.preprocessing import OneHotEncoder
    
    categorical_columns = ['pets', 'owner', 'location']
    numerical_columns = ['age', 'weigth', 'height']
    column_trans = make_column_transformer(
        (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
        (numerical_columns, RobustScaler())
    column_trans.fit_transform(df)

    More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
      December 12, 2020 5:28 PM IST
    0
  • You can easily do this though,

    df.apply(LabelEncoder().fit_transform)

    In scikit-learn 0.20, the recommended way is

    OneHotEncoder().fit_transform(df)

    as the OneHotEncoder now supports string input. Applying OneHotEncoder only to certain columns is possible with the ColumnTransformer.

    Since this answer is over a year ago, and generated many upvotes (including a bounty), I should probably extend this further.

    For inverse_transform and transform, you have to do a little bit of hack.

    from collections import defaultdict
    d = defaultdict(LabelEncoder)

    With this, you now retain all columns LabelEncoder as dictionary.

    # Encoding the variable
    fit = df.apply(lambda x: d[x.name].fit_transform(x))
    
    # Inverse the encoded
    fit.apply(lambda x: d[x.name].inverse_transform(x))
    
    # Using the dictionary to label future data
    df.apply(lambda x: d[x.name].transform(x))

    Using Neuraxle's FlattenForEach step, it's possible to do this as well to use the same LabelEncoder on all the flattened data at once:

    FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)

    For using separate LabelEncoders depending for your columns of data, or if only some of your columns of data needs to be label-encoded and not others, then using a ColumnTransformer is a solution that allows for more control on your column selection and your LabelEncoder instances.

    This post was edited by Raji Reddy A at December 12, 2020 5:56 PM IST
      December 12, 2020 5:56 PM IST
    0
  • We don't need a LabelEncoder.
    You can convert the columns to categoricals and then get their codes. I used a dictionary comprehension below to apply this process to every column and wrap the result back into a dataframe of the same shape with identical indices and column names.

    >> pd.DataFrame({col: df[col].astype('category').cat.codes for col in df}, index=df.index)
       location  owner  pets
    0         1      1     0
    1         0      2     1
    2         0      0     0
    3         1      1     2
    4         1      3     1
    5         0      2     1

    To create a mapping dictionary, you can just enumerate the categories using a dictionary comprehension:

    >>> {col: {n: cat for n, cat in enumerate(df[col].astype('category').cat.categories)} 
         for col in df}
    
    {'location': {0: 'New_York', 1: 'San_Diego'},
     'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
     'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
      December 12, 2020 5:59 PM IST
    0
  • If you have numerical and categorical both type of data in dataframe You can use : here X is my dataframe having categorical and numerical both variables

    from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    
    for i in range(0,X.shape[1]):
        if X.dtypes=='object':
            X[X.columns] = le.fit_transform(X[X.columns])​

    Note: This technique is good if you are not interested in converting them back.

    This post was edited by Pranav B at December 12, 2020 6:17 PM IST
      December 12, 2020 6:02 PM IST
    0