One Hot Encoding using sklearn

The dataset is the famous Titanic dataset.

import numpy as np
import pandas as pd

# Load the dataset
X = pd.read_csv('titanic_data.csv')
# Limit to categorical data
X = X.select_dtypes(include=[object])

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# TODO: Create a LabelEncoder object, which will turn all labels present in
# in each feature to numbers. For example, the labels ['cat', 'dog', 'fish']
# might be transformed into [0, 1, 2]

le = LabelEncoder()

# TODO: For each feature in X, apply the LabelEncoder's fit_transform
# function, which will first learn the labels for the feature (fit)
# and then change the labels to numbers (transform).

for feature in X:
X[feature] = le.fit_transform(X[feature])

# TODO: Create a OneHotEncoder object, which will create a feature for each
# label present in the data. For example, for a feature 'animal' that had
# the labels ['cat','dog','fish'], the new features (instead of 'animal')
# could be ['animal_cat', 'animal_dog', 'animal_fish']

ohe = OneHotEncoder()

# TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will
# first learn of all the (now numerical) labels in the data (fit), and then
# change the data to one-hot encoded entries (transform).

onehotlabels = ohe.fit_transform(X)

“onehotlabels” is a <891×1726 sparse matrix of type ‘<type ‘numpy.float64′>’
with 4455 stored elements in Compressed Sparse Row format>.

Part of it:

  (0, 1725)	1.0
  (0, 1574)	1.0
  (0, 1416)	1.0
  (0, 892)	1.0
  (0, 108)	1.0
  (1, 1723)	1.0
  (1, 1656)	1.0
    :	:
  (886, 1725)	1.0
  (886, 1574)	1.0
  (886, 994)	1.0
  (886, 892)	1.0
  (886, 548)	1.0
This entry was posted in Python for data analysis. Bookmark the permalink.

Leave a comment