One Hot Encoding using sklearn | Perplexity is the beginning of knowledge

One Hot Encoding using sklearn

Posted on February 17, 2017 by faye1010

The dataset is the famous Titanic dataset.

import numpy as np
import pandas as pd

# Load the dataset
X = pd.read_csv('titanic_data.csv')
# Limit to categorical data
X = X.select_dtypes(include=[object])

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# TODO: Create a LabelEncoder object, which will turn all labels present in
# in each feature to numbers. For example, the labels ['cat', 'dog', 'fish']
# might be transformed into [0, 1, 2]

le = LabelEncoder()

# TODO: For each feature in X, apply the LabelEncoder's fit_transform
# function, which will first learn the labels for the feature (fit)
# and then change the labels to numbers (transform).

for feature in X:
X[feature] = le.fit_transform(X[feature])

# TODO: Create a OneHotEncoder object, which will create a feature for each
# label present in the data. For example, for a feature 'animal' that had
# the labels ['cat','dog','fish'], the new features (instead of 'animal')
# could be ['animal_cat', 'animal_dog', 'animal_fish']

ohe = OneHotEncoder()

# TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will
# first learn of all the (now numerical) labels in the data (fit), and then
# change the data to one-hot encoded entries (transform).

onehotlabels = ohe.fit_transform(X)

“onehotlabels” is a <891×1726 sparse matrix of type ‘<type ‘numpy.float64′>’
with 4455 stored elements in Compressed Sparse Row format>.

Part of it:

  (0, 1725)	1.0
  (0, 1574)	1.0
  (0, 1416)	1.0
  (0, 892)	1.0
  (0, 108)	1.0
  (1, 1723)	1.0
  (1, 1656)	1.0

    :	:
  (886, 1725)	1.0
  (886, 1574)	1.0
  (886, 994)	1.0
  (886, 892)	1.0
  (886, 548)	1.0

This entry was posted in Python for data analysis. Bookmark the permalink.

Leave a comment Cancel reply