The dataset is the famous Titanic dataset.
import numpy as np import pandas as pd # Load the dataset X = pd.read_csv('titanic_data.csv') # Limit to categorical data X = X.select_dtypes(include=[object]) from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder # TODO: Create a LabelEncoder object, which will turn all labels present in # in each feature to numbers. For example, the labels ['cat', 'dog', 'fish'] # might be transformed into [0, 1, 2] le = LabelEncoder() # TODO: For each feature in X, apply the LabelEncoder's fit_transform # function, which will first learn the labels for the feature (fit) # and then change the labels to numbers (transform). for feature in X: X[feature] = le.fit_transform(X[feature]) # TODO: Create a OneHotEncoder object, which will create a feature for each # label present in the data. For example, for a feature 'animal' that had # the labels ['cat','dog','fish'], the new features (instead of 'animal') # could be ['animal_cat', 'animal_dog', 'animal_fish'] ohe = OneHotEncoder() # TODO: Apply the OneHotEncoder's fit_transform function to all of X, which will # first learn of all the (now numerical) labels in the data (fit), and then # change the data to one-hot encoded entries (transform). onehotlabels = ohe.fit_transform(X)
“onehotlabels” is a <891×1726 sparse matrix of type ‘<type ‘numpy.float64′>’
with 4455 stored elements in Compressed Sparse Row format>.
Part of it:
(0, 1725) 1.0 (0, 1574) 1.0 (0, 1416) 1.0 (0, 892) 1.0 (0, 108) 1.0 (1, 1723) 1.0 (1, 1656) 1.0
: : (886, 1725) 1.0 (886, 1574) 1.0 (886, 994) 1.0 (886, 892) 1.0 (886, 548) 1.0