Factors and One Hot Encoding

Open In Colab

Factors and One Hot Encoding#

Often we want to use strings in a numerical way besides just counting. In data science we often want to include a category in our models. To deal with this we can do a wide variety of transformations on the categorical variables.

One Hot Encoding#

Perhaps the easiest to understand is the one one hot encoder essentially we give a new column for every category in the categorical variable. In pandas, this action is preformed by the get_dummies command. Let’s see it in action.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

df.head()
SepalLength SepalWidth PedalLength PedalWidth Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
pa.get_dummies(df.Class).head()
Iris-setosa Iris-versicolor Iris-virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0

We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member. We can return this into our orginal dataset through a concat.

df1 = pa.concat([df,pa.get_dummies(df.Class)],axis = 1)

df1 = df1.drop('Class', axis = 1)

df1
SepalLength SepalWidth PedalLength PedalWidth Iris-setosa Iris-versicolor Iris-virginica
0 5.1 3.5 1.4 0.2 1 0 0
1 4.9 3.0 1.4 0.2 1 0 0
2 4.7 3.2 1.3 0.2 1 0 0
3 4.6 3.1 1.5 0.2 1 0 0
4 5.0 3.6 1.4 0.2 1 0 0
... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 0 0 1
146 6.3 2.5 5.0 1.9 0 0 1
147 6.5 3.0 5.2 2.0 0 0 1
148 6.2 3.4 5.4 2.3 0 0 1
149 5.9 3.0 5.1 1.8 0 0 1

150 rows × 7 columns

This might have been easier with just passing the entire dataframe to the get_dummies command but it gives less flexibility for changing other variables independently.

pa.get_dummies(df)
SepalLength SepalWidth PedalLength PedalWidth Class_Iris-setosa Class_Iris-versicolor Class_Iris-virginica
0 5.1 3.5 1.4 0.2 1 0 0
1 4.9 3.0 1.4 0.2 1 0 0
2 4.7 3.2 1.3 0.2 1 0 0
3 4.6 3.1 1.5 0.2 1 0 0
4 5.0 3.6 1.4 0.2 1 0 0
... ... ... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 0 0 1
146 6.3 2.5 5.0 1.9 0 0 1
147 6.5 3.0 5.2 2.0 0 0 1
148 6.2 3.4 5.4 2.3 0 0 1
149 5.9 3.0 5.1 1.8 0 0 1

150 rows × 7 columns

We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!) Perhaps we want to get indicators for that.

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")

df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)

newcol = df1.iloc[:,-1]
newcol = newcol.apply(lambda x: re.sub(r"\[(.+?)\]","",x))
newcol = newcol.apply(lambda x: re.sub(r"[^A-z]","",x))
newcol = newcol.apply(lambda x: re.findall(r"[A-Z][a-z]*",x))

newcol
0         [Nepal, China]
1      [Pakistan, China]
2         [Nepal, India]
3         [Nepal, China]
4         [Nepal, China]
             ...        
115              [China]
116       [Nepal, China]
117      [Bhutan, China]
118       [India, China]
119           [Pakistan]
Name: Country (disputed claims in italics), Length: 120, dtype: object

First I’ll convert the data Series into a dataframe with multiple columns.

newcol.apply(pa.Series)
0 1 2
0 Nepal China NaN
1 Pakistan China NaN
2 Nepal India NaN
3 Nepal China NaN
4 Nepal China NaN
... ... ... ...
115 China NaN NaN
116 Nepal China NaN
117 Bhutan China NaN
118 India China NaN
119 Pakistan NaN NaN

120 rows × 3 columns

Next I’ll use the stack command to break each row apart into its individual peices.

newcol.apply(pa.Series).stack()
0    0       Nepal
     1       China
1    0    Pakistan
     1       China
2    0       Nepal
            ...   
117  0      Bhutan
     1       China
118  0       India
     1       China
119  0    Pakistan
Length: 161, dtype: object

Now I can get the dummies!

pa.get_dummies(newcol.apply(pa.Series).stack())
Afghanistan Bhutan China India Kyrgyzstan Nepal Pakistan Tajikistan
0 0 0 0 0 0 0 1 0 0
1 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 1 0
1 0 0 1 0 0 0 0 0
2 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ...
117 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
118 0 0 0 0 1 0 0 0 0
1 0 0 1 0 0 0 0 0
119 0 0 0 0 0 0 0 1 0

161 rows × 8 columns

Lastly we bring it all back together using the groupby command on the indexes using level = 0. We sum the totals as well.

pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()
Afghanistan Bhutan China India Kyrgyzstan Nepal Pakistan Tajikistan
0 0 0 1 0 0 1 0 0
1 0 0 1 0 0 0 1 0
2 0 0 0 1 0 1 0 0
3 0 0 1 0 0 1 0 0
4 0 0 1 0 0 1 0 0
... ... ... ... ... ... ... ... ...
115 0 0 1 0 0 0 0 0
116 0 0 1 0 0 1 0 0
117 0 1 1 0 0 0 0 0
118 0 0 1 1 0 0 0 0
119 0 0 0 0 0 0 1 0

120 rows × 8 columns

Factors#

One hot encoding works great for data that is nominal. What then should we do for data that has an order (ordinal)? If the data has a natural order AND the steps are equal, it may just be best to transform with a dictionary. Let’s see that in action.

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')

df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no

Education is clearly well ordered. We’ll assume that each step is of equal importance. (This is probably not a good assumption as not having a high school degree VS GED/High School Diploma VS College Grad are not equal steps in the credit world of banking.)

edlevels = sorted(df.education.unique())

edlevels
['primary', 'secondary', 'tertiary', 'unknown']

I’ve gathered the levels and sorted them alphabetically. Now I’ll build a dictionary. Besides the ordering, there is nothing special about the dictionary I have assigned. There are many others that could be equivalent!

eddict = {}

for i in range(4):
  eddict[edlevels[i]] = i

eddict
{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}

I probably shouldn’t assign an ‘unknown’ education level to beyond college!

eddict['unknown'] = 0

eddict
{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 0}

How many were there that were unknown any way?

df.groupby('education').age.agg('count')
education
primary       678
secondary    2306
tertiary     1350
unknown       187
Name: age, dtype: int64

Now I’ll apply the dictionary with map.

df.education.map(eddict)
0       0
1       1
2       2
3       2
4       1
       ..
4516    1
4517    2
4518    1
4519    1
4520    2
Name: education, Length: 4521, dtype: int64

Putting it back into the dataframe is a snap.

df.education = df.education.map(eddict)

df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married 0 no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married 1 no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single 2 no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married 2 no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married 1 no 0 yes no unknown 5 may 226 1 -1 0 unknown no

Your Turn#

Use the data file found here on AirB&B in NYC. Convert the neighbourhood_group and the room_type into appropriate numerical data. Justify your decisisons on how you made the transformations.

df = pa.read_csv("https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv")
df.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0