Factors and One Hot Encoding#
Often we want to use strings in a numerical way besides just counting. In data science we often want to include a category in our models. To deal with this we can do a wide variety of transformations on the categorical variables.
One Hot Encoding#
Perhaps the easiest to understand is the one one hot encoder essentially we give a new column for every category in the categorical variable. In pandas, this action is preformed by the get_dummies
command. Let’s see it in action.
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')
df.head()
SepalLength | SepalWidth | PedalLength | PedalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
pa.get_dummies(df.Class).head()
Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 1 | 0 | 0 |
2 | 1 | 0 | 0 |
3 | 1 | 0 | 0 |
4 | 1 | 0 | 0 |
We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member. We can return this into our orginal dataset through a concat
.
df1 = pa.concat([df,pa.get_dummies(df.Class)],axis = 1)
df1 = df1.drop('Class', axis = 1)
df1
SepalLength | SepalWidth | PedalLength | PedalWidth | Iris-setosa | Iris-versicolor | Iris-virginica | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 1 | 0 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 1 | 0 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 1 | 0 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 1 | 0 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 0 | 0 | 1 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 0 | 0 | 1 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 0 | 0 | 1 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 0 | 0 | 1 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 0 | 0 | 1 |
150 rows × 7 columns
This might have been easier with just passing the entire dataframe to the get_dummies command but it gives less flexibility for changing other variables independently.
pa.get_dummies(df)
SepalLength | SepalWidth | PedalLength | PedalWidth | Class_Iris-setosa | Class_Iris-versicolor | Class_Iris-virginica | |
---|---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 1 | 0 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 1 | 0 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 1 | 0 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 1 | 0 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | 0 | 0 | 1 |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 0 | 0 | 1 |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 0 | 0 | 1 |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 0 | 0 | 1 |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 0 | 0 | 1 |
150 rows × 7 columns
We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!) Perhaps we want to get indicators for that.
from bs4 import BeautifulSoup
import requests
import re
r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")
df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)
newcol = df1.iloc[:,-1]
newcol = newcol.apply(lambda x: re.sub(r"\[(.+?)\]","",x))
newcol = newcol.apply(lambda x: re.sub(r"[^A-z]","",x))
newcol = newcol.apply(lambda x: re.findall(r"[A-Z][a-z]*",x))
newcol
0 [Nepal, China]
1 [Pakistan, China]
2 [Nepal, India]
3 [Nepal, China]
4 [Nepal, China]
...
115 [China]
116 [Nepal, China]
117 [Bhutan, China]
118 [India, China]
119 [Pakistan]
Name: Country (disputed claims in italics), Length: 120, dtype: object
First I’ll convert the data Series into a dataframe with multiple columns.
newcol.apply(pa.Series)
0 | 1 | 2 | |
---|---|---|---|
0 | Nepal | China | NaN |
1 | Pakistan | China | NaN |
2 | Nepal | India | NaN |
3 | Nepal | China | NaN |
4 | Nepal | China | NaN |
... | ... | ... | ... |
115 | China | NaN | NaN |
116 | Nepal | China | NaN |
117 | Bhutan | China | NaN |
118 | India | China | NaN |
119 | Pakistan | NaN | NaN |
120 rows × 3 columns
Next I’ll use the stack
command to break each row apart into its individual peices.
newcol.apply(pa.Series).stack()
0 0 Nepal
1 China
1 0 Pakistan
1 China
2 0 Nepal
...
117 0 Bhutan
1 China
118 0 India
1 China
119 0 Pakistan
Length: 161, dtype: object
Now I can get the dummies!
pa.get_dummies(newcol.apply(pa.Series).stack())
Afghanistan | Bhutan | China | India | Kyrgyzstan | Nepal | Pakistan | Tajikistan | ||
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
117 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
118 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
119 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
161 rows × 8 columns
Lastly we bring it all back together using the groupby command on the indexes using level = 0
. We sum the totals as well.
pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()
Afghanistan | Bhutan | China | India | Kyrgyzstan | Nepal | Pakistan | Tajikistan | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
115 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
116 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
117 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
118 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
119 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
120 rows × 8 columns
Factors#
One hot encoding works great for data that is nominal. What then should we do for data that has an order (ordinal)? If the data has a natural order AND the steps are equal, it may just be best to transform with a dictionary. Let’s see that in action.
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')
df.head()
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
Education is clearly well ordered. We’ll assume that each step is of equal importance. (This is probably not a good assumption as not having a high school degree VS GED/High School Diploma VS College Grad are not equal steps in the credit world of banking.)
edlevels = sorted(df.education.unique())
edlevels
['primary', 'secondary', 'tertiary', 'unknown']
I’ve gathered the levels and sorted them alphabetically. Now I’ll build a dictionary. Besides the ordering, there is nothing special about the dictionary I have assigned. There are many others that could be equivalent!
eddict = {}
for i in range(4):
eddict[edlevels[i]] = i
eddict
{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}
I probably shouldn’t assign an ‘unknown’ education level to beyond college!
eddict['unknown'] = 0
eddict
{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 0}
How many were there that were unknown any way?
df.groupby('education').age.agg('count')
education
primary 678
secondary 2306
tertiary 1350
unknown 187
Name: age, dtype: int64
Now I’ll apply the dictionary with map
.
df.education.map(eddict)
0 0
1 1
2 2
3 2
4 1
..
4516 1
4517 2
4518 1
4519 1
4520 2
Name: education, Length: 4521, dtype: int64
Putting it back into the dataframe is a snap.
df.education = df.education.map(eddict)
df.head()
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | 0 | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | 1 | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
2 | 35 | management | single | 2 | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
3 | 30 | management | married | 2 | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | 1 | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
Your Turn#
Use the data file found here on AirB&B in NYC. Convert the neighbourhood_group and the room_type into appropriate numerical data. Justify your decisisons on how you made the transformations.
df = pa.read_csv("https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv")
df.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |