Factors and One Hot Encoding

Factors and One Hot Encoding#

Often we want to use strings in a numerical way besides just counting. In data science we often want to include a category in our models. To deal with this we can do a wide variety of transformations on the categorical variables.

One Hot Encoding#

Perhaps the easiest to understand is the one one hot encoder essentially we give a new column for every category in the categorical variable. In pandas, this action is preformed by the get_dummies command. Let’s see it in action.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

df.head()

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

pa.get_dummies(df.Class).head()

	Iris-setosa	Iris-versicolor	Iris-virginica
0	1	0	0
1	1	0	0
2	1	0	0
3	1	0	0
4	1	0	0

We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member. We can return this into our orginal dataset through a concat.

df1 = pa.concat([df,pa.get_dummies(df.Class)],axis = 1)

df1 = df1.drop('Class', axis = 1)

df1

	SepalLength	SepalWidth	PedalLength	PedalWidth	Iris-setosa	Iris-versicolor	Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

150 rows × 7 columns

This might have been easier with just passing the entire dataframe to the get_dummies command but it gives less flexibility for changing other variables independently.

pa.get_dummies(df)

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class_Iris-setosa	Class_Iris-versicolor	Class_Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

150 rows × 7 columns

We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!) Perhaps we want to get indicators for that.

from bs4 import BeautifulSoup
import requests
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")

df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)

newcol = df1.iloc[:,-1]
newcol = newcol.apply(lambda x: re.sub(r"\[(.+?)\]","",x))
newcol = newcol.apply(lambda x: re.sub(r"[^A-z]","",x))
newcol = newcol.apply(lambda x: re.findall(r"[A-Z][a-z]*",x))

newcol

       [Nepal, China]
    [Pakistan, China]
       [Nepal, India]
       [Nepal, China]
       [Nepal, China]
             ...        
            [China]
     [Nepal, China]
    [Bhutan, China]
     [India, China]
         [Pakistan]
Name: Country (disputed claims in italics), Length: 120, dtype: object

First I’ll convert the data Series into a dataframe with multiple columns.

newcol.apply(pa.Series)

	0	1	2
0	Nepal	China	NaN
1	Pakistan	China	NaN
2	Nepal	India	NaN
3	Nepal	China	NaN
4	Nepal	China	NaN
...	...	...	...
115	China	NaN	NaN
116	Nepal	China	NaN
117	Bhutan	China	NaN
118	India	China	NaN
119	Pakistan	NaN	NaN

120 rows × 3 columns

Next I’ll use the stack command to break each row apart into its individual peices.

newcol.apply(pa.Series).stack()

  0       Nepal
     China
  0    Pakistan
     China
  0       Nepal
            ...   
0      Bhutan
     China
0       India
     China
0    Pakistan
Length: 161, dtype: object

Now I can get the dummies!

pa.get_dummies(newcol.apply(pa.Series).stack())

		Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	0	0	0	0	1	0	0
0	1	0	0	1	0	0	0	0	0
1	0	0	0	0	0	0	0	1	0
1	1	0	0	1	0	0	0	0	0
2	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...
117	0	0	1	0	0	0	0	0	0
117	1	0	0	1	0	0	0	0	0
118	0	0	0	0	1	0	0	0	0
118	1	0	0	1	0	0	0	0	0
119	0	0	0	0	0	0	0	1	0

161 rows × 8 columns

Lastly we bring it all back together using the groupby command on the indexes using level = 0. We sum the totals as well.

pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()

	Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	1	0	0	1	0	0
1	0	0	1	0	0	0	1	0
2	0	0	0	1	0	1	0	0
3	0	0	1	0	0	1	0	0
4	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...
115	0	0	1	0	0	0	0	0
116	0	0	1	0	0	1	0	0
117	0	1	1	0	0	0	0	0
118	0	0	1	1	0	0	0	0
119	0	0	0	0	0	0	1	0

120 rows × 8 columns

Factors#

One hot encoding works great for data that is nominal. What then should we do for data that has an order (ordinal)? If the data has a natural order AND the steps are equal, it may just be best to transform with a dictionary. Let’s see that in action.

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')

df.head()

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
4	59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no

Education is clearly well ordered. We’ll assume that each step is of equal importance. (This is probably not a good assumption as not having a high school degree VS GED/High School Diploma VS College Grad are not equal steps in the credit world of banking.)

edlevels = sorted(df.education.unique())

edlevels

['primary', 'secondary', 'tertiary', 'unknown']

I’ve gathered the levels and sorted them alphabetically. Now I’ll build a dictionary. Besides the ordering, there is nothing special about the dictionary I have assigned. There are many others that could be equivalent!

eddict = {}

for i in range(4):
  eddict[edlevels[i]] = i

eddict

{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}

I probably shouldn’t assign an ‘unknown’ education level to beyond college!

eddict['unknown'] = 0

eddict

{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 0}

How many were there that were unknown any way?

df.groupby('education').age.agg('count')

education
primary       678
secondary    2306
tertiary     1350
unknown       187
Name: age, dtype: int64

Now I’ll apply the dictionary with map.

df.education.map(eddict)

     0
     1
     2
     2
     1
       ..
  1
  2
  1
  1
  2
Name: education, Length: 4521, dtype: int64

Putting it back into the dataframe is a snap.

df.education = df.education.map(eddict)

df.head()

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	unemployed	married	0	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
1	33	services	married	1	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
2	35	management	single	2	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
3	30	management	married	2	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
4	59	blue-collar	married	1	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no

Your Turn#

Use the data file found here on AirB&B in NYC. Convert the neighbourhood_group and the room_type into appropriate numerical data. Justify your decisisons on how you made the transformations.

df = pa.read_csv("https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv")
df.head()

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

	SepalLength	SepalWidth	PedalLength	PedalWidth	Iris-setosa	Iris-versicolor	Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class_Iris-setosa	Class_Iris-versicolor	Class_Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

		Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	0	0	0	0	1	0	0
0	1	0	0	1	0	0	0	0	0
1	0	0	0	0	0	0	0	1	0
1	1	0	0	1	0	0	0	0	0
2	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...
117	0	0	1	0	0	0	0	0	0
117	1	0	0	1	0	0	0	0	0
118	0	0	0	0	1	0	0	0	0
118	1	0	0	1	0	0	0	0	0
119	0	0	0	0	0	0	0	1	0

	Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	1	0	0	1	0	0
1	0	0	1	0	0	0	1	0
2	0	0	0	1	0	1	0	0
3	0	0	1	0	0	1	0	0
4	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...
115	0	0	1	0	0	0	0	0
116	0	0	1	0	0	1	0	0
117	0	1	1	0	0	0	0	0
118	0	0	1	1	0	0	0	0
119	0	0	0	0	0	0	1	0

	SepalLength	SepalWidth	PedalLength	PedalWidth	Iris-setosa	Iris-versicolor	Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class_Iris-setosa	Class_Iris-versicolor	Class_Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

		Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	0	0	0	0	1	0	0
0	1	0	0	1	0	0	0	0	0
1	0	0	0	0	0	0	0	1	0
1	1	0	0	1	0	0	0	0	0
2	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...
117	0	0	1	0	0	0	0	0	0
117	1	0	0	1	0	0	0	0	0
118	0	0	0	0	1	0	0	0	0
118	1	0	0	1	0	0	0	0	0
119	0	0	0	0	0	0	0	1	0

	Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	1	0	0	1	0	0
1	0	0	1	0	0	0	1	0
2	0	0	0	1	0	1	0	0
3	0	0	1	0	0	1	0	0
4	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...
115	0	0	1	0	0	0	0	0
116	0	0	1	0	0	1	0	0
117	0	1	1	0	0	0	0	0
118	0	0	1	1	0	0	0	0
119	0	0	0	0	0	0	1	0

Factors and One Hot Encoding

Contents

Factors and One Hot Encoding#

One Hot Encoding#

Factors#

Your Turn#

	SepalLength	SepalWidth	PedalLength	PedalWidth	Iris-setosa	Iris-versicolor	Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class_Iris-setosa	Class_Iris-versicolor	Class_Iris-virginica
0	5.1	3.5	1.4	0.2	1	0	0
1	4.9	3.0	1.4	0.2	1	0	0
2	4.7	3.2	1.3	0.2	1	0	0
3	4.6	3.1	1.5	0.2	1	0	0
4	5.0	3.6	1.4	0.2	1	0	0
...	...	...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	0	0	1
146	6.3	2.5	5.0	1.9	0	0	1
147	6.5	3.0	5.2	2.0	0	0	1
148	6.2	3.4	5.4	2.3	0	0	1
149	5.9	3.0	5.1	1.8	0	0	1

		Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	0	0	0	0	1	0	0
0	1	0	0	1	0	0	0	0	0
1	0	0	0	0	0	0	0	1	0
1	1	0	0	1	0	0	0	0	0
2	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...
117	0	0	1	0	0	0	0	0	0
117	1	0	0	1	0	0	0	0	0
118	0	0	0	0	1	0	0	0	0
118	1	0	0	1	0	0	0	0	0
119	0	0	0	0	0	0	0	1	0

	Afghanistan	Bhutan	China	India	Kyrgyzstan	Nepal	Pakistan	Tajikistan
0	0	0	1	0	0	1	0	0
1	0	0	1	0	0	0	1	0
2	0	0	0	1	0	1	0	0
3	0	0	1	0	0	1	0	0
4	0	0	1	0	0	1	0	0
...	...	...	...	...	...	...	...	...
115	0	0	1	0	0	0	0	0
116	0	0	1	0	0	1	0	0
117	0	1	1	0	0	0	0	0
118	0	0	1	1	0	0	0	0
119	0	0	0	0	0	0	1	0