{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Untitled87.ipynb",
"provenance": [],
"authorship_tag": "ABX9TyMYr/3v8bwA55sBwwVEC4qh",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
},
"language_info": {
"name": "python"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
" "
]
},
{
"cell_type": "markdown",
"source": [
"# Factors and One Hot Encoding"
],
"metadata": {
"id": "QwffnQ_A7O3u"
}
},
{
"cell_type": "markdown",
"source": [
"Often we want to use strings in a numerical way besides just counting. In data science we often want to include a category in our models. To deal with this we can do a wide variety of transformations on the categorical variables."
],
"metadata": {
"id": "t1DKbTmM7YmX"
}
},
{
"cell_type": "markdown",
"source": [
"## One Hot Encoding"
],
"metadata": {
"id": "tGic7KKJ70rE"
}
},
{
"cell_type": "markdown",
"source": [
"Perhaps the easiest to understand is the one **one hot encoder** essentially we give a new column for every category in the categorical variable. In pandas, this action is preformed by the `get_dummies` command. Let's see it in action."
],
"metadata": {
"id": "U-6ZSxlI7582"
}
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "7vhr3vi17N1F",
"outputId": "efcfa73d-4c90-4a84-f83d-cd19945f2436"
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
"
\n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SepalLength \n",
" SepalWidth \n",
" PedalLength \n",
" PedalWidth \n",
" Class \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.1 \n",
" 3.5 \n",
" 1.4 \n",
" 0.2 \n",
" Iris-setosa \n",
" \n",
" \n",
" 1 \n",
" 4.9 \n",
" 3.0 \n",
" 1.4 \n",
" 0.2 \n",
" Iris-setosa \n",
" \n",
" \n",
" 2 \n",
" 4.7 \n",
" 3.2 \n",
" 1.3 \n",
" 0.2 \n",
" Iris-setosa \n",
" \n",
" \n",
" 3 \n",
" 4.6 \n",
" 3.1 \n",
" 1.5 \n",
" 0.2 \n",
" Iris-setosa \n",
" \n",
" \n",
" 4 \n",
" 5.0 \n",
" 3.6 \n",
" 1.4 \n",
" 0.2 \n",
" Iris-setosa \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" SepalLength SepalWidth PedalLength PedalWidth Class\n",
"0 5.1 3.5 1.4 0.2 Iris-setosa\n",
"1 4.9 3.0 1.4 0.2 Iris-setosa\n",
"2 4.7 3.2 1.3 0.2 Iris-setosa\n",
"3 4.6 3.1 1.5 0.2 Iris-setosa\n",
"4 5.0 3.6 1.4 0.2 Iris-setosa"
]
},
"metadata": {},
"execution_count": 1
}
],
"source": [
"import pandas as pa\n",
"\n",
"df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')\n",
"\n",
"df.head()"
]
},
{
"cell_type": "code",
"source": [
"pa.get_dummies(df.Class).head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "GSiO-3Ex8cMq",
"outputId": "f894cb21-aa18-4c71-f481-78cd85a51d50"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Iris-setosa \n",
" Iris-versicolor \n",
" Iris-virginica \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" Iris-setosa Iris-versicolor Iris-virginica\n",
"0 1 0 0\n",
"1 1 0 0\n",
"2 1 0 0\n",
"3 1 0 0\n",
"4 1 0 0"
]
},
"metadata": {},
"execution_count": 2
}
]
},
{
"cell_type": "markdown",
"source": [
"We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member. We can return this into our orginal dataset through a `concat`."
],
"metadata": {
"id": "--Gpysyb80lC"
}
},
{
"cell_type": "code",
"source": [
"df1 = pa.concat([df,pa.get_dummies(df.Class)],axis = 1)\n",
"\n",
"df1 = df1.drop('Class', axis = 1)\n",
"\n",
"df1"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "gzIw4eKFDWix",
"outputId": "690b13a1-c3cd-4430-8af6-c72ee47eabe4"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SepalLength \n",
" SepalWidth \n",
" PedalLength \n",
" PedalWidth \n",
" Iris-setosa \n",
" Iris-versicolor \n",
" Iris-virginica \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.1 \n",
" 3.5 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 4.9 \n",
" 3.0 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 4.7 \n",
" 3.2 \n",
" 1.3 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 4.6 \n",
" 3.1 \n",
" 1.5 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 5.0 \n",
" 3.6 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 145 \n",
" 6.7 \n",
" 3.0 \n",
" 5.2 \n",
" 2.3 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 146 \n",
" 6.3 \n",
" 2.5 \n",
" 5.0 \n",
" 1.9 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 147 \n",
" 6.5 \n",
" 3.0 \n",
" 5.2 \n",
" 2.0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 148 \n",
" 6.2 \n",
" 3.4 \n",
" 5.4 \n",
" 2.3 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 149 \n",
" 5.9 \n",
" 3.0 \n",
" 5.1 \n",
" 1.8 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
150 rows × 7 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" SepalLength SepalWidth ... Iris-versicolor Iris-virginica\n",
"0 5.1 3.5 ... 0 0\n",
"1 4.9 3.0 ... 0 0\n",
"2 4.7 3.2 ... 0 0\n",
"3 4.6 3.1 ... 0 0\n",
"4 5.0 3.6 ... 0 0\n",
".. ... ... ... ... ...\n",
"145 6.7 3.0 ... 0 1\n",
"146 6.3 2.5 ... 0 1\n",
"147 6.5 3.0 ... 0 1\n",
"148 6.2 3.4 ... 0 1\n",
"149 5.9 3.0 ... 0 1\n",
"\n",
"[150 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 3
}
]
},
{
"cell_type": "markdown",
"source": [
"This might have been easier with just passing the entire dataframe to the get_dummies command but it gives less flexibility for changing other variables independently."
],
"metadata": {
"id": "jhX6bSAPEiDl"
}
},
{
"cell_type": "code",
"source": [
"pa.get_dummies(df)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "JYol5KAZEp2U",
"outputId": "b228b934-8513-4ffc-9d9d-b3f1c2a241fd"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" SepalLength \n",
" SepalWidth \n",
" PedalLength \n",
" PedalWidth \n",
" Class_Iris-setosa \n",
" Class_Iris-versicolor \n",
" Class_Iris-virginica \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 5.1 \n",
" 3.5 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 4.9 \n",
" 3.0 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 4.7 \n",
" 3.2 \n",
" 1.3 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 4.6 \n",
" 3.1 \n",
" 1.5 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 5.0 \n",
" 3.6 \n",
" 1.4 \n",
" 0.2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 145 \n",
" 6.7 \n",
" 3.0 \n",
" 5.2 \n",
" 2.3 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 146 \n",
" 6.3 \n",
" 2.5 \n",
" 5.0 \n",
" 1.9 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 147 \n",
" 6.5 \n",
" 3.0 \n",
" 5.2 \n",
" 2.0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 148 \n",
" 6.2 \n",
" 3.4 \n",
" 5.4 \n",
" 2.3 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 149 \n",
" 5.9 \n",
" 3.0 \n",
" 5.1 \n",
" 1.8 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
150 rows × 7 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" SepalLength SepalWidth ... Class_Iris-versicolor Class_Iris-virginica\n",
"0 5.1 3.5 ... 0 0\n",
"1 4.9 3.0 ... 0 0\n",
"2 4.7 3.2 ... 0 0\n",
"3 4.6 3.1 ... 0 0\n",
"4 5.0 3.6 ... 0 0\n",
".. ... ... ... ... ...\n",
"145 6.7 3.0 ... 0 1\n",
"146 6.3 2.5 ... 0 1\n",
"147 6.5 3.0 ... 0 1\n",
"148 6.2 3.4 ... 0 1\n",
"149 5.9 3.0 ... 0 1\n",
"\n",
"[150 rows x 7 columns]"
]
},
"metadata": {},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"source": [
"We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!) Perhaps we want to get indicators for that."
],
"metadata": {
"id": "sCFCBQqv9WKB"
}
},
{
"cell_type": "code",
"source": [
"from bs4 import BeautifulSoup\n",
"import requests\n",
"import re\n",
"\n",
"r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')\n",
"html_contents = r.text\n",
"html_soup = BeautifulSoup(html_contents,\"lxml\")\n",
"tables = html_soup.find_all('table',class_=\"wikitable\")\n",
"\n",
"df1 = pa.read_html(str(tables))[0]\n",
"df1.columns = df1.columns.droplevel(0).droplevel(0)\n",
"\n",
"newcol = df1.iloc[:,-1]\n",
"newcol = newcol.apply(lambda x: re.sub(r\"\\[(.+?)\\]\",\"\",x))\n",
"newcol = newcol.apply(lambda x: re.sub(r\"[^A-z]\",\"\",x))\n",
"newcol = newcol.apply(lambda x: re.findall(r\"[A-Z][a-z]*\",x))\n",
"\n",
"newcol"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "8J_jAj-98g_j",
"outputId": "c1282a5c-3deb-4e05-9867-064a64760c07"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 [Nepal, China]\n",
"1 [Pakistan, China]\n",
"2 [Nepal, India]\n",
"3 [Nepal, China]\n",
"4 [Nepal, China]\n",
" ... \n",
"115 [China]\n",
"116 [Nepal, China]\n",
"117 [Bhutan, China]\n",
"118 [India, China]\n",
"119 [Pakistan]\n",
"Name: Country (disputed claims in italics), Length: 120, dtype: object"
]
},
"metadata": {},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"source": [
"First I'll convert the data Series into a dataframe with multiple columns."
],
"metadata": {
"id": "G6_5rG8r_vgK"
}
},
{
"cell_type": "code",
"source": [
"newcol.apply(pa.Series)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "nhARZeK8_YEU",
"outputId": "9e961f17-16f6-4e83-e83c-0d88b4b83989"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" 0 \n",
" 1 \n",
" 2 \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Nepal \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 1 \n",
" Pakistan \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 2 \n",
" Nepal \n",
" India \n",
" NaN \n",
" \n",
" \n",
" 3 \n",
" Nepal \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 4 \n",
" Nepal \n",
" China \n",
" NaN \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 115 \n",
" China \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 116 \n",
" Nepal \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 117 \n",
" Bhutan \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 118 \n",
" India \n",
" China \n",
" NaN \n",
" \n",
" \n",
" 119 \n",
" Pakistan \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
"
\n",
"
120 rows × 3 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" 0 1 2\n",
"0 Nepal China NaN\n",
"1 Pakistan China NaN\n",
"2 Nepal India NaN\n",
"3 Nepal China NaN\n",
"4 Nepal China NaN\n",
".. ... ... ...\n",
"115 China NaN NaN\n",
"116 Nepal China NaN\n",
"117 Bhutan China NaN\n",
"118 India China NaN\n",
"119 Pakistan NaN NaN\n",
"\n",
"[120 rows x 3 columns]"
]
},
"metadata": {},
"execution_count": 17
}
]
},
{
"cell_type": "markdown",
"source": [
"Next I'll use the `stack` command to break each row apart into its individual peices."
],
"metadata": {
"id": "fkphLSLoAS-k"
}
},
{
"cell_type": "code",
"source": [
"newcol.apply(pa.Series).stack()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "17GBojy3Ahm5",
"outputId": "f8e2b061-9e0f-4c9e-9331-f4c9b0b9156c"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 0 Nepal\n",
" 1 China\n",
"1 0 Pakistan\n",
" 1 China\n",
"2 0 Nepal\n",
" ... \n",
"117 0 Bhutan\n",
" 1 China\n",
"118 0 India\n",
" 1 China\n",
"119 0 Pakistan\n",
"Length: 161, dtype: object"
]
},
"metadata": {},
"execution_count": 18
}
]
},
{
"cell_type": "markdown",
"source": [
"Now I can get the dummies!"
],
"metadata": {
"id": "vkE4bwEwAoKt"
}
},
{
"cell_type": "code",
"source": [
"pa.get_dummies(newcol.apply(pa.Series).stack())"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "j-bY53qRArL1",
"outputId": "4121cb8a-c779-4ced-9380-cdbd8d0d6e2c"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" Afghanistan \n",
" Bhutan \n",
" China \n",
" India \n",
" Kyrgyzstan \n",
" Nepal \n",
" Pakistan \n",
" Tajikistan \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 117 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 118 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 119 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
161 rows × 8 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" Afghanistan Bhutan China ... Nepal Pakistan Tajikistan\n",
"0 0 0 0 0 ... 1 0 0\n",
" 1 0 0 1 ... 0 0 0\n",
"1 0 0 0 0 ... 0 1 0\n",
" 1 0 0 1 ... 0 0 0\n",
"2 0 0 0 0 ... 1 0 0\n",
"... ... ... ... ... ... ... ...\n",
"117 0 0 1 0 ... 0 0 0\n",
" 1 0 0 1 ... 0 0 0\n",
"118 0 0 0 0 ... 0 0 0\n",
" 1 0 0 1 ... 0 0 0\n",
"119 0 0 0 0 ... 0 1 0\n",
"\n",
"[161 rows x 8 columns]"
]
},
"metadata": {},
"execution_count": 19
}
]
},
{
"cell_type": "markdown",
"source": [
"Lastly we bring it all back together using the groupby command on the indexes using `level = 0`. We sum the totals as well."
],
"metadata": {
"id": "u0u6hU14A6Ou"
}
},
{
"cell_type": "code",
"source": [
"pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 424
},
"id": "BAcZeYvp9jJo",
"outputId": "532914a0-484b-4bd0-f176-bf88ae367aa5"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Afghanistan \n",
" Bhutan \n",
" China \n",
" India \n",
" Kyrgyzstan \n",
" Nepal \n",
" Pakistan \n",
" Tajikistan \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 115 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 116 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 117 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 118 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 119 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
120 rows × 8 columns
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" Afghanistan Bhutan China India Kyrgyzstan Nepal Pakistan Tajikistan\n",
"0 0 0 1 0 0 1 0 0\n",
"1 0 0 1 0 0 0 1 0\n",
"2 0 0 0 1 0 1 0 0\n",
"3 0 0 1 0 0 1 0 0\n",
"4 0 0 1 0 0 1 0 0\n",
".. ... ... ... ... ... ... ... ...\n",
"115 0 0 1 0 0 0 0 0\n",
"116 0 0 1 0 0 1 0 0\n",
"117 0 1 1 0 0 0 0 0\n",
"118 0 0 1 1 0 0 0 0\n",
"119 0 0 0 0 0 0 1 0\n",
"\n",
"[120 rows x 8 columns]"
]
},
"metadata": {},
"execution_count": 20
}
]
},
{
"cell_type": "markdown",
"source": [
"## Factors"
],
"metadata": {
"id": "Jz-YTrLhC4Tw"
}
},
{
"cell_type": "markdown",
"source": [
"One hot encoding works great for data that is nominal. What then should we do for data that has an order (ordinal)? If the data has a natural order **AND** the steps are equal, it may just be best to transform with a dictionary. Let's see that in action."
],
"metadata": {
"id": "_4LaXQzAC6Wf"
}
},
{
"cell_type": "code",
"source": [
"df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')\n",
"\n",
"df.head()"
],
"metadata": {
"id": "MSEOQLWV9q6l",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"outputId": "3c88dfe1-cc5c-482c-b9d5-a5b6c4820158"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" job \n",
" marital \n",
" education \n",
" default \n",
" balance \n",
" housing \n",
" loan \n",
" contact \n",
" day \n",
" month \n",
" duration \n",
" campaign \n",
" pdays \n",
" previous \n",
" poutcome \n",
" y \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 30 \n",
" unemployed \n",
" married \n",
" primary \n",
" no \n",
" 1787 \n",
" no \n",
" no \n",
" cellular \n",
" 19 \n",
" oct \n",
" 79 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 1 \n",
" 33 \n",
" services \n",
" married \n",
" secondary \n",
" no \n",
" 4789 \n",
" yes \n",
" yes \n",
" cellular \n",
" 11 \n",
" may \n",
" 220 \n",
" 1 \n",
" 339 \n",
" 4 \n",
" failure \n",
" no \n",
" \n",
" \n",
" 2 \n",
" 35 \n",
" management \n",
" single \n",
" tertiary \n",
" no \n",
" 1350 \n",
" yes \n",
" no \n",
" cellular \n",
" 16 \n",
" apr \n",
" 185 \n",
" 1 \n",
" 330 \n",
" 1 \n",
" failure \n",
" no \n",
" \n",
" \n",
" 3 \n",
" 30 \n",
" management \n",
" married \n",
" tertiary \n",
" no \n",
" 1476 \n",
" yes \n",
" yes \n",
" unknown \n",
" 3 \n",
" jun \n",
" 199 \n",
" 4 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 4 \n",
" 59 \n",
" blue-collar \n",
" married \n",
" secondary \n",
" no \n",
" 0 \n",
" yes \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 226 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" age job marital education ... pdays previous poutcome y\n",
"0 30 unemployed married primary ... -1 0 unknown no\n",
"1 33 services married secondary ... 339 4 failure no\n",
"2 35 management single tertiary ... 330 1 failure no\n",
"3 30 management married tertiary ... -1 0 unknown no\n",
"4 59 blue-collar married secondary ... -1 0 unknown no\n",
"\n",
"[5 rows x 17 columns]"
]
},
"metadata": {},
"execution_count": 6
}
]
},
{
"cell_type": "markdown",
"source": [
"Education is clearly well ordered. We'll assume that each step is of equal importance. (This is probably not a good assumption as not having a high school degree VS GED/High School Diploma VS College Grad are not equal steps in the credit world of banking.)"
],
"metadata": {
"id": "CQW-L6dKFZPH"
}
},
{
"cell_type": "code",
"source": [
"edlevels = sorted(df.education.unique())\n",
"\n",
"edlevels"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2uK3bSTJFYVF",
"outputId": "f5579a83-f108-467c-f218-40903527bd5d"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['primary', 'secondary', 'tertiary', 'unknown']"
]
},
"metadata": {},
"execution_count": 7
}
]
},
{
"cell_type": "markdown",
"source": [
"I've gathered the levels and sorted them alphabetically. Now I'll build a dictionary. Besides the ordering, there is nothing special about the dictionary I have assigned. There are many others that could be equivalent!"
],
"metadata": {
"id": "NuPZn5LKGsvU"
}
},
{
"cell_type": "code",
"source": [
"eddict = {}\n",
"\n",
"for i in range(4):\n",
" eddict[edlevels[i]] = i\n",
"\n",
"eddict"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "1NDkCIB1F4Ss",
"outputId": "0ccbed5a-be51-493f-dbf5-f9547666449f"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}"
]
},
"metadata": {},
"execution_count": 8
}
]
},
{
"cell_type": "markdown",
"source": [
"I probably shouldn't assign an 'unknown' education level to beyond college!"
],
"metadata": {
"id": "eAcYcN8cFaFl"
}
},
{
"cell_type": "code",
"source": [
"eddict['unknown'] = 0\n",
"\n",
"eddict"
],
"metadata": {
"id": "P6McQU-_Dn_P",
"outputId": "5b664a47-da5c-4673-edc2-9204c4917d7b",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 0}"
]
},
"metadata": {},
"execution_count": 9
}
]
},
{
"cell_type": "markdown",
"source": [
"How many were there that were unknown any way?"
],
"metadata": {
"id": "dF5ockAnFjZ1"
}
},
{
"cell_type": "code",
"source": [
"df.groupby('education').age.agg('count')"
],
"metadata": {
"id": "HcGLbWECD737",
"outputId": "6693e36e-ce6d-4cd0-82b0-9d5f37f882a1",
"colab": {
"base_uri": "https://localhost:8080/"
}
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"education\n",
"primary 678\n",
"secondary 2306\n",
"tertiary 1350\n",
"unknown 187\n",
"Name: age, dtype: int64"
]
},
"metadata": {},
"execution_count": 11
}
]
},
{
"cell_type": "markdown",
"source": [
"Now I'll apply the dictionary with `map`."
],
"metadata": {
"id": "OezYTzdOGrpb"
}
},
{
"cell_type": "code",
"source": [
"df.education.map(eddict)"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "pRGzPi2eGfMb",
"outputId": "ebb04320-740f-4d1a-8793-4cd705e71eec"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"0 0\n",
"1 1\n",
"2 2\n",
"3 2\n",
"4 1\n",
" ..\n",
"4516 1\n",
"4517 2\n",
"4518 1\n",
"4519 1\n",
"4520 2\n",
"Name: education, Length: 4521, dtype: int64"
]
},
"metadata": {},
"execution_count": 12
}
]
},
{
"cell_type": "markdown",
"source": [
"Putting it back into the dataframe is a snap."
],
"metadata": {
"id": "ra5c-1e4HiTg"
}
},
{
"cell_type": "code",
"source": [
"df.education = df.education.map(eddict)\n",
"\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 206
},
"id": "804o3iyZGpsc",
"outputId": "dbac132b-29b0-4aed-d8c1-3c6783556bb0"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" age \n",
" job \n",
" marital \n",
" education \n",
" default \n",
" balance \n",
" housing \n",
" loan \n",
" contact \n",
" day \n",
" month \n",
" duration \n",
" campaign \n",
" pdays \n",
" previous \n",
" poutcome \n",
" y \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 30 \n",
" unemployed \n",
" married \n",
" 0 \n",
" no \n",
" 1787 \n",
" no \n",
" no \n",
" cellular \n",
" 19 \n",
" oct \n",
" 79 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 1 \n",
" 33 \n",
" services \n",
" married \n",
" 1 \n",
" no \n",
" 4789 \n",
" yes \n",
" yes \n",
" cellular \n",
" 11 \n",
" may \n",
" 220 \n",
" 1 \n",
" 339 \n",
" 4 \n",
" failure \n",
" no \n",
" \n",
" \n",
" 2 \n",
" 35 \n",
" management \n",
" single \n",
" 2 \n",
" no \n",
" 1350 \n",
" yes \n",
" no \n",
" cellular \n",
" 16 \n",
" apr \n",
" 185 \n",
" 1 \n",
" 330 \n",
" 1 \n",
" failure \n",
" no \n",
" \n",
" \n",
" 3 \n",
" 30 \n",
" management \n",
" married \n",
" 2 \n",
" no \n",
" 1476 \n",
" yes \n",
" yes \n",
" unknown \n",
" 3 \n",
" jun \n",
" 199 \n",
" 4 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
" 4 \n",
" 59 \n",
" blue-collar \n",
" married \n",
" 1 \n",
" no \n",
" 0 \n",
" yes \n",
" no \n",
" unknown \n",
" 5 \n",
" may \n",
" 226 \n",
" 1 \n",
" -1 \n",
" 0 \n",
" unknown \n",
" no \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" age job marital education ... pdays previous poutcome y\n",
"0 30 unemployed married 0 ... -1 0 unknown no\n",
"1 33 services married 1 ... 339 4 failure no\n",
"2 35 management single 2 ... 330 1 failure no\n",
"3 30 management married 2 ... -1 0 unknown no\n",
"4 59 blue-collar married 1 ... -1 0 unknown no\n",
"\n",
"[5 rows x 17 columns]"
]
},
"metadata": {},
"execution_count": 13
}
]
},
{
"cell_type": "markdown",
"source": [
"## Your Turn"
],
"metadata": {
"id": "_6g6Be1sKrzH"
}
},
{
"cell_type": "markdown",
"source": [
"Use the data file found [here](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv) on [AirB&B in NYC](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Convert the neighbourhood_group and the room_type into appropriate numerical data. Justify your decisisons on how you made the transformations."
],
"metadata": {
"id": "eOEJzG9jKuPM"
}
},
{
"cell_type": "code",
"source": [
"df = pa.read_csv(\"https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv\")\n",
"df.head()"
],
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 427
},
"id": "2cKI8mDZHpzy",
"outputId": "97d7bae1-b7db-4a05-cff8-cf3281317b06"
},
"execution_count": null,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"\n",
" \n",
"
\n",
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" id \n",
" name \n",
" host_id \n",
" host_name \n",
" neighbourhood_group \n",
" neighbourhood \n",
" latitude \n",
" longitude \n",
" room_type \n",
" price \n",
" minimum_nights \n",
" number_of_reviews \n",
" last_review \n",
" reviews_per_month \n",
" calculated_host_listings_count \n",
" availability_365 \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 2539 \n",
" Clean & quiet apt home by the park \n",
" 2787 \n",
" John \n",
" Brooklyn \n",
" Kensington \n",
" 40.64749 \n",
" -73.97237 \n",
" Private room \n",
" 149 \n",
" 1 \n",
" 9 \n",
" 2018-10-19 \n",
" 0.21 \n",
" 6 \n",
" 365 \n",
" \n",
" \n",
" 1 \n",
" 2595 \n",
" Skylit Midtown Castle \n",
" 2845 \n",
" Jennifer \n",
" Manhattan \n",
" Midtown \n",
" 40.75362 \n",
" -73.98377 \n",
" Entire home/apt \n",
" 225 \n",
" 1 \n",
" 45 \n",
" 2019-05-21 \n",
" 0.38 \n",
" 2 \n",
" 355 \n",
" \n",
" \n",
" 2 \n",
" 3647 \n",
" THE VILLAGE OF HARLEM....NEW YORK ! \n",
" 4632 \n",
" Elisabeth \n",
" Manhattan \n",
" Harlem \n",
" 40.80902 \n",
" -73.94190 \n",
" Private room \n",
" 150 \n",
" 3 \n",
" 0 \n",
" NaN \n",
" NaN \n",
" 1 \n",
" 365 \n",
" \n",
" \n",
" 3 \n",
" 3831 \n",
" Cozy Entire Floor of Brownstone \n",
" 4869 \n",
" LisaRoxanne \n",
" Brooklyn \n",
" Clinton Hill \n",
" 40.68514 \n",
" -73.95976 \n",
" Entire home/apt \n",
" 89 \n",
" 1 \n",
" 270 \n",
" 2019-07-05 \n",
" 4.64 \n",
" 1 \n",
" 194 \n",
" \n",
" \n",
" 4 \n",
" 5022 \n",
" Entire Apt: Spacious Studio/Loft by central park \n",
" 7192 \n",
" Laura \n",
" Manhattan \n",
" East Harlem \n",
" 40.79851 \n",
" -73.94399 \n",
" Entire home/apt \n",
" 80 \n",
" 10 \n",
" 9 \n",
" 2018-11-19 \n",
" 0.10 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
"\n",
" \n",
"
\n",
"
\n",
" "
],
"text/plain": [
" id ... availability_365\n",
"0 2539 ... 365\n",
"1 2595 ... 355\n",
"2 3647 ... 365\n",
"3 3831 ... 194\n",
"4 5022 ... 0\n",
"\n",
"[5 rows x 16 columns]"
]
},
"metadata": {},
"execution_count": 37
}
]
}
]
}