{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Untitled87.ipynb", "provenance": [], "authorship_tag": "ABX9TyMYr/3v8bwA55sBwwVEC4qh", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Factors and One Hot Encoding" ], "metadata": { "id": "QwffnQ_A7O3u" } }, { "cell_type": "markdown", "source": [ "Often we want to use strings in a numerical way besides just counting. In data science we often want to include a category in our models. To deal with this we can do a wide variety of transformations on the categorical variables." ], "metadata": { "id": "t1DKbTmM7YmX" } }, { "cell_type": "markdown", "source": [ "## One Hot Encoding" ], "metadata": { "id": "tGic7KKJ70rE" } }, { "cell_type": "markdown", "source": [ "Perhaps the easiest to understand is the one **one hot encoder** essentially we give a new column for every category in the categorical variable. In pandas, this action is preformed by the `get_dummies` command. Let's see it in action." ], "metadata": { "id": "U-6ZSxlI7582" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "7vhr3vi17N1F", "outputId": "efcfa73d-4c90-4a84-f83d-cd19945f2436" }, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth PedalLength PedalWidth Class\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "metadata": {}, "execution_count": 1 } ], "source": [ "import pandas as pa\n", "\n", "df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')\n", "\n", "df.head()" ] }, { "cell_type": "code", "source": [ "pa.get_dummies(df.Class).head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "GSiO-3Ex8cMq", "outputId": "f894cb21-aa18-4c71-f481-78cd85a51d50" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Iris-setosaIris-versicolorIris-virginica
0100
1100
2100
3100
4100
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Iris-setosa Iris-versicolor Iris-virginica\n", "0 1 0 0\n", "1 1 0 0\n", "2 1 0 0\n", "3 1 0 0\n", "4 1 0 0" ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "source": [ "We see that each column was given the name of the category and a 1 indeicating membership and a 0 if not a member. We can return this into our orginal dataset through a `concat`." ], "metadata": { "id": "--Gpysyb80lC" } }, { "cell_type": "code", "source": [ "df1 = pa.concat([df,pa.get_dummies(df.Class)],axis = 1)\n", "\n", "df1 = df1.drop('Class', axis = 1)\n", "\n", "df1" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "gzIw4eKFDWix", "outputId": "690b13a1-c3cd-4430-8af6-c72ee47eabe4" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthIris-setosaIris-versicolorIris-virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
........................
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001
\n", "

150 rows × 7 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth ... Iris-versicolor Iris-virginica\n", "0 5.1 3.5 ... 0 0\n", "1 4.9 3.0 ... 0 0\n", "2 4.7 3.2 ... 0 0\n", "3 4.6 3.1 ... 0 0\n", "4 5.0 3.6 ... 0 0\n", ".. ... ... ... ... ...\n", "145 6.7 3.0 ... 0 1\n", "146 6.3 2.5 ... 0 1\n", "147 6.5 3.0 ... 0 1\n", "148 6.2 3.4 ... 0 1\n", "149 5.9 3.0 ... 0 1\n", "\n", "[150 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "source": [ "This might have been easier with just passing the entire dataframe to the get_dummies command but it gives less flexibility for changing other variables independently." ], "metadata": { "id": "jhX6bSAPEiDl" } }, { "cell_type": "code", "source": [ "pa.get_dummies(df)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "JYol5KAZEp2U", "outputId": "b228b934-8513-4ffc-9d9d-b3f1c2a241fd" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass_Iris-setosaClass_Iris-versicolorClass_Iris-virginica
05.13.51.40.2100
14.93.01.40.2100
24.73.21.30.2100
34.63.11.50.2100
45.03.61.40.2100
........................
1456.73.05.22.3001
1466.32.55.01.9001
1476.53.05.22.0001
1486.23.45.42.3001
1495.93.05.11.8001
\n", "

150 rows × 7 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth ... Class_Iris-versicolor Class_Iris-virginica\n", "0 5.1 3.5 ... 0 0\n", "1 4.9 3.0 ... 0 0\n", "2 4.7 3.2 ... 0 0\n", "3 4.6 3.1 ... 0 0\n", "4 5.0 3.6 ... 0 0\n", ".. ... ... ... ... ...\n", "145 6.7 3.0 ... 0 1\n", "146 6.3 2.5 ... 0 1\n", "147 6.5 3.0 ... 0 1\n", "148 6.2 3.4 ... 0 1\n", "149 5.9 3.0 ... 0 1\n", "\n", "[150 rows x 7 columns]" ] }, "metadata": {}, "execution_count": 4 } ] }, { "cell_type": "markdown", "source": [ "We ran across some data recently that had a list as the input (actually we had to do a bit of cleanning to get to it but that is included below!) Perhaps we want to get indicators for that." ], "metadata": { "id": "sCFCBQqv9WKB" } }, { "cell_type": "code", "source": [ "from bs4 import BeautifulSoup\n", "import requests\n", "import re\n", "\n", "r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')\n", "html_contents = r.text\n", "html_soup = BeautifulSoup(html_contents,\"lxml\")\n", "tables = html_soup.find_all('table',class_=\"wikitable\")\n", "\n", "df1 = pa.read_html(str(tables))[0]\n", "df1.columns = df1.columns.droplevel(0).droplevel(0)\n", "\n", "newcol = df1.iloc[:,-1]\n", "newcol = newcol.apply(lambda x: re.sub(r\"\\[(.+?)\\]\",\"\",x))\n", "newcol = newcol.apply(lambda x: re.sub(r\"[^A-z]\",\"\",x))\n", "newcol = newcol.apply(lambda x: re.findall(r\"[A-Z][a-z]*\",x))\n", "\n", "newcol" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "8J_jAj-98g_j", "outputId": "c1282a5c-3deb-4e05-9867-064a64760c07" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 [Nepal, China]\n", "1 [Pakistan, China]\n", "2 [Nepal, India]\n", "3 [Nepal, China]\n", "4 [Nepal, China]\n", " ... \n", "115 [China]\n", "116 [Nepal, China]\n", "117 [Bhutan, China]\n", "118 [India, China]\n", "119 [Pakistan]\n", "Name: Country (disputed claims in italics), Length: 120, dtype: object" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "First I'll convert the data Series into a dataframe with multiple columns." ], "metadata": { "id": "G6_5rG8r_vgK" } }, { "cell_type": "code", "source": [ "newcol.apply(pa.Series)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "nhARZeK8_YEU", "outputId": "9e961f17-16f6-4e83-e83c-0d88b4b83989" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
012
0NepalChinaNaN
1PakistanChinaNaN
2NepalIndiaNaN
3NepalChinaNaN
4NepalChinaNaN
............
115ChinaNaNNaN
116NepalChinaNaN
117BhutanChinaNaN
118IndiaChinaNaN
119PakistanNaNNaN
\n", "

120 rows × 3 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " 0 1 2\n", "0 Nepal China NaN\n", "1 Pakistan China NaN\n", "2 Nepal India NaN\n", "3 Nepal China NaN\n", "4 Nepal China NaN\n", ".. ... ... ...\n", "115 China NaN NaN\n", "116 Nepal China NaN\n", "117 Bhutan China NaN\n", "118 India China NaN\n", "119 Pakistan NaN NaN\n", "\n", "[120 rows x 3 columns]" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "markdown", "source": [ "Next I'll use the `stack` command to break each row apart into its individual peices." ], "metadata": { "id": "fkphLSLoAS-k" } }, { "cell_type": "code", "source": [ "newcol.apply(pa.Series).stack()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "17GBojy3Ahm5", "outputId": "f8e2b061-9e0f-4c9e-9331-f4c9b0b9156c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 0 Nepal\n", " 1 China\n", "1 0 Pakistan\n", " 1 China\n", "2 0 Nepal\n", " ... \n", "117 0 Bhutan\n", " 1 China\n", "118 0 India\n", " 1 China\n", "119 0 Pakistan\n", "Length: 161, dtype: object" ] }, "metadata": {}, "execution_count": 18 } ] }, { "cell_type": "markdown", "source": [ "Now I can get the dummies!" ], "metadata": { "id": "vkE4bwEwAoKt" } }, { "cell_type": "code", "source": [ "pa.get_dummies(newcol.apply(pa.Series).stack())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "j-bY53qRArL1", "outputId": "4121cb8a-c779-4ced-9380-cdbd8d0d6e2c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AfghanistanBhutanChinaIndiaKyrgyzstanNepalPakistanTajikistan
0000000100
100100000
1000000010
100100000
2000000100
..............................
117001000000
100100000
118000010000
100100000
119000000010
\n", "

161 rows × 8 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Afghanistan Bhutan China ... Nepal Pakistan Tajikistan\n", "0 0 0 0 0 ... 1 0 0\n", " 1 0 0 1 ... 0 0 0\n", "1 0 0 0 0 ... 0 1 0\n", " 1 0 0 1 ... 0 0 0\n", "2 0 0 0 0 ... 1 0 0\n", "... ... ... ... ... ... ... ...\n", "117 0 0 1 0 ... 0 0 0\n", " 1 0 0 1 ... 0 0 0\n", "118 0 0 0 0 ... 0 0 0\n", " 1 0 0 1 ... 0 0 0\n", "119 0 0 0 0 ... 0 1 0\n", "\n", "[161 rows x 8 columns]" ] }, "metadata": {}, "execution_count": 19 } ] }, { "cell_type": "markdown", "source": [ "Lastly we bring it all back together using the groupby command on the indexes using `level = 0`. We sum the totals as well." ], "metadata": { "id": "u0u6hU14A6Ou" } }, { "cell_type": "code", "source": [ "pa.get_dummies(newcol.apply(pa.Series).stack()).groupby(level = 0).sum()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 424 }, "id": "BAcZeYvp9jJo", "outputId": "532914a0-484b-4bd0-f176-bf88ae367aa5" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AfghanistanBhutanChinaIndiaKyrgyzstanNepalPakistanTajikistan
000100100
100100010
200010100
300100100
400100100
...........................
11500100000
11600100100
11701100000
11800110000
11900000010
\n", "

120 rows × 8 columns

\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Afghanistan Bhutan China India Kyrgyzstan Nepal Pakistan Tajikistan\n", "0 0 0 1 0 0 1 0 0\n", "1 0 0 1 0 0 0 1 0\n", "2 0 0 0 1 0 1 0 0\n", "3 0 0 1 0 0 1 0 0\n", "4 0 0 1 0 0 1 0 0\n", ".. ... ... ... ... ... ... ... ...\n", "115 0 0 1 0 0 0 0 0\n", "116 0 0 1 0 0 1 0 0\n", "117 0 1 1 0 0 0 0 0\n", "118 0 0 1 1 0 0 0 0\n", "119 0 0 0 0 0 0 1 0\n", "\n", "[120 rows x 8 columns]" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "source": [ "## Factors" ], "metadata": { "id": "Jz-YTrLhC4Tw" } }, { "cell_type": "markdown", "source": [ "One hot encoding works great for data that is nominal. What then should we do for data that has an order (ordinal)? If the data has a natural order **AND** the steps are equal, it may just be best to transform with a dictionary. Let's see that in action." ], "metadata": { "id": "_4LaXQzAC6Wf" } }, { "cell_type": "code", "source": [ "df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')\n", "\n", "df.head()" ], "metadata": { "id": "MSEOQLWV9q6l", "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "outputId": "3c88dfe1-cc5c-482c-b9d5-a5b6c4820158" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
030unemployedmarriedprimaryno1787nonocellular19oct791-10unknownno
133servicesmarriedsecondaryno4789yesyescellular11may22013394failureno
235managementsingletertiaryno1350yesnocellular16apr18513301failureno
330managementmarriedtertiaryno1476yesyesunknown3jun1994-10unknownno
459blue-collarmarriedsecondaryno0yesnounknown5may2261-10unknownno
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " age job marital education ... pdays previous poutcome y\n", "0 30 unemployed married primary ... -1 0 unknown no\n", "1 33 services married secondary ... 339 4 failure no\n", "2 35 management single tertiary ... 330 1 failure no\n", "3 30 management married tertiary ... -1 0 unknown no\n", "4 59 blue-collar married secondary ... -1 0 unknown no\n", "\n", "[5 rows x 17 columns]" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "Education is clearly well ordered. We'll assume that each step is of equal importance. (This is probably not a good assumption as not having a high school degree VS GED/High School Diploma VS College Grad are not equal steps in the credit world of banking.)" ], "metadata": { "id": "CQW-L6dKFZPH" } }, { "cell_type": "code", "source": [ "edlevels = sorted(df.education.unique())\n", "\n", "edlevels" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "2uK3bSTJFYVF", "outputId": "f5579a83-f108-467c-f218-40903527bd5d" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "['primary', 'secondary', 'tertiary', 'unknown']" ] }, "metadata": {}, "execution_count": 7 } ] }, { "cell_type": "markdown", "source": [ "I've gathered the levels and sorted them alphabetically. Now I'll build a dictionary. Besides the ordering, there is nothing special about the dictionary I have assigned. There are many others that could be equivalent!" ], "metadata": { "id": "NuPZn5LKGsvU" } }, { "cell_type": "code", "source": [ "eddict = {}\n", "\n", "for i in range(4):\n", " eddict[edlevels[i]] = i\n", "\n", "eddict" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1NDkCIB1F4Ss", "outputId": "0ccbed5a-be51-493f-dbf5-f9547666449f" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 3}" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "I probably shouldn't assign an 'unknown' education level to beyond college!" ], "metadata": { "id": "eAcYcN8cFaFl" } }, { "cell_type": "code", "source": [ "eddict['unknown'] = 0\n", "\n", "eddict" ], "metadata": { "id": "P6McQU-_Dn_P", "outputId": "5b664a47-da5c-4673-edc2-9204c4917d7b", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "{'primary': 0, 'secondary': 1, 'tertiary': 2, 'unknown': 0}" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "source": [ "How many were there that were unknown any way?" ], "metadata": { "id": "dF5ockAnFjZ1" } }, { "cell_type": "code", "source": [ "df.groupby('education').age.agg('count')" ], "metadata": { "id": "HcGLbWECD737", "outputId": "6693e36e-ce6d-4cd0-82b0-9d5f37f882a1", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "education\n", "primary 678\n", "secondary 2306\n", "tertiary 1350\n", "unknown 187\n", "Name: age, dtype: int64" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "Now I'll apply the dictionary with `map`." ], "metadata": { "id": "OezYTzdOGrpb" } }, { "cell_type": "code", "source": [ "df.education.map(eddict)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "pRGzPi2eGfMb", "outputId": "ebb04320-740f-4d1a-8793-4cd705e71eec" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 0\n", "1 1\n", "2 2\n", "3 2\n", "4 1\n", " ..\n", "4516 1\n", "4517 2\n", "4518 1\n", "4519 1\n", "4520 2\n", "Name: education, Length: 4521, dtype: int64" ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "markdown", "source": [ "Putting it back into the dataframe is a snap." ], "metadata": { "id": "ra5c-1e4HiTg" } }, { "cell_type": "code", "source": [ "df.education = df.education.map(eddict)\n", "\n", "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "804o3iyZGpsc", "outputId": "dbac132b-29b0-4aed-d8c1-3c6783556bb0" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agejobmaritaleducationdefaultbalancehousingloancontactdaymonthdurationcampaignpdayspreviouspoutcomey
030unemployedmarried0no1787nonocellular19oct791-10unknownno
133servicesmarried1no4789yesyescellular11may22013394failureno
235managementsingle2no1350yesnocellular16apr18513301failureno
330managementmarried2no1476yesyesunknown3jun1994-10unknownno
459blue-collarmarried1no0yesnounknown5may2261-10unknownno
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " age job marital education ... pdays previous poutcome y\n", "0 30 unemployed married 0 ... -1 0 unknown no\n", "1 33 services married 1 ... 339 4 failure no\n", "2 35 management single 2 ... 330 1 failure no\n", "3 30 management married 2 ... -1 0 unknown no\n", "4 59 blue-collar married 1 ... -1 0 unknown no\n", "\n", "[5 rows x 17 columns]" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "markdown", "source": [ "## Your Turn" ], "metadata": { "id": "_6g6Be1sKrzH" } }, { "cell_type": "markdown", "source": [ "Use the data file found [here](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv) on [AirB&B in NYC](https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data). Convert the neighbourhood_group and the room_type into appropriate numerical data. Justify your decisisons on how you made the transformations." ], "metadata": { "id": "eOEJzG9jKuPM" } }, { "cell_type": "code", "source": [ "df = pa.read_csv(\"https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv\")\n", "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 427 }, "id": "2cKI8mDZHpzy", "outputId": "97d7bae1-b7db-4a05-cff8-cf3281317b06" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
02539Clean & quiet apt home by the park2787JohnBrooklynKensington40.64749-73.97237Private room149192018-10-190.216365
12595Skylit Midtown Castle2845JenniferManhattanMidtown40.75362-73.98377Entire home/apt2251452019-05-210.382355
23647THE VILLAGE OF HARLEM....NEW YORK !4632ElisabethManhattanHarlem40.80902-73.94190Private room15030NaNNaN1365
33831Cozy Entire Floor of Brownstone4869LisaRoxanneBrooklynClinton Hill40.68514-73.95976Entire home/apt8912702019-07-054.641194
45022Entire Apt: Spacious Studio/Loft by central park7192LauraManhattanEast Harlem40.79851-73.94399Entire home/apt801092018-11-190.1010
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " id ... availability_365\n", "0 2539 ... 365\n", "1 2595 ... 355\n", "2 3647 ... 365\n", "3 3831 ... 194\n", "4 5022 ... 0\n", "\n", "[5 rows x 16 columns]" ] }, "metadata": {}, "execution_count": 37 } ] } ] }