{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Untitled78.ipynb", "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Data Types" ], "metadata": { "id": "GR4Jyk8u1cZk" } }, { "cell_type": "markdown", "source": [ "## Lots of Data" ], "metadata": { "id": "QHF7m2Ef7Nea" } }, { "cell_type": "markdown", "source": [ "To get onto the task of cleaning our data, it is first important to know what type of data we have and the best tools for using it!\n", "\n", "Let's first look at how we bring our data into the python environment. We have worked mostly with `pandas` and will continue to do so! Pandas is actually built on top of another environment, `numpy`. At some point in our work we will need both!" ], "metadata": { "id": "33R-OSZj1f_q" } }, { "cell_type": "markdown", "source": [ "### Pandas" ], "metadata": { "id": "pCoJ0ero2Q0f" } }, { "cell_type": "markdown", "source": [ "Pandas is great for loading data. We have seen it handle csv, html and data from an sql call. We can also load JSON and excel files.\n", "\n", "`DataFrame` is the table environment we've used before and `series` is similar to a column.\n", "\n", "You should use a pandas dataframe when your data contains categorical data.\n", "\n", "Pandas is best when dealing with large datasets." ], "metadata": { "id": "NB0yJEUk2TML" } }, { "cell_type": "code", "source": [ "import pandas as pa\n", "\n", "df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')\n", "\n", "df.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "ZnmcDmsw8NZ5", "outputId": "5b9cf3e1-9d27-4b6b-98cc-3751ad1081ab" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth PedalLength PedalWidth Class\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "metadata": {}, "execution_count": 1 } ] }, { "cell_type": "markdown", "source": [ "### Numpy" ], "metadata": { "id": "j7csniAy4DeT" } }, { "cell_type": "markdown", "source": [ "The other important tool in python is `numpy`. It is the foundation of the pandas module but it has some limitations. \n", "\n", "Numpy is excellent for higher dimensional data, stored as an `array`. Think multiple sheets in a excel spreadsheet, data that will not simply fit in a 2 dimensional array.\n", "\n", "Numpy arrays can be accessed easily by there indicies, this is very ineffiecient in pandas.\n", "\n", "Numpy data should be just numbers! Categorical data should be converted first before utilizing a numpy array. Numpy is effiecient and fast but works best on smaller datasets.\n" ], "metadata": { "id": "5yVYYKMI4FuL" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "\n", "df1 = pa.get_dummies(data = df)\n", "\n", "X = np.array(df1)\n", "\n", "X" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "qw2-bFP98keb", "outputId": "b52daa4c-e0ac-4066-a59b-03cb5dcb7660" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array([[5.1, 3.5, 1.4, ..., 1. , 0. , 0. ],\n", " [4.9, 3. , 1.4, ..., 1. , 0. , 0. ],\n", " [4.7, 3.2, 1.3, ..., 1. , 0. , 0. ],\n", " ...,\n", " [6.5, 3. , 5.2, ..., 0. , 0. , 1. ],\n", " [6.2, 3.4, 5.4, ..., 0. , 0. , 1. ],\n", " [5.9, 3. , 5.1, ..., 0. , 0. , 1. ]])" ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "source": [ "We will discuss what the `get_dummies` does in due time. For now just know that it converted the class into numbers for use in numpy array." ], "metadata": { "id": "QTO8Lilb-KeW" } }, { "cell_type": "code", "source": [ "df1.dtypes" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "voqkO1G3ylv6", "outputId": "139f9db6-351d-4846-b12e-a6f45cfa98e7" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "SepalLength float64\n", "SepalWidth float64\n", "PedalLength float64\n", "PedalWidth float64\n", "Class_Iris-setosa uint8\n", "Class_Iris-versicolor uint8\n", "Class_Iris-virginica uint8\n", "dtype: object" ] }, "metadata": {}, "execution_count": 23 } ] }, { "cell_type": "markdown", "source": [ "### Which is Best?" ], "metadata": { "id": "Wd0zlK7T6vzC" } }, { "cell_type": "markdown", "source": [ "Very often I will use both in a project. I'll start with pandas for loading, cleaning and basic analysis. Then I will convert the data to an numpy array and create models for predicitons." ], "metadata": { "id": "ICqp5xjU61OD" } }, { "cell_type": "markdown", "source": [ "## Less Data" ], "metadata": { "id": "ryPaQGPR7K8E" } }, { "cell_type": "markdown", "source": [ "Now that we have lots of data we'll have to start examining each piece! I am following [this](https://pbpython.com/pandas_dtypes.html) page of the different types." ], "metadata": { "id": "nfgrj3I67zCN" } }, { "cell_type": "markdown", "source": [ "### Strings" ], "metadata": { "id": "y61PnqpzG5Sr" } }, { "cell_type": "markdown", "source": [ "The most common type of data we examine is a string. We will spend a lot of time dealing with strings. Often data in another format is actually given as a string so we'll have our work cut out for use manipulating strings. \n", "\n", "In the `iris` dataset, the class was given as a string." ], "metadata": { "id": "-n7hvkbMFpme" } }, { "cell_type": "code", "source": [ "df.Class.iloc[0]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "id": "-6tRnxCBGINt", "outputId": "f8a1f6cf-907d-4da3-f7b9-5cb04e22405f" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Iris-setosa'" ] }, "metadata": {}, "execution_count": 39 } ] }, { "cell_type": "markdown", "source": [ "The tell-tale sign of a string is the quotes. Of course we can save a string too." ], "metadata": { "id": "gFOs_z7YGgJ6" } }, { "cell_type": "code", "source": [ "a_string = 'My really cool string'\n", "\n", "print(a_string)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AFu3cEzAGwHx", "outputId": "e110a63e-4fdb-4467-ad79-2331da2d8f6e" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "My really cool string\n" ] } ] }, { "cell_type": "markdown", "source": [ "Pandas calls the datatype of object ('O') for any string it is passed." ], "metadata": { "id": "aBokifMD0FhE" } }, { "cell_type": "code", "source": [ "df.Class.dtype" ], "metadata": { "id": "gu58lnGD0M2L", "outputId": "669f7229-1396-48cf-9326-67f14a6a92d0", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "dtype('O')" ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "markdown", "source": [ "### Boolean" ], "metadata": { "id": "L8nX1RwBGvVM" } }, { "cell_type": "markdown", "source": [ "Boolean is the logical operator, taking only two values, `True` or `False`. We can combine them using the normal logical connections. We can also get a boolean by doing comparisons." ], "metadata": { "id": "6hXPO8SYG_Sb" } }, { "cell_type": "code", "source": [ "a = True\n", "b = False\n", "\n", "print(a and b) #or a & b\n", "\n", "print(a or b) # or a | b\n", "\n", "print( not a ) # or ~a" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ukAhD0KOIRaU", "outputId": "4aabf856-cf8e-4c93-802f-acd628510523" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "False\n", "True\n", "False\n" ] } ] }, { "cell_type": "code", "source": [ "print(3 == 4)\n", "\n", "print(5>-2)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "v_VSmnttInMF", "outputId": "e57e6d43-72a2-417b-ecc2-30ade51b38f3" }, "execution_count": null, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "False\n", "True\n" ] } ] }, { "cell_type": "code", "source": [ "bool(0)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "bVi-awoDNc2z", "outputId": "1c54eb19-4c38-4033-8e47-1d61e2b36a81" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "False" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "source": [ "This may show up in manipulating data! You can ask for only the classes that are *Iris-setosa* in your dataset" ], "metadata": { "id": "DXWyq5W_LXaY" } }, { "cell_type": "code", "source": [ "df.Class == 'Iris-setosa'" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oGTG6Lq_Ll98", "outputId": "e06efaa9-092a-402a-973b-721ee576e278" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 True\n", "1 True\n", "2 True\n", "3 True\n", "4 True\n", " ... \n", "145 False\n", "146 False\n", "147 False\n", "148 False\n", "149 False\n", "Name: Class, Length: 150, dtype: bool" ] }, "metadata": {}, "execution_count": 44 } ] }, { "cell_type": "code", "source": [ "(df.Class == 'Iris-setosa').dtype" ], "metadata": { "id": "vHQiojaA0Vam", "outputId": "f760aee7-0871-4ae2-9c41-163ce434922d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "dtype('bool')" ] }, "metadata": {}, "execution_count": 31 } ] }, { "cell_type": "markdown", "source": [ "Then you can pass that back into the dataframe and it will only give you the entries that were true." ], "metadata": { "id": "CxxBSPL8Lr8E" } }, { "cell_type": "code", "source": [ "df[df.Class == 'Iris-setosa'].head(10) #I've added head to limit the output to 10 entries" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 363 }, "id": "z9KY0TT_Lwu8", "outputId": "dd63afec-7563-40e9-ea7b-5c844df0ed9e" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
55.43.91.70.4Iris-setosa
64.63.41.40.3Iris-setosa
75.03.41.50.2Iris-setosa
84.42.91.40.2Iris-setosa
94.93.11.50.1Iris-setosa
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth PedalLength PedalWidth Class\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa\n", "5 5.4 3.9 1.7 0.4 Iris-setosa\n", "6 4.6 3.4 1.4 0.3 Iris-setosa\n", "7 5.0 3.4 1.5 0.2 Iris-setosa\n", "8 4.4 2.9 1.4 0.2 Iris-setosa\n", "9 4.9 3.1 1.5 0.1 Iris-setosa" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "source": [ "If we want to combine several boolean DataSeries, use the `&` for and and `|` for or.\n", "\n", "\n" ], "metadata": { "id": "hSXI0RyMwWQ5" } }, { "cell_type": "code", "source": [ "df[(df.Class == 'Iris-setosa') & (df.SepalLength > 5.2)]" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 394 }, "id": "MgIbhKDpwgjY", "outputId": "4ec1eecf-39c3-4aa5-ba25-49d60a7b878c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass
55.43.91.70.4Iris-setosa
105.43.71.50.2Iris-setosa
145.84.01.20.2Iris-setosa
155.74.41.50.4Iris-setosa
165.43.91.30.4Iris-setosa
185.73.81.70.3Iris-setosa
205.43.41.70.2Iris-setosa
315.43.41.50.4Iris-setosa
335.54.21.40.2Iris-setosa
365.53.51.30.2Iris-setosa
485.33.71.50.2Iris-setosa
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " SepalLength SepalWidth PedalLength PedalWidth Class\n", "5 5.4 3.9 1.7 0.4 Iris-setosa\n", "10 5.4 3.7 1.5 0.2 Iris-setosa\n", "14 5.8 4.0 1.2 0.2 Iris-setosa\n", "15 5.7 4.4 1.5 0.4 Iris-setosa\n", "16 5.4 3.9 1.3 0.4 Iris-setosa\n", "18 5.7 3.8 1.7 0.3 Iris-setosa\n", "20 5.4 3.4 1.7 0.2 Iris-setosa\n", "31 5.4 3.4 1.5 0.4 Iris-setosa\n", "33 5.5 4.2 1.4 0.2 Iris-setosa\n", "36 5.5 3.5 1.3 0.2 Iris-setosa\n", "48 5.3 3.7 1.5 0.2 Iris-setosa" ] }, "metadata": {}, "execution_count": 21 } ] }, { "cell_type": "markdown", "source": [ "### Integers" ], "metadata": { "id": "XhmvO1dhMSnY" } }, { "cell_type": "markdown", "source": [ "Integers are whole numbers that can be positive or negative. Integers are closed under addition, subtration and multiplication (Not division). Using integers saves some memory so if your entry is an integer you should use it that way.\n", "\n", "Some examples of integers are customer numbers and counts of objects. The code to convert to an integer is `int`. " ], "metadata": { "id": "su1hq2d2MVNg" } }, { "cell_type": "code", "source": [ "int(-3.0000)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "lmxUCvHKNSc4", "outputId": "bf533867-1503-4b0c-bacc-1be006d22b86" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "-3" ] }, "metadata": {}, "execution_count": 46 } ] }, { "cell_type": "markdown", "source": [ "### Floats" ], "metadata": { "id": "xCATTI0NOHVS" } }, { "cell_type": "markdown", "source": [ "A float is a generic number stored up to a certain number of decimals (64 bits in pandas). Be wary of the last few decimals, more if you have done lots of computations.\n", "\n", "In the `iris` dataset most columns are floats." ], "metadata": { "id": "jqaiIV4QOK5T" } }, { "cell_type": "code", "source": [ "df.dtypes" ], "metadata": { "id": "_qPrgL2EUHQ7", "outputId": "7e6116b0-aeb8-4503-a1f8-2caa437b9b33", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "SepalLength float64\n", "SepalWidth float64\n", "PedalLength float64\n", "PedalWidth float64\n", "Class object\n", "dtype: object" ] }, "metadata": {}, "execution_count": 47 } ] }, { "cell_type": "markdown", "source": [ "### Daytime" ], "metadata": { "id": "TLlYE_zcPSg2" } }, { "cell_type": "markdown", "source": [ "These are dates and times and allow you to manipulate differences easily! I'll grab a dataset with some dates in it. Pandas does not recognize the datetime automatically so I had to convert." ], "metadata": { "id": "mxgXThX_PVYS" } }, { "cell_type": "code", "source": [ "df2 = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/Landslides_From_NASA.csv')\n", "ds = df2.event_date.astype('datetime64')\n", "\n", "ds" ], "metadata": { "id": "X2zRNVSrLmBm", "outputId": "0f2f9098-fa66-4a2a-e24d-9018505f8eba", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 2008-08-01 00:00:00\n", "1 2009-01-02 02:00:00\n", "2 2007-01-19 00:00:00\n", "3 2009-07-31 00:00:00\n", "4 2010-10-16 12:00:00\n", " ... \n", "11028 2017-04-01 13:34:00\n", "11029 2017-03-25 17:32:00\n", "11030 2016-12-15 05:00:00\n", "11031 2017-04-29 19:03:00\n", "11032 2017-03-13 14:32:00\n", "Name: event_date, Length: 11033, dtype: datetime64[ns]" ] }, "metadata": {}, "execution_count": 10 } ] }, { "cell_type": "code", "source": [ "ds[1]-ds[0]" ], "metadata": { "id": "0gn37nUbMfFr", "outputId": "787731e5-4874-4e36-cadd-8fb36f6f7cd3", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Timedelta('154 days 02:00:00')" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "The `Timedelta` is itself another data structure! " ], "metadata": { "id": "t5YERsL6MpYh" } }, { "cell_type": "code", "source": [ "(ds[1]-ds[0]).total_seconds()" ], "metadata": { "id": "FqFtI5osM2qn", "outputId": "28b0afb8-d2a7-4c3a-80a6-248d0e724ba7", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "13312800.0" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "markdown", "source": [ "This gives us the total seconds in the elapsed time. I'm certain there are other things you could do here. When you need them, you'll have to explore!" ], "metadata": { "id": "0jwx-HKuN7ij" } }, { "cell_type": "markdown", "source": [ "### Category" ], "metadata": { "id": "jJAVX4chPz7j" } }, { "cell_type": "markdown", "source": [ "I'll convert the Class into a category inside Pandas by converting the DataSereis of Class into a category and passing it back to the dataframe." ], "metadata": { "id": "BsMEdTRX0fMl" } }, { "cell_type": "code", "source": [ "df.Class = df.Class.astype('category')\n", "\n", "df.dtypes" ], "metadata": { "id": "4-C-716NUVp_", "outputId": "eab8c49d-fe94-4c60-b7f9-7813e0b0591e", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "SepalLength float64\n", "SepalWidth float64\n", "PedalLength float64\n", "PedalWidth float64\n", "Class category\n", "dtype: object" ] }, "metadata": {}, "execution_count": 48 } ] }, { "cell_type": "markdown", "source": [ "You can get at each category through the `unique` command. " ], "metadata": { "id": "gDBZaIt_00K1" } }, { "cell_type": "code", "source": [ "df.Class.unique()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "OpuHwR54yWD8", "outputId": "19e72bcd-19b1-4ed1-fd7c-2a313f3a4179" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)" ] }, "metadata": {}, "execution_count": 22 } ] }, { "cell_type": "markdown", "source": [ "Category is a structure only in pandas. It will allow you to put an order to categorical ordinal variables. " ], "metadata": { "id": "qhTI2dMbP8k1" } }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "3YzlpyBF1bxb", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "de660b33-e131-4734-cafb-27d9442e6260" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], ordered=True)" ] }, "metadata": {}, "execution_count": 25 } ], "source": [ "from pandas.api.types import CategoricalDtype\n", "\n", "\n", "cat_type = CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', \"Iris-virginica\"], ordered=True)\n", "\n", "dfClass = df.Class.astype(cat_type)\n", "\n", "dfClass.dtype" ] }, { "cell_type": "markdown", "source": [ "This may be nice for many categorical and ordinal variables! Taken from [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categoricaldtype)" ], "metadata": { "id": "-uevwluhRjTd" } }, { "cell_type": "markdown", "source": [ "You might be asking why I would order a categorical variable (especcially since the above example has no defined order) \n", "\n", "Some things do have an order! Monday, Tuesday, Wednesday, etc. GED, High school Grad, Associates, Bachelors, Masters, etc.\n", "\n", "Some things have an order but not always well defined: Single, Married, Divorced, Widowed?\n", "\n", "Somethings have a circular order Winter, Spring, Summer, Fall, Winter, etc.\n", "\n", "Be careful and use order correctly!" ], "metadata": { "id": "SynrFbgivglH" } }, { "cell_type": "markdown", "source": [ "## Your Turn" ], "metadata": { "id": "njEPBHzKT6bH" } }, { "cell_type": "markdown", "source": [ "Using the `banks` dataset retrieved from [UCI Machine Learning Repo](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) but also accessible [here](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/bank.csv). Explore the following questions:\n", "\n", "1. What is the data type of the first column? Should you change it or is that the best data type.\n", "2. The second column is *job* could you assign an order to this string? Could you assign an order the the fourth column *education*. If yes, what order would you assign? Do so.\n", "3. Could any of the columns have been a boolean? Find one and create a new column in the dataframe that is just boolean and has a descriptive name.\n", "4. Name one more interesting fact about this dataset and the datatypes." ], "metadata": { "id": "m8lLbPwPT809" } }, { "cell_type": "code", "source": [ "" ], "metadata": { "id": "y3J890JCRdII" }, "execution_count": null, "outputs": [] } ] }