{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Untitled67.ipynb", "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Pandas for Data Analysis" ], "metadata": { "id": "gzNvteiezgJW" } }, { "cell_type": "code", "source": [ "import pandas as pd" ], "metadata": { "id": "w0H09dgLz6A2" }, "execution_count": 1, "outputs": [] }, { "cell_type": "markdown", "source": [ "I think rather than getting busy with another scripting langauge, we should buckle down and try to get some `pandas` essentials under our belts. My goal here is to not load any other library, think I can succeed?\n", "\n", "Here is the wine dataset again. I've added the \"nice\" headers." ], "metadata": { "id": "deXQNMLjkhSY" } }, { "cell_type": "code", "source": [ "df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header = None)\n", "head = ['Class','Alcohol','MalicAcid','Ash','AlcalinityAsh','Magnesium','Phenols','Flavanoids','NonflavanoidPhenols','Proanthocyanins','ColorIntensity','Hue','OD280/OD315','Proline']\n", "#https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names more info on the data file than you could ever use!\n", "df.columns = head\n", "df.head()" ], "metadata": { "id": "njbnZ_nz6su0", "outputId": "9036f15d-968c-4967-b116-5cbbd90d8b27", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "execution_count": 2, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols \\\n", "0 1 14.23 1.71 2.43 15.6 127 2.80 \n", "1 1 13.20 1.78 2.14 11.2 100 2.65 \n", "2 1 13.16 2.36 2.67 18.6 101 2.80 \n", "3 1 14.37 1.95 2.50 16.8 113 3.85 \n", "4 1 13.24 2.59 2.87 21.0 118 2.80 \n", "\n", " Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue \\\n", "0 3.06 0.28 2.29 5.64 1.04 \n", "1 2.76 0.26 1.28 4.38 1.05 \n", "2 3.24 0.30 2.81 5.68 1.03 \n", "3 3.49 0.24 2.18 7.80 0.86 \n", "4 2.69 0.39 1.82 4.32 1.04 \n", "\n", " OD280/OD315 Proline \n", "0 3.92 1065 \n", "1 3.40 1050 \n", "2 3.17 1185 \n", "3 3.45 1480 \n", "4 2.93 735 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassAlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 2 } ] }, { "cell_type": "markdown", "source": [ "Why do I keep calling my dataset `df`, TRADITION! No really you might consider using something more descriptive. You should note that I demanded that the columns have descriptive names because it makes the rest easier but was not nesseccary!" ], "metadata": { "id": "w0qQdfh16wPK" } }, { "cell_type": "code", "source": [ "wineData = df" ], "metadata": { "id": "MBjcvyDkdvDe" }, "execution_count": 3, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Basics About Your Data" ], "metadata": { "id": "b0dNE88flejI" } }, { "cell_type": "markdown", "source": [ "The first is just the column names but sometimes it can be nice to have that as a list" ], "metadata": { "id": "K79voOvolkiv" } }, { "cell_type": "code", "source": [ "wineData.columns" ], "metadata": { "id": "huPxWYY64S7k", "outputId": "2e799853-40a2-4445-e797-b3df5d2f7649", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 5, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['Class', 'Alcohol', 'MalicAcid', 'Ash', 'AlcalinityAsh', 'Magnesium',\n", " 'Phenols', 'Flavanoids', 'NonflavanoidPhenols', 'Proanthocyanins',\n", " 'ColorIntensity', 'Hue', 'OD280/OD315', 'Proline'],\n", " dtype='object')" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "Size and shape are two excellent resources. How many entries and how many rows and columns" ], "metadata": { "id": "vZe6SOs1_4ev" } }, { "cell_type": "code", "source": [ "print('Size:' ,wineData.size)\n", "print('Shape', wineData.shape)\n", "\n", "178*14" ], "metadata": { "id": "QruZFQdtBJ4I", "outputId": "746ad920-c74f-4c85-9e6c-370c90a6ac2d", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 6, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Size: 2492\n", "Shape (178, 14)\n" ] }, { "output_type": "execute_result", "data": { "text/plain": [ "2492" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "Data types will be important as we clean our data." ], "metadata": { "id": "KHsYit96A6sy" } }, { "cell_type": "code", "source": [ "wineData.info()" ], "metadata": { "id": "ufa5YzItBC-w", "outputId": "3e0b2ca5-efb5-4ebf-beee-03b8b77135ba", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 8, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "\n", "RangeIndex: 178 entries, 0 to 177\n", "Data columns (total 14 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Class 178 non-null int64 \n", " 1 Alcohol 178 non-null float64\n", " 2 MalicAcid 178 non-null float64\n", " 3 Ash 178 non-null float64\n", " 4 AlcalinityAsh 178 non-null float64\n", " 5 Magnesium 178 non-null int64 \n", " 6 Phenols 178 non-null float64\n", " 7 Flavanoids 178 non-null float64\n", " 8 NonflavanoidPhenols 178 non-null float64\n", " 9 Proanthocyanins 178 non-null float64\n", " 10 ColorIntensity 178 non-null float64\n", " 11 Hue 178 non-null float64\n", " 12 OD280/OD315 178 non-null float64\n", " 13 Proline 178 non-null int64 \n", "dtypes: float64(11), int64(3)\n", "memory usage: 19.6 KB\n" ] } ] }, { "cell_type": "markdown", "source": [ "Similarly you can ask how many different values are in each column, this command is short for \"number unique\"" ], "metadata": { "id": "BrnKkfEpCLcO" } }, { "cell_type": "code", "source": [ "wineData.nunique()" ], "metadata": { "id": "c5pF3Rxu8pg2", "outputId": "dc5cc174-8c50-44fd-a5db-7affdae012b0", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 9, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Class 3\n", "Alcohol 126\n", "MalicAcid 133\n", "Ash 79\n", "AlcalinityAsh 63\n", "Magnesium 53\n", "Phenols 97\n", "Flavanoids 132\n", "NonflavanoidPhenols 39\n", "Proanthocyanins 101\n", "ColorIntensity 132\n", "Hue 78\n", "OD280/OD315 122\n", "Proline 121\n", "dtype: int64" ] }, "metadata": {}, "execution_count": 9 } ] }, { "cell_type": "markdown", "source": [ "We saw this one last time and may see it again soon!" ], "metadata": { "id": "oiKNePOvCb8O" } }, { "cell_type": "code", "source": [ "wineData.describe()" ], "metadata": { "id": "yOMkCqttB8k7", "outputId": "e562ef02-1234-486c-e680-b8029ed9430b", "colab": { "base_uri": "https://localhost:8080/", "height": 300 } }, "execution_count": 11, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Class Alcohol MalicAcid Ash AlcalinityAsh \\\n", "count 178.000000 178.000000 178.000000 178.000000 178.000000 \n", "mean 1.938202 13.000618 2.336348 2.366517 19.494944 \n", "std 0.775035 0.811827 1.117146 0.274344 3.339564 \n", "min 1.000000 11.030000 0.740000 1.360000 10.600000 \n", "25% 1.000000 12.362500 1.602500 2.210000 17.200000 \n", "50% 2.000000 13.050000 1.865000 2.360000 19.500000 \n", "75% 3.000000 13.677500 3.082500 2.557500 21.500000 \n", "max 3.000000 14.830000 5.800000 3.230000 30.000000 \n", "\n", " Magnesium Phenols Flavanoids NonflavanoidPhenols \\\n", "count 178.000000 178.000000 178.000000 178.000000 \n", "mean 99.741573 2.295112 2.029270 0.361854 \n", "std 14.282484 0.625851 0.998859 0.124453 \n", "min 70.000000 0.980000 0.340000 0.130000 \n", "25% 88.000000 1.742500 1.205000 0.270000 \n", "50% 98.000000 2.355000 2.135000 0.340000 \n", "75% 107.000000 2.800000 2.875000 0.437500 \n", "max 162.000000 3.880000 5.080000 0.660000 \n", "\n", " Proanthocyanins ColorIntensity Hue OD280/OD315 Proline \n", "count 178.000000 178.000000 178.000000 178.000000 178.000000 \n", "mean 1.590899 5.058090 0.957449 2.611685 746.893258 \n", "std 0.572359 2.318286 0.228572 0.709990 314.907474 \n", "min 0.410000 1.280000 0.480000 1.270000 278.000000 \n", "25% 1.250000 3.220000 0.782500 1.937500 500.500000 \n", "50% 1.555000 4.690000 0.965000 2.780000 673.500000 \n", "75% 1.950000 6.200000 1.120000 3.170000 985.000000 \n", "max 3.580000 13.000000 1.710000 4.000000 1680.000000 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassAlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
count178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000178.000000
mean1.93820213.0006182.3363482.36651719.49494499.7415732.2951122.0292700.3618541.5908995.0580900.9574492.611685746.893258
std0.7750350.8118271.1171460.2743443.33956414.2824840.6258510.9988590.1244530.5723592.3182860.2285720.709990314.907474
min1.00000011.0300000.7400001.36000010.60000070.0000000.9800000.3400000.1300000.4100001.2800000.4800001.270000278.000000
25%1.00000012.3625001.6025002.21000017.20000088.0000001.7425001.2050000.2700001.2500003.2200000.7825001.937500500.500000
50%2.00000013.0500001.8650002.36000019.50000098.0000002.3550002.1350000.3400001.5550004.6900000.9650002.780000673.500000
75%3.00000013.6775003.0825002.55750021.500000107.0000002.8000002.8750000.4375001.9500006.2000001.1200003.170000985.000000
max3.00000014.8300005.8000003.23000030.000000162.0000003.8800005.0800000.6600003.58000013.0000001.7100004.0000001680.000000
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "## Acsessing The Data" ], "metadata": { "id": "TBfnnF-8m9e7" } }, { "cell_type": "markdown", "source": [ "The above commands were on the entire dataset but sometimes we might want to consider only a subset. Let's see some tools for that!\n", "\n", "The most straight forward is by column name although it does not work if you have spaces in column names (don't do that!)" ], "metadata": { "id": "73QrYnWpnBOv" } }, { "cell_type": "code", "source": [ "wineData.Alcohol" ], "metadata": { "id": "SL-y6l31C6iu", "outputId": "f2f80d31-e529-4fec-d1dd-d57694ee4807", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 14.23\n", "1 13.20\n", "2 13.16\n", "3 14.37\n", "4 13.24\n", " ... \n", "173 13.71\n", "174 13.40\n", "175 13.27\n", "176 13.17\n", "177 14.13\n", "Name: Alcohol, Length: 178, dtype: float64" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "markdown", "source": [ "Or" ], "metadata": { "id": "xjuUJtEjndWj" } }, { "cell_type": "code", "source": [ "wineData['Alcohol']" ], "metadata": { "id": "UcqfYM5OC-Hs", "outputId": "2d3731e9-52b3-4593-9c2a-722a40c9e641", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 15, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 14.23\n", "1 13.20\n", "2 13.16\n", "3 14.37\n", "4 13.24\n", " ... \n", "173 13.71\n", "174 13.40\n", "175 13.27\n", "176 13.17\n", "177 14.13\n", "Name: Alcohol, Length: 178, dtype: float64" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "source": [ "Or if you want a dataframe still" ], "metadata": { "id": "EoOTWjUOnkpS" } }, { "cell_type": "code", "source": [ "wineData[['Alcohol']]" ], "metadata": { "id": "P-QqotIKDElB", "outputId": "d69dc2e5-a039-4194-ae77-148e045a2382", "colab": { "base_uri": "https://localhost:8080/", "height": 424 } }, "execution_count": 17, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Alcohol\n", "0 14.23\n", "1 13.20\n", "2 13.16\n", "3 14.37\n", "4 13.24\n", ".. ...\n", "173 13.71\n", "174 13.40\n", "175 13.27\n", "176 13.17\n", "177 14.13\n", "\n", "[178 rows x 1 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Alcohol
014.23
113.20
213.16
314.37
413.24
......
17313.71
17413.40
17513.27
17613.17
17714.13
\n", "

178 rows × 1 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 17 } ] }, { "cell_type": "markdown", "source": [ "If you need multiple columns" ], "metadata": { "id": "c4UOEMfKnuhh" } }, { "cell_type": "code", "source": [ "wineData[['Alcohol','Ash']]" ], "metadata": { "id": "Mrh3gv0HDsvN", "outputId": "4d145cfa-832c-4d2f-f18f-386af5679a40", "colab": { "base_uri": "https://localhost:8080/", "height": 424 } }, "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Alcohol Ash\n", "0 14.23 2.43\n", "1 13.20 2.14\n", "2 13.16 2.67\n", "3 14.37 2.50\n", "4 13.24 2.87\n", ".. ... ...\n", "173 13.71 2.45\n", "174 13.40 2.48\n", "175 13.27 2.26\n", "176 13.17 2.37\n", "177 14.13 2.74\n", "\n", "[178 rows x 2 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlcoholAsh
014.232.43
113.202.14
213.162.67
314.372.50
413.242.87
.........
17313.712.45
17413.402.48
17513.272.26
17613.172.37
17714.132.74
\n", "

178 rows × 2 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 19 } ] }, { "cell_type": "markdown", "source": [ "Let's get a little more exotic use `loc` You give it the row and the column. There are some great tricks here." ], "metadata": { "id": "oHKDpprWFAhQ" } }, { "cell_type": "code", "source": [ "wineData.loc[:,'Ash']#all\n", "wineData.loc[1:3,'Ash']#1,2,3\n", "wineData.loc[[1,5,7],'Ash']#1,5,7" ], "metadata": { "id": "ehUx6J_TD0UE", "outputId": "5ae4460f-e2e3-40f8-b0fe-8a64e51af00b", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 21, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "1 2.14\n", "5 2.45\n", "7 2.61\n", "Name: Ash, dtype: float64" ] }, "metadata": {}, "execution_count": 21 } ] }, { "cell_type": "markdown", "source": [ "You may want to try some of those!\n", "\n", "There is also `iloc`. It is similar but you have to use columns by number not name." ], "metadata": { "id": "Orkkj2jRo2X6" } }, { "cell_type": "code", "source": [ "wineData.iloc[:-10,3]#last 10 entries still 'Ash'" ], "metadata": { "id": "xf9VWZ9BpFE2", "outputId": "af22b082-1bde-4e2f-e153-82869c7e7177", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": 22, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 2.43\n", "1 2.14\n", "2 2.67\n", "3 2.50\n", "4 2.87\n", " ... \n", "163 2.35\n", "164 2.30\n", "165 2.26\n", "166 2.60\n", "167 2.30\n", "Name: Ash, Length: 168, dtype: float64" ] }, "metadata": {}, "execution_count": 22 } ] }, { "cell_type": "markdown", "source": [ "Lastly, what if we want data by a condition? `loc` will work for this. Here we get all the `Class 1` wines." ], "metadata": { "id": "VmMhpMKZIYMu" } }, { "cell_type": "code", "source": [ "wineData.loc[wineData.Class==1].head()" ], "metadata": { "id": "rYRPbAncHpc8", "outputId": "a4e1fd3b-c23b-4257-804d-125d7a16747a", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "execution_count": 23, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols \\\n", "0 1 14.23 1.71 2.43 15.6 127 2.80 \n", "1 1 13.20 1.78 2.14 11.2 100 2.65 \n", "2 1 13.16 2.36 2.67 18.6 101 2.80 \n", "3 1 14.37 1.95 2.50 16.8 113 3.85 \n", "4 1 13.24 2.59 2.87 21.0 118 2.80 \n", "\n", " Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue \\\n", "0 3.06 0.28 2.29 5.64 1.04 \n", "1 2.76 0.26 1.28 4.38 1.05 \n", "2 3.24 0.30 2.81 5.68 1.03 \n", "3 3.49 0.24 2.18 7.80 0.86 \n", "4 2.69 0.39 1.82 4.32 1.04 \n", "\n", " OD280/OD315 Proline \n", "0 3.92 1065 \n", "1 3.40 1050 \n", "2 3.17 1185 \n", "3 3.45 1480 \n", "4 2.93 735 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassAlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 23 } ] }, { "cell_type": "markdown", "source": [ "But I like to avoid nesting if I can. I suggest we use the `query` command." ], "metadata": { "id": "HZ0cSZXBpsM_" } }, { "cell_type": "code", "source": [ "wineData.query('Class ==1').head()#head is just here to limit the output" ], "metadata": { "id": "S9EqDO_6MYNp", "outputId": "373746d5-f808-4c03-ef44-3ebdbbd53d07", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassAlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
0114.231.712.4315.61272.803.060.282.295.641.043.921065
1113.201.782.1411.21002.652.760.261.284.381.053.401050
2113.162.362.6718.61012.803.240.302.815.681.033.171185
3114.371.952.5016.81133.853.490.242.187.800.863.451480
4113.242.592.8721.01182.802.690.391.824.321.042.93735
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Class Alcohol MalicAcid Ash ... ColorIntensity Hue OD280/OD315 Proline\n", "0 1 14.23 1.71 2.43 ... 5.64 1.04 3.92 1065\n", "1 1 13.20 1.78 2.14 ... 4.38 1.05 3.40 1050\n", "2 1 13.16 2.36 2.67 ... 5.68 1.03 3.17 1185\n", "3 1 14.37 1.95 2.50 ... 7.80 0.86 3.45 1480\n", "4 1 13.24 2.59 2.87 ... 4.32 1.04 2.93 735\n", "\n", "[5 rows x 14 columns]" ] }, "metadata": {}, "execution_count": 16 } ] }, { "cell_type": "markdown", "source": [ "`query` will allow me to combine multiple statements!" ], "metadata": { "id": "tx2D5w2ZRgyD" } }, { "cell_type": "code", "source": [ "wineData.query('Class ==1 and Ash >2.8')" ], "metadata": { "id": "drxx7ZALqaxn", "outputId": "e6c718c3-a473-4abb-abc6-c2baf5521ce2", "colab": { "base_uri": "https://localhost:8080/", "height": 143 } }, "execution_count": 25, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols \\\n", "4 1 13.24 2.59 2.87 21.0 118 2.80 \n", "25 1 13.05 2.05 3.22 25.0 124 2.63 \n", "36 1 13.28 1.64 2.84 15.5 110 2.60 \n", "\n", " Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue \\\n", "4 2.69 0.39 1.82 4.32 1.04 \n", "25 2.68 0.47 1.92 3.58 1.13 \n", "36 2.68 0.34 1.36 4.60 1.09 \n", "\n", " OD280/OD315 Proline \n", "4 2.93 735 \n", "25 3.20 830 \n", "36 2.78 880 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ClassAlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
4113.242.592.8721.01182.802.690.391.824.321.042.93735
25113.052.053.2225.01242.632.680.471.923.581.133.20830
36113.281.642.8415.51102.602.680.341.364.601.092.78880
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 25 } ] }, { "cell_type": "markdown", "source": [ "## More Advanced Analysis" ], "metadata": { "id": "_qTDicjvqyqF" } }, { "cell_type": "markdown", "source": [ "The melt command is very powerful and does some nifty things to large datasets quickly!" ], "metadata": { "id": "lg4Sem0PRxBI" } }, { "cell_type": "code", "source": [ "df_melt = wineData.melt(id_vars = 'Class',\n", " value_vars = ['Flavanoids','Hue'],\n", " var_name = 'colmuns')\n", "df_melt.tail()" ], "metadata": { "id": "MQTXu5heQgCg", "outputId": "8a41f5c4-c7f0-4101-9a53-b9d80f994c61", "colab": { "base_uri": "https://localhost:8080/", "height": 206 } }, "execution_count": 26, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Class colmuns value\n", "351 3 Hue 0.64\n", "352 3 Hue 0.70\n", "353 3 Hue 0.59\n", "354 3 Hue 0.60\n", "355 3 Hue 0.61" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Classcolmunsvalue
3513Hue0.64
3523Hue0.70
3533Hue0.59
3543Hue0.60
3553Hue0.61
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 26 } ] }, { "cell_type": "markdown", "source": [ "This is the same data as before but now *Flavanoids* and *Hue* are in a column and there values became another column. Why would this help us? Some time this is referred to as longer data and gives us a way graph. " ], "metadata": { "id": "lxvRhg3FrRdF" } }, { "cell_type": "markdown", "source": [ "Next is `pivot`. It would work best on a dataset with multiple categorical variables. It would take a long table and return a wide table. I cannot use it here without just undoing what I did above. " ], "metadata": { "id": "BA8MpcxPr_mN" } }, { "cell_type": "code", "source": [ "df_melt.pivot(columns = 'colmuns', values = 'value')" ], "metadata": { "id": "7Ht8u2X2Q9_o", "outputId": "6ae66309-88b1-42c7-b83d-10bf7d668f2c", "colab": { "base_uri": "https://localhost:8080/", "height": 424 } }, "execution_count": 27, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "colmuns Flavanoids Hue\n", "0 3.06 NaN\n", "1 2.76 NaN\n", "2 3.24 NaN\n", "3 3.49 NaN\n", "4 2.69 NaN\n", ".. ... ...\n", "351 NaN 0.64\n", "352 NaN 0.70\n", "353 NaN 0.59\n", "354 NaN 0.60\n", "355 NaN 0.61\n", "\n", "[356 rows x 2 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
colmunsFlavanoidsHue
03.06NaN
12.76NaN
23.24NaN
33.49NaN
42.69NaN
.........
351NaN0.64
352NaN0.70
353NaN0.59
354NaN0.60
355NaN0.61
\n", "

356 rows × 2 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 27 } ] }, { "cell_type": "markdown", "source": [ "I actually couldn't undo that because it was not clear which value for *Flavanoids* went with which *Hue*. If I would have told, I would have give it `index = 'Index'` that would allow me to match the data back up. It is important not to lose data while doing analysis. You should always be able to reproduce your results but here we could not return!" ], "metadata": { "id": "wxIk5TNvR-iJ" } }, { "cell_type": "markdown", "source": [ "`groupby` is excellent! You need a statistic with it too." ], "metadata": { "id": "FGBGV3n5tvqo" } }, { "cell_type": "code", "source": [ "wineData.groupby(by = 'Class').mean()" ], "metadata": { "id": "oNm3GmdTRS2I", "outputId": "a2e84185-cdc9-4fe5-bad0-13f30d84a702", "colab": { "base_uri": "https://localhost:8080/", "height": 175 } }, "execution_count": 28, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols \\\n", "Class \n", "1 13.744746 2.010678 2.455593 17.037288 106.338983 2.840169 \n", "2 12.278732 1.932676 2.244789 20.238028 94.549296 2.258873 \n", "3 13.153750 3.333750 2.437083 21.416667 99.312500 1.678750 \n", "\n", " Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity \\\n", "Class \n", "1 2.982373 0.290000 1.899322 5.528305 \n", "2 2.080845 0.363662 1.630282 3.086620 \n", "3 0.781458 0.447500 1.153542 7.396250 \n", "\n", " Hue OD280/OD315 Proline \n", "Class \n", "1 1.062034 3.157797 1115.711864 \n", "2 1.056282 2.785352 519.507042 \n", "3 0.682708 1.683542 629.895833 " ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlcoholMalicAcidAshAlcalinityAshMagnesiumPhenolsFlavanoidsNonflavanoidPhenolsProanthocyaninsColorIntensityHueOD280/OD315Proline
Class
113.7447462.0106782.45559317.037288106.3389832.8401692.9823730.2900001.8993225.5283051.0620343.1577971115.711864
212.2787321.9326762.24478920.23802894.5492962.2588732.0808450.3636621.6302823.0866201.0562822.785352519.507042
313.1537503.3337502.43708321.41666799.3125001.6787500.7814580.4475001.1535427.3962500.6827081.683542629.895833
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 28 } ] }, { "cell_type": "markdown", "source": [ "You can add to this by using `agg`" ], "metadata": { "id": "rJCpgbxSvPNt" } }, { "cell_type": "code", "source": [ "wineData.groupby(by = 'Class').agg(['mean','median'])" ], "metadata": { "id": "ii45_yXBSP01", "outputId": "ebd48dc7-d378-4071-d45c-2c78e5ad416b", "colab": { "base_uri": "https://localhost:8080/", "height": 236 } }, "execution_count": 30, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " Alcohol MalicAcid Ash AlcalinityAsh \\\n", " mean median mean median mean median mean \n", "Class \n", "1 13.744746 13.750 2.010678 1.770 2.455593 2.44 17.037288 \n", "2 12.278732 12.290 1.932676 1.610 2.244789 2.24 20.238028 \n", "3 13.153750 13.165 3.333750 3.265 2.437083 2.38 21.416667 \n", "\n", " Magnesium ... Proanthocyanins ColorIntensity \\\n", " median mean median ... mean median mean \n", "Class ... \n", "1 16.8 106.338983 104.0 ... 1.899322 1.870 5.528305 \n", "2 20.0 94.549296 88.0 ... 1.630282 1.610 3.086620 \n", "3 21.0 99.312500 97.0 ... 1.153542 1.105 7.396250 \n", "\n", " Hue OD280/OD315 Proline \n", " median mean median mean median mean median \n", "Class \n", "1 5.40 1.062034 1.070 3.157797 3.17 1115.711864 1095.0 \n", "2 2.90 1.056282 1.040 2.785352 2.83 519.507042 495.0 \n", "3 7.55 0.682708 0.665 1.683542 1.66 629.895833 627.5 \n", "\n", "[3 rows x 26 columns]" ], "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AlcoholMalicAcidAshAlcalinityAshMagnesium...ProanthocyaninsColorIntensityHueOD280/OD315Proline
meanmedianmeanmedianmeanmedianmeanmedianmeanmedian...meanmedianmeanmedianmeanmedianmeanmedianmeanmedian
Class
113.74474613.7502.0106781.7702.4555932.4417.03728816.8106.338983104.0...1.8993221.8705.5283055.401.0620341.0703.1577973.171115.7118641095.0
212.27873212.2901.9326761.6102.2447892.2420.23802820.094.54929688.0...1.6302821.6103.0866202.901.0562821.0402.7853522.83519.507042495.0
313.15375013.1653.3337503.2652.4370832.3821.41666721.099.31250097.0...1.1535421.1057.3962507.550.6827080.6651.6835421.66629.895833627.5
\n", "

3 rows × 26 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ] }, "metadata": {}, "execution_count": 30 } ] }, { "cell_type": "markdown", "source": [ "So that is a lot! You can see there are many actions you can prefrom so it is time for you to try!" ], "metadata": { "id": "q50nAzPUTSd6" } }, { "cell_type": "markdown", "source": [ "## Your Turn" ], "metadata": { "id": "qiY7TzWWT8QK" } }, { "cell_type": "markdown", "source": [ "1. Add to your previous document containing the [iris](https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv) dataset\n", "2. Get the column names\n", "3. Call a column, find it's mean\n", "4. Find the last row\n", "5. Melt the data keeping two columns of values and the type of flower\n", "6. Select flowers that have pedal length greater than 5 and sepal length less than 7\n", "6. Group the data by type of flower and compute mean and median" ], "metadata": { "id": "K7nBWhroT-pp" } } ] }