{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Untitled82.ipynb",
      "provenance": [],
      "authorship_tag": "ABX9TyObCbQFZ0mfH94sG7NeKnG3",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/nurfnick/Data_Viz/blob/main/Content/Data_Cleaning/15_String_Cleaning.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Strings And Cleaning a DataFrame"
      ],
      "metadata": {
        "id": "dS1ulNfIwrFD"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Lambda"
      ],
      "metadata": {
        "id": "fTid8y890Eht"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "To clean the strings properly in a dataframe, we are going to first need to talk about `lambda` functions.  Lambda functions are different then normal functions in python.  They are anonymous and has not been named.  They are great because you can pass them in other calls and they avoid some of the slowness of for loops.\n",
        "\n",
        "\n",
        "https://realpython.com/python-lambda/"
      ],
      "metadata": {
        "id": "L_Wx5V2LxSfb"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "dyp4KUp-wqhx",
        "outputId": "15439351-f324-485e-9824-dfc2fc0bdba2"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "<function __main__.<lambda>>"
            ]
          },
          "metadata": {},
          "execution_count": 1
        }
      ],
      "source": [
        "lambda x: x^2"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "This defines a function that will square any entry.  Let's see it in action"
      ],
      "metadata": {
        "id": "hFoNwrbryu61"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "list(map(lambda x: x**2,[1,2,3,4,5]))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "aExg-A9WylIG",
        "outputId": "346bfde0-33d4-4df0-fb6c-1907a539ad7f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[1, 4, 9, 16, 25]"
            ]
          },
          "metadata": {},
          "execution_count": 1
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we easily could have done this with a regular function and a for loop see the following as an example."
      ],
      "metadata": {
        "id": "gWm01MdkzcrF"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def square(x):\n",
        "  return x**2\n",
        "\n",
        "newlist = []\n",
        "for i in [1,2,3,4,5]:\n",
        "  newlist.append(square(i))\n",
        "\n",
        "newlist"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "iHyI7s5gy5Bu",
        "outputId": "f301fb5f-c950-43b8-9120-881d4e93bdf3"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[1, 4, 9, 16, 25]"
            ]
          },
          "metadata": {},
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "But this will take more memory and will not work on large datasets!"
      ],
      "metadata": {
        "id": "y5crpu8Wz5Yb"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Apply"
      ],
      "metadata": {
        "id": "LezXKUTT0CLj"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We will use the `apply` function to pass a lambda function to a dataframe.  Let's see one in action!"
      ],
      "metadata": {
        "id": "eD1HCLn60HeQ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import pandas as pa\n",
        "\n",
        "df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')\n",
        "\n",
        "df.head()\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 206
        },
        "id": "Sw68Gt9NzuU1",
        "outputId": "4e3812d0-3d43-4abe-eb05-762c78c881de"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "   SepalLength  SepalWidth  PedalLength  PedalWidth        Class\n",
              "0          5.1         3.5          1.4         0.2  Iris-setosa\n",
              "1          4.9         3.0          1.4         0.2  Iris-setosa\n",
              "2          4.7         3.2          1.3         0.2  Iris-setosa\n",
              "3          4.6         3.1          1.5         0.2  Iris-setosa\n",
              "4          5.0         3.6          1.4         0.2  Iris-setosa"
            ],
            "text/html": [
              "\n",
              "  <div id=\"df-35dbe008-4509-4704-85a0-c003a3cf4f63\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>SepalLength</th>\n",
              "      <th>SepalWidth</th>\n",
              "      <th>PedalLength</th>\n",
              "      <th>PedalWidth</th>\n",
              "      <th>Class</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>5.1</td>\n",
              "      <td>3.5</td>\n",
              "      <td>1.4</td>\n",
              "      <td>0.2</td>\n",
              "      <td>Iris-setosa</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>4.9</td>\n",
              "      <td>3.0</td>\n",
              "      <td>1.4</td>\n",
              "      <td>0.2</td>\n",
              "      <td>Iris-setosa</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>4.7</td>\n",
              "      <td>3.2</td>\n",
              "      <td>1.3</td>\n",
              "      <td>0.2</td>\n",
              "      <td>Iris-setosa</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>4.6</td>\n",
              "      <td>3.1</td>\n",
              "      <td>1.5</td>\n",
              "      <td>0.2</td>\n",
              "      <td>Iris-setosa</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>5.0</td>\n",
              "      <td>3.6</td>\n",
              "      <td>1.4</td>\n",
              "      <td>0.2</td>\n",
              "      <td>Iris-setosa</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-35dbe008-4509-4704-85a0-c003a3cf4f63')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-35dbe008-4509-4704-85a0-c003a3cf4f63 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-35dbe008-4509-4704-85a0-c003a3cf4f63');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ]
          },
          "metadata": {},
          "execution_count": 1
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's remove the capitalization from each letter."
      ],
      "metadata": {
        "id": "b380iJLK00Oa"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df.Class.apply(lambda s: s.lower())"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "O_bHLmbL0iqD",
        "outputId": "b1fbaa62-2f11-4150-ed68-91d6a9d0c0f6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0         iris-setosa\n",
              "1         iris-setosa\n",
              "2         iris-setosa\n",
              "3         iris-setosa\n",
              "4         iris-setosa\n",
              "            ...      \n",
              "145    iris-virginica\n",
              "146    iris-virginica\n",
              "147    iris-virginica\n",
              "148    iris-virginica\n",
              "149    iris-virginica\n",
              "Name: Class, Length: 150, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Actually the word iris is also redundent so let's strip all that away too!  We'll capitalize as well to keep our data looking sharp!"
      ],
      "metadata": {
        "id": "zqYJD4Bg06J_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df.Class.apply(lambda s: s[5:].capitalize())"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "5hEOgNiy0yK5",
        "outputId": "1cdb15a5-9608-4b38-cc62-d9558d82c93c"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0         Setosa\n",
              "1         Setosa\n",
              "2         Setosa\n",
              "3         Setosa\n",
              "4         Setosa\n",
              "         ...    \n",
              "145    Virginica\n",
              "146    Virginica\n",
              "147    Virginica\n",
              "148    Virginica\n",
              "149    Virginica\n",
              "Name: Class, Length: 150, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Patterns and Regular Expressions"
      ],
      "metadata": {
        "id": "4mJsuUsd2cB9"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Often when I am dealing with strings I get annoying extras hanging around.  This happens a lot when I scrape data from the web.  Let's gather an example."
      ],
      "metadata": {
        "id": "aG7gwCCa2ee2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import requests\n",
        "import pandas as pa\n",
        "from bs4 import BeautifulSoup\n",
        "\n",
        "\n",
        "r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')\n",
        "html_contents = r.text\n",
        "html_soup = BeautifulSoup(html_contents,\"lxml\")\n",
        "tables = html_soup.find_all('table',class_=\"wikitable\")\n",
        "\n",
        "df1 = pa.read_html(str(tables))[0]\n",
        "df1.columns = df1.columns.droplevel(0).droplevel(0)\n",
        "df1.head()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 372
        },
        "id": "5lY05_tR2sji",
        "outputId": "e2278a12-91ba-44d7-d147-9c991bf6de49"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/html": [
              "\n",
              "  <div id=\"df-95e7bfa8-69e8-4d40-aaa8-935349e3c698\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Rank[dp 1]</th>\n",
              "      <th>Mountain name(s)</th>\n",
              "      <th>m</th>\n",
              "      <th>ft</th>\n",
              "      <th>m</th>\n",
              "      <th>ft</th>\n",
              "      <th>Range</th>\n",
              "      <th>Coordinates[dp 4]</th>\n",
              "      <th>Parent mountain[dp 5]</th>\n",
              "      <th>1st</th>\n",
              "      <th>y</th>\n",
              "      <th>n</th>\n",
              "      <th>Country (disputed claims in italics)</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>1</td>\n",
              "      <td>.mw-parser-output ul.cslist,.mw-parser-output ...</td>\n",
              "      <td>8848</td>\n",
              "      <td>29,029[dp 7]</td>\n",
              "      <td>8848</td>\n",
              "      <td>29029</td>\n",
              "      <td>Mahalangur Himalaya</td>\n",
              "      <td>.mw-parser-output .geo-default,.mw-parser-outp...</td>\n",
              "      <td>—</td>\n",
              "      <td>1953</td>\n",
              "      <td>145</td>\n",
              "      <td>121</td>\n",
              "      <td>NepalChina</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>2</td>\n",
              "      <td>K2</td>\n",
              "      <td>8611</td>\n",
              "      <td>28251</td>\n",
              "      <td>4020</td>\n",
              "      <td>13190</td>\n",
              "      <td>Baltoro Karakoram</td>\n",
              "      <td>35°52′53″N 76°30′48″E﻿ / ﻿35.88139°N 76.51333°E</td>\n",
              "      <td>Mount Everest</td>\n",
              "      <td>1954</td>\n",
              "      <td>45</td>\n",
              "      <td>44</td>\n",
              "      <td>Pakistan[dp 8]China[12]</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>3</td>\n",
              "      <td>Kangchenjunga</td>\n",
              "      <td>8586</td>\n",
              "      <td>28169</td>\n",
              "      <td>3922</td>\n",
              "      <td>12867</td>\n",
              "      <td>Kangchenjunga Himalaya</td>\n",
              "      <td>27°42′12″N 88°08′51″E﻿ / ﻿27.70333°N 88.14750°E *</td>\n",
              "      <td>Mount Everest</td>\n",
              "      <td>1955</td>\n",
              "      <td>38</td>\n",
              "      <td>24</td>\n",
              "      <td>NepalIndia</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>4</td>\n",
              "      <td>Lhotse</td>\n",
              "      <td>8516</td>\n",
              "      <td>27940</td>\n",
              "      <td>610</td>\n",
              "      <td>2000</td>\n",
              "      <td>Mahalangur Himalaya</td>\n",
              "      <td>27°57′42″N 86°55′59″E﻿ / ﻿27.96167°N 86.93306°E</td>\n",
              "      <td>Mount Everest</td>\n",
              "      <td>1956</td>\n",
              "      <td>26</td>\n",
              "      <td>26</td>\n",
              "      <td>NepalChina</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>5</td>\n",
              "      <td>Makalu</td>\n",
              "      <td>8485</td>\n",
              "      <td>27838</td>\n",
              "      <td>2378</td>\n",
              "      <td>7802</td>\n",
              "      <td>Mahalangur Himalaya</td>\n",
              "      <td>27°53′23″N 87°05′20″E﻿ / ﻿27.88972°N 87.08889°E</td>\n",
              "      <td>Mount Everest</td>\n",
              "      <td>1955</td>\n",
              "      <td>45</td>\n",
              "      <td>—</td>\n",
              "      <td>NepalChina</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-95e7bfa8-69e8-4d40-aaa8-935349e3c698')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "        \n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "      \n",
              "  <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-95e7bfa8-69e8-4d40-aaa8-935349e3c698 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-95e7bfa8-69e8-4d40-aaa8-935349e3c698');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n",
              "  "
            ],
            "text/plain": [
              "  Rank[dp 1]  ... Country (disputed claims in italics)\n",
              "0          1  ...                           NepalChina\n",
              "1          2  ...              Pakistan[dp 8]China[12]\n",
              "2          3  ...                           NepalIndia\n",
              "3          4  ...                           NepalChina\n",
              "4          5  ...                           NepalChina\n",
              "\n",
              "[5 rows x 13 columns]"
            ]
          },
          "metadata": {},
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "I see right away that Everest has some issues with its height in feet."
      ],
      "metadata": {
        "id": "FGJ0AjpI4JxX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df1.iloc[0,3]\n",
        "\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "n9rfu2NG4Rp6",
        "outputId": "d63c0ae8-3f5a-4d76-dcf7-799294827a0e"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'29,029[dp 7]'"
            ]
          },
          "metadata": {},
          "execution_count": 8
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's see if we can strip away anything that is not a digit.  I'll use the RegEx package.  The most useful commands in RegEx is hard to say.  There are normally lots of ways to do things.  RegEx looks for certain characters depending on what you tell it. A more complete table can be found [here](https://docs.python.org/3/library/re.html) but I'll try to update this with anything I need to use \n",
        "\n",
        "Symbol|Meaning in RegEx\n",
        "------|-----------------\n",
        "\\d | digits\n",
        "\\D | Letter but not digits\n",
        "[a-z]| Lowercase letters\n",
        "[A-Z]| Uppercase letters\n",
        "\\w | All letters and numbers\n",
        ". | Any characters including spaces\n",
        "? | Makes it non-greedy only picking the first of the patterns\n",
        "*| Allows for 0 or more repititions\n",
        "+| Allows for 1 or more repititions of your pattern\n",
        "^| Not in\n",
        "\n",
        "\n",
        "If you need to look for any of the characters above, use a backslash in front of it.  Let's play around on the height.  First I'll sub anything not a digit with no space (this is how I remove things!)"
      ],
      "metadata": {
        "id": "vLmeVS254xVz"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import re\n",
        "\n",
        "re.sub(r\"\\D\",\"\",df1.iloc[0,3])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "h9ig5qjn7qzm",
        "outputId": "45fd1207-801a-4e3c-fd74-63ea0da8336d"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'290297'"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "This is not quite right!  **7** came inside the block so it was not part of the height!"
      ],
      "metadata": {
        "id": "jHAj84AI9u9C"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We do see that the problem is of the shape [...] perhaps we can look for that."
      ],
      "metadata": {
        "id": "WId-6N02ALY3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "re.sub(r\"\\[\\w*\\]\",\"\",df1.iloc[0,3])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "1gOmRlJZWP46",
        "outputId": "78088023-e411-4fb8-c5bb-52fa79245ada"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'29,029[dp 7]'"
            ]
          },
          "metadata": {},
          "execution_count": 15
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "This isn't what I wanted!  The issue is the space! `\\w` saw the space and didn't eliminate the part we wanted.  Instead I'll use the `.`.  I'll need to add a repition since there are multiple and I have to put these special characters inside of `()` "
      ],
      "metadata": {
        "id": "zPaGxXaqWcs5"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "re.sub(r\"\\[(.+)\\]\",\"\",df1.iloc[0,3])"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "fCrrxRZJAZX_",
        "outputId": "5257b925-49b7-4a0d-f21f-ed4463b598ee"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'29,029'"
            ]
          },
          "metadata": {},
          "execution_count": 20
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The trickiest part here is that there are spaces!  We need to grab everything inside.  We literally grab the [ and the ] but also grab everything inside with `(.+)`"
      ],
      "metadata": {
        "id": "j0lheAgIDOoK"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's do another example of regular expressions and try to simplify the iris class column!  Maybe we want to change the hyphen to a space and capitalize the name of the flower "
      ],
      "metadata": {
        "id": "1HfaTBHcX-GM"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "\n",
        "\n",
        "list1 = re.sub(r\"-\", \" \",df.Class[0]).split()\n",
        "\n",
        "list1[0] + \" \" + list1[1].capitalize()\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 36
        },
        "id": "WWFtH12eYK3r",
        "outputId": "45c5b017-917f-4300-c49d-29d3480db9c7"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            },
            "text/plain": [
              "'Iris Setosa'"
            ]
          },
          "metadata": {},
          "execution_count": 22
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "To apply this to the whole column, we write it as two lambda functions."
      ],
      "metadata": {
        "id": "Iq_BhQtUaDif"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "split_the_string = lambda s: re.sub(r\"-\", \" \",s).split()\n",
        "reform_the_string = lambda list1: list1[0] + \" \" + list1[1].capitalize()\n",
        "\n",
        "df.Class.apply(lambda s: reform_the_string(split_the_string(s)) )"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ktuEM0PRaKH9",
        "outputId": "c53a82fa-151f-4205-ea82-132911c07c66"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0         Iris Setosa\n",
              "1         Iris Setosa\n",
              "2         Iris Setosa\n",
              "3         Iris Setosa\n",
              "4         Iris Setosa\n",
              "            ...      \n",
              "145    Iris Virginica\n",
              "146    Iris Virginica\n",
              "147    Iris Virginica\n",
              "148    Iris Virginica\n",
              "149    Iris Virginica\n",
              "Name: Class, Length: 150, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Just to get the point across, I want to show this in another way.  I find it easier to write one complicated function and then pass that into the apply activating it with a lambda.  You can observe the same thing here."
      ],
      "metadata": {
        "id": "3l8_6YN-tnLq"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "def clean_name(dirty_name):\n",
        "  list1 = re.sub(r\"-\", \" \",dirty_name).split()\n",
        "  return list1[0] + \" \" + list1[1].capitalize()\n",
        "\n",
        "df.Class.apply(lambda s: clean_name(s))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QV-P3z2Jt4-h",
        "outputId": "2901a4d5-ade6-4b32-9b24-47bd94c544b9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0         Iris Setosa\n",
              "1         Iris Setosa\n",
              "2         Iris Setosa\n",
              "3         Iris Setosa\n",
              "4         Iris Setosa\n",
              "            ...      \n",
              "145    Iris Virginica\n",
              "146    Iris Virginica\n",
              "147    Iris Virginica\n",
              "148    Iris Virginica\n",
              "149    Iris Virginica\n",
              "Name: Class, Length: 150, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 14
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Dictionary"
      ],
      "metadata": {
        "id": "zxPP0Trr1ZsP"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Sometimes you just want to change the names all together.  Dictionaries would be great for that!"
      ],
      "metadata": {
        "id": "FQ-nZEQ51eLx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "dictionary = {'Iris-setosa':'Iris1', 'Iris-versicolor':'Iris2','Iris-virginica':'Iris3'}"
      ],
      "metadata": {
        "id": "e4Sj31Ho1FG7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "df.Class.map(dictionary)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "sKUX9LU015na",
        "outputId": "7c0a5fb7-69f8-4b45-df9c-86f57212db45"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0      Iris1\n",
              "1      Iris1\n",
              "2      Iris1\n",
              "3      Iris1\n",
              "4      Iris1\n",
              "       ...  \n",
              "145    Iris3\n",
              "146    Iris3\n",
              "147    Iris3\n",
              "148    Iris3\n",
              "149    Iris3\n",
              "Name: Class, Length: 150, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 16
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Your Turn"
      ],
      "metadata": {
        "id": "iZCgxWsv15Co"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Using the techniques discussed here for regular expressions, clean up the column names in the dataframe `df1`.  Get rid of notes, parentheticals and spaces.  Decide on a capitalization scheme for all column names and apply it."
      ],
      "metadata": {
        "id": "rA_MSaqAPK3X"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df1.columns"
      ],
      "metadata": {
        "id": "lwvhCdhYPKol",
        "outputId": "af7626f4-8aba-4cb9-fa82-c34e8f27e9eb",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "Index(['Rank[dp 1]', 'Mountain name(s)', 'm', 'ft', 'm', 'ft', 'Range',\n",
              "       'Coordinates[dp 4]', 'Parent mountain[dp 5]', '1st', 'y', 'n',\n",
              "       'Country (disputed claims in italics)'],\n",
              "      dtype='object')"
            ]
          },
          "metadata": {},
          "execution_count": 23
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Examine the last column.  Describe a way in which you might be able to clean it up to make sense of it and allow for later analysis."
      ],
      "metadata": {
        "id": "DLTtoQ3LQ2Lm"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "df1.iloc[:,-1]"
      ],
      "metadata": {
        "id": "o0EhHeh7Xs80",
        "outputId": "efd306dd-a61d-47e4-ddbf-ef48f854812c",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0                            NepalChina\n",
              "1               Pakistan[dp 8]China[12]\n",
              "2                            NepalIndia\n",
              "3                            NepalChina\n",
              "4                            NepalChina\n",
              "                     ...               \n",
              "115                               China\n",
              "116                          NepalChina\n",
              "117                  BhutanChina[dp 18]\n",
              "118    IndiaChina[dp 10][dp 11]'[dp 12]\n",
              "119                      Pakistan[dp 8]\n",
              "Name: Country (disputed claims in italics), Length: 120, dtype: object"
            ]
          },
          "metadata": {},
          "execution_count": 57
        }
      ]
    }
  ]
}