{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "Untitled82.ipynb", "provenance": [], "authorship_tag": "ABX9TyObCbQFZ0mfH94sG7NeKnG3", "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "source": [ "# Strings And Cleaning a DataFrame" ], "metadata": { "id": "dS1ulNfIwrFD" } }, { "cell_type": "markdown", "source": [ "## Lambda" ], "metadata": { "id": "fTid8y890Eht" } }, { "cell_type": "markdown", "source": [ "To clean the strings properly in a dataframe, we are going to first need to talk about `lambda` functions. Lambda functions are different then normal functions in python. They are anonymous and has not been named. They are great because you can pass them in other calls and they avoid some of the slowness of for loops.\n", "\n", "\n", "https://realpython.com/python-lambda/" ], "metadata": { "id": "L_Wx5V2LxSfb" } }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dyp4KUp-wqhx", "outputId": "15439351-f324-485e-9824-dfc2fc0bdba2" }, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ ">" ] }, "metadata": {}, "execution_count": 1 } ], "source": [ "lambda x: x^2" ] }, { "cell_type": "markdown", "source": [ "This defines a function that will square any entry. Let's see it in action" ], "metadata": { "id": "hFoNwrbryu61" } }, { "cell_type": "code", "source": [ "list(map(lambda x: x**2,[1,2,3,4,5]))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "aExg-A9WylIG", "outputId": "346bfde0-33d4-4df0-fb6c-1907a539ad7f" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "metadata": {}, "execution_count": 1 } ] }, { "cell_type": "markdown", "source": [ "Now we easily could have done this with a regular function and a for loop see the following as an example." ], "metadata": { "id": "gWm01MdkzcrF" } }, { "cell_type": "code", "source": [ "def square(x):\n", " return x**2\n", "\n", "newlist = []\n", "for i in [1,2,3,4,5]:\n", " newlist.append(square(i))\n", "\n", "newlist" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "iHyI7s5gy5Bu", "outputId": "f301fb5f-c950-43b8-9120-881d4e93bdf3" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "[1, 4, 9, 16, 25]" ] }, "metadata": {}, "execution_count": 3 } ] }, { "cell_type": "markdown", "source": [ "But this will take more memory and will not work on large datasets!" ], "metadata": { "id": "y5crpu8Wz5Yb" } }, { "cell_type": "markdown", "source": [ "## Apply" ], "metadata": { "id": "LezXKUTT0CLj" } }, { "cell_type": "markdown", "source": [ "We will use the `apply` function to pass a lambda function to a dataframe. Let's see one in action!" ], "metadata": { "id": "eD1HCLn60HeQ" } }, { "cell_type": "code", "source": [ "import pandas as pa\n", "\n", "df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')\n", "\n", "df.head()\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "Sw68Gt9NzuU1", "outputId": "4e3812d0-3d43-4abe-eb05-762c78c881de" }, "execution_count": 1, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " SepalLength SepalWidth PedalLength PedalWidth Class\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ], "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SepalLengthSepalWidthPedalLengthPedalWidthClass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ] }, "metadata": {}, "execution_count": 1 } ] }, { "cell_type": "markdown", "source": [ "Let's remove the capitalization from each letter." ], "metadata": { "id": "b380iJLK00Oa" } }, { "cell_type": "code", "source": [ "df.Class.apply(lambda s: s.lower())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "O_bHLmbL0iqD", "outputId": "b1fbaa62-2f11-4150-ed68-91d6a9d0c0f6" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 iris-setosa\n", "1 iris-setosa\n", "2 iris-setosa\n", "3 iris-setosa\n", "4 iris-setosa\n", " ... \n", "145 iris-virginica\n", "146 iris-virginica\n", "147 iris-virginica\n", "148 iris-virginica\n", "149 iris-virginica\n", "Name: Class, Length: 150, dtype: object" ] }, "metadata": {}, "execution_count": 5 } ] }, { "cell_type": "markdown", "source": [ "Actually the word iris is also redundent so let's strip all that away too! We'll capitalize as well to keep our data looking sharp!" ], "metadata": { "id": "zqYJD4Bg06J_" } }, { "cell_type": "code", "source": [ "df.Class.apply(lambda s: s[5:].capitalize())" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "5hEOgNiy0yK5", "outputId": "1cdb15a5-9608-4b38-cc62-d9558d82c93c" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Setosa\n", "1 Setosa\n", "2 Setosa\n", "3 Setosa\n", "4 Setosa\n", " ... \n", "145 Virginica\n", "146 Virginica\n", "147 Virginica\n", "148 Virginica\n", "149 Virginica\n", "Name: Class, Length: 150, dtype: object" ] }, "metadata": {}, "execution_count": 6 } ] }, { "cell_type": "markdown", "source": [ "## Patterns and Regular Expressions" ], "metadata": { "id": "4mJsuUsd2cB9" } }, { "cell_type": "markdown", "source": [ "Often when I am dealing with strings I get annoying extras hanging around. This happens a lot when I scrape data from the web. Let's gather an example." ], "metadata": { "id": "aG7gwCCa2ee2" } }, { "cell_type": "code", "source": [ "import requests\n", "import pandas as pa\n", "from bs4 import BeautifulSoup\n", "\n", "\n", "r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')\n", "html_contents = r.text\n", "html_soup = BeautifulSoup(html_contents,\"lxml\")\n", "tables = html_soup.find_all('table',class_=\"wikitable\")\n", "\n", "df1 = pa.read_html(str(tables))[0]\n", "df1.columns = df1.columns.droplevel(0).droplevel(0)\n", "df1.head()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 372 }, "id": "5lY05_tR2sji", "outputId": "e2278a12-91ba-44d7-d147-9c991bf6de49" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "\n", "
\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Rank[dp 1]Mountain name(s)mftmftRangeCoordinates[dp 4]Parent mountain[dp 5]1stynCountry (disputed claims in italics)
01.mw-parser-output ul.cslist,.mw-parser-output ...884829,029[dp 7]884829029Mahalangur Himalaya.mw-parser-output .geo-default,.mw-parser-outp...1953145121NepalChina
12K2861128251402013190Baltoro Karakoram35°52′53″N 76°30′48″E / 35.88139°N 76.51333°EMount Everest19544544Pakistan[dp 8]China[12]
23Kangchenjunga858628169392212867Kangchenjunga Himalaya27°42′12″N 88°08′51″E / 27.70333°N 88.14750°E *Mount Everest19553824NepalIndia
34Lhotse8516279406102000Mahalangur Himalaya27°57′42″N 86°55′59″E / 27.96167°N 86.93306°EMount Everest19562626NepalChina
45Makalu84852783823787802Mahalangur Himalaya27°53′23″N 87°05′20″E / 27.88972°N 87.08889°EMount Everest195545NepalChina
\n", "
\n", " \n", " \n", " \n", "\n", " \n", "
\n", "
\n", " " ], "text/plain": [ " Rank[dp 1] ... Country (disputed claims in italics)\n", "0 1 ... NepalChina\n", "1 2 ... Pakistan[dp 8]China[12]\n", "2 3 ... NepalIndia\n", "3 4 ... NepalChina\n", "4 5 ... NepalChina\n", "\n", "[5 rows x 13 columns]" ] }, "metadata": {}, "execution_count": 11 } ] }, { "cell_type": "markdown", "source": [ "I see right away that Everest has some issues with its height in feet." ], "metadata": { "id": "FGJ0AjpI4JxX" } }, { "cell_type": "code", "source": [ "df1.iloc[0,3]\n", "\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "n9rfu2NG4Rp6", "outputId": "d63c0ae8-3f5a-4d76-dcf7-799294827a0e" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'29,029[dp 7]'" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "markdown", "source": [ "Let's see if we can strip away anything that is not a digit. I'll use the RegEx package. The most useful commands in RegEx is hard to say. There are normally lots of ways to do things. RegEx looks for certain characters depending on what you tell it. A more complete table can be found [here](https://docs.python.org/3/library/re.html) but I'll try to update this with anything I need to use \n", "\n", "Symbol|Meaning in RegEx\n", "------|-----------------\n", "\\d | digits\n", "\\D | Letter but not digits\n", "[a-z]| Lowercase letters\n", "[A-Z]| Uppercase letters\n", "\\w | All letters and numbers\n", ". | Any characters including spaces\n", "? | Makes it non-greedy only picking the first of the patterns\n", "*| Allows for 0 or more repititions\n", "+| Allows for 1 or more repititions of your pattern\n", "^| Not in\n", "\n", "\n", "If you need to look for any of the characters above, use a backslash in front of it. Let's play around on the height. First I'll sub anything not a digit with no space (this is how I remove things!)" ], "metadata": { "id": "vLmeVS254xVz" } }, { "cell_type": "code", "source": [ "import re\n", "\n", "re.sub(r\"\\D\",\"\",df1.iloc[0,3])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "h9ig5qjn7qzm", "outputId": "45fd1207-801a-4e3c-fd74-63ea0da8336d" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'290297'" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "source": [ "This is not quite right! **7** came inside the block so it was not part of the height!" ], "metadata": { "id": "jHAj84AI9u9C" } }, { "cell_type": "markdown", "source": [ "We do see that the problem is of the shape [...] perhaps we can look for that." ], "metadata": { "id": "WId-6N02ALY3" } }, { "cell_type": "code", "source": [ "re.sub(r\"\\[\\w*\\]\",\"\",df1.iloc[0,3])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "1gOmRlJZWP46", "outputId": "78088023-e411-4fb8-c5bb-52fa79245ada" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'29,029[dp 7]'" ] }, "metadata": {}, "execution_count": 15 } ] }, { "cell_type": "markdown", "source": [ "This isn't what I wanted! The issue is the space! `\\w` saw the space and didn't eliminate the part we wanted. Instead I'll use the `.`. I'll need to add a repition since there are multiple and I have to put these special characters inside of `()` " ], "metadata": { "id": "zPaGxXaqWcs5" } }, { "cell_type": "code", "source": [ "re.sub(r\"\\[(.+)\\]\",\"\",df1.iloc[0,3])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "fCrrxRZJAZX_", "outputId": "5257b925-49b7-4a0d-f21f-ed4463b598ee" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'29,029'" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "markdown", "source": [ "The trickiest part here is that there are spaces! We need to grab everything inside. We literally grab the [ and the ] but also grab everything inside with `(.+)`" ], "metadata": { "id": "j0lheAgIDOoK" } }, { "cell_type": "markdown", "source": [ "Let's do another example of regular expressions and try to simplify the iris class column! Maybe we want to change the hyphen to a space and capitalize the name of the flower " ], "metadata": { "id": "1HfaTBHcX-GM" } }, { "cell_type": "code", "source": [ "\n", "\n", "list1 = re.sub(r\"-\", \" \",df.Class[0]).split()\n", "\n", "list1[0] + \" \" + list1[1].capitalize()\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 36 }, "id": "WWFtH12eYK3r", "outputId": "45c5b017-917f-4300-c49d-29d3480db9c7" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Iris Setosa'" ] }, "metadata": {}, "execution_count": 22 } ] }, { "cell_type": "markdown", "source": [ "To apply this to the whole column, we write it as two lambda functions." ], "metadata": { "id": "Iq_BhQtUaDif" } }, { "cell_type": "code", "source": [ "split_the_string = lambda s: re.sub(r\"-\", \" \",s).split()\n", "reform_the_string = lambda list1: list1[0] + \" \" + list1[1].capitalize()\n", "\n", "df.Class.apply(lambda s: reform_the_string(split_the_string(s)) )" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ktuEM0PRaKH9", "outputId": "c53a82fa-151f-4205-ea82-132911c07c66" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Iris Setosa\n", "1 Iris Setosa\n", "2 Iris Setosa\n", "3 Iris Setosa\n", "4 Iris Setosa\n", " ... \n", "145 Iris Virginica\n", "146 Iris Virginica\n", "147 Iris Virginica\n", "148 Iris Virginica\n", "149 Iris Virginica\n", "Name: Class, Length: 150, dtype: object" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "markdown", "source": [ "Just to get the point across, I want to show this in another way. I find it easier to write one complicated function and then pass that into the apply activating it with a lambda. You can observe the same thing here." ], "metadata": { "id": "3l8_6YN-tnLq" } }, { "cell_type": "code", "source": [ "def clean_name(dirty_name):\n", " list1 = re.sub(r\"-\", \" \",dirty_name).split()\n", " return list1[0] + \" \" + list1[1].capitalize()\n", "\n", "df.Class.apply(lambda s: clean_name(s))" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "QV-P3z2Jt4-h", "outputId": "2901a4d5-ade6-4b32-9b24-47bd94c544b9" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Iris Setosa\n", "1 Iris Setosa\n", "2 Iris Setosa\n", "3 Iris Setosa\n", "4 Iris Setosa\n", " ... \n", "145 Iris Virginica\n", "146 Iris Virginica\n", "147 Iris Virginica\n", "148 Iris Virginica\n", "149 Iris Virginica\n", "Name: Class, Length: 150, dtype: object" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "source": [ "## Dictionary" ], "metadata": { "id": "zxPP0Trr1ZsP" } }, { "cell_type": "markdown", "source": [ "Sometimes you just want to change the names all together. Dictionaries would be great for that!" ], "metadata": { "id": "FQ-nZEQ51eLx" } }, { "cell_type": "code", "source": [ "dictionary = {'Iris-setosa':'Iris1', 'Iris-versicolor':'Iris2','Iris-virginica':'Iris3'}" ], "metadata": { "id": "e4Sj31Ho1FG7" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "df.Class.map(dictionary)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sKUX9LU015na", "outputId": "7c0a5fb7-69f8-4b45-df9c-86f57212db45" }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 Iris1\n", "1 Iris1\n", "2 Iris1\n", "3 Iris1\n", "4 Iris1\n", " ... \n", "145 Iris3\n", "146 Iris3\n", "147 Iris3\n", "148 Iris3\n", "149 Iris3\n", "Name: Class, Length: 150, dtype: object" ] }, "metadata": {}, "execution_count": 16 } ] }, { "cell_type": "markdown", "source": [ "## Your Turn" ], "metadata": { "id": "iZCgxWsv15Co" } }, { "cell_type": "markdown", "source": [ "Using the techniques discussed here for regular expressions, clean up the column names in the dataframe `df1`. Get rid of notes, parentheticals and spaces. Decide on a capitalization scheme for all column names and apply it." ], "metadata": { "id": "rA_MSaqAPK3X" } }, { "cell_type": "code", "source": [ "df1.columns" ], "metadata": { "id": "lwvhCdhYPKol", "outputId": "af7626f4-8aba-4cb9-fa82-c34e8f27e9eb", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Index(['Rank[dp 1]', 'Mountain name(s)', 'm', 'ft', 'm', 'ft', 'Range',\n", " 'Coordinates[dp 4]', 'Parent mountain[dp 5]', '1st', 'y', 'n',\n", " 'Country (disputed claims in italics)'],\n", " dtype='object')" ] }, "metadata": {}, "execution_count": 23 } ] }, { "cell_type": "markdown", "source": [ "Examine the last column. Describe a way in which you might be able to clean it up to make sense of it and allow for later analysis." ], "metadata": { "id": "DLTtoQ3LQ2Lm" } }, { "cell_type": "code", "source": [ "df1.iloc[:,-1]" ], "metadata": { "id": "o0EhHeh7Xs80", "outputId": "efd306dd-a61d-47e4-ddbf-ef48f854812c", "colab": { "base_uri": "https://localhost:8080/" } }, "execution_count": null, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 NepalChina\n", "1 Pakistan[dp 8]China[12]\n", "2 NepalIndia\n", "3 NepalChina\n", "4 NepalChina\n", " ... \n", "115 China\n", "116 NepalChina\n", "117 BhutanChina[dp 18]\n", "118 IndiaChina[dp 10][dp 11]'[dp 12]\n", "119 Pakistan[dp 8]\n", "Name: Country (disputed claims in italics), Length: 120, dtype: object" ] }, "metadata": {}, "execution_count": 57 } ] } ] }