Strings And Cleaning a DataFrame#
Lambda#
To clean the strings properly in a dataframe, we are going to first need to talk about lambda
functions. Lambda functions are different then normal functions in python. They are anonymous and has not been named. They are great because you can pass them in other calls and they avoid some of the slowness of for loops.
https://realpython.com/python-lambda/
lambda x: x^2
<function __main__.<lambda>>
This defines a function that will square any entry. Let’s see it in action
list(map(lambda x: x**2,[1,2,3,4,5]))
[1, 4, 9, 16, 25]
Now we easily could have done this with a regular function and a for loop see the following as an example.
def square(x):
return x**2
newlist = []
for i in [1,2,3,4,5]:
newlist.append(square(i))
newlist
[1, 4, 9, 16, 25]
But this will take more memory and will not work on large datasets!
Apply#
We will use the apply
function to pass a lambda function to a dataframe. Let’s see one in action!
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')
df.head()
SepalLength | SepalWidth | PedalLength | PedalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Let’s remove the capitalization from each letter.
df.Class.apply(lambda s: s.lower())
0 iris-setosa
1 iris-setosa
2 iris-setosa
3 iris-setosa
4 iris-setosa
...
145 iris-virginica
146 iris-virginica
147 iris-virginica
148 iris-virginica
149 iris-virginica
Name: Class, Length: 150, dtype: object
Actually the word iris is also redundent so let’s strip all that away too! We’ll capitalize as well to keep our data looking sharp!
df.Class.apply(lambda s: s[5:].capitalize())
0 Setosa
1 Setosa
2 Setosa
3 Setosa
4 Setosa
...
145 Virginica
146 Virginica
147 Virginica
148 Virginica
149 Virginica
Name: Class, Length: 150, dtype: object
Patterns and Regular Expressions#
Often when I am dealing with strings I get annoying extras hanging around. This happens a lot when I scrape data from the web. Let’s gather an example.
import requests
import pandas as pa
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")
df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)
df1.head()
Rank[dp 1] | Mountain name(s) | m | ft | m | ft | Range | Coordinates[dp 4] | Parent mountain[dp 5] | 1st | y | n | Country (disputed claims in italics) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | .mw-parser-output ul.cslist,.mw-parser-output ... | 8848 | 29,029[dp 7] | 8848 | 29029 | Mahalangur Himalaya | .mw-parser-output .geo-default,.mw-parser-outp... | — | 1953 | 145 | 121 | NepalChina |
1 | 2 | K2 | 8611 | 28251 | 4020 | 13190 | Baltoro Karakoram | 35°52′53″N 76°30′48″E / 35.88139°N 76.51333°E | Mount Everest | 1954 | 45 | 44 | Pakistan[dp 8]China[12] |
2 | 3 | Kangchenjunga | 8586 | 28169 | 3922 | 12867 | Kangchenjunga Himalaya | 27°42′12″N 88°08′51″E / 27.70333°N 88.14750°E * | Mount Everest | 1955 | 38 | 24 | NepalIndia |
3 | 4 | Lhotse | 8516 | 27940 | 610 | 2000 | Mahalangur Himalaya | 27°57′42″N 86°55′59″E / 27.96167°N 86.93306°E | Mount Everest | 1956 | 26 | 26 | NepalChina |
4 | 5 | Makalu | 8485 | 27838 | 2378 | 7802 | Mahalangur Himalaya | 27°53′23″N 87°05′20″E / 27.88972°N 87.08889°E | Mount Everest | 1955 | 45 | — | NepalChina |
I see right away that Everest has some issues with its height in feet.
df1.iloc[0,3]
'29,029[dp 7]'
Let’s see if we can strip away anything that is not a digit. I’ll use the RegEx package. The most useful commands in RegEx is hard to say. There are normally lots of ways to do things. RegEx looks for certain characters depending on what you tell it. A more complete table can be found here but I’ll try to update this with anything I need to use
Symbol |
Meaning in RegEx |
---|---|
\d |
digits |
\D |
Letter but not digits |
[a-z] |
Lowercase letters |
[A-Z] |
Uppercase letters |
\w |
All letters and numbers |
. |
Any characters including spaces |
? |
Makes it non-greedy only picking the first of the patterns |
* |
Allows for 0 or more repititions |
+ |
Allows for 1 or more repititions of your pattern |
^ |
Not in |
If you need to look for any of the characters above, use a backslash in front of it. Let’s play around on the height. First I’ll sub anything not a digit with no space (this is how I remove things!)
import re
re.sub(r"\D","",df1.iloc[0,3])
'290297'
This is not quite right! 7 came inside the block so it was not part of the height!
We do see that the problem is of the shape […] perhaps we can look for that.
re.sub(r"\[\w*\]","",df1.iloc[0,3])
'29,029[dp 7]'
This isn’t what I wanted! The issue is the space! \w
saw the space and didn’t eliminate the part we wanted. Instead I’ll use the .
. I’ll need to add a repition since there are multiple and I have to put these special characters inside of ()
re.sub(r"\[(.+)\]","",df1.iloc[0,3])
'29,029'
The trickiest part here is that there are spaces! We need to grab everything inside. We literally grab the [ and the ] but also grab everything inside with (.+)
Let’s do another example of regular expressions and try to simplify the iris class column! Maybe we want to change the hyphen to a space and capitalize the name of the flower
list1 = re.sub(r"-", " ",df.Class[0]).split()
list1[0] + " " + list1[1].capitalize()
'Iris Setosa'
To apply this to the whole column, we write it as two lambda functions.
split_the_string = lambda s: re.sub(r"-", " ",s).split()
reform_the_string = lambda list1: list1[0] + " " + list1[1].capitalize()
df.Class.apply(lambda s: reform_the_string(split_the_string(s)) )
0 Iris Setosa
1 Iris Setosa
2 Iris Setosa
3 Iris Setosa
4 Iris Setosa
...
145 Iris Virginica
146 Iris Virginica
147 Iris Virginica
148 Iris Virginica
149 Iris Virginica
Name: Class, Length: 150, dtype: object
Just to get the point across, I want to show this in another way. I find it easier to write one complicated function and then pass that into the apply activating it with a lambda. You can observe the same thing here.
def clean_name(dirty_name):
list1 = re.sub(r"-", " ",dirty_name).split()
return list1[0] + " " + list1[1].capitalize()
df.Class.apply(lambda s: clean_name(s))
0 Iris Setosa
1 Iris Setosa
2 Iris Setosa
3 Iris Setosa
4 Iris Setosa
...
145 Iris Virginica
146 Iris Virginica
147 Iris Virginica
148 Iris Virginica
149 Iris Virginica
Name: Class, Length: 150, dtype: object
Dictionary#
Sometimes you just want to change the names all together. Dictionaries would be great for that!
dictionary = {'Iris-setosa':'Iris1', 'Iris-versicolor':'Iris2','Iris-virginica':'Iris3'}
df.Class.map(dictionary)
0 Iris1
1 Iris1
2 Iris1
3 Iris1
4 Iris1
...
145 Iris3
146 Iris3
147 Iris3
148 Iris3
149 Iris3
Name: Class, Length: 150, dtype: object
Your Turn#
Using the techniques discussed here for regular expressions, clean up the column names in the dataframe df1
. Get rid of notes, parentheticals and spaces. Decide on a capitalization scheme for all column names and apply it.
df1.columns
Index(['Rank[dp 1]', 'Mountain name(s)', 'm', 'ft', 'm', 'ft', 'Range',
'Coordinates[dp 4]', 'Parent mountain[dp 5]', '1st', 'y', 'n',
'Country (disputed claims in italics)'],
dtype='object')
Examine the last column. Describe a way in which you might be able to clean it up to make sense of it and allow for later analysis.
df1.iloc[:,-1]
0 NepalChina
1 Pakistan[dp 8]China[12]
2 NepalIndia
3 NepalChina
4 NepalChina
...
115 China
116 NepalChina
117 BhutanChina[dp 18]
118 IndiaChina[dp 10][dp 11]'[dp 12]
119 Pakistan[dp 8]
Name: Country (disputed claims in italics), Length: 120, dtype: object