Strings And Cleaning a DataFrame

Open In Colab

Strings And Cleaning a DataFrame#

Lambda#

To clean the strings properly in a dataframe, we are going to first need to talk about lambda functions. Lambda functions are different then normal functions in python. They are anonymous and has not been named. They are great because you can pass them in other calls and they avoid some of the slowness of for loops.

https://realpython.com/python-lambda/

lambda x: x^2
<function __main__.<lambda>>

This defines a function that will square any entry. Let’s see it in action

list(map(lambda x: x**2,[1,2,3,4,5]))
[1, 4, 9, 16, 25]

Now we easily could have done this with a regular function and a for loop see the following as an example.

def square(x):
  return x**2

newlist = []
for i in [1,2,3,4,5]:
  newlist.append(square(i))

newlist
[1, 4, 9, 16, 25]

But this will take more memory and will not work on large datasets!

Apply#

We will use the apply function to pass a lambda function to a dataframe. Let’s see one in action!

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

df.head()
SepalLength SepalWidth PedalLength PedalWidth Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Let’s remove the capitalization from each letter.

df.Class.apply(lambda s: s.lower())
0         iris-setosa
1         iris-setosa
2         iris-setosa
3         iris-setosa
4         iris-setosa
            ...      
145    iris-virginica
146    iris-virginica
147    iris-virginica
148    iris-virginica
149    iris-virginica
Name: Class, Length: 150, dtype: object

Actually the word iris is also redundent so let’s strip all that away too! We’ll capitalize as well to keep our data looking sharp!

df.Class.apply(lambda s: s[5:].capitalize())
0         Setosa
1         Setosa
2         Setosa
3         Setosa
4         Setosa
         ...    
145    Virginica
146    Virginica
147    Virginica
148    Virginica
149    Virginica
Name: Class, Length: 150, dtype: object

Patterns and Regular Expressions#

Often when I am dealing with strings I get annoying extras hanging around. This happens a lot when I scrape data from the web. Let’s gather an example.

import requests
import pandas as pa
from bs4 import BeautifulSoup


r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")

df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)
df1.head()
Rank[dp 1] Mountain name(s) m ft m ft Range Coordinates[dp 4] Parent mountain[dp 5] 1st y n Country (disputed claims in italics)
0 1 .mw-parser-output ul.cslist,.mw-parser-output ... 8848 29,029[dp 7] 8848 29029 Mahalangur Himalaya .mw-parser-output .geo-default,.mw-parser-outp... 1953 145 121 NepalChina
1 2 K2 8611 28251 4020 13190 Baltoro Karakoram 35°52′53″N 76°30′48″E / 35.88139°N 76.51333°E Mount Everest 1954 45 44 Pakistan[dp 8]China[12]
2 3 Kangchenjunga 8586 28169 3922 12867 Kangchenjunga Himalaya 27°42′12″N 88°08′51″E / 27.70333°N 88.14750°E * Mount Everest 1955 38 24 NepalIndia
3 4 Lhotse 8516 27940 610 2000 Mahalangur Himalaya 27°57′42″N 86°55′59″E / 27.96167°N 86.93306°E Mount Everest 1956 26 26 NepalChina
4 5 Makalu 8485 27838 2378 7802 Mahalangur Himalaya 27°53′23″N 87°05′20″E / 27.88972°N 87.08889°E Mount Everest 1955 45 NepalChina

I see right away that Everest has some issues with its height in feet.

df1.iloc[0,3]

'29,029[dp 7]'

Let’s see if we can strip away anything that is not a digit. I’ll use the RegEx package. The most useful commands in RegEx is hard to say. There are normally lots of ways to do things. RegEx looks for certain characters depending on what you tell it. A more complete table can be found here but I’ll try to update this with anything I need to use

Symbol

Meaning in RegEx

\d

digits

\D

Letter but not digits

[a-z]

Lowercase letters

[A-Z]

Uppercase letters

\w

All letters and numbers

.

Any characters including spaces

?

Makes it non-greedy only picking the first of the patterns

*

Allows for 0 or more repititions

+

Allows for 1 or more repititions of your pattern

^

Not in

If you need to look for any of the characters above, use a backslash in front of it. Let’s play around on the height. First I’ll sub anything not a digit with no space (this is how I remove things!)

import re

re.sub(r"\D","",df1.iloc[0,3])
'290297'

This is not quite right! 7 came inside the block so it was not part of the height!

We do see that the problem is of the shape […] perhaps we can look for that.

re.sub(r"\[\w*\]","",df1.iloc[0,3])
'29,029[dp 7]'

This isn’t what I wanted! The issue is the space! \w saw the space and didn’t eliminate the part we wanted. Instead I’ll use the .. I’ll need to add a repition since there are multiple and I have to put these special characters inside of ()

re.sub(r"\[(.+)\]","",df1.iloc[0,3])
'29,029'

The trickiest part here is that there are spaces! We need to grab everything inside. We literally grab the [ and the ] but also grab everything inside with (.+)

Let’s do another example of regular expressions and try to simplify the iris class column! Maybe we want to change the hyphen to a space and capitalize the name of the flower



list1 = re.sub(r"-", " ",df.Class[0]).split()

list1[0] + " " + list1[1].capitalize()
'Iris Setosa'

To apply this to the whole column, we write it as two lambda functions.

split_the_string = lambda s: re.sub(r"-", " ",s).split()
reform_the_string = lambda list1: list1[0] + " " + list1[1].capitalize()

df.Class.apply(lambda s: reform_the_string(split_the_string(s)) )
0         Iris Setosa
1         Iris Setosa
2         Iris Setosa
3         Iris Setosa
4         Iris Setosa
            ...      
145    Iris Virginica
146    Iris Virginica
147    Iris Virginica
148    Iris Virginica
149    Iris Virginica
Name: Class, Length: 150, dtype: object

Just to get the point across, I want to show this in another way. I find it easier to write one complicated function and then pass that into the apply activating it with a lambda. You can observe the same thing here.

def clean_name(dirty_name):
  list1 = re.sub(r"-", " ",dirty_name).split()
  return list1[0] + " " + list1[1].capitalize()

df.Class.apply(lambda s: clean_name(s))
0         Iris Setosa
1         Iris Setosa
2         Iris Setosa
3         Iris Setosa
4         Iris Setosa
            ...      
145    Iris Virginica
146    Iris Virginica
147    Iris Virginica
148    Iris Virginica
149    Iris Virginica
Name: Class, Length: 150, dtype: object

Dictionary#

Sometimes you just want to change the names all together. Dictionaries would be great for that!

dictionary = {'Iris-setosa':'Iris1', 'Iris-versicolor':'Iris2','Iris-virginica':'Iris3'}
df.Class.map(dictionary)
0      Iris1
1      Iris1
2      Iris1
3      Iris1
4      Iris1
       ...  
145    Iris3
146    Iris3
147    Iris3
148    Iris3
149    Iris3
Name: Class, Length: 150, dtype: object

Your Turn#

Using the techniques discussed here for regular expressions, clean up the column names in the dataframe df1. Get rid of notes, parentheticals and spaces. Decide on a capitalization scheme for all column names and apply it.

df1.columns
Index(['Rank[dp 1]', 'Mountain name(s)', 'm', 'ft', 'm', 'ft', 'Range',
       'Coordinates[dp 4]', 'Parent mountain[dp 5]', '1st', 'y', 'n',
       'Country (disputed claims in italics)'],
      dtype='object')

Examine the last column. Describe a way in which you might be able to clean it up to make sense of it and allow for later analysis.

df1.iloc[:,-1]
0                            NepalChina
1               Pakistan[dp 8]China[12]
2                            NepalIndia
3                            NepalChina
4                            NepalChina
                     ...               
115                               China
116                          NepalChina
117                  BhutanChina[dp 18]
118    IndiaChina[dp 10][dp 11]'[dp 12]
119                      Pakistan[dp 8]
Name: Country (disputed claims in italics), Length: 120, dtype: object