Strings And Cleaning a DataFrame

Strings And Cleaning a DataFrame#

Lambda#

To clean the strings properly in a dataframe, we are going to first need to talk about lambda functions. Lambda functions are different then normal functions in python. They are anonymous and has not been named. They are great because you can pass them in other calls and they avoid some of the slowness of for loops.

https://realpython.com/python-lambda/

lambda x: x^2

<function __main__.<lambda>>

This defines a function that will square any entry. Let’s see it in action

list(map(lambda x: x**2,[1,2,3,4,5]))

[1, 4, 9, 16, 25]

Now we easily could have done this with a regular function and a for loop see the following as an example.

def square(x):
  return x**2

newlist = []
for i in [1,2,3,4,5]:
  newlist.append(square(i))

newlist

[1, 4, 9, 16, 25]

But this will take more memory and will not work on large datasets!

Apply#

We will use the apply function to pass a lambda function to a dataframe. Let’s see one in action!

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

df.head()

	SepalLength	SepalWidth	PedalLength	PedalWidth	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

Let’s remove the capitalization from each letter.

df.Class.apply(lambda s: s.lower())

       iris-setosa
       iris-setosa
       iris-setosa
       iris-setosa
       iris-setosa
            ...      
  iris-virginica
  iris-virginica
  iris-virginica
  iris-virginica
  iris-virginica
Name: Class, Length: 150, dtype: object

Actually the word iris is also redundent so let’s strip all that away too! We’ll capitalize as well to keep our data looking sharp!

df.Class.apply(lambda s: s[5:].capitalize())

       Setosa
       Setosa
       Setosa
       Setosa
       Setosa
         ...    
  Virginica
  Virginica
  Virginica
  Virginica
  Virginica
Name: Class, Length: 150, dtype: object

Patterns and Regular Expressions#

Often when I am dealing with strings I get annoying extras hanging around. This happens a lot when I scrape data from the web. Let’s gather an example.

import requests
import pandas as pa
from bs4 import BeautifulSoup


r = requests.get('https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth')
html_contents = r.text
html_soup = BeautifulSoup(html_contents,"lxml")
tables = html_soup.find_all('table',class_="wikitable")

df1 = pa.read_html(str(tables))[0]
df1.columns = df1.columns.droplevel(0).droplevel(0)
df1.head()

	Rank[dp 1]	Mountain name(s)	m	ft	m	ft	Range	Coordinates[dp 4]	Parent mountain[dp 5]	1st	y	n	Country (disputed claims in italics)
0	1	.mw-parser-output ul.cslist,.mw-parser-output ...	8848	29,029[dp 7]	8848	29029	Mahalangur Himalaya	.mw-parser-output .geo-default,.mw-parser-outp...	—	1953	145	121	NepalChina
1	2	K2	8611	28251	4020	13190	Baltoro Karakoram	35°52′53″N 76°30′48″E / 35.88139°N 76.51333°E	Mount Everest	1954	45	44	Pakistan[dp 8]China[12]
2	3	Kangchenjunga	8586	28169	3922	12867	Kangchenjunga Himalaya	27°42′12″N 88°08′51″E / 27.70333°N 88.14750°E *	Mount Everest	1955	38	24	NepalIndia
3	4	Lhotse	8516	27940	610	2000	Mahalangur Himalaya	27°57′42″N 86°55′59″E / 27.96167°N 86.93306°E	Mount Everest	1956	26	26	NepalChina
4	5	Makalu	8485	27838	2378	7802	Mahalangur Himalaya	27°53′23″N 87°05′20″E / 27.88972°N 87.08889°E	Mount Everest	1955	45	—	NepalChina

I see right away that Everest has some issues with its height in feet.

df1.iloc[0,3]

'29,029[dp 7]'

Let’s see if we can strip away anything that is not a digit. I’ll use the RegEx package. The most useful commands in RegEx is hard to say. There are normally lots of ways to do things. RegEx looks for certain characters depending on what you tell it. A more complete table can be found here but I’ll try to update this with anything I need to use

Symbol	Meaning in RegEx
\d	digits
\D	Letter but not digits
[a-z]	Lowercase letters
[A-Z]	Uppercase letters
\w	All letters and numbers
.	Any characters including spaces
?	Makes it non-greedy only picking the first of the patterns
*	Allows for 0 or more repititions
+	Allows for 1 or more repititions of your pattern
^	Not in

If you need to look for any of the characters above, use a backslash in front of it. Let’s play around on the height. First I’ll sub anything not a digit with no space (this is how I remove things!)

import re

re.sub(r"\D","",df1.iloc[0,3])

'290297'

This is not quite right! 7 came inside the block so it was not part of the height!

We do see that the problem is of the shape […] perhaps we can look for that.

re.sub(r"\[\w*\]","",df1.iloc[0,3])

'29,029[dp 7]'

This isn’t what I wanted! The issue is the space! \w saw the space and didn’t eliminate the part we wanted. Instead I’ll use the .. I’ll need to add a repition since there are multiple and I have to put these special characters inside of ()

re.sub(r"\[(.+)\]","",df1.iloc[0,3])

'29,029'

The trickiest part here is that there are spaces! We need to grab everything inside. We literally grab the [ and the ] but also grab everything inside with (.+)

Let’s do another example of regular expressions and try to simplify the iris class column! Maybe we want to change the hyphen to a space and capitalize the name of the flower

list1 = re.sub(r"-", " ",df.Class[0]).split()

list1[0] + " " + list1[1].capitalize()

'Iris Setosa'

To apply this to the whole column, we write it as two lambda functions.

split_the_string = lambda s: re.sub(r"-", " ",s).split()
reform_the_string = lambda list1: list1[0] + " " + list1[1].capitalize()

df.Class.apply(lambda s: reform_the_string(split_the_string(s)) )

       Iris Setosa
       Iris Setosa
       Iris Setosa
       Iris Setosa
       Iris Setosa
            ...      
  Iris Virginica
  Iris Virginica
  Iris Virginica
  Iris Virginica
  Iris Virginica
Name: Class, Length: 150, dtype: object

Just to get the point across, I want to show this in another way. I find it easier to write one complicated function and then pass that into the apply activating it with a lambda. You can observe the same thing here.

def clean_name(dirty_name):
  list1 = re.sub(r"-", " ",dirty_name).split()
  return list1[0] + " " + list1[1].capitalize()

df.Class.apply(lambda s: clean_name(s))

       Iris Setosa
       Iris Setosa
       Iris Setosa
       Iris Setosa
       Iris Setosa
            ...      
  Iris Virginica
  Iris Virginica
  Iris Virginica
  Iris Virginica
  Iris Virginica
Name: Class, Length: 150, dtype: object

Dictionary#

Sometimes you just want to change the names all together. Dictionaries would be great for that!

dictionary = {'Iris-setosa':'Iris1', 'Iris-versicolor':'Iris2','Iris-virginica':'Iris3'}

df.Class.map(dictionary)

    Iris1
    Iris1
    Iris1
    Iris1
    Iris1
       ...  
  Iris3
  Iris3
  Iris3
  Iris3
  Iris3
Name: Class, Length: 150, dtype: object

Your Turn#

Using the techniques discussed here for regular expressions, clean up the column names in the dataframe df1. Get rid of notes, parentheticals and spaces. Decide on a capitalization scheme for all column names and apply it.

df1.columns

Index(['Rank[dp 1]', 'Mountain name(s)', 'm', 'ft', 'm', 'ft', 'Range',
       'Coordinates[dp 4]', 'Parent mountain[dp 5]', '1st', 'y', 'n',
       'Country (disputed claims in italics)'],
      dtype='object')

Examine the last column. Describe a way in which you might be able to clean it up to make sense of it and allow for later analysis.

df1.iloc[:,-1]

                          NepalChina
             Pakistan[dp 8]China[12]
                          NepalIndia
                          NepalChina
                          NepalChina
                     ...               
                             China
                        NepalChina
                BhutanChina[dp 18]
  IndiaChina[dp 10][dp 11]'[dp 12]
                    Pakistan[dp 8]
Name: Country (disputed claims in italics), Length: 120, dtype: object