Missing and Incomplete#
Often datasets will be missing entries. There are many approaches we can take to dealing with these errors and omissions. I will examine a dataset on the characters from The Lord of The Rings
Finding NaN’s#
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/lotr_characters.csv')
df.head()
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | Female | NaN | NaN | Adanel | Men | NaN | Belemir |
1 | TA 2978 | February 26 ,3019 | Male | Dark (book) Light brown (movie) | NaN | Boromir | Men | NaN | NaN |
2 | NaN | March ,3019 | Male | NaN | NaN | Lagduf | Orcs | NaN | NaN |
3 | TA 280 | TA 515 | Male | NaN | NaN | Tarcil | Men | Arnor | Unnamed wife |
4 | NaN | NaN | Male | NaN | NaN | Fire-drake of Gondolin | Dragon | NaN | NaN |
We see right away that there are lots of NaN
’s. This is an empty field in our dataset. Some characters are mentioned but never given much more background than a name.
df.isnull().sum(axis = 0)
birth 207
death 315
gender 143
hair 734
height 813
name 0
race 140
realm 714
spouse 403
dtype: int64
There are null values in every column except name.
df.isnull().sum(axis = 1).value_counts().sort_index()
0 15
1 59
2 185
3 236
4 178
5 81
6 20
7 1
8 136
dtype: int64
Here we see that there are only 15 entries with all fields and 136 that are name only (since name was never blank!) Let’s look at just those characters.
df[~df.isnull().any(axis = 1)]
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
125 | SA 3209 | TA 2 | Male | Black | Very tall almost 7'1 | Isildur | Men | Arnor,Gondor | Unnamed wife |
134 | YT, and perhaps firstborn | Still Alive | Male | Probably Golden | Tall | Ingwë | Elves | Valinor,Taniquetil | Unnamed wife |
166 | YT | FA 400 | Male | Dark | Tall | Eöl | Elves | Nan Elmoth | Aredhel |
186 | TA 2990 | FO 63 | Male | Dirty blond | Tall-6'6 | omer | Men | Rohan | Lothíriel after the War of the Ring |
194 | FA 532 | Still alive; departed to ,Aman, on ,September ... | Male | Dark | Tall | Elrond | Half-elven | Rivendell | Celebrían |
204 | SA 3119 | SA 3441 | Male | Brown | 7' 10" | Elendil | Men | Arnor,Gondor | Unnamed wife |
530 | YT | Still alive, departed over the sea in the earl... | Male | Silver | Tall | Celeborn | Elves | Eregion,Lothlórien,Caras Galadhon | Galadriel |
551 | Possibly pre First Age | Unknown; possibly still alive | Most likely male | None | Huge | Watcher in the Water | Urulóki | Doors of Durin | Most likely none |
579 | 3019 | February 293019 | Male | Dark (movie) | 6' 6" (movie) | Uglúk | Uruk-hai | Isengard | None |
620 | TA 2925 | TA 3007 | Male | Brown (film) | 1.76m / 5'9" (film) | Bain | Men | Dale | Unnamed wife |
686 | YT 1362 | Still alive: Departed over the sea on ,Septemb... | Female | Golden | Tall | Galadriel | Elves | Eregion,Lothlórien,Caras Galadhon | Celeborn |
692 | YT 1169 | YT 1497 | Male | Raven | Tall | Fëanor | Elves | Tirion,Formenos | Nerdanel |
795 | First Age | Presumably departed to ,Aman | Male | Golden | Tall | Thranduil | Elves | Woodland Realm,Mirkwood | Unnamed wife |
802 | YT 1050 | FA 502 | Male | Silver | Tallest of the Elven-folk, 8'2" | Thingol | Elves | Doriath | Melian |
873 | March 1 ,2931 | FO 120 | Male | Dark | 198cm (6'6") | Aragorn II Elessar | Men | Reunited Kingdom,Arnor,Gondor | Arwen |
Of course we could ask for just the ones with 8 null values.
df[df.isnull().sum(axis = 1) == 8].name
8 Angrim
14 Angelimar
17 Linda (Baggins) Proudfoot
18 Bodo Proudfoot
40 Tanta (Hornblower) Baggins
...
886 Andvír
891 Amlach
904 Aghan
905 Agathor
907 Aerandir
Name: name, Length: 136, dtype: object
I only included the names since the rest of the dataset was null!
Of course we can use this method to include only entries that have 4 or less null entries.
df[df.isnull().sum(axis = 1) <= 4]
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
1 | TA 2978 | February 26 ,3019 | Male | Dark (book) Light brown (movie) | NaN | Boromir | Men | NaN | NaN |
3 | TA 280 | TA 515 | Male | NaN | NaN | Tarcil | Men | Arnor | Unnamed wife |
5 | SA 2709 | SA 2962 | Male | NaN | NaN | Ar-Adûnakhôr | Men | Númenor | Unnamed wife |
7 | YT | FA 455 | Male | Golden | NaN | Angrod | Elves | NaN | Eldalótë |
9 | SA 3219 | SA 3440 | Male | NaN | NaN | Anárion | Men | Gondor | Unnamed wife |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
903 | TA 2827 | TA 2932 | Male | NaN | NaN | Aglahad | Men | NaN | Unnamed wife |
906 | Mid ,First Age | FA 495 | Female | NaN | NaN | Aerin | Men | NaN | Brodda |
908 | YT during the ,Noontide of Valinor | FA 455 | Male | Golden | NaN | Aegnor | Elves | NaN | Loved ,Andreth but remained unmarried |
909 | TA 2917 | TA 3010 | Male | NaN | NaN | Adrahil II | Men | NaN | Unnamed wife |
910 | Before ,TA 1944 | Late ,Third Age | Male | NaN | NaN | Adrahil I | Men | NaN | NaN |
673 rows × 9 columns
Maybe we only want the characters whose realm has been included. We’ll negate the isnull()
command.
df[~df.realm.isnull()]
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
3 | TA 280 | TA 515 | Male | NaN | NaN | Tarcil | Men | Arnor | Unnamed wife |
5 | SA 2709 | SA 2962 | Male | NaN | NaN | Ar-Adûnakhôr | Men | Númenor | Unnamed wife |
9 | SA 3219 | SA 3440 | Male | NaN | NaN | Anárion | Men | Gondor | Unnamed wife |
10 | SA 3118 | Still alive | Male | NaN | Tall | Ar-Pharazôn | Men | Númenor | Tar-Míriel |
11 | SA 2876 | SA 3102 | Male | NaN | NaN | Ar-Sakalthôr | Men | Númenor | Unnamed wife |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
890 | TA 726 | TA 946 | Male | NaN | NaN | Amlaith | Men | Arthedain | Unnamed wife |
892 | Sometime during ,Years of the Trees, or the ,F... | SA 3434 | Male | NaN | NaN | Amdír | Elves | Lórien | Unnamed wife |
898 | NaN | NaN | Female | NaN | NaN | Almarian | Men | Númenor | Tar-Meneldur |
900 | TA 2544 | TA 2645 | Male | NaN | NaN | Aldor | Men | Rohan | Unnamed wife |
901 | TA 1330 | TA 1540 | Male | NaN | NaN | Aldamir | Men | Gondor | Unnamed wife |
197 rows × 9 columns
Imputing#
The simplest method for filling in NaN
s is to just place a value there.
df.fillna(value = 0)
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | Female | 0 | 0 | Adanel | Men | 0 | Belemir |
1 | TA 2978 | February 26 ,3019 | Male | Dark (book) Light brown (movie) | 0 | Boromir | Men | 0 | 0 |
2 | 0 | March ,3019 | Male | 0 | 0 | Lagduf | Orcs | 0 | 0 |
3 | TA 280 | TA 515 | Male | 0 | 0 | Tarcil | Men | Arnor | Unnamed wife |
4 | 0 | 0 | Male | 0 | 0 | Fire-drake of Gondolin | Dragon | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
906 | Mid ,First Age | FA 495 | Female | 0 | 0 | Aerin | Men | 0 | Brodda |
907 | 0 | 0 | 0 | 0 | 0 | Aerandir | 0 | 0 | 0 |
908 | YT during the ,Noontide of Valinor | FA 455 | Male | Golden | 0 | Aegnor | Elves | 0 | Loved ,Andreth but remained unmarried |
909 | TA 2917 | TA 3010 | Male | 0 | 0 | Adrahil II | Men | 0 | Unnamed wife |
910 | Before ,TA 1944 | Late ,Third Age | Male | 0 | 0 | Adrahil I | Men | 0 | 0 |
911 rows × 9 columns
You should note right away that some of these zeros make no sense. You might be more careful with your zeros.
df.height.fillna(value = 0)
0 0
1 0
2 0
3 0
4 0
..
906 0
907 0
908 0
909 0
910 0
Name: height, Length: 911, dtype: object
Or you might not want to skew the average so much. You could assign the mean if the remaining values were numerical. Unfortuantely these are mostly strings with little hope of converting to a numerical value.
df.height[~df.height.isnull()]
10 Tall
19 Tall
20 Tallest of the Elves of Gondolin
41 Tall
74 Large and immense
...
831 8'5
850 Tall
853 Tall
873 198cm (6'6")
881 As tall as a mountain
Name: height, Length: 98, dtype: object
We can also fill the empties by grabbing other values around our missing.
df.height.fillna(method= 'pad')
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
906 As tall as a mountain
907 As tall as a mountain
908 As tall as a mountain
909 As tall as a mountain
910 As tall as a mountain
Name: height, Length: 911, dtype: object
pad
took the last value and filled it forward. We can also go the otherway with bfill
df.height.fillna(method= 'bfill')
0 Tall
1 Tall
2 Tall
3 Tall
4 Tall
...
906 NaN
907 NaN
908 NaN
909 NaN
910 NaN
Name: height, Length: 911, dtype: object
Filling ing my mode is a little tricky as the mode returns an array rather than a single value. The code below changes all to height to the mode.
df.height.transform(lambda x: x.fillna(value = x.mode()[0]))
0 Tall
1 Tall
2 Tall
3 Tall
4 Tall
...
906 Tall
907 Tall
908 Tall
909 Tall
910 Tall
Name: height, Length: 911, dtype: object
Imputing by Category#
There is no quantitative data here so I actually have to work a little harder than I’d like. If height was just a number you’d run some code like
df.height.fillna(df.groupby('realm').height.transform('mean'))
To fill the NaNs with the mean from there group. To deal with the categories I’ll need to get the most frequent from category first.
df.groupby(['race']).height.agg(pa.Series.mode)
race
Ainur Varies
Ainur,Maiar []
Balrog []
Black Uruk 7'1
Dragon []
Dragons [As tall as a mountain, Gigantic]
Drúedain Short
Dwarf []
Dwarven []
Dwarves [4'5 - 5' (Estimate) , 4'5" (film)]
Eagle []
Eagles []
Elf []
Elves Tall
Elves,Maiar []
Elves,Noldor []
Ents Very tall
Ents,Onodrim 15'4
Goblin,Orc 8,4 Body weight = 190kg
God Varies
Great Eagles 30
Great Spiders [Enormous, Large and immense]
Half-elven Tall
Half-elven,Men []
Hobbit [1.06m (3'6"), 1.17m (3'10"), 1.2m (3'11"), 1....
Hobbits 1.22m (4'0")
Horse []
Maiar Various until
Maiar,Balrog Slightly larger and taller than a Man (book), ...
Maiar,Balrogs []
Men Tall
Men,Rohirrim []
Men,Skin-changer Tall (in Man-form)
Men,Undead Tall
Men,Wraith 7' 1" (2.13 metres)
Orc 5'9" - 6'4" (film)
Orc,Goblin []
Orcs [8'5, About nine feet (film)]
Raven []
Skin-changer Tall
Stone-trolls About 13'
Uruk-hai [6' 6" (movie), 6'1 (film)]
Uruk-hai,Orc medium
Urulóki Huge
Vampire []
Werewolves Gigantic
Wolfhound Horse-sized
Name: height, dtype: object
This is showing be that the most common height by each realm is mostly NaN. We could to get rid of all that to help this imputation.
dfrh = df[(~df.race.isna())&(~df.height.isna())]
dfrh.groupby(['race']).height.agg(pa.Series.mode)
race
Ainur Varies
Black Uruk 7'1
Dragons [As tall as a mountain, Gigantic]
Drúedain Short
Dwarves [4'5 - 5' (Estimate) , 4'5" (film)]
Elves Tall
Ents Very tall
Ents,Onodrim 15'4
Goblin,Orc 8,4 Body weight = 190kg
God Varies
Great Eagles 30
Great Spiders [Enormous, Large and immense]
Half-elven Tall
Hobbit [1.06m (3'6"), 1.17m (3'10"), 1.2m (3'11"), 1....
Hobbits 1.22m (4'0")
Maiar Various until
Maiar,Balrog Slightly larger and taller than a Man (book), ...
Men Tall
Men,Skin-changer Tall (in Man-form)
Men,Undead Tall
Men,Wraith 7' 1" (2.13 metres)
Orc 5'9" - 6'4" (film)
Orcs [8'5, About nine feet (film)]
Skin-changer Tall
Stone-trolls About 13'
Uruk-hai [6' 6" (movie), 6'1 (film)]
Uruk-hai,Orc medium
Urulóki Huge
Werewolves Gigantic
Wolfhound Horse-sized
Name: height, dtype: object
The next line of code is not working as intended, only changing the first of each category to the mode.
df.height.fillna(df.groupby('race').height.transform(lambda s: s.mode()))
0 Tall
1 NaN
2 NaN
3 NaN
4 NaN
...
906 NaN
907 NaN
908 NaN
909 NaN
910 NaN
Name: height, Length: 911, dtype: object
I believe this line of code does the same mistake but I leave it as another way to do the transformation and might be useful at some point.
df.groupby('race', sort=False).height.apply(lambda x: x.fillna(value = x.mode()))
0 Tall
1 NaN
2 NaN
3 NaN
4 NaN
...
903 NaN
906 NaN
908 NaN
909 NaN
910 NaN
Name: height, Length: 771, dtype: object
Below I am finally able to do the conversion. I’ll be honest in saying I don’t understand why this works but the pandas methods would not allow the transformation on the entire mode.
import numpy as np
df.height = df.height.fillna(df.groupby('race').height.transform(lambda x: next(iter(x.mode()), np.nan)))
df
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | Female | NaN | Tall | Adanel | Men | NaN | Belemir |
1 | TA 2978 | February 26 ,3019 | Male | Dark (book) Light brown (movie) | Tall | Boromir | Men | NaN | NaN |
2 | NaN | March ,3019 | Male | NaN | 8'5 | Lagduf | Orcs | NaN | NaN |
3 | TA 280 | TA 515 | Male | NaN | Tall | Tarcil | Men | Arnor | Unnamed wife |
4 | NaN | NaN | Male | NaN | NaN | Fire-drake of Gondolin | Dragon | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
906 | Mid ,First Age | FA 495 | Female | NaN | Tall | Aerin | Men | NaN | Brodda |
907 | NaN | NaN | NaN | NaN | NaN | Aerandir | NaN | NaN | NaN |
908 | YT during the ,Noontide of Valinor | FA 455 | Male | Golden | Tall | Aegnor | Elves | NaN | Loved ,Andreth but remained unmarried |
909 | TA 2917 | TA 3010 | Male | NaN | Tall | Adrahil II | Men | NaN | Unnamed wife |
910 | Before ,TA 1944 | Late ,Third Age | Male | NaN | Tall | Adrahil I | Men | NaN | NaN |
911 rows × 9 columns
Lastly, I’ll demonstrate the entire dataset transforming by the mode when grouped by race.
df.fillna(df.groupby('race').transform(lambda x: next(iter(x.mode()), np.nan)))
birth | death | gender | hair | height | name | race | realm | spouse | |
---|---|---|---|---|---|---|---|---|---|
0 | Late ,Third Age | FA 473 | Female | Dark | Tall | Adanel | Men | Gondor | Belemir |
1 | TA 2978 | February 26 ,3019 | Male | Dark (book) Light brown (movie) | Tall | Boromir | Men | Gondor | Unnamed wife |
2 | NaN | March ,3019 | Male | Grey/white strands of hair (film) | 8'5 | Lagduf | Orcs | Moria,Mount Gundabad | NaN |
3 | TA 280 | TA 515 | Male | Dark | Tall | Tarcil | Men | Arnor | Unnamed wife |
4 | NaN | NaN | Male | NaN | NaN | Fire-drake of Gondolin | Dragon | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
906 | Mid ,First Age | FA 495 | Female | Dark | Tall | Aerin | Men | Gondor | Brodda |
907 | NaN | NaN | NaN | NaN | NaN | Aerandir | NaN | NaN | NaN |
908 | YT during the ,Noontide of Valinor | FA 455 | Male | Golden | Tall | Aegnor | Elves | Doriath | Loved ,Andreth but remained unmarried |
909 | TA 2917 | TA 3010 | Male | Dark | Tall | Adrahil II | Men | Gondor | Unnamed wife |
910 | Before ,TA 1944 | Late ,Third Age | Male | Dark | Tall | Adrahil I | Men | Gondor | Unnamed wife |
911 rows × 9 columns
This is a bit silly as the first person is dead before they are born!
Your Turn#
Check out the Air B&B dataset, https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv. Examine how many entries are null. Impute for atleast two columns that are null in an approriate fashion.
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv')
df.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2539 | Clean & quiet apt home by the park | 2787 | John | Brooklyn | Kensington | 40.64749 | -73.97237 | Private room | 149 | 1 | 9 | 2018-10-19 | 0.21 | 6 | 365 |
1 | 2595 | Skylit Midtown Castle | 2845 | Jennifer | Manhattan | Midtown | 40.75362 | -73.98377 | Entire home/apt | 225 | 1 | 45 | 2019-05-21 | 0.38 | 2 | 355 |
2 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! | 4632 | Elisabeth | Manhattan | Harlem | 40.80902 | -73.94190 | Private room | 150 | 3 | 0 | NaN | NaN | 1 | 365 |
3 | 3831 | Cozy Entire Floor of Brownstone | 4869 | LisaRoxanne | Brooklyn | Clinton Hill | 40.68514 | -73.95976 | Entire home/apt | 89 | 1 | 270 | 2019-07-05 | 4.64 | 1 | 194 |
4 | 5022 | Entire Apt: Spacious Studio/Loft by central park | 7192 | Laura | Manhattan | East Harlem | 40.79851 | -73.94399 | Entire home/apt | 80 | 10 | 9 | 2018-11-19 | 0.10 | 1 | 0 |