Missing and Incomplete

Open In Colab

Missing and Incomplete#

Often datasets will be missing entries. There are many approaches we can take to dealing with these errors and omissions. I will examine a dataset on the characters from The Lord of The Rings

Finding NaN’s#

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/lotr_characters.csv')

df.head()
birth death gender hair height name race realm spouse
0 NaN NaN Female NaN NaN Adanel Men NaN Belemir
1 TA 2978 February 26 ,3019 Male Dark (book) Light brown (movie) NaN Boromir Men NaN NaN
2 NaN March ,3019 Male NaN NaN Lagduf Orcs NaN NaN
3 TA 280 TA 515 Male NaN NaN Tarcil Men Arnor Unnamed wife
4 NaN NaN Male NaN NaN Fire-drake of Gondolin Dragon NaN NaN

We see right away that there are lots of NaN’s. This is an empty field in our dataset. Some characters are mentioned but never given much more background than a name.

df.isnull().sum(axis = 0)
birth     207
death     315
gender    143
hair      734
height    813
name        0
race      140
realm     714
spouse    403
dtype: int64

There are null values in every column except name.

df.isnull().sum(axis = 1).value_counts().sort_index()
0     15
1     59
2    185
3    236
4    178
5     81
6     20
7      1
8    136
dtype: int64

Here we see that there are only 15 entries with all fields and 136 that are name only (since name was never blank!) Let’s look at just those characters.

df[~df.isnull().any(axis = 1)]
birth death gender hair height name race realm spouse
125 SA 3209 TA 2 Male Black Very tall almost 7'1 Isildur Men Arnor,Gondor Unnamed wife
134 YT, and perhaps firstborn Still Alive Male Probably Golden Tall Ingwë Elves Valinor,Taniquetil Unnamed wife
166 YT FA 400 Male Dark Tall Eöl Elves Nan Elmoth Aredhel
186 TA 2990 FO 63 Male Dirty blond Tall-6'6 omer Men Rohan Lothíriel after the War of the Ring
194 FA 532 Still alive; departed to ,Aman, on ,September ... Male Dark Tall Elrond Half-elven Rivendell Celebrían
204 SA 3119 SA 3441 Male Brown 7' 10" Elendil Men Arnor,Gondor Unnamed wife
530 YT Still alive, departed over the sea in the earl... Male Silver Tall Celeborn Elves Eregion,Lothlórien,Caras Galadhon Galadriel
551 Possibly pre First Age Unknown; possibly still alive Most likely male None Huge Watcher in the Water Urulóki Doors of Durin Most likely none
579 3019 February 293019 Male Dark (movie) 6' 6" (movie) Uglúk Uruk-hai Isengard None
620 TA 2925 TA 3007 Male Brown (film) 1.76m / 5'9" (film) Bain Men Dale Unnamed wife
686 YT 1362 Still alive: Departed over the sea on ,Septemb... Female Golden Tall Galadriel Elves Eregion,Lothlórien,Caras Galadhon Celeborn
692 YT 1169 YT 1497 Male Raven Tall Fëanor Elves Tirion,Formenos Nerdanel
795 First Age Presumably departed to ,Aman Male Golden Tall Thranduil Elves Woodland Realm,Mirkwood Unnamed wife
802 YT 1050 FA 502 Male Silver Tallest of the Elven-folk, 8'2" Thingol Elves Doriath Melian
873 March 1 ,2931 FO 120 Male Dark 198cm (6'6") Aragorn II Elessar Men Reunited Kingdom,Arnor,Gondor Arwen

Of course we could ask for just the ones with 8 null values.

df[df.isnull().sum(axis = 1) == 8].name
8                          Angrim
14                      Angelimar
17      Linda (Baggins) Proudfoot
18                 Bodo Proudfoot
40     Tanta (Hornblower) Baggins
                  ...            
886                        Andvír
891                        Amlach
904                         Aghan
905                       Agathor
907                      Aerandir
Name: name, Length: 136, dtype: object

I only included the names since the rest of the dataset was null!

Of course we can use this method to include only entries that have 4 or less null entries.

df[df.isnull().sum(axis = 1) <= 4]
birth death gender hair height name race realm spouse
1 TA 2978 February 26 ,3019 Male Dark (book) Light brown (movie) NaN Boromir Men NaN NaN
3 TA 280 TA 515 Male NaN NaN Tarcil Men Arnor Unnamed wife
5 SA 2709 SA 2962 Male NaN NaN Ar-Adûnakhôr Men Númenor Unnamed wife
7 YT FA 455 Male Golden NaN Angrod Elves NaN Eldalótë
9 SA 3219 SA 3440 Male NaN NaN Anárion Men Gondor Unnamed wife
... ... ... ... ... ... ... ... ... ...
903 TA 2827 TA 2932 Male NaN NaN Aglahad Men NaN Unnamed wife
906 Mid ,First Age FA 495 Female NaN NaN Aerin Men NaN Brodda
908 YT during the ,Noontide of Valinor FA 455 Male Golden NaN Aegnor Elves NaN Loved ,Andreth but remained unmarried
909 TA 2917 TA 3010 Male NaN NaN Adrahil II Men NaN Unnamed wife
910 Before ,TA 1944 Late ,Third Age Male NaN NaN Adrahil I Men NaN NaN

673 rows × 9 columns

Maybe we only want the characters whose realm has been included. We’ll negate the isnull() command.

df[~df.realm.isnull()]
birth death gender hair height name race realm spouse
3 TA 280 TA 515 Male NaN NaN Tarcil Men Arnor Unnamed wife
5 SA 2709 SA 2962 Male NaN NaN Ar-Adûnakhôr Men Númenor Unnamed wife
9 SA 3219 SA 3440 Male NaN NaN Anárion Men Gondor Unnamed wife
10 SA 3118 Still alive Male NaN Tall Ar-Pharazôn Men Númenor Tar-Míriel
11 SA 2876 SA 3102 Male NaN NaN Ar-Sakalthôr Men Númenor Unnamed wife
... ... ... ... ... ... ... ... ... ...
890 TA 726 TA 946 Male NaN NaN Amlaith Men Arthedain Unnamed wife
892 Sometime during ,Years of the Trees, or the ,F... SA 3434 Male NaN NaN Amdír Elves Lórien Unnamed wife
898 NaN NaN Female NaN NaN Almarian Men Númenor Tar-Meneldur
900 TA 2544 TA 2645 Male NaN NaN Aldor Men Rohan Unnamed wife
901 TA 1330 TA 1540 Male NaN NaN Aldamir Men Gondor Unnamed wife

197 rows × 9 columns

Imputing#

The simplest method for filling in NaNs is to just place a value there.

df.fillna(value = 0)
birth death gender hair height name race realm spouse
0 0 0 Female 0 0 Adanel Men 0 Belemir
1 TA 2978 February 26 ,3019 Male Dark (book) Light brown (movie) 0 Boromir Men 0 0
2 0 March ,3019 Male 0 0 Lagduf Orcs 0 0
3 TA 280 TA 515 Male 0 0 Tarcil Men Arnor Unnamed wife
4 0 0 Male 0 0 Fire-drake of Gondolin Dragon 0 0
... ... ... ... ... ... ... ... ... ...
906 Mid ,First Age FA 495 Female 0 0 Aerin Men 0 Brodda
907 0 0 0 0 0 Aerandir 0 0 0
908 YT during the ,Noontide of Valinor FA 455 Male Golden 0 Aegnor Elves 0 Loved ,Andreth but remained unmarried
909 TA 2917 TA 3010 Male 0 0 Adrahil II Men 0 Unnamed wife
910 Before ,TA 1944 Late ,Third Age Male 0 0 Adrahil I Men 0 0

911 rows × 9 columns

You should note right away that some of these zeros make no sense. You might be more careful with your zeros.

df.height.fillna(value = 0)
0      0
1      0
2      0
3      0
4      0
      ..
906    0
907    0
908    0
909    0
910    0
Name: height, Length: 911, dtype: object

Or you might not want to skew the average so much. You could assign the mean if the remaining values were numerical. Unfortuantely these are mostly strings with little hope of converting to a numerical value.

df.height[~df.height.isnull()]
10                                 Tall
19                                 Tall
20     Tallest of the Elves of Gondolin
41                                 Tall
74                    Large and immense
                     ...               
831                                 8'5
850                                Tall
853                                Tall
873                        198cm (6'6")
881               As tall as a mountain
Name: height, Length: 98, dtype: object

We can also fill the empties by grabbing other values around our missing.

df.height.fillna(method= 'pad')
0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
               ...          
906    As tall as a mountain
907    As tall as a mountain
908    As tall as a mountain
909    As tall as a mountain
910    As tall as a mountain
Name: height, Length: 911, dtype: object

pad took the last value and filled it forward. We can also go the otherway with bfill

df.height.fillna(method= 'bfill')
0      Tall
1      Tall
2      Tall
3      Tall
4      Tall
       ... 
906     NaN
907     NaN
908     NaN
909     NaN
910     NaN
Name: height, Length: 911, dtype: object

Filling ing my mode is a little tricky as the mode returns an array rather than a single value. The code below changes all to height to the mode.


df.height.transform(lambda x: x.fillna(value = x.mode()[0]))
0      Tall
1      Tall
2      Tall
3      Tall
4      Tall
       ... 
906    Tall
907    Tall
908    Tall
909    Tall
910    Tall
Name: height, Length: 911, dtype: object

Imputing by Category#

There is no quantitative data here so I actually have to work a little harder than I’d like. If height was just a number you’d run some code like

df.height.fillna(df.groupby('realm').height.transform('mean'))

To fill the NaNs with the mean from there group. To deal with the categories I’ll need to get the most frequent from category first.

df.groupby(['race']).height.agg(pa.Series.mode)
race
Ainur                                                          Varies
Ainur,Maiar                                                        []
Balrog                                                             []
Black Uruk                                                        7'1
Dragon                                                             []
Dragons                             [As tall as a mountain, Gigantic]
Drúedain                                                        Short
Dwarf                                                              []
Dwarven                                                            []
Dwarves                           [4'5 - 5' (Estimate) , 4'5" (film)]
Eagle                                                              []
Eagles                                                             []
Elf                                                                []
Elves                                                            Tall
Elves,Maiar                                                        []
Elves,Noldor                                                       []
Ents                                                        Very tall
Ents,Onodrim                                                     15'4
Goblin,Orc                                   8,4  Body weight = 190kg
God                                                            Varies
Great Eagles                                                       30
Great Spiders                           [Enormous, Large and immense]
Half-elven                                                       Tall
Half-elven,Men                                                     []
Hobbit              [1.06m (3'6"), 1.17m (3'10"), 1.2m (3'11"), 1....
Hobbits                                                  1.22m (4'0")
Horse                                                              []
Maiar                                                  Various until 
Maiar,Balrog        Slightly larger and taller than a Man (book), ...
Maiar,Balrogs                                                      []
Men                                                              Tall
Men,Rohirrim                                                       []
Men,Skin-changer                                   Tall (in Man-form)
Men,Undead                                                       Tall
Men,Wraith                                        7' 1" (2.13 metres)
Orc                                                5'9" - 6'4" (film)
Orc,Goblin                                                         []
Orcs                                    [8'5, About nine feet (film)]
Raven                                                              []
Skin-changer                                                     Tall
Stone-trolls                                                About 13'
Uruk-hai                                  [6' 6" (movie), 6'1 (film)]
Uruk-hai,Orc                                                   medium
Urulóki                                                          Huge
Vampire                                                            []
Werewolves                                                   Gigantic
Wolfhound                                                 Horse-sized
Name: height, dtype: object

This is showing be that the most common height by each realm is mostly NaN. We could to get rid of all that to help this imputation.

dfrh = df[(~df.race.isna())&(~df.height.isna())]

dfrh.groupby(['race']).height.agg(pa.Series.mode)
race
Ainur                                                          Varies
Black Uruk                                                        7'1
Dragons                             [As tall as a mountain, Gigantic]
Drúedain                                                        Short
Dwarves                           [4'5 - 5' (Estimate) , 4'5" (film)]
Elves                                                            Tall
Ents                                                        Very tall
Ents,Onodrim                                                     15'4
Goblin,Orc                                   8,4  Body weight = 190kg
God                                                            Varies
Great Eagles                                                       30
Great Spiders                           [Enormous, Large and immense]
Half-elven                                                       Tall
Hobbit              [1.06m (3'6"), 1.17m (3'10"), 1.2m (3'11"), 1....
Hobbits                                                  1.22m (4'0")
Maiar                                                  Various until 
Maiar,Balrog        Slightly larger and taller than a Man (book), ...
Men                                                              Tall
Men,Skin-changer                                   Tall (in Man-form)
Men,Undead                                                       Tall
Men,Wraith                                        7' 1" (2.13 metres)
Orc                                                5'9" - 6'4" (film)
Orcs                                    [8'5, About nine feet (film)]
Skin-changer                                                     Tall
Stone-trolls                                                About 13'
Uruk-hai                                  [6' 6" (movie), 6'1 (film)]
Uruk-hai,Orc                                                   medium
Urulóki                                                          Huge
Werewolves                                                   Gigantic
Wolfhound                                                 Horse-sized
Name: height, dtype: object

The next line of code is not working as intended, only changing the first of each category to the mode.

df.height.fillna(df.groupby('race').height.transform(lambda s: s.mode()))
0      Tall
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
906     NaN
907     NaN
908     NaN
909     NaN
910     NaN
Name: height, Length: 911, dtype: object

I believe this line of code does the same mistake but I leave it as another way to do the transformation and might be useful at some point.

df.groupby('race', sort=False).height.apply(lambda x: x.fillna(value = x.mode()))

0      Tall
1       NaN
2       NaN
3       NaN
4       NaN
       ... 
903     NaN
906     NaN
908     NaN
909     NaN
910     NaN
Name: height, Length: 771, dtype: object

Below I am finally able to do the conversion. I’ll be honest in saying I don’t understand why this works but the pandas methods would not allow the transformation on the entire mode.

import numpy as np

df.height = df.height.fillna(df.groupby('race').height.transform(lambda x: next(iter(x.mode()), np.nan)))

df
birth death gender hair height name race realm spouse
0 NaN NaN Female NaN Tall Adanel Men NaN Belemir
1 TA 2978 February 26 ,3019 Male Dark (book) Light brown (movie) Tall Boromir Men NaN NaN
2 NaN March ,3019 Male NaN 8'5 Lagduf Orcs NaN NaN
3 TA 280 TA 515 Male NaN Tall Tarcil Men Arnor Unnamed wife
4 NaN NaN Male NaN NaN Fire-drake of Gondolin Dragon NaN NaN
... ... ... ... ... ... ... ... ... ...
906 Mid ,First Age FA 495 Female NaN Tall Aerin Men NaN Brodda
907 NaN NaN NaN NaN NaN Aerandir NaN NaN NaN
908 YT during the ,Noontide of Valinor FA 455 Male Golden Tall Aegnor Elves NaN Loved ,Andreth but remained unmarried
909 TA 2917 TA 3010 Male NaN Tall Adrahil II Men NaN Unnamed wife
910 Before ,TA 1944 Late ,Third Age Male NaN Tall Adrahil I Men NaN NaN

911 rows × 9 columns

Lastly, I’ll demonstrate the entire dataset transforming by the mode when grouped by race.

df.fillna(df.groupby('race').transform(lambda x: next(iter(x.mode()), np.nan)))
birth death gender hair height name race realm spouse
0 Late ,Third Age FA 473 Female Dark Tall Adanel Men Gondor Belemir
1 TA 2978 February 26 ,3019 Male Dark (book) Light brown (movie) Tall Boromir Men Gondor Unnamed wife
2 NaN March ,3019 Male Grey/white strands of hair (film) 8'5 Lagduf Orcs Moria,Mount Gundabad NaN
3 TA 280 TA 515 Male Dark Tall Tarcil Men Arnor Unnamed wife
4 NaN NaN Male NaN NaN Fire-drake of Gondolin Dragon NaN NaN
... ... ... ... ... ... ... ... ... ...
906 Mid ,First Age FA 495 Female Dark Tall Aerin Men Gondor Brodda
907 NaN NaN NaN NaN NaN Aerandir NaN NaN NaN
908 YT during the ,Noontide of Valinor FA 455 Male Golden Tall Aegnor Elves Doriath Loved ,Andreth but remained unmarried
909 TA 2917 TA 3010 Male Dark Tall Adrahil II Men Gondor Unnamed wife
910 Before ,TA 1944 Late ,Third Age Male Dark Tall Adrahil I Men Gondor Unnamed wife

911 rows × 9 columns

This is a bit silly as the first person is dead before they are born!

Your Turn#

Check out the Air B&B dataset, https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv. Examine how many entries are null. Impute for atleast two columns that are null in an approriate fashion.

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/AB_NYC_2019.csv')

df.head()
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0