Open In Colab

Exam 2 Review#

Utilizing the bank dataset nurfnick/Data_Viz answer the following questions.

  1. What are the datatypes for each column. Comment on the appropriateness and change any easy/obvious ones.

  2. Using regular expresssions, remove the hyphen “-” from any job and replace it with a space.

  3. Convert any place where pdays is -1 to NaN.

  4. Fill the NaN in pdays with a zero.

  5. Compute the mean and median of balance when grouped by job.

  6. Add a column that is a datetime. Use year of 2020.

  7. Which day of the week (Monday, Tuesday, ect.) had the most approved loans, y = yes.

  8. Create an indicator column that shows if the person has more than $500 in their acocunt.

  9. Are divorced, admin people given loans at a higher rate?

  10. Which job is most likely to have a default on their record?

  11. Do highly educated people have more or less cellular phones than on average?

  12. What is the maritial status of the person with the maximum balance?

  13. Write another question based on your knowledge of the dataset.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')

df.head(20)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
5 35 management single tertiary no 747 no no cellular 23 feb 141 2 176 3 failure no
6 36 self-employed married tertiary no 307 yes no cellular 14 may 341 1 330 2 other no
7 39 technician married secondary no 147 yes no cellular 6 may 151 2 -1 0 unknown no
8 41 entrepreneur married tertiary no 221 yes no unknown 14 may 57 2 -1 0 unknown no
9 43 services married primary no -88 yes yes cellular 17 apr 313 1 147 2 failure no
10 39 services married secondary no 9374 yes no unknown 20 may 273 1 -1 0 unknown no
11 43 admin. married secondary no 264 yes no cellular 17 apr 113 2 -1 0 unknown no
12 36 technician married tertiary no 1109 no no cellular 13 aug 328 2 -1 0 unknown no
13 20 student single secondary no 502 no no cellular 30 apr 261 1 -1 0 unknown yes
14 31 blue-collar married secondary no 360 yes yes cellular 29 jan 89 1 241 1 failure no
15 40 management married tertiary no 194 no yes cellular 29 aug 189 2 -1 0 unknown no
16 56 technician married secondary no 4073 no no cellular 27 aug 239 5 -1 0 unknown no
17 37 admin. single tertiary no 2317 yes no cellular 20 apr 114 1 152 2 failure no
18 25 blue-collar single primary no -221 yes no unknown 23 may 250 1 -1 0 unknown no
19 31 services married secondary no 132 no no cellular 7 jul 148 1 152 1 other no

Data Types#

df.dtypes
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

I’ll change the y column to booleen.

df.y = df.y=='yes'
df.dtypes
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y              bool
dtype: object

Regular Epressions#

import re

re.sub(r"-"," ",df.job[6])
'self employed'

The above code successfully removes the ‘-’ so I just need to apply it to the entire column and not forget to replace the current column.

df.job.apply(lambda x: re.sub(r"-"," ",x))
0          unemployed
1            services
2          management
3          management
4         blue collar
            ...      
4516         services
4517    self employed
4518       technician
4519      blue collar
4520     entrepreneur
Name: job, Length: 4521, dtype: object
df.job = df.job.apply(lambda x: re.sub(r"-"," ",x))

Replace With NaN#

I needed numpy to do this. This was a silly exercise but still a nice thing to think about.

import numpy as np

df.pdays.replace(-1,np.nan)
0         NaN
1       339.0
2       330.0
3         NaN
4         NaN
        ...  
4516      NaN
4517      NaN
4518      NaN
4519    211.0
4520    249.0
Name: pdays, Length: 4521, dtype: float64
df.pdays = df.pdays.replace(-1,np.nan)

Fill NaN#

Now I immediately undo what I just did.

df.pdays = df.pdays.fillna(0)

Group By#

df.groupby('job').balance.agg(['mean','median','count','std'])
mean median count std
job
admin. 1226.736402 430.0 478 2370.119128
blue collar 1085.161734 408.5 946 2040.218220
entrepreneur 1645.125000 365.5 168 4441.303620
housemaid 2083.803571 296.5 112 4603.836647
management 1766.928793 577.0 969 3267.733077
retired 2319.191304 672.5 230 5846.379889
self employed 1392.409836 483.0 183 2479.640999
services 1103.956835 288.0 417 2445.239976
student 1543.821429 422.5 84 2579.886671
technician 1330.996094 434.5 768 2630.253390
unemployed 1089.421875 473.5 128 1692.267628
unknown 1501.710526 655.5 38 1957.258258

Dates#

This requires some conversion of the months to numerical.

monthConvert = {
    'jan':1,
    'feb':2,
    'mar':3,
    'apr':4,
    'may':5,
    'jun':6,
    'jul':7,
    'aug':8,
    'sep':9,
    'oct':10,
    'nov':11,
    'dec':12
}

This next line actually creates the datetime column.

date = pa.to_datetime(dict(year = 2020, day = df.day, month = df.month.map(monthConvert)))

I’ll add it to my dataframe. Adding a new column requires concat and creating a DataFrame from the series.

I struggled in class to do this because I needed a dictinary with the coulmn name…

df = pa.concat([df,pa.DataFrame( {'date':date})],axis = 1)

df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y date
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 0.0 0 unknown False 2020-10-19
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339.0 4 failure False 2020-05-11
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330.0 1 failure False 2020-04-16
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 0.0 0 unknown False 2020-06-03
4 59 blue collar married secondary no 0 yes no unknown 5 may 226 1 0.0 0 unknown False 2020-05-05

Day of the Week#

Here is one way to do it.

pa.crosstab(df.date.dt.day_name(),df.y)
y False True
date
Friday 705 81
Monday 295 70
Saturday 491 36
Sunday 77 29
Thursday 846 105
Tuesday 753 107
Wednesday 833 93

Another way might be with groupby.

df.groupby([date.dt.day_name(),df.y]).age.agg('count')
           y    
Friday     False    705
           True      81
Monday     False    295
           True      70
Saturday   False    491
           True      36
Sunday     False     77
           True      29
Thursday   False    846
           True     105
Tuesday    False    753
           True     107
Wednesday  False    833
           True      93
Name: age, dtype: int64

I think you could look at this by doing some subsetting of the data too.

df[df.y == True].groupby(df.date.dt.day_name()).y.agg('count')
date
Friday        81
Monday        70
Saturday      36
Sunday        29
Thursday     105
Tuesday      107
Wednesday     93
Name: y, dtype: int64

This one was rather difficult. Finding the day names was what I was after!

Indicator#

bal = df.balance > 500
pa.concat([df,pa.DataFrame({'FiveHundo':bal})], axis = 1)
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y date FiveHundo
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 0.0 0 unknown False 2020-10-19 True
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339.0 4 failure False 2020-05-11 True
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330.0 1 failure False 2020-04-16 True
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 0.0 0 unknown False 2020-06-03 True
4 59 blue collar married secondary no 0 yes no unknown 5 may 226 1 0.0 0 unknown False 2020-05-05 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4516 33 services married secondary no -333 yes no cellular 30 jul 329 5 0.0 0 unknown False 2020-07-30 False
4517 57 self employed married tertiary yes -3313 yes yes unknown 9 may 153 1 0.0 0 unknown False 2020-05-09 False
4518 57 technician married secondary no 295 no no cellular 19 aug 151 11 0.0 0 unknown False 2020-08-19 False
4519 28 blue collar married secondary no 1137 no no cellular 6 feb 129 4 211.0 3 other False 2020-02-06 True
4520 44 entrepreneur single tertiary no 1136 yes yes cellular 3 apr 345 2 249.0 7 other False 2020-04-03 True

4521 rows × 19 columns

Subsetting Data#

df[(df.job == 'admin.')&(df.marital == 'divorced')]
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y date
35 42 admin. divorced secondary no 1811 yes no unknown 14 may 150 1 0.0 0 unknown False 2020-05-14
80 27 admin. divorced secondary no 451 yes no cellular 16 jul 652 1 0.0 0 unknown True 2020-07-16
120 31 admin. divorced secondary no 1890 yes no cellular 21 jul 588 1 0.0 0 unknown False 2020-07-21
183 40 admin. divorced secondary no 6 no no unknown 11 jun 140 1 0.0 0 unknown False 2020-06-11
283 42 admin. divorced secondary no 63 no no unknown 16 may 88 6 0.0 0 unknown False 2020-05-16
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4128 34 admin. divorced secondary no 1268 yes no cellular 15 may 259 3 0.0 0 unknown False 2020-05-15
4244 39 admin. divorced secondary no 83 yes no cellular 30 jul 69 6 0.0 0 unknown False 2020-07-30
4282 34 admin. divorced secondary no -251 no no cellular 18 jul 641 1 0.0 0 unknown True 2020-07-18
4407 45 admin. divorced secondary no 59 yes no cellular 24 jul 873 8 0.0 0 unknown True 2020-07-24
4438 52 admin. divorced secondary no 98 no yes unknown 19 jun 148 1 0.0 0 unknown False 2020-06-19

69 rows × 18 columns

Here are those folks! Let’s see what their stats are for getting approved.

df[(df.job == 'admin.')&(df.marital == 'divorced')].groupby('y').age.agg('count')
y
False    58
True     11
Name: age, dtype: int64

Another Group By?#

df.groupby(['job','default']).age.agg('count')
job            default
admin.         no         472
               yes          6
blue collar    no         932
               yes         14
entrepreneur   no         161
               yes          7
housemaid      no         110
               yes          2
management     no         955
               yes         14
retired        no         227
               yes          3
self employed  no         179
               yes          4
services       no         410
               yes          7
student        no          83
               yes          1
technician     no         753
               yes         15
unemployed     no         125
               yes          3
unknown        no          38
Name: age, dtype: int64

Two Way Tables#

Below is how I looked at number 11 in class.

df.groupby(['education','contact']).age.agg('count')
education  contact  
primary    cellular      343
           telephone      75
           unknown       260
secondary  cellular     1446
           telephone     141
           unknown       719
tertiary   cellular     1012
           telephone      66
           unknown       272
unknown    cellular       95
           telephone      19
           unknown        73
Name: age, dtype: int64

Below is a nicer way to create a two-way table with pandas.crosstab

pa.crosstab(df.education,df.contact)
contact cellular telephone unknown
education
primary 343 75 260
secondary 1446 141 719
tertiary 1012 66 272
unknown 95 19 73

Maximum#

df[df.balance == max(df.balance)].marital
3700    married
Name: marital, dtype: object