Exam 2 Review#
Utilizing the bank dataset nurfnick/Data_Viz answer the following questions.
What are the datatypes for each column. Comment on the appropriateness and change any easy/obvious ones.
Using regular expresssions, remove the hyphen “-” from any job and replace it with a space.
Convert any place where pdays is -1 to NaN.
Fill the NaN in pdays with a zero.
Compute the mean and median of balance when grouped by job.
Add a column that is a datetime. Use year of 2020.
Which day of the week (Monday, Tuesday, ect.) had the most approved loans, y = yes.
Create an indicator column that shows if the person has more than $500 in their acocunt.
Are divorced, admin people given loans at a higher rate?
Which job is most likely to have a default on their record?
Do highly educated people have more or less cellular phones than on average?
What is the maritial status of the person with the maximum balance?
Write another question based on your knowledge of the dataset.
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')
df.head(20)
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
4 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
5 | 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
6 | 36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
7 | 39 | technician | married | secondary | no | 147 | yes | no | cellular | 6 | may | 151 | 2 | -1 | 0 | unknown | no |
8 | 41 | entrepreneur | married | tertiary | no | 221 | yes | no | unknown | 14 | may | 57 | 2 | -1 | 0 | unknown | no |
9 | 43 | services | married | primary | no | -88 | yes | yes | cellular | 17 | apr | 313 | 1 | 147 | 2 | failure | no |
10 | 39 | services | married | secondary | no | 9374 | yes | no | unknown | 20 | may | 273 | 1 | -1 | 0 | unknown | no |
11 | 43 | admin. | married | secondary | no | 264 | yes | no | cellular | 17 | apr | 113 | 2 | -1 | 0 | unknown | no |
12 | 36 | technician | married | tertiary | no | 1109 | no | no | cellular | 13 | aug | 328 | 2 | -1 | 0 | unknown | no |
13 | 20 | student | single | secondary | no | 502 | no | no | cellular | 30 | apr | 261 | 1 | -1 | 0 | unknown | yes |
14 | 31 | blue-collar | married | secondary | no | 360 | yes | yes | cellular | 29 | jan | 89 | 1 | 241 | 1 | failure | no |
15 | 40 | management | married | tertiary | no | 194 | no | yes | cellular | 29 | aug | 189 | 2 | -1 | 0 | unknown | no |
16 | 56 | technician | married | secondary | no | 4073 | no | no | cellular | 27 | aug | 239 | 5 | -1 | 0 | unknown | no |
17 | 37 | admin. | single | tertiary | no | 2317 | yes | no | cellular | 20 | apr | 114 | 1 | 152 | 2 | failure | no |
18 | 25 | blue-collar | single | primary | no | -221 | yes | no | unknown | 23 | may | 250 | 1 | -1 | 0 | unknown | no |
19 | 31 | services | married | secondary | no | 132 | no | no | cellular | 7 | jul | 148 | 1 | 152 | 1 | other | no |
Data Types#
df.dtypes
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y object
dtype: object
I’ll change the y
column to booleen.
df.y = df.y=='yes'
df.dtypes
age int64
job object
marital object
education object
default object
balance int64
housing object
loan object
contact object
day int64
month object
duration int64
campaign int64
pdays int64
previous int64
poutcome object
y bool
dtype: object
Regular Epressions#
import re
re.sub(r"-"," ",df.job[6])
'self employed'
The above code successfully removes the ‘-’ so I just need to apply it to the entire column and not forget to replace the current column.
df.job.apply(lambda x: re.sub(r"-"," ",x))
0 unemployed
1 services
2 management
3 management
4 blue collar
...
4516 services
4517 self employed
4518 technician
4519 blue collar
4520 entrepreneur
Name: job, Length: 4521, dtype: object
df.job = df.job.apply(lambda x: re.sub(r"-"," ",x))
Replace With NaN#
I needed numpy
to do this. This was a silly exercise but still a nice thing to think about.
import numpy as np
df.pdays.replace(-1,np.nan)
0 NaN
1 339.0
2 330.0
3 NaN
4 NaN
...
4516 NaN
4517 NaN
4518 NaN
4519 211.0
4520 249.0
Name: pdays, Length: 4521, dtype: float64
df.pdays = df.pdays.replace(-1,np.nan)
Fill NaN#
Now I immediately undo what I just did.
df.pdays = df.pdays.fillna(0)
Group By#
df.groupby('job').balance.agg(['mean','median','count','std'])
mean | median | count | std | |
---|---|---|---|---|
job | ||||
admin. | 1226.736402 | 430.0 | 478 | 2370.119128 |
blue collar | 1085.161734 | 408.5 | 946 | 2040.218220 |
entrepreneur | 1645.125000 | 365.5 | 168 | 4441.303620 |
housemaid | 2083.803571 | 296.5 | 112 | 4603.836647 |
management | 1766.928793 | 577.0 | 969 | 3267.733077 |
retired | 2319.191304 | 672.5 | 230 | 5846.379889 |
self employed | 1392.409836 | 483.0 | 183 | 2479.640999 |
services | 1103.956835 | 288.0 | 417 | 2445.239976 |
student | 1543.821429 | 422.5 | 84 | 2579.886671 |
technician | 1330.996094 | 434.5 | 768 | 2630.253390 |
unemployed | 1089.421875 | 473.5 | 128 | 1692.267628 |
unknown | 1501.710526 | 655.5 | 38 | 1957.258258 |
Dates#
This requires some conversion of the months to numerical.
monthConvert = {
'jan':1,
'feb':2,
'mar':3,
'apr':4,
'may':5,
'jun':6,
'jul':7,
'aug':8,
'sep':9,
'oct':10,
'nov':11,
'dec':12
}
This next line actually creates the datetime column.
date = pa.to_datetime(dict(year = 2020, day = df.day, month = df.month.map(monthConvert)))
I’ll add it to my dataframe. Adding a new column requires concat
and creating a DataFrame
from the series.
I struggled in class to do this because I needed a dictinary with the coulmn name…
df = pa.concat([df,pa.DataFrame( {'date':date})],axis = 1)
df.head()
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | 0.0 | 0 | unknown | False | 2020-10-19 |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339.0 | 4 | failure | False | 2020-05-11 |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330.0 | 1 | failure | False | 2020-04-16 |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | 0.0 | 0 | unknown | False | 2020-06-03 |
4 | 59 | blue collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | 0.0 | 0 | unknown | False | 2020-05-05 |
Day of the Week#
Here is one way to do it.
pa.crosstab(df.date.dt.day_name(),df.y)
y | False | True |
---|---|---|
date | ||
Friday | 705 | 81 |
Monday | 295 | 70 |
Saturday | 491 | 36 |
Sunday | 77 | 29 |
Thursday | 846 | 105 |
Tuesday | 753 | 107 |
Wednesday | 833 | 93 |
Another way might be with groupby
.
df.groupby([date.dt.day_name(),df.y]).age.agg('count')
y
Friday False 705
True 81
Monday False 295
True 70
Saturday False 491
True 36
Sunday False 77
True 29
Thursday False 846
True 105
Tuesday False 753
True 107
Wednesday False 833
True 93
Name: age, dtype: int64
I think you could look at this by doing some subsetting of the data too.
df[df.y == True].groupby(df.date.dt.day_name()).y.agg('count')
date
Friday 81
Monday 70
Saturday 36
Sunday 29
Thursday 105
Tuesday 107
Wednesday 93
Name: y, dtype: int64
This one was rather difficult. Finding the day names was what I was after!
Indicator#
bal = df.balance > 500
pa.concat([df,pa.DataFrame({'FiveHundo':bal})], axis = 1)
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | date | FiveHundo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | 0.0 | 0 | unknown | False | 2020-10-19 | True |
1 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339.0 | 4 | failure | False | 2020-05-11 | True |
2 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330.0 | 1 | failure | False | 2020-04-16 | True |
3 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | 0.0 | 0 | unknown | False | 2020-06-03 | True |
4 | 59 | blue collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | 0.0 | 0 | unknown | False | 2020-05-05 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4516 | 33 | services | married | secondary | no | -333 | yes | no | cellular | 30 | jul | 329 | 5 | 0.0 | 0 | unknown | False | 2020-07-30 | False |
4517 | 57 | self employed | married | tertiary | yes | -3313 | yes | yes | unknown | 9 | may | 153 | 1 | 0.0 | 0 | unknown | False | 2020-05-09 | False |
4518 | 57 | technician | married | secondary | no | 295 | no | no | cellular | 19 | aug | 151 | 11 | 0.0 | 0 | unknown | False | 2020-08-19 | False |
4519 | 28 | blue collar | married | secondary | no | 1137 | no | no | cellular | 6 | feb | 129 | 4 | 211.0 | 3 | other | False | 2020-02-06 | True |
4520 | 44 | entrepreneur | single | tertiary | no | 1136 | yes | yes | cellular | 3 | apr | 345 | 2 | 249.0 | 7 | other | False | 2020-04-03 | True |
4521 rows × 19 columns
Subsetting Data#
df[(df.job == 'admin.')&(df.marital == 'divorced')]
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
35 | 42 | admin. | divorced | secondary | no | 1811 | yes | no | unknown | 14 | may | 150 | 1 | 0.0 | 0 | unknown | False | 2020-05-14 |
80 | 27 | admin. | divorced | secondary | no | 451 | yes | no | cellular | 16 | jul | 652 | 1 | 0.0 | 0 | unknown | True | 2020-07-16 |
120 | 31 | admin. | divorced | secondary | no | 1890 | yes | no | cellular | 21 | jul | 588 | 1 | 0.0 | 0 | unknown | False | 2020-07-21 |
183 | 40 | admin. | divorced | secondary | no | 6 | no | no | unknown | 11 | jun | 140 | 1 | 0.0 | 0 | unknown | False | 2020-06-11 |
283 | 42 | admin. | divorced | secondary | no | 63 | no | no | unknown | 16 | may | 88 | 6 | 0.0 | 0 | unknown | False | 2020-05-16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4128 | 34 | admin. | divorced | secondary | no | 1268 | yes | no | cellular | 15 | may | 259 | 3 | 0.0 | 0 | unknown | False | 2020-05-15 |
4244 | 39 | admin. | divorced | secondary | no | 83 | yes | no | cellular | 30 | jul | 69 | 6 | 0.0 | 0 | unknown | False | 2020-07-30 |
4282 | 34 | admin. | divorced | secondary | no | -251 | no | no | cellular | 18 | jul | 641 | 1 | 0.0 | 0 | unknown | True | 2020-07-18 |
4407 | 45 | admin. | divorced | secondary | no | 59 | yes | no | cellular | 24 | jul | 873 | 8 | 0.0 | 0 | unknown | True | 2020-07-24 |
4438 | 52 | admin. | divorced | secondary | no | 98 | no | yes | unknown | 19 | jun | 148 | 1 | 0.0 | 0 | unknown | False | 2020-06-19 |
69 rows × 18 columns
Here are those folks! Let’s see what their stats are for getting approved.
df[(df.job == 'admin.')&(df.marital == 'divorced')].groupby('y').age.agg('count')
y
False 58
True 11
Name: age, dtype: int64
Another Group By?#
df.groupby(['job','default']).age.agg('count')
job default
admin. no 472
yes 6
blue collar no 932
yes 14
entrepreneur no 161
yes 7
housemaid no 110
yes 2
management no 955
yes 14
retired no 227
yes 3
self employed no 179
yes 4
services no 410
yes 7
student no 83
yes 1
technician no 753
yes 15
unemployed no 125
yes 3
unknown no 38
Name: age, dtype: int64
Two Way Tables#
Below is how I looked at number 11 in class.
df.groupby(['education','contact']).age.agg('count')
education contact
primary cellular 343
telephone 75
unknown 260
secondary cellular 1446
telephone 141
unknown 719
tertiary cellular 1012
telephone 66
unknown 272
unknown cellular 95
telephone 19
unknown 73
Name: age, dtype: int64
Below is a nicer way to create a two-way table with pandas.crosstab
pa.crosstab(df.education,df.contact)
contact | cellular | telephone | unknown |
---|---|---|---|
education | |||
primary | 343 | 75 | 260 |
secondary | 1446 | 141 | 719 |
tertiary | 1012 | 66 | 272 |
unknown | 95 | 19 | 73 |
Maximum#
df[df.balance == max(df.balance)].marital
3700 married
Name: marital, dtype: object