Exam 2 Review

Exam 2 Review#

Utilizing the bank dataset nurfnick/Data_Viz answer the following questions.

What are the datatypes for each column. Comment on the appropriateness and change any easy/obvious ones.
Using regular expresssions, remove the hyphen “-” from any job and replace it with a space.
Convert any place where pdays is -1 to NaN.
Fill the NaN in pdays with a zero.
Compute the mean and median of balance when grouped by job.
Add a column that is a datetime. Use year of 2020.
Which day of the week (Monday, Tuesday, ect.) had the most approved loans, y = yes.
Create an indicator column that shows if the person has more than $500 in their acocunt.
Are divorced, admin people given loans at a higher rate?
Which job is most likely to have a default on their record?
Do highly educated people have more or less cellular phones than on average?
What is the maritial status of the person with the maximum balance?
Write another question based on your knowledge of the dataset.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')

df.head(20)

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
4	59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no
5	35	management	single	tertiary	no	747	no	no	cellular	23	feb	141	2	176	3	failure	no
6	36	self-employed	married	tertiary	no	307	yes	no	cellular	14	may	341	1	330	2	other	no
7	39	technician	married	secondary	no	147	yes	no	cellular	6	may	151	2	-1	0	unknown	no
8	41	entrepreneur	married	tertiary	no	221	yes	no	unknown	14	may	57	2	-1	0	unknown	no
9	43	services	married	primary	no	-88	yes	yes	cellular	17	apr	313	1	147	2	failure	no
10	39	services	married	secondary	no	9374	yes	no	unknown	20	may	273	1	-1	0	unknown	no
11	43	admin.	married	secondary	no	264	yes	no	cellular	17	apr	113	2	-1	0	unknown	no
12	36	technician	married	tertiary	no	1109	no	no	cellular	13	aug	328	2	-1	0	unknown	no
13	20	student	single	secondary	no	502	no	no	cellular	30	apr	261	1	-1	0	unknown	yes
14	31	blue-collar	married	secondary	no	360	yes	yes	cellular	29	jan	89	1	241	1	failure	no
15	40	management	married	tertiary	no	194	no	yes	cellular	29	aug	189	2	-1	0	unknown	no
16	56	technician	married	secondary	no	4073	no	no	cellular	27	aug	239	5	-1	0	unknown	no
17	37	admin.	single	tertiary	no	2317	yes	no	cellular	20	apr	114	1	152	2	failure	no
18	25	blue-collar	single	primary	no	-221	yes	no	unknown	23	may	250	1	-1	0	unknown	no
19	31	services	married	secondary	no	132	no	no	cellular	7	jul	148	1	152	1	other	no

Data Types#

df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y            object
dtype: object

I’ll change the y column to booleen.

df.y = df.y=='yes'

df.dtypes

age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
y              bool
dtype: object

Regular Epressions#

import re

re.sub(r"-"," ",df.job[6])

'self employed'

The above code successfully removes the ‘-’ so I just need to apply it to the entire column and not forget to replace the current column.

df.job.apply(lambda x: re.sub(r"-"," ",x))

        unemployed
          services
        management
        management
       blue collar
            ...      
       services
  self employed
     technician
    blue collar
   entrepreneur
Name: job, Length: 4521, dtype: object

df.job = df.job.apply(lambda x: re.sub(r"-"," ",x))

Replace With NaN#

I needed numpy to do this. This was a silly exercise but still a nice thing to think about.

import numpy as np

df.pdays.replace(-1,np.nan)

       NaN
     339.0
     330.0
       NaN
       NaN
        ...  
    NaN
    NaN
    NaN
  211.0
  249.0
Name: pdays, Length: 4521, dtype: float64

df.pdays = df.pdays.replace(-1,np.nan)

Fill NaN#

Now I immediately undo what I just did.

df.pdays = df.pdays.fillna(0)

Group By#

df.groupby('job').balance.agg(['mean','median','count','std'])

	mean	median	count	std
job
admin.	1226.736402	430.0	478	2370.119128
blue collar	1085.161734	408.5	946	2040.218220
entrepreneur	1645.125000	365.5	168	4441.303620
housemaid	2083.803571	296.5	112	4603.836647
management	1766.928793	577.0	969	3267.733077
retired	2319.191304	672.5	230	5846.379889
self employed	1392.409836	483.0	183	2479.640999
services	1103.956835	288.0	417	2445.239976
student	1543.821429	422.5	84	2579.886671
technician	1330.996094	434.5	768	2630.253390
unemployed	1089.421875	473.5	128	1692.267628
unknown	1501.710526	655.5	38	1957.258258

Dates#

This requires some conversion of the months to numerical.

monthConvert = {
    'jan':1,
    'feb':2,
    'mar':3,
    'apr':4,
    'may':5,
    'jun':6,
    'jul':7,
    'aug':8,
    'sep':9,
    'oct':10,
    'nov':11,
    'dec':12
}

This next line actually creates the datetime column.

date = pa.to_datetime(dict(year = 2020, day = df.day, month = df.month.map(monthConvert)))

I’ll add it to my dataframe. Adding a new column requires concat and creating a DataFrame from the series.

I struggled in class to do this because I needed a dictinary with the coulmn name…

df = pa.concat([df,pa.DataFrame( {'date':date})],axis = 1)

df.head()

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y	date
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	0.0	0	unknown	False	2020-10-19
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339.0	4	failure	False	2020-05-11
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330.0	1	failure	False	2020-04-16
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	0.0	0	unknown	False	2020-06-03
4	59	blue collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	0.0	0	unknown	False	2020-05-05

Day of the Week#

Here is one way to do it.

pa.crosstab(df.date.dt.day_name(),df.y)

y	False	True
date
Friday	705	81
Monday	295	70
Saturday	491	36
Sunday	77	29
Thursday	846	105
Tuesday	753	107
Wednesday	833	93

Another way might be with groupby.

df.groupby([date.dt.day_name(),df.y]).age.agg('count')

           y    
Friday     False    705
           True      81
Monday     False    295
           True      70
Saturday   False    491
           True      36
Sunday     False     77
           True      29
Thursday   False    846
           True     105
Tuesday    False    753
           True     107
Wednesday  False    833
           True      93
Name: age, dtype: int64

I think you could look at this by doing some subsetting of the data too.

df[df.y == True].groupby(df.date.dt.day_name()).y.agg('count')

date
Friday        81
Monday        70
Saturday      36
Sunday        29
Thursday     105
Tuesday      107
Wednesday     93
Name: y, dtype: int64

This one was rather difficult. Finding the day names was what I was after!

Indicator#

bal = df.balance > 500
pa.concat([df,pa.DataFrame({'FiveHundo':bal})], axis = 1)

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y	date	FiveHundo
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	0.0	0	unknown	False	2020-10-19	True
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339.0	4	failure	False	2020-05-11	True
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330.0	1	failure	False	2020-04-16	True
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	0.0	0	unknown	False	2020-06-03	True
4	59	blue collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	0.0	0	unknown	False	2020-05-05	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4516	33	services	married	secondary	no	-333	yes	no	cellular	30	jul	329	5	0.0	0	unknown	False	2020-07-30	False
4517	57	self employed	married	tertiary	yes	-3313	yes	yes	unknown	9	may	153	1	0.0	0	unknown	False	2020-05-09	False
4518	57	technician	married	secondary	no	295	no	no	cellular	19	aug	151	11	0.0	0	unknown	False	2020-08-19	False
4519	28	blue collar	married	secondary	no	1137	no	no	cellular	6	feb	129	4	211.0	3	other	False	2020-02-06	True
4520	44	entrepreneur	single	tertiary	no	1136	yes	yes	cellular	3	apr	345	2	249.0	7	other	False	2020-04-03	True

4521 rows × 19 columns

Subsetting Data#

df[(df.job == 'admin.')&(df.marital == 'divorced')]

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y	date
35	42	admin.	divorced	secondary	no	1811	yes	no	unknown	14	may	150	1	0.0	0	unknown	False	2020-05-14
80	27	admin.	divorced	secondary	no	451	yes	no	cellular	16	jul	652	1	0.0	0	unknown	True	2020-07-16
120	31	admin.	divorced	secondary	no	1890	yes	no	cellular	21	jul	588	1	0.0	0	unknown	False	2020-07-21
183	40	admin.	divorced	secondary	no	6	no	no	unknown	11	jun	140	1	0.0	0	unknown	False	2020-06-11
283	42	admin.	divorced	secondary	no	63	no	no	unknown	16	may	88	6	0.0	0	unknown	False	2020-05-16
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4128	34	admin.	divorced	secondary	no	1268	yes	no	cellular	15	may	259	3	0.0	0	unknown	False	2020-05-15
4244	39	admin.	divorced	secondary	no	83	yes	no	cellular	30	jul	69	6	0.0	0	unknown	False	2020-07-30
4282	34	admin.	divorced	secondary	no	-251	no	no	cellular	18	jul	641	1	0.0	0	unknown	True	2020-07-18
4407	45	admin.	divorced	secondary	no	59	yes	no	cellular	24	jul	873	8	0.0	0	unknown	True	2020-07-24
4438	52	admin.	divorced	secondary	no	98	no	yes	unknown	19	jun	148	1	0.0	0	unknown	False	2020-06-19

69 rows × 18 columns

Here are those folks! Let’s see what their stats are for getting approved.

df[(df.job == 'admin.')&(df.marital == 'divorced')].groupby('y').age.agg('count')

y
False    58
True     11
Name: age, dtype: int64

Another Group By?#

df.groupby(['job','default']).age.agg('count')

job            default
admin.         no         472
               yes          6
blue collar    no         932
               yes         14
entrepreneur   no         161
               yes          7
housemaid      no         110
               yes          2
management     no         955
               yes         14
retired        no         227
               yes          3
self employed  no         179
               yes          4
services       no         410
               yes          7
student        no          83
               yes          1
technician     no         753
               yes         15
unemployed     no         125
               yes          3
unknown        no          38
Name: age, dtype: int64

Two Way Tables#

Below is how I looked at number 11 in class.

df.groupby(['education','contact']).age.agg('count')

education  contact  
primary    cellular      343
           telephone      75
           unknown       260
secondary  cellular     1446
           telephone     141
           unknown       719
tertiary   cellular     1012
           telephone      66
           unknown       272
unknown    cellular       95
           telephone      19
           unknown        73
Name: age, dtype: int64

Below is a nicer way to create a two-way table with pandas.crosstab

pa.crosstab(df.education,df.contact)

contact	cellular	telephone	unknown
education
primary	343	75	260
secondary	1446	141	719
tertiary	1012	66	272
unknown	95	19	73

Maximum#

df[df.balance == max(df.balance)].marital

3700    married
Name: marital, dtype: object