Open In Colab

Visualization Conclusion#

What do you think will be on the exam?

Why Do We Visualize#

Visualization allows use to see many datapoints and once. It allows us to compare those points quickly and efficiently. It allows us to see trends and look for interesting aspects about our data that may be hard to parse out without a graphic. Graphics we have done in this course.

  • Bar Chart

  • Histogram

  • Violin Plot

  • Box and Whisker Plot

  • Scatter Plot

  • Correlation Heatmap

  • Error Bars

  • Pie Charts

  • Mosaic Plots

import pandas as pa
import matplotlib.pyplot as plt
import seaborn as sns

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/bank.csv')
df.head()
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no
df.loc[:100].duration.plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0da308090>
../../_images/330d40413f2cfa0442d214fe7489b8ee2620a7f943e2b5e34be03ea55d58a8fd.png
ax = df.groupby('education').duration.agg('mean').plot.barh()
ax.set(title = "Average Duration by education Level")
ax.set(xlabel = 'Average Duration')
plt.show()
../../_images/544f8799ee2da8fe0a24212f9b56bd7f618dbea95eb827937a60bc4b99e153b6.png

What does this graph show? Everybody has about the same duration…

Histogram#

ax = df.duration.hist(bins = 25)
ax.set(title = "Histogram of Duration")
plt.show()
../../_images/1451a59df0112b4b09d435cb6e5983c966324ed2c68a34f796d9d14e20300a80.png

Looks like there are some serious outliers in duration

Violin#

ax = sns.violinplot(data = df, x = 'duration', y = 'education')
ax.set(title = 'Violin of Duration and Education')

plt.show()
../../_images/885607d2030b22c374e6e97f69eaf3207deacc2b1cf5596be1abfe1e11a7d15c.png

Seems like the unknowns dont have the outliers. Perhaps only un-educated and over-educated have longer waits.

Box and Whisker#

df.balance.plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0dffa8910>
../../_images/555020f2bc5a6970ff64c3c4c1fd998cce3efdcaa9eb261ff47e828969cb5214.png
sns.boxplot(data = df, x = 'balance', y = 'housing')
<matplotlib.axes._subplots.AxesSubplot at 0x7fc0dff2a790>
../../_images/e65ac262b4b215b39cfb83dac9265aaab9ccd3a547417743a5fb9c3c6aaff687.png

More than $3k is an outlier!

Scatterplot#

ax = sns.scatterplot(data = df, x = 'balance', y = 'duration', hue = 'housing', style= 'education')
ax.set(title = "Scatter")

plt.show()
../../_images/0547da1c26b2d02d313f570351e612508225a38a693a82a9071a19dec69694f1.png

Non-linear trend! No money: Wait!:: Lot’s of money:Close the deal!

Correlation Heatmap#

ax = sns.heatmap(df.corr(),vmin = -1)
ax.set(title = 'Heatmap of Banking')

plt.show()
../../_images/7e0dd01d74b1f37b3d4a2d155d85aea65b1df578eb7d08223b57396d5e25ce2d.png

campagin and day seem to have some correlation.

Error Bars#

df.groupby('education').duration.agg(['mean', 'std', 'count'])
mean std count
education
primary 261.709440 271.988443 678
secondary 269.863833 260.896979 2306
tertiary 256.881481 254.290937 1350
unknown 250.449198 241.189640 187
import numpy as np

df1 = df.groupby('education').duration.agg(['mean', 'std', 'count'])
def SE(std,n):
  return 2*std/np.sqrt(n)


df1['SE'] = df1.apply(lambda x: SE(x['std'],x['count']), axis = 1)






df1.plot.bar(y = 'mean', yerr = 'SE')
ax.set(title = "Average Duration by education Level")
ax.set(xlabel = 'Average Duration')
plt.show()
../../_images/fa2f7faae7124cbb3aeb67aa201d6da24afef572cb8baeab38195c816c603d6d.png

All have the same mean…

Your Turn#

Sumbit at least two graphics you made while thinking about what may be on the exam.