Visualizing Association#
import pandas as pa
import matplotlib.pyplot as plt
import seaborn as sns
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')
Scatter Plots#
The most important visualization is the scatter plot. It will help us see association between two (or possibly more) variables.
ax = sns.scatterplot(data = df, x = 'SepalLength', y = 'SepalWidth')
ax.set(title = "Length vs Width",
xticks = [x for x in range(4,9,1)])
plt.show()
data:image/s3,"s3://crabby-images/f4692/f46929af67bdddea11c09cbae7530f55e13868fa" alt="../../_images/68de7cafc6077af2ed7baceb5aafc127b70097ef317c0958feec1c2f49ee3709.png"
The nice part about seaborn is I can add other aspects quickly.
sns.scatterplot(data = df, x = 'SepalLength', y = 'SepalWidth', hue = "Class")
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb127e8b90>
data:image/s3,"s3://crabby-images/a007a/a007aa4264c47e40f926b5b22049ad4ee10b3e40" alt="../../_images/00a17a02e39ea063522602cfb1e020a44d46acf731097706d0640baa42c58593.png"
I can pick the colors I want too! Here I do it with a dictionary.
colors = ['blue', 'green','orange']
colordict = {}
for i,name in enumerate(df.Class.unique()):
colordict[name] = colors[i]
sns.scatterplot(data = df,
x = 'SepalLength',
y = 'SepalWidth',
hue = "Class",
palette = colordict )
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb1226fa10>
data:image/s3,"s3://crabby-images/55e37/55e37241a3933e51543edd42fb2c9a620707aa02" alt="../../_images/d3e03db0763a3d68cb0dedb7ee124d568f27c70efe16693670343e7989f4ad32.png"
If you prefer you can change the marker
sns.scatterplot(data = df,
x = 'SepalLength',
y = 'SepalWidth',
hue = 'Class',
style= 'Class' )
<matplotlib.axes._subplots.AxesSubplot at 0x7fcb122cc0d0>
data:image/s3,"s3://crabby-images/78389/7838908e8147e7444e81b80e667ffde8a28df6a3" alt="../../_images/ef6eccc1bd0cf89b7b6f8ddc6c078ae0332f0798a4d83f78ac4f6cd0b16cb937.png"
We can vary the size of each entry too.
ax = sns.scatterplot(data = df,
x = 'SepalLength',
y = 'SepalWidth',
hue = 'Class',
size = 'PedalWidth')
sns.move_legend(ax, "upper right", bbox_to_anchor=(-.2, 1))
data:image/s3,"s3://crabby-images/ec071/ec071bb278870611a6a39c250ccb8f449d8b7ed1" alt="../../_images/646ba93ea74c8dd699119790a34700323bedc308a8a48702ecfd78716d529838.png"
Adding the line of best fit (or regression) is easy.
sns.regplot(data = df,
x = 'SepalLength',
y = 'SepalWidth',
ci = False, #I removed the confidence interval!
order = 1)
<matplotlib.axes._subplots.AxesSubplot at 0x7faf7ce3a090>
data:image/s3,"s3://crabby-images/f5c63/f5c631fb1fd0bcd64498836e93a5d0500ea0e2b3" alt="../../_images/39af11c3c2ad63e9113bc9a8b2e74a38a2161d2fd304f43aca6fae4b7873af9e.png"
sns.lmplot(data = df,
x = 'SepalLength',
y = 'SepalWidth',
hue = 'Class',
ci = False )
<seaborn.axisgrid.FacetGrid at 0x7faf7cd321d0>
data:image/s3,"s3://crabby-images/4388d/4388dd96fe4762c150e85a40872c1abae335881c" alt="../../_images/6c6c3ebff929b138a613d240c3e3252eee89cd1be1b21146c7f6d0c7cad9faba.png"
Often it is nice to look at all of the associations in your data quickly.
g = sns.PairGrid(df, hue="Class")
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
g.add_legend()
plt.show()
data:image/s3,"s3://crabby-images/0f1c7/0f1c76bee194e4f6b36415faf2a1488633b1b0e6" alt="../../_images/2da98ff9f4734d43f2f88109f9358f9e9149382fd24f16db4634b5bd0882a6ad.png"
Heat Map#
Heat maps show correlation quickly between the variables. You’ll need to pass the correlation to make the map work.
sns.heatmap(df.corr(), annot=True, linewidths=0.5,vmin = -1)
<matplotlib.axes._subplots.AxesSubplot at 0x7faf793a4b10>
data:image/s3,"s3://crabby-images/4a7e5/4a7e53f26d0b7e84edaecb610e5aeb807e63a9db" alt="../../_images/928165d278ed9506f4cf6bc4ef484fc3e9df73de4023a3e7f03aea52b6a4b774.png"
Your Turn#
Using the workout dataset, create a scatterplot with as many features as possible. Can you get 5 or six variables represented in one graphic?