Many Distributions

Open In Colab

Many Distributions#

Often we have many distributions we want to look at. In the previous lecture I had three histograms on the same axis. That will be my limit! If I have more data than that, I am going to use other tools for visualization.

Box and Whisker Plots#

We have already seen box and whisker plots but I find they are very important to visualizing distributions. I will normally include a box and whisker of just one variable too!

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/Activity_Dataset_V1.csv')
df.calories.plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7f2aa5e439d0>
../../_images/012bb67b6a7f01a991795003368e5ccd18a15704bf7787873f418d96ed2fd07c.png

The central green bar is the median, the other four bars are the minimum (no outliers here!), Q1, Q3, and the max. An outlier would have been represented as a dot falling outside the whisker.

Let’s get many boxplots with the by command

df.boxplot(column = 'calories',by = 'workout_type')
/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2aa5db1ed0>
../../_images/379dcf8ccfcd50f81b6dc7095de4c8112c6272fe6786376a6dc89e46bcd244ff.png

This is not a very great visualization as you cannot read the categories! Rotating the angle by \(45^\circ\) will do it.

import matplotlib.pyplot as plt


ax = df.boxplot(column = 'calories',by = 'workout_type',rot = 45)
ax.set_ylabel('Calories')
plt.show()
/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
../../_images/1f271e5bcb1fb015d5d5aae8ace3d1741db50d81a55003a0aa52ac2610b95e20.png

A better option would be to rotate the graph and have the number of calories on the \(x\) axis and the catagories on the \(y\).

df.boxplot(column = 'calories',by = 'workout_type',vert = False)
/usr/local/lib/python3.7/dist-packages/matplotlib/cbook/__init__.py:1376: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  X = np.atleast_1d(X.T if isinstance(X, np.ndarray) else np.asarray(X))
<matplotlib.axes._subplots.AxesSubplot at 0x7f6734f66dd0>
../../_images/541a2714b7497378be7d7a46a379363445aa0253447dd4908f069be7def5f7d1.png

Violin Plots#

We’ll repeat the graphics above with a violin plot. I’ll need to use a different tool to make it work. Here I have choosen to use matplotlib it is actually what is running in the background of pandas to make all the other graphics we have done thus far.

import matplotlib.pyplot as plt

plt.violinplot(df.calories)
{'bodies': [<matplotlib.collections.PolyCollection at 0x7f673512f510>],
 'cbars': <matplotlib.collections.LineCollection at 0x7f673512f310>,
 'cmaxes': <matplotlib.collections.LineCollection at 0x7f673512f390>,
 'cmins': <matplotlib.collections.LineCollection at 0x7f673512f950>}
../../_images/e0f1d3e5170a3cdad5e6d4a1abe94cd4b37f471bbdda8cfa67b5be9c69cdcd7d.png

If I wanted multiple violins, I’d do the following with seaborn.

import seaborn as sns

ax = sns.violinplot(data = df, x = 'workout_type', y = 'calories')
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
ax.set_title('Violin Plot of Calories Burned')
plt.show()
../../_images/703f362cc4f9f5e82c59783ceaaa0bbf4d5789a771c154ab7590ff8093e63969.png

What if we wanted just ‘Trekking’

df.query("workout_type == 'Trekking'").calories.plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7f2a924a6210>
../../_images/ccc0562d7061bdecdc84ccda49f51cc59ed6e257da69cb3a18e215fb7f4e5012.png
ax = sns.violinplot(data = df.query("workout_type == 'Trekking'"), x = 'workout_type', y = 'calories')
ax.set_xticklabels(ax.get_xticklabels(),rotation = 45)
ax.set_title('Violin Plot of Calories Burned')
plt.show()
../../_images/cafa1f4ddae158b5fdf18f480cdccdf05a147df16429ba44d098b6d9bbdf23cf.png

To change the direction, swap \(x\) and \(y\)

sns.violinplot(data = df, y = 'workout_type', x = 'calories')
<matplotlib.axes._subplots.AxesSubplot at 0x7f8bce402590>
../../_images/54c7a3768d1c723527e915f6a4a6dd5521d5185dbef1edb1c43b3561f8edff70.png

The colors mean different sports but that seems redundant to me. I also added the points with some ‘jittering’, called stripplot in seaborn.

sns.violinplot(data = df, y = 'workout_type', x = 'calories', color = 'orange')
sns.stripplot(data = df, y = 'workout_type', x = 'calories' , color = 'black')
<matplotlib.axes._subplots.AxesSubplot at 0x7f2a90792e50>
../../_images/8fe8e7c8f73ae14095694024d411f025331e533341947f78c4d0dd2a7f87925a.png
rgb= [255,78,0]
rgbscaled = []
for color in rgb:
  rgbscaled.append(color/255)
rgbscaled
[1.0, 0.3058823529411765, 0.0]
sns.violinplot(data = df, y = 'workout_type', x = 'calories', color = rgbscaled)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2a8ff48990>
../../_images/6f049d8c1558b6b61950116a81e56db7c4eb24e435fe9d66101c90b04cfc8ae3.png
hexcolor = "#FF4E00"

ax = sns.violinplot(data = df, y = 'workout_type', x = 'calories', color = hexcolor)
sns.stripplot(data = df, y = 'workout_type', x = 'calories' , color = '#000000')

ax.set_title('Go Tigers!')
plt.show()
../../_images/cf9da36267261df5eb64c5afee5fd983d8d055c4b9710450850e820c477d5d2a.png

Your Turn#

Recreate the three boxplots in this lecture using seaborn. Add the jittered points to the final boxplot.