Open In Colab

Histograms#

Next we look at many quantititative variables. Looking at the histogram allows us to get an idea of the shape of the distribution.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')
df.SepalLength.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7fbf0139b590>
../../_images/5903082c1e252ccac697fdeeca41a7c702a881fbd70dfdff7f2e5f6295aeeb6c.png

You can get all the data in a histogram rather quickly!

df.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fbf020eb0d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fbf01fb9d90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fbf018340d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fbf019916d0>]],
      dtype=object)
../../_images/695d2281b7a7fdae60cbe256487e2b1d5cf062964d16771a98d6cd29aea3827b.png

Number of Bins#

Remember to look at multiple bin widths when doing a histogram it may make a difference in the display of your data! The default is 10 which is not bad here.

df.SepalLength.plot.hist(bins = 100)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdda6826bd0>
../../_images/ca4be32b969b62688f5340eed56a29987acdeb300920b25a758d18b45e539187.png

100 was too many and 5 too few.

df.SepalLength.plot.hist(bins = 5)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdda620acd0>
../../_images/f86f30271dfdf82bf49c0ae3b4624c5898d4b58390e2faa7a22117395d8a2880.png

25 bins gave a little more clarity to how the data was distributed.

df.SepalLength.plot.hist(bins = 25)
<matplotlib.axes._subplots.AxesSubplot at 0x7fdda639b9d0>
../../_images/4e4d038dba282e2513d7788e5d2129310b550f3a986c1a37bb0d22bc8a4ff7ab.png

Density Shows Much the Same as Histograms#

Densitied can give you a similar view to the histogram.

df.SepalLength.plot(kind = 'density')
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd931bf910>
../../_images/4793e7618d56936ca809f1d15f4e894ad80924ebb0a1c7dcf9d1bd43325cc77d.png

Playing with the method can give you more detail but can also lead to overfitting.

df.SepalLength.plot(kind = 'density',bw_method = 0.3)
<matplotlib.axes._subplots.AxesSubplot at 0x7fbef10f7510>
../../_images/927656e8424257cb7c506416d547d321f3baa8f2e3bc20ac9f3ca176fd692cc8.png

Multiple Data Points#

Often we want to see if the distributions differ on another categorical variable. Here we look at SepalLength by Class.

df['SepalLength'].hist(by = df['Class'])
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fbef1000610>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fbef0fb6410>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fbef0f79a10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fbef0f25b10>]],
      dtype=object)
../../_images/8219b3126c8bc07f443d3f069f17dd85b1692d984eb9c102e8e6f62c9cbc1082.png

One of the issues in the above graph is that the axis do not match. Can you fix that?

Below I put them all together on the same axis. The alpha value makes them transparent.

df.groupby('Class').SepalLength.plot.hist(alpha = .7)
Class
Iris-setosa        AxesSubplot(0.125,0.125;0.775x0.755)
Iris-versicolor    AxesSubplot(0.125,0.125;0.775x0.755)
Iris-virginica     AxesSubplot(0.125,0.125;0.775x0.755)
Name: SepalLength, dtype: object
../../_images/d0de769a1263fd0b2919afb1379d88f879c5801699b3c302f6d2cec751057ced.png

Some Things to Avoid#

Putting non-comparable data on the same axis is easy but not appropriate! The following don’t really make any sense since we should not be comparing length and width on the same axis!

df.groupby('Class').plot.hist(alpha = .7, title = "Bad Graphs!")
Class
Iris-setosa        AxesSubplot(0.125,0.125;0.775x0.755)
Iris-versicolor    AxesSubplot(0.125,0.125;0.775x0.755)
Iris-virginica     AxesSubplot(0.125,0.125;0.775x0.755)
dtype: object
../../_images/c12d7b552ce3b38b4796b4994b0a1a1397608cbdafde58c927fbad978e382ca2.png ../../_images/ed6810dfdb791e460e711c7ba2e9789317ad000f4b9e2acb8ff4f837557b6f68.png ../../_images/1e7c552ee9ead7cfb2180ff88833b82a5428668c0272ef5497c8f29e29850d29.png

Here is another that doesn’t make any sense as the data should not be comparable!

df.plot.hist(title = "Bad Graph!")
<matplotlib.axes._subplots.AxesSubplot at 0x7fdda38f49d0>
../../_images/3f4e93f71bb139a73c5b670514c63b4bee4eb12f817b823a4b85de8add999606.png

Your Turn#

Look at NYC Air B&B, provide a histogram of the price and a histogram of the price broken down by ‘neighbourhood_group’. Make sure to give titles! Play with bin size to find appropriate shape. Comment on why you are seeing so little of the data and how you might see more.