Open In Colab

Visualize Amounts#

To get started in visualizing, we’ll look at one of the simplest ideas, single quantaties. Let’s grab some data too!

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

Bar Charts#

I’ll give a small bar chart of the means of the different Classes of flowers.

df.groupby('Class').SepalLength.agg('mean')
Class
Iris-setosa        5.006
Iris-versicolor    5.936
Iris-virginica     6.588
Name: SepalLength, dtype: float64
df.groupby('Class').SepalLength.agg('mean').plot(kind = 'bar')
<AxesSubplot:xlabel='Class'>
../../_images/d00142ad0d3cf81eb6b96d48221175076d17ac9380f87e7e36700463a696929f.png
df.groupby('Class').SepalLength.agg('mean').plot.bar()
<AxesSubplot:xlabel='Class'>
../../_images/d00142ad0d3cf81eb6b96d48221175076d17ac9380f87e7e36700463a696929f.png

There are lots of options some of which we should be using reguarly. A title is always nice

df.groupby('Class').SepalLength.agg('mean').plot(kind = 'bar', title = 'Mean by Class')
<AxesSubplot:title={'center':'Mean by Class'}, xlabel='Class'>
../../_images/d57b8d748e8503b1b2b0446d7b39eaa311c40e7603797727931ad5909823081f.png

A vertical description on what the \(y\) axis represents should not be forgotten!

df.groupby('Class').SepalLength.agg('mean').plot(kind = 'bar',
                                                 title = 'Mean by Class', 
                                                 ylabel= 'Mean of Sepal Length')
<AxesSubplot:title={'center':'Mean by Class'}, xlabel='Class', ylabel='Mean of Sepal Length'>
../../_images/2f7f4739f3e6ab1b0672ff9a1a765d02160d3128391b64fb00b51f30b59fe2fc.png

One of the complaints about a graphic like this is the length of the class titles. It takes up a lot of vertical space. With a barh you can change the orientation of the bars.

df.groupby('Class').SepalLength.agg('mean').plot(kind = 'barh',
                                                 title = 'Mean by Class', 
                                                 ylabel= 'Mean of Sepal Length')
<AxesSubplot:title={'center':'Mean by Class'}, ylabel='Class'>
../../_images/5914c5f3f11acba2370044aa73a9c01dfe1e77ad559ba7d0de4d55c9f3c7ba13.png

I couldn’t get the label of the values to appear, maybe you can?

If there are lots of values, don’t use bars! Let’s see this with a different dataset.

df2 = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Activity_Dataset_V1.csv')

In the following graph it is very difficult to follow the data points across.

df2.groupby('workout_type').calories.agg('mean').sort_values(ascending = True).plot(kind = 'barh')
<AxesSubplot:ylabel='workout_type'>
../../_images/f7ededf3868374c482ee78ef6c6d428629ba4bf738fdfdb1e37ab89b85492ac9.png

To clear this up you could use a point instead of a bar!

Dot Plots Work Well Too#

df2.groupby('workout_type').calories.agg(['mean']).sort_values(by = 'mean',ascending = True).reset_index().plot.scatter(x = 'mean', y = 'workout_type')
<AxesSubplot:xlabel='mean', ylabel='workout_type'>
../../_images/b1750746a00b860fa53b00f7fa571b7ad41b6e43e768b3625b439492c47f66b6.png

This creates other issues in that the origin of the figure is not zero. To fix that, we simply require that the x limits go from 0 to 310.

df2.groupby('workout_type').calories.agg(['mean']).sort_values(by = 'mean',ascending = True).reset_index().plot.scatter(x = 'mean', y = 'workout_type', xlim = [0,310])
<AxesSubplot:xlabel='mean', ylabel='workout_type'>
../../_images/ccb5f5df884995e793809ef495dc0b40c06c389729b6577dc86bd6333c385804.png

All the workout types are giving us about the same bang for our buck, at least in mean calories.

Adding Labels#

It might also be nice to see the numbers presented with the data. This is esspecially nice for a small number of quantities.

ax = df2.groupby('workout_type').calories.agg(['mean']).sort_values(by = 'mean',ascending = True).reset_index().plot.scatter(x = 'mean', y = 'workout_type')#this made the same graph as above.

for i,k in enumerate(df2.groupby('workout_type').calories.agg(['mean']).sort_values(by = 'mean',ascending = True).reset_index()['mean']): #here I loop through the values, k, and indicies ,i.
  ax.annotate(str(int(k)),[k+.2,i+.2])

../../_images/4b06703c53723b2db79d26e1d0282667fbfb6b4de6eba913c1a6d6664caf5143.png

Bar Charts with Multiple Data#

df.groupby('Class').agg('mean').plot(kind = 'bar')
<AxesSubplot:xlabel='Class'>
../../_images/b17bd40bc1cbbb5aaaa3370c34a43c7346a5241ab21dd64f97198441d17c560f.png
df.groupby('Class').agg('mean').plot(kind = 'bar', stacked = True)
<AxesSubplot:xlabel='Class'>
../../_images/99083e86535bd4a4c52ce1728ce5db98a625926439ed12947aeeffe38e369a13.png

If you want to add labels, it should be simple but the version on colab is out of date… I update here.

!pip install --upgrade matplotlib
Requirement already satisfied: matplotlib in /usr/local/lib/python3.7/dist-packages (3.5.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.4.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (0.11.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (2.8.2)
Requirement already satisfied: pyparsing>=2.2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (3.0.7)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (7.1.2)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (1.21.5)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (21.3)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.7/dist-packages (from matplotlib) (4.31.2)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from kiwisolver>=1.0.1->matplotlib) (3.10.0.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7->matplotlib) (1.15.0)
import matplotlib
matplotlib.__version__
'3.5.1'

Now with the correct version it is acually really easy.

ax = df.groupby('Class').agg('mean').plot(kind = 'bar', ylim =[0,8])

for container in ax.containers:
    ax.bar_label(container)
../../_images/5b012ff20f0161a07f7c91bc4e32d85a9f5f2c8fad1ff134d306406058ce737f.png
ax = df.groupby('Class').agg('mean').plot(kind = 'bar', stacked = True)

for container in ax.containers:
    ax.bar_label(container)
../../_images/089d1959989450da6347e688a935788870f4ff09af01f2655d0af356b19d8e92.png

Be careful with the stacked as it is giving a cummulative total. This doesn’t really make any sense here…

Your Turn#

Using the Air B&B NYC data complete the following tasks.

  1. Create a bar graph of the maximum ‘price’ by ‘neighbourhood_group’. Include the ‘price’ in your graph

  2. Create a multiple bar graph with ‘neighbourhood_group’ and ‘room_type’ by looking at the average ‘price’.