Open In Colab

Pandas for Data Analysis#

import pandas as pd

I think rather than getting busy with another scripting langauge, we should buckle down and try to get some pandas essentials under our belts. My goal here is to not load any other library, think I can succeed?

Here is the wine dataset again. I’ve added the “nice” headers.

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header = None)
head = ['Class','Alcohol','MalicAcid','Ash','AlcalinityAsh','Magnesium','Phenols','Flavanoids','NonflavanoidPhenols','Proanthocyanins','ColorIntensity','Hue','OD280/OD315','Proline']
#https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.names more info on the data file than you could ever use!
df.columns = head
df.head()
Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

Why do I keep calling my dataset df, TRADITION! No really you might consider using something more descriptive. You should note that I demanded that the columns have descriptive names because it makes the rest easier but was not nesseccary!

wineData = df

Basics About Your Data#

The first is just the column names but sometimes it can be nice to have that as a list

wineData.columns
Index(['Class', 'Alcohol', 'MalicAcid', 'Ash', 'AlcalinityAsh', 'Magnesium',
       'Phenols', 'Flavanoids', 'NonflavanoidPhenols', 'Proanthocyanins',
       'ColorIntensity', 'Hue', 'OD280/OD315', 'Proline'],
      dtype='object')

Size and shape are two excellent resources. How many entries and how many rows and columns

print('Size:' ,wineData.size)
print('Shape', wineData.shape)

178*14
Size: 2492
Shape (178, 14)
2492

Data types will be important as we clean our data.

wineData.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Class                178 non-null    int64  
 1   Alcohol              178 non-null    float64
 2   MalicAcid            178 non-null    float64
 3   Ash                  178 non-null    float64
 4   AlcalinityAsh        178 non-null    float64
 5   Magnesium            178 non-null    int64  
 6   Phenols              178 non-null    float64
 7   Flavanoids           178 non-null    float64
 8   NonflavanoidPhenols  178 non-null    float64
 9   Proanthocyanins      178 non-null    float64
 10  ColorIntensity       178 non-null    float64
 11  Hue                  178 non-null    float64
 12  OD280/OD315          178 non-null    float64
 13  Proline              178 non-null    int64  
dtypes: float64(11), int64(3)
memory usage: 19.6 KB

Similarly you can ask how many different values are in each column, this command is short for “number unique”

wineData.nunique()
Class                    3
Alcohol                126
MalicAcid              133
Ash                     79
AlcalinityAsh           63
Magnesium               53
Phenols                 97
Flavanoids             132
NonflavanoidPhenols     39
Proanthocyanins        101
ColorIntensity         132
Hue                     78
OD280/OD315            122
Proline                121
dtype: int64

We saw this one last time and may see it again soon!

wineData.describe()
Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 1.938202 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258
std 0.775035 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474
min 1.000000 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000
25% 1.000000 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000
50% 2.000000 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000
75% 3.000000 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000
max 3.000000 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000

Acsessing The Data#

The above commands were on the entire dataset but sometimes we might want to consider only a subset. Let’s see some tools for that!

The most straight forward is by column name although it does not work if you have spaces in column names (don’t do that!)

wineData.Alcohol
0      14.23
1      13.20
2      13.16
3      14.37
4      13.24
       ...  
173    13.71
174    13.40
175    13.27
176    13.17
177    14.13
Name: Alcohol, Length: 178, dtype: float64

Or

wineData['Alcohol']
0      14.23
1      13.20
2      13.16
3      14.37
4      13.24
       ...  
173    13.71
174    13.40
175    13.27
176    13.17
177    14.13
Name: Alcohol, Length: 178, dtype: float64

Or if you want a dataframe still

wineData[['Alcohol']]
Alcohol
0 14.23
1 13.20
2 13.16
3 14.37
4 13.24
... ...
173 13.71
174 13.40
175 13.27
176 13.17
177 14.13

178 rows × 1 columns

If you need multiple columns

wineData[['Alcohol','Ash']]
Alcohol Ash
0 14.23 2.43
1 13.20 2.14
2 13.16 2.67
3 14.37 2.50
4 13.24 2.87
... ... ...
173 13.71 2.45
174 13.40 2.48
175 13.27 2.26
176 13.17 2.37
177 14.13 2.74

178 rows × 2 columns

Let’s get a little more exotic use loc You give it the row and the column. There are some great tricks here.

wineData.loc[:,'Ash']#all
wineData.loc[1:3,'Ash']#1,2,3
wineData.loc[[1,5,7],'Ash']#1,5,7
1    2.14
5    2.45
7    2.61
Name: Ash, dtype: float64

You may want to try some of those!

There is also iloc. It is similar but you have to use columns by number not name.

wineData.iloc[:-10,3]#last 10 entries still 'Ash'
0      2.43
1      2.14
2      2.67
3      2.50
4      2.87
       ... 
163    2.35
164    2.30
165    2.26
166    2.60
167    2.30
Name: Ash, Length: 168, dtype: float64

Lastly, what if we want data by a condition? loc will work for this. Here we get all the Class 1 wines.

wineData.loc[wineData.Class==1].head()
Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

But I like to avoid nesting if I can. I suggest we use the query command.

wineData.query('Class ==1').head()#head is just here to limit the output
Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

query will allow me to combine multiple statements!

wineData.query('Class ==1 and Ash >2.8')
Class Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
25 1 13.05 2.05 3.22 25.0 124 2.63 2.68 0.47 1.92 3.58 1.13 3.20 830
36 1 13.28 1.64 2.84 15.5 110 2.60 2.68 0.34 1.36 4.60 1.09 2.78 880

More Advanced Analysis#

The melt command is very powerful and does some nifty things to large datasets quickly!

df_melt = wineData.melt(id_vars = 'Class',
                  value_vars = ['Flavanoids','Hue'],
                  var_name = 'colmuns')
df_melt.tail()
Class colmuns value
351 3 Hue 0.64
352 3 Hue 0.70
353 3 Hue 0.59
354 3 Hue 0.60
355 3 Hue 0.61

This is the same data as before but now Flavanoids and Hue are in a column and there values became another column. Why would this help us? Some time this is referred to as longer data and gives us a way graph.

Next is pivot. It would work best on a dataset with multiple categorical variables. It would take a long table and return a wide table. I cannot use it here without just undoing what I did above.

df_melt.pivot(columns = 'colmuns', values = 'value')
colmuns Flavanoids Hue
0 3.06 NaN
1 2.76 NaN
2 3.24 NaN
3 3.49 NaN
4 2.69 NaN
... ... ...
351 NaN 0.64
352 NaN 0.70
353 NaN 0.59
354 NaN 0.60
355 NaN 0.61

356 rows × 2 columns

I actually couldn’t undo that because it was not clear which value for Flavanoids went with which Hue. If I would have told, I would have give it index = 'Index' that would allow me to match the data back up. It is important not to lose data while doing analysis. You should always be able to reproduce your results but here we could not return!

groupby is excellent! You need a statistic with it too.

wineData.groupby(by = 'Class').mean()
Alcohol MalicAcid Ash AlcalinityAsh Magnesium Phenols Flavanoids NonflavanoidPhenols Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
Class
1 13.744746 2.010678 2.455593 17.037288 106.338983 2.840169 2.982373 0.290000 1.899322 5.528305 1.062034 3.157797 1115.711864
2 12.278732 1.932676 2.244789 20.238028 94.549296 2.258873 2.080845 0.363662 1.630282 3.086620 1.056282 2.785352 519.507042
3 13.153750 3.333750 2.437083 21.416667 99.312500 1.678750 0.781458 0.447500 1.153542 7.396250 0.682708 1.683542 629.895833

You can add to this by using agg

wineData.groupby(by = 'Class').agg(['mean','median'])
Alcohol MalicAcid Ash AlcalinityAsh Magnesium ... Proanthocyanins ColorIntensity Hue OD280/OD315 Proline
mean median mean median mean median mean median mean median ... mean median mean median mean median mean median mean median
Class
1 13.744746 13.750 2.010678 1.770 2.455593 2.44 17.037288 16.8 106.338983 104.0 ... 1.899322 1.870 5.528305 5.40 1.062034 1.070 3.157797 3.17 1115.711864 1095.0
2 12.278732 12.290 1.932676 1.610 2.244789 2.24 20.238028 20.0 94.549296 88.0 ... 1.630282 1.610 3.086620 2.90 1.056282 1.040 2.785352 2.83 519.507042 495.0
3 13.153750 13.165 3.333750 3.265 2.437083 2.38 21.416667 21.0 99.312500 97.0 ... 1.153542 1.105 7.396250 7.55 0.682708 0.665 1.683542 1.66 629.895833 627.5

3 rows × 26 columns

So that is a lot! You can see there are many actions you can prefrom so it is time for you to try!

Your Turn#

  1. Add to your previous document containing the iris dataset

  2. Get the column names

  3. Call a column, find it’s mean

  4. Find the last row

  5. Melt the data keeping two columns of values and the type of flower

  6. Select flowers that have pedal length greater than 5 and sepal length less than 7

  7. Group the data by type of flower and compute mean and median