Open In Colab

Data Types#

Lots of Data#

To get onto the task of cleaning our data, it is first important to know what type of data we have and the best tools for using it!

Let’s first look at how we bring our data into the python environment. We have worked mostly with pandas and will continue to do so! Pandas is actually built on top of another environment, numpy. At some point in our work we will need both!

Pandas#

Pandas is great for loading data. We have seen it handle csv, html and data from an sql call. We can also load JSON and excel files.

DataFrame is the table environment we’ve used before and series is similar to a column.

You should use a pandas dataframe when your data contains categorical data.

Pandas is best when dealing with large datasets.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')

df.head()
SepalLength SepalWidth PedalLength PedalWidth Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Numpy#

The other important tool in python is numpy. It is the foundation of the pandas module but it has some limitations.

Numpy is excellent for higher dimensional data, stored as an array. Think multiple sheets in a excel spreadsheet, data that will not simply fit in a 2 dimensional array.

Numpy arrays can be accessed easily by there indicies, this is very ineffiecient in pandas.

Numpy data should be just numbers! Categorical data should be converted first before utilizing a numpy array. Numpy is effiecient and fast but works best on smaller datasets.

import numpy as np

df1 = pa.get_dummies(data = df)

X = np.array(df1)

X
array([[5.1, 3.5, 1.4, ..., 1. , 0. , 0. ],
       [4.9, 3. , 1.4, ..., 1. , 0. , 0. ],
       [4.7, 3.2, 1.3, ..., 1. , 0. , 0. ],
       ...,
       [6.5, 3. , 5.2, ..., 0. , 0. , 1. ],
       [6.2, 3.4, 5.4, ..., 0. , 0. , 1. ],
       [5.9, 3. , 5.1, ..., 0. , 0. , 1. ]])

We will discuss what the get_dummies does in due time. For now just know that it converted the class into numbers for use in numpy array.

df1.dtypes
SepalLength              float64
SepalWidth               float64
PedalLength              float64
PedalWidth               float64
Class_Iris-setosa          uint8
Class_Iris-versicolor      uint8
Class_Iris-virginica       uint8
dtype: object

Which is Best?#

Very often I will use both in a project. I’ll start with pandas for loading, cleaning and basic analysis. Then I will convert the data to an numpy array and create models for predicitons.

Less Data#

Now that we have lots of data we’ll have to start examining each piece! I am following this page of the different types.

Strings#

The most common type of data we examine is a string. We will spend a lot of time dealing with strings. Often data in another format is actually given as a string so we’ll have our work cut out for use manipulating strings.

In the iris dataset, the class was given as a string.

df.Class.iloc[0]
'Iris-setosa'

The tell-tale sign of a string is the quotes. Of course we can save a string too.

a_string = 'My really cool string'

print(a_string)
My really cool string

Pandas calls the datatype of object (‘O’) for any string it is passed.

df.Class.dtype
dtype('O')

Boolean#

Boolean is the logical operator, taking only two values, True or False. We can combine them using the normal logical connections. We can also get a boolean by doing comparisons.

a = True
b = False

print(a and b) #or a & b

print(a or b) # or a | b

print( not a ) # or ~a
False
True
False
print(3 == 4)

print(5>-2)
False
True
bool(0)
False

This may show up in manipulating data! You can ask for only the classes that are Iris-setosa in your dataset

df.Class == 'Iris-setosa'
0       True
1       True
2       True
3       True
4       True
       ...  
145    False
146    False
147    False
148    False
149    False
Name: Class, Length: 150, dtype: bool
(df.Class == 'Iris-setosa').dtype
dtype('bool')

Then you can pass that back into the dataframe and it will only give you the entries that were true.

df[df.Class == 'Iris-setosa'].head(10) #I've added head to limit the output to 10 entries
SepalLength SepalWidth PedalLength PedalWidth Class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa

If we want to combine several boolean DataSeries, use the & for and and | for or.

df[(df.Class == 'Iris-setosa') & (df.SepalLength > 5.2)]
SepalLength SepalWidth PedalLength PedalWidth Class
5 5.4 3.9 1.7 0.4 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
31 5.4 3.4 1.5 0.4 Iris-setosa
33 5.5 4.2 1.4 0.2 Iris-setosa
36 5.5 3.5 1.3 0.2 Iris-setosa
48 5.3 3.7 1.5 0.2 Iris-setosa

Integers#

Integers are whole numbers that can be positive or negative. Integers are closed under addition, subtration and multiplication (Not division). Using integers saves some memory so if your entry is an integer you should use it that way.

Some examples of integers are customer numbers and counts of objects. The code to convert to an integer is int.

int(-3.0000)
-3

Floats#

A float is a generic number stored up to a certain number of decimals (64 bits in pandas). Be wary of the last few decimals, more if you have done lots of computations.

In the iris dataset most columns are floats.

df.dtypes
SepalLength    float64
SepalWidth     float64
PedalLength    float64
PedalWidth     float64
Class           object
dtype: object

Daytime#

These are dates and times and allow you to manipulate differences easily! I’ll grab a dataset with some dates in it. Pandas does not recognize the datetime automatically so I had to convert.

df2 = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/Landslides_From_NASA.csv')
ds = df2.event_date.astype('datetime64')

ds
0       2008-08-01 00:00:00
1       2009-01-02 02:00:00
2       2007-01-19 00:00:00
3       2009-07-31 00:00:00
4       2010-10-16 12:00:00
                ...        
11028   2017-04-01 13:34:00
11029   2017-03-25 17:32:00
11030   2016-12-15 05:00:00
11031   2017-04-29 19:03:00
11032   2017-03-13 14:32:00
Name: event_date, Length: 11033, dtype: datetime64[ns]
ds[1]-ds[0]
Timedelta('154 days 02:00:00')

The Timedelta is itself another data structure!

(ds[1]-ds[0]).total_seconds()
13312800.0

This gives us the total seconds in the elapsed time. I’m certain there are other things you could do here. When you need them, you’ll have to explore!

Category#

I’ll convert the Class into a category inside Pandas by converting the DataSereis of Class into a category and passing it back to the dataframe.

df.Class = df.Class.astype('category')

df.dtypes
SepalLength     float64
SepalWidth      float64
PedalLength     float64
PedalWidth      float64
Class          category
dtype: object

You can get at each category through the unique command.

df.Class.unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Category is a structure only in pandas. It will allow you to put an order to categorical ordinal variables.

from pandas.api.types import CategoricalDtype


cat_type = CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', "Iris-virginica"], ordered=True)

dfClass = df.Class.astype(cat_type)

dfClass.dtype
CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], ordered=True)

This may be nice for many categorical and ordinal variables! Taken from here

You might be asking why I would order a categorical variable (especcially since the above example has no defined order)

Some things do have an order! Monday, Tuesday, Wednesday, etc. GED, High school Grad, Associates, Bachelors, Masters, etc.

Some things have an order but not always well defined: Single, Married, Divorced, Widowed?

Somethings have a circular order Winter, Spring, Summer, Fall, Winter, etc.

Be careful and use order correctly!

Your Turn#

Using the banks dataset retrieved from UCI Machine Learning Repo but also accessible here. Explore the following questions:

  1. What is the data type of the first column? Should you change it or is that the best data type.

  2. The second column is job could you assign an order to this string? Could you assign an order the the fourth column education. If yes, what order would you assign? Do so.

  3. Could any of the columns have been a boolean? Find one and create a new column in the dataframe that is just boolean and has a descriptive name.

  4. Name one more interesting fact about this dataset and the datatypes.