Data Types#
Lots of Data#
To get onto the task of cleaning our data, it is first important to know what type of data we have and the best tools for using it!
Let’s first look at how we bring our data into the python environment. We have worked mostly with pandas
and will continue to do so! Pandas is actually built on top of another environment, numpy
. At some point in our work we will need both!
Pandas#
Pandas is great for loading data. We have seen it handle csv, html and data from an sql call. We can also load JSON and excel files.
DataFrame
is the table environment we’ve used before and series
is similar to a column.
You should use a pandas dataframe when your data contains categorical data.
Pandas is best when dealing with large datasets.
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/iris.csv')
df.head()
SepalLength | SepalWidth | PedalLength | PedalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Numpy#
The other important tool in python is numpy
. It is the foundation of the pandas module but it has some limitations.
Numpy is excellent for higher dimensional data, stored as an array
. Think multiple sheets in a excel spreadsheet, data that will not simply fit in a 2 dimensional array.
Numpy arrays can be accessed easily by there indicies, this is very ineffiecient in pandas.
Numpy data should be just numbers! Categorical data should be converted first before utilizing a numpy array. Numpy is effiecient and fast but works best on smaller datasets.
import numpy as np
df1 = pa.get_dummies(data = df)
X = np.array(df1)
X
array([[5.1, 3.5, 1.4, ..., 1. , 0. , 0. ],
[4.9, 3. , 1.4, ..., 1. , 0. , 0. ],
[4.7, 3.2, 1.3, ..., 1. , 0. , 0. ],
...,
[6.5, 3. , 5.2, ..., 0. , 0. , 1. ],
[6.2, 3.4, 5.4, ..., 0. , 0. , 1. ],
[5.9, 3. , 5.1, ..., 0. , 0. , 1. ]])
We will discuss what the get_dummies
does in due time. For now just know that it converted the class into numbers for use in numpy array.
df1.dtypes
SepalLength float64
SepalWidth float64
PedalLength float64
PedalWidth float64
Class_Iris-setosa uint8
Class_Iris-versicolor uint8
Class_Iris-virginica uint8
dtype: object
Which is Best?#
Very often I will use both in a project. I’ll start with pandas for loading, cleaning and basic analysis. Then I will convert the data to an numpy array and create models for predicitons.
Less Data#
Now that we have lots of data we’ll have to start examining each piece! I am following this page of the different types.
Strings#
The most common type of data we examine is a string. We will spend a lot of time dealing with strings. Often data in another format is actually given as a string so we’ll have our work cut out for use manipulating strings.
In the iris
dataset, the class was given as a string.
df.Class.iloc[0]
'Iris-setosa'
The tell-tale sign of a string is the quotes. Of course we can save a string too.
a_string = 'My really cool string'
print(a_string)
My really cool string
Pandas calls the datatype of object (‘O’) for any string it is passed.
df.Class.dtype
dtype('O')
Boolean#
Boolean is the logical operator, taking only two values, True
or False
. We can combine them using the normal logical connections. We can also get a boolean by doing comparisons.
a = True
b = False
print(a and b) #or a & b
print(a or b) # or a | b
print( not a ) # or ~a
False
True
False
print(3 == 4)
print(5>-2)
False
True
bool(0)
False
This may show up in manipulating data! You can ask for only the classes that are Iris-setosa in your dataset
df.Class == 'Iris-setosa'
0 True
1 True
2 True
3 True
4 True
...
145 False
146 False
147 False
148 False
149 False
Name: Class, Length: 150, dtype: bool
(df.Class == 'Iris-setosa').dtype
dtype('bool')
Then you can pass that back into the dataframe and it will only give you the entries that were true.
df[df.Class == 'Iris-setosa'].head(10) #I've added head to limit the output to 10 entries
SepalLength | SepalWidth | PedalLength | PedalWidth | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
6 | 4.6 | 3.4 | 1.4 | 0.3 | Iris-setosa |
7 | 5.0 | 3.4 | 1.5 | 0.2 | Iris-setosa |
8 | 4.4 | 2.9 | 1.4 | 0.2 | Iris-setosa |
9 | 4.9 | 3.1 | 1.5 | 0.1 | Iris-setosa |
If we want to combine several boolean DataSeries, use the &
for and and |
for or.
df[(df.Class == 'Iris-setosa') & (df.SepalLength > 5.2)]
SepalLength | SepalWidth | PedalLength | PedalWidth | Class | |
---|---|---|---|---|---|
5 | 5.4 | 3.9 | 1.7 | 0.4 | Iris-setosa |
10 | 5.4 | 3.7 | 1.5 | 0.2 | Iris-setosa |
14 | 5.8 | 4.0 | 1.2 | 0.2 | Iris-setosa |
15 | 5.7 | 4.4 | 1.5 | 0.4 | Iris-setosa |
16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
18 | 5.7 | 3.8 | 1.7 | 0.3 | Iris-setosa |
20 | 5.4 | 3.4 | 1.7 | 0.2 | Iris-setosa |
31 | 5.4 | 3.4 | 1.5 | 0.4 | Iris-setosa |
33 | 5.5 | 4.2 | 1.4 | 0.2 | Iris-setosa |
36 | 5.5 | 3.5 | 1.3 | 0.2 | Iris-setosa |
48 | 5.3 | 3.7 | 1.5 | 0.2 | Iris-setosa |
Integers#
Integers are whole numbers that can be positive or negative. Integers are closed under addition, subtration and multiplication (Not division). Using integers saves some memory so if your entry is an integer you should use it that way.
Some examples of integers are customer numbers and counts of objects. The code to convert to an integer is int
.
int(-3.0000)
-3
Floats#
A float is a generic number stored up to a certain number of decimals (64 bits in pandas). Be wary of the last few decimals, more if you have done lots of computations.
In the iris
dataset most columns are floats.
df.dtypes
SepalLength float64
SepalWidth float64
PedalLength float64
PedalWidth float64
Class object
dtype: object
Daytime#
These are dates and times and allow you to manipulate differences easily! I’ll grab a dataset with some dates in it. Pandas does not recognize the datetime automatically so I had to convert.
df2 = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Sets_For_Stats/master/CuratedDataSets/Landslides_From_NASA.csv')
ds = df2.event_date.astype('datetime64')
ds
0 2008-08-01 00:00:00
1 2009-01-02 02:00:00
2 2007-01-19 00:00:00
3 2009-07-31 00:00:00
4 2010-10-16 12:00:00
...
11028 2017-04-01 13:34:00
11029 2017-03-25 17:32:00
11030 2016-12-15 05:00:00
11031 2017-04-29 19:03:00
11032 2017-03-13 14:32:00
Name: event_date, Length: 11033, dtype: datetime64[ns]
ds[1]-ds[0]
Timedelta('154 days 02:00:00')
The Timedelta
is itself another data structure!
(ds[1]-ds[0]).total_seconds()
13312800.0
This gives us the total seconds in the elapsed time. I’m certain there are other things you could do here. When you need them, you’ll have to explore!
Category#
I’ll convert the Class into a category inside Pandas by converting the DataSereis of Class into a category and passing it back to the dataframe.
df.Class = df.Class.astype('category')
df.dtypes
SepalLength float64
SepalWidth float64
PedalLength float64
PedalWidth float64
Class category
dtype: object
You can get at each category through the unique
command.
df.Class.unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
Category is a structure only in pandas. It will allow you to put an order to categorical ordinal variables.
from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', "Iris-virginica"], ordered=True)
dfClass = df.Class.astype(cat_type)
dfClass.dtype
CategoricalDtype(categories=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], ordered=True)
This may be nice for many categorical and ordinal variables! Taken from here
You might be asking why I would order a categorical variable (especcially since the above example has no defined order)
Some things do have an order! Monday, Tuesday, Wednesday, etc. GED, High school Grad, Associates, Bachelors, Masters, etc.
Some things have an order but not always well defined: Single, Married, Divorced, Widowed?
Somethings have a circular order Winter, Spring, Summer, Fall, Winter, etc.
Be careful and use order correctly!
Your Turn#
Using the banks
dataset retrieved from UCI Machine Learning Repo but also accessible here. Explore the following questions:
What is the data type of the first column? Should you change it or is that the best data type.
The second column is job could you assign an order to this string? Could you assign an order the the fourth column education. If yes, what order would you assign? Do so.
Could any of the columns have been a boolean? Find one and create a new column in the dataframe that is just boolean and has a descriptive name.
Name one more interesting fact about this dataset and the datatypes.