Integers and Floats#
There is not a ton to say about integers and floats except that they are numbers and in data problems, numbers are what we want to deal with if we can.
Integers take less memory so it is best to use them when appropriate but often you cannot avoid floats.
Conversions Between the Two#
import pandas as pa
df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/H1.csv')
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
IsCanceled | 0 | 0 | 0 | 0 | 0 |
LeadTime | 342 | 737 | 7 | 13 | 14 |
ArrivalDateYear | 2015 | 2015 | 2015 | 2015 | 2015 |
ArrivalDateMonth | July | July | July | July | July |
ArrivalDateWeekNumber | 27 | 27 | 27 | 27 | 27 |
ArrivalDateDayOfMonth | 1 | 1 | 1 | 1 | 1 |
StaysInWeekendNights | 0 | 0 | 0 | 0 | 0 |
StaysInWeekNights | 0 | 0 | 1 | 1 | 2 |
Adults | 2 | 2 | 1 | 1 | 2 |
Children | 0 | 0 | 0 | 0 | 0 |
Babies | 0 | 0 | 0 | 0 | 0 |
Meal | BB | BB | BB | BB | BB |
Country | PRT | PRT | GBR | GBR | GBR |
MarketSegment | Direct | Direct | Direct | Corporate | Online TA |
DistributionChannel | Direct | Direct | Direct | Corporate | TA/TO |
IsRepeatedGuest | 0 | 0 | 0 | 0 | 0 |
PreviousCancellations | 0 | 0 | 0 | 0 | 0 |
PreviousBookingsNotCanceled | 0 | 0 | 0 | 0 | 0 |
ReservedRoomType | C | C | A | A | A |
AssignedRoomType | C | C | C | A | A |
BookingChanges | 3 | 4 | 0 | 0 | 0 |
DepositType | No Deposit | No Deposit | No Deposit | No Deposit | No Deposit |
Agent | NULL | NULL | NULL | 304 | 240 |
Company | NULL | NULL | NULL | NULL | NULL |
DaysInWaitingList | 0 | 0 | 0 | 0 | 0 |
CustomerType | Transient | Transient | Transient | Transient | Transient |
ADR | 0.0 | 0.0 | 75.0 | 75.0 | 98.0 |
RequiredCarParkingSpaces | 0 | 0 | 0 | 0 | 0 |
TotalOfSpecialRequests | 0 | 0 | 0 | 0 | 1 |
ReservationStatus | Check-Out | Check-Out | Check-Out | Check-Out | Check-Out |
ReservationStatusDate | 7/1/2015 | 7/1/2015 | 7/2/2015 | 7/2/2015 | 7/3/2015 |
The ADR
column is a float, let’s check it out and see how to convert it.
df.ADR.astype('int')
0 0
1 0
2 75
3 75
4 98
...
40055 89
40056 202
40057 153
40058 112
40059 99
Name: ADR, Length: 40060, dtype: int64
Similarly I can change BookingChanges
into a float.
df.BookingChanges.astype('float')
0 3.0
1 4.0
2 0.0
3 0.0
4 0.0
...
40055 1.0
40056 0.0
40057 0.0
40058 0.0
40059 0.0
Name: BookingChanges, Length: 40060, dtype: float64
If I want to pass that back into my dataframe with the same name, I do the following.
df.BookingChanges = df.BookingChanges.astype('float')
df.head().T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
IsCanceled | 0 | 0 | 0 | 0 | 0 |
LeadTime | 342 | 737 | 7 | 13 | 14 |
ArrivalDateYear | 2015 | 2015 | 2015 | 2015 | 2015 |
ArrivalDateMonth | July | July | July | July | July |
ArrivalDateWeekNumber | 27 | 27 | 27 | 27 | 27 |
ArrivalDateDayOfMonth | 1 | 1 | 1 | 1 | 1 |
StaysInWeekendNights | 0 | 0 | 0 | 0 | 0 |
StaysInWeekNights | 0 | 0 | 1 | 1 | 2 |
Adults | 2 | 2 | 1 | 1 | 2 |
Children | 0 | 0 | 0 | 0 | 0 |
Babies | 0 | 0 | 0 | 0 | 0 |
Meal | BB | BB | BB | BB | BB |
Country | PRT | PRT | GBR | GBR | GBR |
MarketSegment | Direct | Direct | Direct | Corporate | Online TA |
DistributionChannel | Direct | Direct | Direct | Corporate | TA/TO |
IsRepeatedGuest | 0 | 0 | 0 | 0 | 0 |
PreviousCancellations | 0 | 0 | 0 | 0 | 0 |
PreviousBookingsNotCanceled | 0 | 0 | 0 | 0 | 0 |
ReservedRoomType | C | C | A | A | A |
AssignedRoomType | C | C | C | A | A |
BookingChanges | 3.0 | 4.0 | 0.0 | 0.0 | 0.0 |
DepositType | No Deposit | No Deposit | No Deposit | No Deposit | No Deposit |
Agent | NULL | NULL | NULL | 304 | 240 |
Company | NULL | NULL | NULL | NULL | NULL |
DaysInWaitingList | 0 | 0 | 0 | 0 | 0 |
CustomerType | Transient | Transient | Transient | Transient | Transient |
ADR | 0.0 | 0.0 | 75.0 | 75.0 | 98.0 |
RequiredCarParkingSpaces | 0 | 0 | 0 | 0 | 0 |
TotalOfSpecialRequests | 0 | 0 | 0 | 0 | 1 |
ReservationStatus | Check-Out | Check-Out | Check-Out | Check-Out | Check-Out |
ReservationStatusDate | 7/1/2015 | 7/1/2015 | 7/2/2015 | 7/2/2015 | 7/3/2015 |
Note that ADR has not been changed in the dataframe!
Grouping and Stats#
Much like in SQL, we can do lots of operations to our dataframe. We have used lots of this already but this is as good as place as any to review.
df.groupby('DistributionChannel').ADR.agg(['mean','median','count', 'std'])
mean | median | count | std | |
---|---|---|---|---|
DistributionChannel | ||||
Corporate | 53.277788 | 45.0 | 3269 | 30.156894 |
Direct | 103.074526 | 80.0 | 7865 | 67.650012 |
TA/TO | 97.453947 | 80.0 | 28925 | 60.505996 |
Undefined | 112.700000 | 112.7 | 1 | NaN |
Let’s review what the code above does! First I group based on the DistributionChannel this is where the booking to the hotel came from. Next I get the ADR, I think this is the proce of the room. Finally I aggregate the data collecting the mean, median, count and standard deviation. Why does undefined not have a std?
Transform#
We saw apply
in action with strings. There is also a transform command.
df.ADR.transform(lambda x: x+1)
0 1.00
1 1.00
2 76.00
3 76.00
4 99.00
...
40055 90.75
40056 203.27
40057 154.57
40058 113.80
40059 100.06
Name: ADR, Length: 40060, dtype: float64
df.ADR.apply(lambda x: x+1)
0 1.00
1 1.00
2 76.00
3 76.00
4 99.00
...
40055 90.75
40056 203.27
40057 154.57
40058 113.80
40059 100.06
Name: ADR, Length: 40060, dtype: float64
While these seem similar you can send transform
built in functions without the lambda
function which might be more readable for your code.
df.Meal.transform(len)
0 9
1 9
2 9
3 9
4 9
..
40055 9
40056 9
40057 9
40058 9
40059 9
Name: Meal, Length: 40060, dtype: int64
This is the length of the strings. You should be suprised by this result except when you see the following output.
df.Meal[0]
'BB '
Rolling Window#
Sometimes it is nice to know what is happening over several entries. A rolling (or moving) average is common place in finance.
df.ADR.rolling(2).sum()
0 NaN
1 0.00
2 75.00
3 150.00
4 173.00
...
40055 294.02
40056 292.02
40057 355.84
40058 266.37
40059 211.86
Name: ADR, Length: 40060, dtype: float64
This adds the previous entry to the current. To do average, pass it that command. If we wanted to look at total daily take in we would have to gather dailies first by grouping
totaldailies = df.groupby('ReservationStatusDate').ADR.agg('sum')
totaldailies
ReservationStatusDate
1/1/2015 185.90
1/1/2016 2202.59
1/1/2017 14069.98
1/10/2016 1283.39
1/10/2017 2324.99
...
9/8/2016 3531.79
9/8/2017 404.05
9/9/2015 3587.90
9/9/2016 4162.33
9/9/2017 886.67
Name: ADR, Length: 913, dtype: float64
totaldailies.rolling(5).mean()
ReservationStatusDate
1/1/2015 NaN
1/1/2016 NaN
1/1/2017 NaN
1/10/2016 NaN
1/10/2017 4013.370
...
9/8/2016 3165.520
9/8/2017 2353.238
9/9/2015 2424.142
9/9/2016 2860.156
9/9/2017 2514.548
Name: ADR, Length: 913, dtype: float64
This did not work as I intended due to the days not bing in order. Let’s convert the indexes into datetime format and try again.
totaldailies.index = pa.to_datetime(totaldailies.index)
I’ll need to sort them by the index too.
totaldailies = totaldailies.sort_index()
totaldailies
ReservationStatusDate
2014-11-18 0.00
2015-01-01 185.90
2015-01-02 154.14
2015-01-18 0.00
2015-01-21 3394.41
...
2017-09-08 404.05
2017-09-09 886.67
2017-09-10 581.09
2017-09-12 153.57
2017-09-14 211.86
Name: ADR, Length: 913, dtype: float64
Now I think I am ready?
totaldailies.rolling('5d').mean()
ReservationStatusDate
2014-11-18 0.000000
2015-01-01 185.900000
2015-01-02 170.020000
2015-01-18 0.000000
2015-01-21 1697.205000
...
2017-09-08 1851.234000
2017-09-09 1614.150000
2017-09-10 1191.084000
2017-09-12 506.345000
2017-09-14 315.506667
Name: ADR, Length: 913, dtype: float64
Your Turn#
Grab the iris
dataset. Answer the following questions:
Does converting SepalLength to integer increase or decrease the mean?
Does the direction of the shift remain the same if you
groupby
Class?Gather the mean, median, count and standard deviation of all columns when grouped by Class.