Open In Colab

Integers and Floats#

There is not a ton to say about integers and floats except that they are numbers and in data problems, numbers are what we want to deal with if we can.

Integers take less memory so it is best to use them when appropriate but often you cannot avoid floats.

Conversions Between the Two#

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/H1.csv')

df.head().T
0 1 2 3 4
IsCanceled 0 0 0 0 0
LeadTime 342 737 7 13 14
ArrivalDateYear 2015 2015 2015 2015 2015
ArrivalDateMonth July July July July July
ArrivalDateWeekNumber 27 27 27 27 27
ArrivalDateDayOfMonth 1 1 1 1 1
StaysInWeekendNights 0 0 0 0 0
StaysInWeekNights 0 0 1 1 2
Adults 2 2 1 1 2
Children 0 0 0 0 0
Babies 0 0 0 0 0
Meal BB BB BB BB BB
Country PRT PRT GBR GBR GBR
MarketSegment Direct Direct Direct Corporate Online TA
DistributionChannel Direct Direct Direct Corporate TA/TO
IsRepeatedGuest 0 0 0 0 0
PreviousCancellations 0 0 0 0 0
PreviousBookingsNotCanceled 0 0 0 0 0
ReservedRoomType C C A A A
AssignedRoomType C C C A A
BookingChanges 3 4 0 0 0
DepositType No Deposit No Deposit No Deposit No Deposit No Deposit
Agent NULL NULL NULL 304 240
Company NULL NULL NULL NULL NULL
DaysInWaitingList 0 0 0 0 0
CustomerType Transient Transient Transient Transient Transient
ADR 0.0 0.0 75.0 75.0 98.0
RequiredCarParkingSpaces 0 0 0 0 0
TotalOfSpecialRequests 0 0 0 0 1
ReservationStatus Check-Out Check-Out Check-Out Check-Out Check-Out
ReservationStatusDate 7/1/2015 7/1/2015 7/2/2015 7/2/2015 7/3/2015

The ADR column is a float, let’s check it out and see how to convert it.

df.ADR.astype('int')
0          0
1          0
2         75
3         75
4         98
        ... 
40055     89
40056    202
40057    153
40058    112
40059     99
Name: ADR, Length: 40060, dtype: int64

Similarly I can change BookingChanges into a float.

df.BookingChanges.astype('float')
0        3.0
1        4.0
2        0.0
3        0.0
4        0.0
        ... 
40055    1.0
40056    0.0
40057    0.0
40058    0.0
40059    0.0
Name: BookingChanges, Length: 40060, dtype: float64

If I want to pass that back into my dataframe with the same name, I do the following.

df.BookingChanges = df.BookingChanges.astype('float')

df.head().T
0 1 2 3 4
IsCanceled 0 0 0 0 0
LeadTime 342 737 7 13 14
ArrivalDateYear 2015 2015 2015 2015 2015
ArrivalDateMonth July July July July July
ArrivalDateWeekNumber 27 27 27 27 27
ArrivalDateDayOfMonth 1 1 1 1 1
StaysInWeekendNights 0 0 0 0 0
StaysInWeekNights 0 0 1 1 2
Adults 2 2 1 1 2
Children 0 0 0 0 0
Babies 0 0 0 0 0
Meal BB BB BB BB BB
Country PRT PRT GBR GBR GBR
MarketSegment Direct Direct Direct Corporate Online TA
DistributionChannel Direct Direct Direct Corporate TA/TO
IsRepeatedGuest 0 0 0 0 0
PreviousCancellations 0 0 0 0 0
PreviousBookingsNotCanceled 0 0 0 0 0
ReservedRoomType C C A A A
AssignedRoomType C C C A A
BookingChanges 3.0 4.0 0.0 0.0 0.0
DepositType No Deposit No Deposit No Deposit No Deposit No Deposit
Agent NULL NULL NULL 304 240
Company NULL NULL NULL NULL NULL
DaysInWaitingList 0 0 0 0 0
CustomerType Transient Transient Transient Transient Transient
ADR 0.0 0.0 75.0 75.0 98.0
RequiredCarParkingSpaces 0 0 0 0 0
TotalOfSpecialRequests 0 0 0 0 1
ReservationStatus Check-Out Check-Out Check-Out Check-Out Check-Out
ReservationStatusDate 7/1/2015 7/1/2015 7/2/2015 7/2/2015 7/3/2015

Note that ADR has not been changed in the dataframe!

Grouping and Stats#

Much like in SQL, we can do lots of operations to our dataframe. We have used lots of this already but this is as good as place as any to review.

df.groupby('DistributionChannel').ADR.agg(['mean','median','count', 'std'])
mean median count std
DistributionChannel
Corporate 53.277788 45.0 3269 30.156894
Direct 103.074526 80.0 7865 67.650012
TA/TO 97.453947 80.0 28925 60.505996
Undefined 112.700000 112.7 1 NaN

Let’s review what the code above does! First I group based on the DistributionChannel this is where the booking to the hotel came from. Next I get the ADR, I think this is the proce of the room. Finally I aggregate the data collecting the mean, median, count and standard deviation. Why does undefined not have a std?

Transform#

We saw apply in action with strings. There is also a transform command.

df.ADR.transform(lambda x: x+1)
0          1.00
1          1.00
2         76.00
3         76.00
4         99.00
          ...  
40055     90.75
40056    203.27
40057    154.57
40058    113.80
40059    100.06
Name: ADR, Length: 40060, dtype: float64
df.ADR.apply(lambda x: x+1)
0          1.00
1          1.00
2         76.00
3         76.00
4         99.00
          ...  
40055     90.75
40056    203.27
40057    154.57
40058    113.80
40059    100.06
Name: ADR, Length: 40060, dtype: float64

While these seem similar you can send transform built in functions without the lambda function which might be more readable for your code.

df.Meal.transform(len)
0        9
1        9
2        9
3        9
4        9
        ..
40055    9
40056    9
40057    9
40058    9
40059    9
Name: Meal, Length: 40060, dtype: int64

This is the length of the strings. You should be suprised by this result except when you see the following output.

df.Meal[0]
'BB       '

Rolling Window#

Sometimes it is nice to know what is happening over several entries. A rolling (or moving) average is common place in finance.

df.ADR.rolling(2).sum()
0           NaN
1          0.00
2         75.00
3        150.00
4        173.00
          ...  
40055    294.02
40056    292.02
40057    355.84
40058    266.37
40059    211.86
Name: ADR, Length: 40060, dtype: float64

This adds the previous entry to the current. To do average, pass it that command. If we wanted to look at total daily take in we would have to gather dailies first by grouping

totaldailies = df.groupby('ReservationStatusDate').ADR.agg('sum')

totaldailies
ReservationStatusDate
1/1/2015       185.90
1/1/2016      2202.59
1/1/2017     14069.98
1/10/2016     1283.39
1/10/2017     2324.99
               ...   
9/8/2016      3531.79
9/8/2017       404.05
9/9/2015      3587.90
9/9/2016      4162.33
9/9/2017       886.67
Name: ADR, Length: 913, dtype: float64
totaldailies.rolling(5).mean()
ReservationStatusDate
1/1/2015          NaN
1/1/2016          NaN
1/1/2017          NaN
1/10/2016         NaN
1/10/2017    4013.370
               ...   
9/8/2016     3165.520
9/8/2017     2353.238
9/9/2015     2424.142
9/9/2016     2860.156
9/9/2017     2514.548
Name: ADR, Length: 913, dtype: float64

This did not work as I intended due to the days not bing in order. Let’s convert the indexes into datetime format and try again.

totaldailies.index = pa.to_datetime(totaldailies.index)

I’ll need to sort them by the index too.

totaldailies = totaldailies.sort_index()
totaldailies
ReservationStatusDate
2014-11-18       0.00
2015-01-01     185.90
2015-01-02     154.14
2015-01-18       0.00
2015-01-21    3394.41
               ...   
2017-09-08     404.05
2017-09-09     886.67
2017-09-10     581.09
2017-09-12     153.57
2017-09-14     211.86
Name: ADR, Length: 913, dtype: float64

Now I think I am ready?

totaldailies.rolling('5d').mean()
ReservationStatusDate
2014-11-18       0.000000
2015-01-01     185.900000
2015-01-02     170.020000
2015-01-18       0.000000
2015-01-21    1697.205000
                 ...     
2017-09-08    1851.234000
2017-09-09    1614.150000
2017-09-10    1191.084000
2017-09-12     506.345000
2017-09-14     315.506667
Name: ADR, Length: 913, dtype: float64

Your Turn#

Grab the iris dataset. Answer the following questions:

  1. Does converting SepalLength to integer increase or decrease the mean?

  2. Does the direction of the shift remain the same if you groupby Class?

  3. Gather the mean, median, count and standard deviation of all columns when grouped by Class.