Integers and Floats

Integers and Floats#

There is not a ton to say about integers and floats except that they are numbers and in data problems, numbers are what we want to deal with if we can.

Integers take less memory so it is best to use them when appropriate but often you cannot avoid floats.

Conversions Between the Two#

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/Data_Sets/H1.csv')

df.head().T

	0	1	2	3	4
IsCanceled	0	0	0	0	0
LeadTime	342	737	7	13	14
ArrivalDateYear	2015	2015	2015	2015	2015
ArrivalDateMonth	July	July	July	July	July
ArrivalDateWeekNumber	27	27	27	27	27
ArrivalDateDayOfMonth	1	1	1	1	1
StaysInWeekendNights	0	0	0	0	0
StaysInWeekNights	0	0	1	1	2
Adults	2	2	1	1	2
Children	0	0	0	0	0
Babies	0	0	0	0	0
Meal	BB	BB	BB	BB	BB
Country	PRT	PRT	GBR	GBR	GBR
MarketSegment	Direct	Direct	Direct	Corporate	Online TA
DistributionChannel	Direct	Direct	Direct	Corporate	TA/TO
IsRepeatedGuest	0	0	0	0	0
PreviousCancellations	0	0	0	0	0
PreviousBookingsNotCanceled	0	0	0	0	0
ReservedRoomType	C	C	A	A	A
AssignedRoomType	C	C	C	A	A
BookingChanges	3	4	0	0	0
DepositType	No Deposit	No Deposit	No Deposit	No Deposit	No Deposit
Agent	NULL	NULL	NULL	304	240
Company	NULL	NULL	NULL	NULL	NULL
DaysInWaitingList	0	0	0	0	0
CustomerType	Transient	Transient	Transient	Transient	Transient
ADR	0.0	0.0	75.0	75.0	98.0
RequiredCarParkingSpaces	0	0	0	0	0
TotalOfSpecialRequests	0	0	0	0	1
ReservationStatus	Check-Out	Check-Out	Check-Out	Check-Out	Check-Out
ReservationStatusDate	7/1/2015	7/1/2015	7/2/2015	7/2/2015	7/3/2015

The ADR column is a float, let’s check it out and see how to convert it.

df.ADR.astype('int')

        0
        0
       75
       75
       98
        ... 
   89
  202
  153
  112
   99
Name: ADR, Length: 40060, dtype: int64

Similarly I can change BookingChanges into a float.

df.BookingChanges.astype('float')

      3.0
      4.0
      0.0
      0.0
      0.0
        ... 
  1.0
  0.0
  0.0
  0.0
  0.0
Name: BookingChanges, Length: 40060, dtype: float64

If I want to pass that back into my dataframe with the same name, I do the following.

df.BookingChanges = df.BookingChanges.astype('float')

df.head().T

	0	1	2	3	4
IsCanceled	0	0	0	0	0
LeadTime	342	737	7	13	14
ArrivalDateYear	2015	2015	2015	2015	2015
ArrivalDateMonth	July	July	July	July	July
ArrivalDateWeekNumber	27	27	27	27	27
ArrivalDateDayOfMonth	1	1	1	1	1
StaysInWeekendNights	0	0	0	0	0
StaysInWeekNights	0	0	1	1	2
Adults	2	2	1	1	2
Children	0	0	0	0	0
Babies	0	0	0	0	0
Meal	BB	BB	BB	BB	BB
Country	PRT	PRT	GBR	GBR	GBR
MarketSegment	Direct	Direct	Direct	Corporate	Online TA
DistributionChannel	Direct	Direct	Direct	Corporate	TA/TO
IsRepeatedGuest	0	0	0	0	0
PreviousCancellations	0	0	0	0	0
PreviousBookingsNotCanceled	0	0	0	0	0
ReservedRoomType	C	C	A	A	A
AssignedRoomType	C	C	C	A	A
BookingChanges	3.0	4.0	0.0	0.0	0.0
DepositType	No Deposit	No Deposit	No Deposit	No Deposit	No Deposit
Agent	NULL	NULL	NULL	304	240
Company	NULL	NULL	NULL	NULL	NULL
DaysInWaitingList	0	0	0	0	0
CustomerType	Transient	Transient	Transient	Transient	Transient
ADR	0.0	0.0	75.0	75.0	98.0
RequiredCarParkingSpaces	0	0	0	0	0
TotalOfSpecialRequests	0	0	0	0	1
ReservationStatus	Check-Out	Check-Out	Check-Out	Check-Out	Check-Out
ReservationStatusDate	7/1/2015	7/1/2015	7/2/2015	7/2/2015	7/3/2015

Note that ADR has not been changed in the dataframe!

Grouping and Stats#

Much like in SQL, we can do lots of operations to our dataframe. We have used lots of this already but this is as good as place as any to review.

df.groupby('DistributionChannel').ADR.agg(['mean','median','count', 'std'])

	mean	median	count	std
DistributionChannel
Corporate	53.277788	45.0	3269	30.156894
Direct	103.074526	80.0	7865	67.650012
TA/TO	97.453947	80.0	28925	60.505996
Undefined	112.700000	112.7	1	NaN

Let’s review what the code above does! First I group based on the DistributionChannel this is where the booking to the hotel came from. Next I get the ADR, I think this is the proce of the room. Finally I aggregate the data collecting the mean, median, count and standard deviation. Why does undefined not have a std?

Transform#

We saw apply in action with strings. There is also a transform command.

df.ADR.transform(lambda x: x+1)

        1.00
        1.00
       76.00
       76.00
       99.00
          ...  
   90.75
  203.27
  154.57
  113.80
  100.06
Name: ADR, Length: 40060, dtype: float64

df.ADR.apply(lambda x: x+1)

        1.00
        1.00
       76.00
       76.00
       99.00
          ...  
   90.75
  203.27
  154.57
  113.80
  100.06
Name: ADR, Length: 40060, dtype: float64

While these seem similar you can send transform built in functions without the lambda function which might be more readable for your code.

df.Meal.transform(len)

      9
      9
      9
      9
      9
        ..
  9
  9
  9
  9
  9
Name: Meal, Length: 40060, dtype: int64

This is the length of the strings. You should be suprised by this result except when you see the following output.

df.Meal[0]

'BB       '

Rolling Window#

Sometimes it is nice to know what is happening over several entries. A rolling (or moving) average is common place in finance.

df.ADR.rolling(2).sum()

         NaN
        0.00
       75.00
      150.00
      173.00
          ...  
  294.02
  292.02
  355.84
  266.37
  211.86
Name: ADR, Length: 40060, dtype: float64

This adds the previous entry to the current. To do average, pass it that command. If we wanted to look at total daily take in we would have to gather dailies first by grouping

totaldailies = df.groupby('ReservationStatusDate').ADR.agg('sum')

totaldailies

ReservationStatusDate
1/1/2015       185.90
1/1/2016      2202.59
1/1/2017     14069.98
1/10/2016     1283.39
1/10/2017     2324.99
               ...   
9/8/2016      3531.79
9/8/2017       404.05
9/9/2015      3587.90
9/9/2016      4162.33
9/9/2017       886.67
Name: ADR, Length: 913, dtype: float64

totaldailies.rolling(5).mean()

ReservationStatusDate
1/1/2015          NaN
1/1/2016          NaN
1/1/2017          NaN
1/10/2016         NaN
1/10/2017    4013.370
               ...   
9/8/2016     3165.520
9/8/2017     2353.238
9/9/2015     2424.142
9/9/2016     2860.156
9/9/2017     2514.548
Name: ADR, Length: 913, dtype: float64

This did not work as I intended due to the days not bing in order. Let’s convert the indexes into datetime format and try again.

totaldailies.index = pa.to_datetime(totaldailies.index)

I’ll need to sort them by the index too.

totaldailies = totaldailies.sort_index()
totaldailies

ReservationStatusDate
2014-11-18       0.00
2015-01-01     185.90
2015-01-02     154.14
2015-01-18       0.00
2015-01-21    3394.41
               ...   
2017-09-08     404.05
2017-09-09     886.67
2017-09-10     581.09
2017-09-12     153.57
2017-09-14     211.86
Name: ADR, Length: 913, dtype: float64

Now I think I am ready?

totaldailies.rolling('5d').mean()

ReservationStatusDate
2014-11-18       0.000000
2015-01-01     185.900000
2015-01-02     170.020000
2015-01-18       0.000000
2015-01-21    1697.205000
                 ...     
2017-09-08    1851.234000
2017-09-09    1614.150000
2017-09-10    1191.084000
2017-09-12     506.345000
2017-09-14     315.506667
Name: ADR, Length: 913, dtype: float64

Your Turn#

Grab the iris dataset. Answer the following questions:

Does converting SepalLength to integer increase or decrease the mean?
Does the direction of the shift remain the same if you groupby Class?
Gather the mean, median, count and standard deviation of all columns when grouped by Class.