Open In Colab

Dates#

Converting Seperate Columns into Datetime#

I gathered the folloiwng dataset from this paper. It looks at actual hotel stays and reservations. It is interesting because it has dates in several formats.

import pandas as pa

df = pa.read_csv('https://raw.githubusercontent.com/nurfnick/Data_Viz/main/H1.csv')

df.head(15).T
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
IsCanceled 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0
LeadTime 342 737 7 13 14 14 0 9 85 75 23 35 68 18 37
ArrivalDateYear 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015
ArrivalDateMonth July July July July July July July July July July July July July July July
ArrivalDateWeekNumber 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27
ArrivalDateDayOfMonth 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
StaysInWeekendNights 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
StaysInWeekNights 0 0 1 1 2 2 2 2 3 3 4 4 4 4 4
Adults 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2
Children 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
Babies 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Meal BB BB BB BB BB BB BB FB BB HB BB HB BB HB BB
Country PRT PRT GBR GBR GBR GBR PRT PRT PRT PRT PRT PRT USA ESP PRT
MarketSegment Direct Direct Direct Corporate Online TA Online TA Direct Direct Online TA Offline TA/TO Online TA Online TA Online TA Online TA Online TA
DistributionChannel Direct Direct Direct Corporate TA/TO TA/TO Direct Direct TA/TO TA/TO TA/TO TA/TO TA/TO TA/TO TA/TO
IsRepeatedGuest 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PreviousCancellations 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
PreviousBookingsNotCanceled 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ReservedRoomType C C A A A A C C A D E D D G E
AssignedRoomType C C C A A A C C A D E D E G E
BookingChanges 3 4 0 0 0 0 0 0 0 0 0 0 0 1 0
DepositType No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit No Deposit
Agent NULL NULL NULL 304 240 240 NULL 303 240 15 240 240 240 241 241
Company NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
DaysInWaitingList 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
CustomerType Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient Transient
ADR 0.0 0.0 75.0 75.0 98.0 98.0 107.0 103.0 82.0 105.5 123.0 145.0 97.0 154.77 94.71
RequiredCarParkingSpaces 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
TotalOfSpecialRequests 0 0 0 0 1 1 0 1 1 0 0 0 3 1 0
ReservationStatus Check-Out Check-Out Check-Out Check-Out Check-Out Check-Out Check-Out Check-Out Canceled Canceled Canceled Check-Out Check-Out Check-Out Check-Out
ReservationStatusDate 7/1/2015 7/1/2015 7/2/2015 7/2/2015 7/3/2015 7/3/2015 7/3/2015 7/3/2015 5/6/2015 4/22/2015 6/23/2015 7/5/2015 7/5/2015 7/5/2015 7/5/2015
df.dtypes
IsCanceled                       int64
LeadTime                         int64
ArrivalDateYear                  int64
ArrivalDateMonth                object
ArrivalDateWeekNumber            int64
ArrivalDateDayOfMonth            int64
StaysInWeekendNights             int64
StaysInWeekNights                int64
Adults                           int64
Children                         int64
Babies                           int64
Meal                            object
Country                         object
MarketSegment                   object
DistributionChannel             object
IsRepeatedGuest                  int64
PreviousCancellations            int64
PreviousBookingsNotCanceled      int64
ReservedRoomType                object
AssignedRoomType                object
BookingChanges                   int64
DepositType                     object
Agent                           object
Company                         object
DaysInWaitingList                int64
CustomerType                    object
ADR                            float64
RequiredCarParkingSpaces         int64
TotalOfSpecialRequests           int64
ReservationStatus               object
ReservationStatusDate           object
dtype: object

We can see nothing has been interpreted as a date yet! Let’s grab the low hanging fruit and convert the ReservationStatusDate into datetime. I’ll have to give it the correct format of the date. In this case it is in the Month/Day/Year.

pa.to_datetime(df.ReservationStatusDate, format='%m/%d/%Y')
0       2015-07-01
1       2015-07-01
2       2015-07-02
3       2015-07-02
4       2015-07-03
           ...    
40055   2017-09-10
40056   2017-09-10
40057   2017-09-12
40058   2017-09-14
40059   2017-09-14
Name: ReservationStatusDate, Length: 40060, dtype: datetime64[ns]

I’ll add that to the dataframe converting it’s current column.

df.ReservationStatusDate = pa.to_datetime(df.ReservationStatusDate, format='%m/%d/%Y')

If I’d like to do is get the Arrival Date into a date format. To do that I’ll first convert the month into a numeric using a dictionary.

df.ArrivalDateMonth.unique()
array(['July', 'August', 'September', 'October', 'November', 'December',
       'January', 'February', 'March', 'April', 'May', 'June'],
      dtype=object)
monthConvert = {
    'January':1,
    'February':2,
    'March':3,
    'April':4,
    'May':5,
    'June':6,
    'July':7,
    'August':8,
    'September':9,
    'October':10,
    'November':11,
    'December':12
}
df.ArrivalDateMonth.apply(lambda s: monthConvert[s])
0        7
1        7
2        7
3        7
4        7
        ..
40055    8
40056    8
40057    8
40058    8
40059    8
Name: ArrivalDateMonth, Length: 40060, dtype: int64
df['ArrivalDate'] = pa.to_datetime(dict(year = df.ArrivalDateYear, month = df.ArrivalDateMonth.apply(lambda s: monthConvert[s]), day = df.ArrivalDateDayOfMonth))

df.dtypes
IsCanceled                              int64
LeadTime                                int64
ArrivalDateYear                         int64
ArrivalDateMonth                       object
ArrivalDateWeekNumber                   int64
ArrivalDateDayOfMonth                   int64
StaysInWeekendNights                    int64
StaysInWeekNights                       int64
Adults                                  int64
Children                                int64
Babies                                  int64
Meal                                   object
Country                                object
MarketSegment                          object
DistributionChannel                    object
IsRepeatedGuest                         int64
PreviousCancellations                   int64
PreviousBookingsNotCanceled             int64
ReservedRoomType                       object
AssignedRoomType                       object
BookingChanges                          int64
DepositType                            object
Agent                                  object
Company                                object
DaysInWaitingList                       int64
CustomerType                           object
ADR                                   float64
RequiredCarParkingSpaces                int64
TotalOfSpecialRequests                  int64
ReservationStatus                      object
ReservationStatusDate          datetime64[ns]
ArrivalDate                    datetime64[ns]
dtype: object

Date Differences#

I am curious if these two dates have any differences…

differenceInDates =  df.ArrivalDate -df.ReservationStatusDate

differenceInDates.describe()
count                         40060
mean     18 days 01:33:23.275087369
std      54 days 20:29:21.893473870
min              -69 days +00:00:00
25%               -5 days +00:00:00
50%               -2 days +00:00:00
75%                 3 days 00:00:00
max               411 days 00:00:00
dtype: object

We see that on average people make their reservation about 18 days ahead of time and some more than a year! Not so sure what is going on with all the negative times…

If graphed this in terms of days. Of course there are other options found here

differenceInDates.astype('timedelta64[D]').plot(kind = 'hist')
<matplotlib.axes._subplots.AxesSubplot at 0x7f3d7c013a10>
../../_images/697315d691f2b7c3000e9a7822681d05fe8932c9fdde50b53d74d91ad4c1a190.png
differenceInDates.astype('timedelta64[D]').plot.box()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3d7bf89410>
../../_images/801e1c6e052d6544f4bc80ffd618c6f1b58ed2ac333b3e6153e30cc2bd11401b.png

Extracting Peices#

What if we wanted to get the day from the datetime? To get it you’ll need to convert it to datetime it is shortened to dt here.

df.ArrivalDate.dt.day
0         1
1         1
2         1
3         1
4         1
         ..
40055    31
40056    30
40057    29
40058    31
40059    31
Name: ArrivalDate, Length: 40060, dtype: int64
df.ArrivalDate.dt.week
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  """Entry point for launching an IPython kernel.
0        27
1        27
2        27
3        27
4        27
         ..
40055    35
40056    35
40057    35
40058    35
40059    35
Name: ArrivalDate, Length: 40060, dtype: int64
df.ArrivalDate.dt.year
0        2015
1        2015
2        2015
3        2015
4        2015
         ... 
40055    2017
40056    2017
40057    2017
40058    2017
40059    2017
Name: ArrivalDate, Length: 40060, dtype: int64
df.ArrivalDate.dt.month
0        7
1        7
2        7
3        7
4        7
        ..
40055    8
40056    8
40057    8
40058    8
40059    8
Name: ArrivalDate, Length: 40060, dtype: int64

We could also ask what day of the week it was.

df.ArrivalDate.dt.dayofweek
0        2
1        2
2        2
3        2
4        2
        ..
40055    3
40056    2
40057    1
40058    3
40059    3
Name: ArrivalDate, Length: 40060, dtype: int64
df.ArrivalDate.dt.day_name()
0        Wednesday
1        Wednesday
2        Wednesday
3        Wednesday
4        Wednesday
           ...    
40055     Thursday
40056    Wednesday
40057      Tuesday
40058     Thursday
40059     Thursday
Name: ArrivalDate, Length: 40060, dtype: object
df.ArrivalDate.dt.weekday
0        2
1        2
2        2
3        2
4        2
        ..
40055    3
40056    2
40057    1
40058    3
40059    3
Name: ArrivalDate, Length: 40060, dtype: int64

The weekend is 5 and 6 for Saturday and Sunday so we can ask how many were then by the following

df.ArrivalDate.dt.weekday[df.ArrivalDate.dt.weekday>4].count()
12009

We might wonder if that is an appropriate amount, it aappears to be about

12009/40060
0.29977533699450826
2/7
0.2857142857142857

Yeah thos proportions look similar but we’d have to do a hypothesis test to verify it statistically.

Numpy Can Handle Dates#

Sometimes you might need to know several dates in a row. Numpy can handle that with np.arange.

import numpy as np

np.arange('2018-01-01','2018-01-08', dtype='datetime64[D]')
array(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
       '2018-01-05', '2018-01-06', '2018-01-07'], dtype='datetime64[D]')
pa.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

Actually pandas did have a similar call and was less picky about the input style of the dates.

If you want to check for business days in a range, week days! Use bdate_range. We see the weekend was the 6th and 7th.

pa.bdate_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-08'],
              dtype='datetime64[ns]', freq='B')

Lag For TimeSeries Data#

Sometimes our data is set in order in a time series and we are interested in knowing what came before it. To look at this we would use the shift function.

df.StaysInWeekendNights.shift()
0        NaN
1        0.0
2        0.0
3        0.0
4        0.0
        ... 
40055    2.0
40056    2.0
40057    2.0
40058    4.0
40059    4.0
Name: StaysInWeekendNights, Length: 40060, dtype: float64

We can then ask about the difference of those columns. I’ll force the first to be filled with a zero.

df.StaysInWeekendNights.shift(fill_value=0)-df.StaysInWeekendNights
0        0
1        0
2        0
3        0
4        0
        ..
40055    0
40056    0
40057   -2
40058    0
40059    0
Name: StaysInWeekendNights, Length: 40060, dtype: int64

This might be excellent for something like a moving average! My example is a 5 step moveing average but that can be changed easily.

cols = df.StaysInWeekendNights
for i in range(1,5):
  cols += df.StaysInWeekendNights.shift(periods = i)

cols/5
0         NaN
1         NaN
2         NaN
3         NaN
4         NaN
         ... 
40055    10.0
40056    10.0
40057    10.0
40058    10.0
40059    10.0
Name: StaysInWeekendNights, Length: 40060, dtype: float64

I had some issues making this code do what I expected so be warry!

Selecting Dates#

Some times we want to select certain dates from a dataset. Perhaps I wanted to gather the fourth of July week from the dataset, 01-07-18 to 07-07-18.

df.ArrivalDate.between('01-07-15','07-07-15')
0         True
1         True
2         True
3         True
4         True
         ...  
40055    False
40056    False
40057    False
40058    False
40059    False
Name: ArrivalDate, Length: 40060, dtype: bool

Then to get at that data, we can pass the booleen values into the dataframe.

df[df.ArrivalDate.between('01-07-15','07-07-15')].tail(100)
IsCanceled LeadTime ArrivalDateYear ArrivalDateMonth ArrivalDateWeekNumber ArrivalDateDayOfMonth StaysInWeekendNights StaysInWeekNights Adults Children Babies Meal Country MarketSegment DistributionChannel IsRepeatedGuest PreviousCancellations PreviousBookingsNotCanceled ReservedRoomType AssignedRoomType BookingChanges DepositType Agent Company DaysInWaitingList CustomerType ADR RequiredCarParkingSpaces TotalOfSpecialRequests ReservationStatus ReservationStatusDate ArrivalDate
202 0 66 2015 July 28 7 0 4 2 0 0 BB PRT Online TA TA/TO 0 0 0 D D 0 No Deposit 240 NULL 0 Transient 97.00 0 2 Check-Out 2015-07-11 2015-07-07
203 0 88 2015 July 28 7 0 4 3 0 0 HB USA Offline TA/TO TA/TO 0 0 0 D E 0 No Deposit 95 NULL 0 Transient 150.00 0 1 Check-Out 2015-07-11 2015-07-07
204 0 0 2015 July 28 7 0 4 2 0 0 BB NLD Direct Direct 0 0 0 A A 2 No Deposit NULL NULL 0 Transient 98.00 0 0 Check-Out 2015-07-11 2015-07-07
205 1 43 2015 July 28 7 0 5 2 0 0 HB PRT Offline TA/TO TA/TO 0 0 0 A A 0 No Deposit 134 NULL 0 Transient 91.50 0 0 Canceled 2015-05-28 2015-07-07
206 1 88 2015 July 28 7 2 5 2 0 0 HB PRT Online TA TA/TO 0 0 0 E E 0 No Deposit 240 NULL 0 Transient 139.00 0 2 Canceled 2015-04-13 2015-07-07
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15443 0 185 2015 July 28 6 3 8 2 0 0 BB NLD Offline TA/TO TA/TO 0 0 0 A A 0 No Deposit 128 NULL 0 Contract 90.95 0 0 Check-Out 2015-07-17 2015-07-06
15447 0 184 2015 July 28 7 2 9 2 0 0 HB GBR Offline TA/TO TA/TO 0 0 0 A A 0 No Deposit 143 NULL 0 Contract 110.50 0 0 Check-Out 2015-07-18 2015-07-07
15459 0 184 2015 July 27 4 4 11 2 0 0 BB IRL Offline TA/TO TA/TO 0 0 0 A A 0 No Deposit 156 NULL 0 Contract 90.95 0 0 Check-Out 2015-07-19 2015-07-04
15460 0 93 2015 July 28 5 4 10 2 0 0 BB GBR Offline TA/TO TA/TO 0 0 0 A A 0 No Deposit 243 NULL 0 Contract 88.40 0 1 Check-Out 2015-07-19 2015-07-05
15470 0 125 2015 July 28 5 4 10 2 0 0 HB DEU Offline TA/TO TA/TO 0 0 0 A A 1 No Deposit 69 NULL 0 Contract 113.05 0 0 Check-Out 2015-07-19 2015-07-05

100 rows × 32 columns

The one strange thing about the between command is that it is inclusive so both the 1st and the 7th are included. If times are included only exactly at midnight on the 7th is included. You can change the inclusion with the command inclusive it can be both, neither, left or right.

Your Turn#

Using the dataset from an IOT device retreived from here and introduced to here. Uploaded to the web for you to retrieve easily here https://raw.githubusercontent.com/nurfnick/Data_Viz/main/IOT-temp.csv Answer the folloiwng questions.

  1. Convert the noted_date into DateTime, it is in day-month-Year Hour:Minute format.

  2. What is the minimum and maximum date and time?

  3. What is the largest interval inbetween temperature checks?

  4. What is the average datetime?

  5. What was the average outside temperature on ‘09-11-2018’?