Open In Colab

Learning Some SQL with BigQuery#

The first bit below comes directly from Google, you’ll need to do each of those to be successful in getting this document to work.

Before you begin#

  1. Use the Cloud Resource Manager to Create a Cloud Platform project if you do not already have one.

  2. Enable billing for the project.

  3. Enable BigQuery APIs for the project.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated

Now that I am authenticated, I can start to play around in the dataset. I am going to look at the liquor sales data from Iowa and try to find the most and least sales by city. I do have a project called pic-math in my BigQuery interface. So you’ll need to make one but keep the name simple but identifiable!

Why do we use SQL#

Below you’ll see a basic SQL call. This illustrates why excel is not useful, 22 million rows is about 21.5 million more than excel can handle! Essentially SQL will do the data manipulations on the database server side instead of on you machine (or in the cloud with colab)

%%bigquery --project pic-math
SELECT
  COUNT(*) as total_rows
FROM `bigquery-public-data.iowa_liquor_sales.sales`
total_rows
0 28176372

We see that is a lot of rows. We really don’t want to try to store that into memory! Let’s have a peak at the data.

The only two required features of an SQL call are SELECT and FROM. SELECT picks the columns you want by name in the data table. FROM picks the table you want to look at. Both can be shorted in the call and sometimes it is nesseccary to all the table name with the column. Below I do the same thing in two different ways. Do you see a difference in the output?

%%bigquery --project pic-math
SELECT
  AVG(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
f0_
0 144.279646
%%bigquery --project pic-math
SELECT AVG(table.sale_dollars) as average_sale_dollars
FROM `bigquery-public-data.iowa_liquor_sales.sales` as table
average_sale_dollars
0 144.279646

See any difference? You should be asking yourself why it would be adventageous to name your tables. Well, we will see shortly that joining the tables (remember relational database?) is going to be an important task! Sometimes some info we want will be in one table and some of it in another.

Here is a command that will allow you to peak at the whole table (like head!) The * gives you all the columns and the LIMIT gives only how many you specify. There is not preferred order in SQL.

%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5
invoice_and_item_number date store_number store_name address city zip_code store_location county_number county ... item_number item_description pack bottle_volume_ml state_bottle_cost state_bottle_retail bottles_sold sale_dollars volume_sold_liters volume_sold_gallons
0 INV-23344200024 2019-11-19 4452 SELECT MART GORDON DR 2825 GORDON DR SIOUX CITY 51105 POINT(-96.372014 42.489632) 97 WOODBURY ... 100510 RUMCHATA MINICHATAS HOLIDAY MUG 12 25 5.00 7.50 1 7.50 0.02 0.00
1 INV-22726700022 2019-10-22 5504 NEIGHBORHOOD MART 2102 LAFAYETTE ST WATERLOO 50703 POINT(-92.313923 42.489893) 7 BLACK HAWK ... 100794 RUMCHATA MINICHATAS SUMMER TUMBLER VAP 12 25 5.00 7.50 1 7.50 0.02 0.00
2 INV-40707200098 2021-10-05 5423 STAMMER LIQUOR CORP 615 2ND AVE SHELDON 51201.0 None 71 O'BRIEN ... 65259 JAGERMEISTER LIQUEUR MINI MEISTERS 12 20 4.93 7.40 1 7.40 0.02 0.00
3 INV-61319400030 2023-08-16 5822 CASEY'S GENERAL STORE #2836 / MONROE 509 S MONROE ST MONROE 50170.0 POINT(-93.10249604 41.51755801) None JASPER ... 65259 JAGERMEISTER LIQUEUR MINI MEISTERS 12 20 5.63 8.45 1 8.45 0.02 0.00
4 INV-33181800087 2021-01-04 2614 HY-VEE #3 FOOD & DRUGSTORE / DAVENPORT 1823 E KIMBERLY RD DAVENPORT 52807.0 POINT(-90.548919 41.556781) 82 SCOTT ... 100413 FIREBALL CINNAMON WHISKEY PARTY BUCKET 1 50 51.60 77.40 1 77.40 0.05 0.01

5 rows × 24 columns

Not sure how much we might use this but if I wanted the data into a pandas dataframe, I just add a name for it after the bigquery project name.

import pandas_gbq

# TODO: Set project_id to your Google Cloud Platform project ID.
# project_id = "my-project"

sql = """
SELECT
  city,
  store_name,
  SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY city, store_name
"""
df = pandas_gbq.read_gbq(sql, project_id="pic-math")


Downloading: 100%|██████████|
df
city store_name total_sales
0 WEST DES MOINES HY-VEE FOOD AND DRUG (1888) / GRAND / WDM 1080634.67
1 CEDAR RAPIDS CASEY'S GENERAL STORE #2782 / CEDAR RAPIDS 172046.88
2 SIOUX CITY HY-VEE FOOD STORE / SIOUX CITY 10342610.75
3 CEDAR RAPIDS QUICK STOP / CEDAR RAPIDS 734653.22
4 DES MOINES FAREWAY STORES #900 / EUCLID 3001919.51
... ... ... ...
3260 DES MOINES CASEY'S GENERAL STORE #2099 / CLIVE 2244.50
3261 DUNCOMBE WILDCAT DISTILLING 1296.00
3262 WEST DES MOINES THE FRESH MARKET / W DES MOINES 16748.67
3263 COLFAX COLFAX TRUCK STOP LLC / COLFAX 1125.03
3264 RAYMOND THOME ENTERPRISES LLC 288.00

3265 rows × 3 columns



groupeddf = df.groupby('city')
maxdf = groupeddf.max()
maxdf
store_name total_sales
city
ACKLEY KUM & GO #513 / ACKLEY 256895.99
ADAIR KUM & GO #76 / ADAIR 474742.46
ADEL WORLD LIQUOR & TOBACCO PLUS VAPORS / ADEL 4036269.71
AFTON CASEY'S GENERAL STORE #2626 / AFTON 528773.74
AINSWORTH AINSWORTH FOUR CORNERS / AINSWORTH 77552.47
... ... ...
WINTHROP SPEEDE SHOP / WINTHROP 444054.97
WOODBINE FOODLAND SUPER MARKETS / WOODBINE 864858.43
WOODWARD CASEY'S GENERAL STORE #2551 / WOODWARD 307674.78
WYOMING CASEY'S GENERAL STORE #1126 / WYOMING 108172.59
ZWINGLE SNK GAS & FOOD LLC 293386.31

486 rows × 2 columns

mindf = groupeddf.min()

mindf
store_name total_sales
city
ACKLEY ACKLEY SUPER FOODS 126629.70
ADAIR CASEY'S GENERAL STORE #2521 / ADAIR 269633.10
ADEL CASEY'S GENERAL STORE #1680 / ADEL 18870.67
AFTON CASEY'S GENERAL STORE #2626 / AFTON 528773.74
AINSWORTH AINSWORTH FOUR CORNERS / AINSWORTH 77552.47
... ... ...
WINTHROP MCELROY'S FOOD / WINTHROP 13408.22
WOODBINE CASEY'S GENERAL STORE #2665 / WOODBINE 6439.33
WOODWARD CASEY'S GENERAL STORE #2551 / WOODWA 136097.63
WYOMING CASEY'S GENERAL STORE #1126 / WYOMING 108172.59
ZWINGLE OTTER CREEK STOP / ZWINGLE 76944.05

486 rows × 2 columns

I am clearly just showing off now. I have left more along this line at the bottom but let’s get your assignment up!

I notice a few things attempting this. While I think I have a solution, it is clearly not the best. Zwingle and ZWINGLE are probably the same town and SNK may just be the only store but the fact that it appears four times in my lists is disappointing!

maxdf.sort_values('total_sales',ascending=False)
store_name total_sales
city
DES MOINES WORLD LIQUOR & TOBACCO + VAPORS / DES MOINES 1.294751e+08
IOWA CITY WALGREENS #05077 / IOWA CITY 4.876863e+07
CEDAR RAPIDS YR DOLLAR STOP 4.252915e+07
WINDSOR HEIGHTS WINE AND SPIRITS GALLERY 3.638091e+07
BETTENDORF WALGREENS #06553 / BETTENDORF 3.092285e+07
... ... ...
WILLIAMS JAY BROS / WILLIAMS 5.741640e+03
DOUDS DOUDS KWIK STOP / DOUDS 4.163310e+03
DUNCOMBE WILDCAT DISTILLING / DUNCOMBE 1.296000e+03
ALVORD DIRECT LIQUOR / ALVORD 7.659100e+02
COLORADO SPRINGS DASH EVENTS LLC 3.114000e+01

486 rows × 2 columns

I want to try the extra challenge and do it all in SQL. I’ll try to find which store had the most sales by date!

%%bigquery --project pic-math

WITH bestday as (
SELECT
  date,
  store_name,
  city,
  SUM(sale_dollars) as total_sales,
  RANK() over (PARTITION BY date ORDER BY SUM(sale_dollars) desc) as top_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY date, store_name, city
)

SELECT
  date,
  store_name,
  city,
  total_sales
FROM bestday
WHERE top_sales_rank = 1
ORDER BY date
date store_name city total_sales
0 2012-01-03 SAM'S CLUB 8238 / DAVENPORT DAVENPORT 51350.70
1 2012-01-04 HY-VEE FOOD STORE / SHELDON SHELDON 73296.69
2 2012-01-05 HY-VEE #3 / BDI / DES MOINES DES MOINES 74085.23
3 2012-01-09 CENTRAL CITY LIQUOR, INC. DES MOINES 60435.09
4 2012-01-10 SAM'S CLUB 8238 / DAVENPORT DAVENPORT 33423.60
... ... ... ... ...
3097 2023-12-26 HY-VEE #3 / BDI / DES MOINES DES MOINES 188329.84
3098 2023-12-27 COSTCO WHOLESALE #788 / WDM WEST DES MOINES 112209.65
3099 2023-12-28 MISSISSIPPI RIVER DISTILLERY - LECLAIRE (ET) LECLAIRE 184514.04
3100 2023-12-29 HY-VEE #3 / BDI / DES MOINES DES MOINES 114838.96
3101 2023-12-30 ANOTHER ROUND / DEWITT DEWITT 73238.16

3102 rows × 4 columns

Your Turn#

  1. Start a notebook getting BigQuery to work. Feel free to use the authentication atop but you will need to create your own project. You should be able to follow the links at the top AND not have to enter a credit card.

  2. Navigate to the dataset ‘bigquery-public-data.austin_bikeshare.bikeshare_trips’

  3. Compute how many entries are in the dataset

  4. Compute the longest trip from ‘duration_minutes’

  5. Compute the average time for a trip