Learning Some SQL with BigQuery#
The first bit below comes directly from Google, you’ll need to do each of those to be successful in getting this document to work.
Before you begin#
Use the Cloud Resource Manager to Create a Cloud Platform project if you do not already have one.
Enable billing for the project.
Enable BigQuery APIs for the project.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
Now that I am authenticated, I can start to play around in the dataset. I am going to look at the liquor sales data from Iowa and try to find the most and least sales by city. I do have a project called pic-math
in my BigQuery interface. So you’ll need to make one but keep the name simple but identifiable!
Why do we use SQL#
Below you’ll see a basic SQL call. This illustrates why excel is not useful, 22 million rows is about 21.5 million more than excel can handle! Essentially SQL will do the data manipulations on the database server side instead of on you machine (or in the cloud with colab)
%%bigquery --project pic-math
SELECT
COUNT(*) as total_rows
FROM `bigquery-public-data.iowa_liquor_sales.sales`
total_rows | |
---|---|
0 | 28176372 |
We see that is a lot of rows. We really don’t want to try to store that into memory! Let’s have a peak at the data.
The only two required features of an SQL call are SELECT
and FROM
. SELECT
picks the columns you want by name in the data table. FROM
picks the table you want to look at. Both can be shorted in the call and sometimes it is nesseccary to all the table name with the column. Below I do the same thing in two different ways. Do you see a difference in the output?
%%bigquery --project pic-math
SELECT
AVG(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
f0_ | |
---|---|
0 | 144.279646 |
%%bigquery --project pic-math
SELECT AVG(table.sale_dollars) as average_sale_dollars
FROM `bigquery-public-data.iowa_liquor_sales.sales` as table
average_sale_dollars | |
---|---|
0 | 144.279646 |
See any difference? You should be asking yourself why it would be adventageous to name your tables. Well, we will see shortly that joining the tables (remember relational database?) is going to be an important task! Sometimes some info we want will be in one table and some of it in another.
Here is a command that will allow you to peak at the whole table (like head!) The *
gives you all the columns and the LIMIT
gives only how many you specify. There is not preferred order in SQL.
%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5
invoice_and_item_number | date | store_number | store_name | address | city | zip_code | store_location | county_number | county | ... | item_number | item_description | pack | bottle_volume_ml | state_bottle_cost | state_bottle_retail | bottles_sold | sale_dollars | volume_sold_liters | volume_sold_gallons | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INV-23344200024 | 2019-11-19 | 4452 | SELECT MART GORDON DR | 2825 GORDON DR | SIOUX CITY | 51105 | POINT(-96.372014 42.489632) | 97 | WOODBURY | ... | 100510 | RUMCHATA MINICHATAS HOLIDAY MUG | 12 | 25 | 5.00 | 7.50 | 1 | 7.50 | 0.02 | 0.00 |
1 | INV-22726700022 | 2019-10-22 | 5504 | NEIGHBORHOOD MART | 2102 LAFAYETTE ST | WATERLOO | 50703 | POINT(-92.313923 42.489893) | 7 | BLACK HAWK | ... | 100794 | RUMCHATA MINICHATAS SUMMER TUMBLER VAP | 12 | 25 | 5.00 | 7.50 | 1 | 7.50 | 0.02 | 0.00 |
2 | INV-40707200098 | 2021-10-05 | 5423 | STAMMER LIQUOR CORP | 615 2ND AVE | SHELDON | 51201.0 | None | 71 | O'BRIEN | ... | 65259 | JAGERMEISTER LIQUEUR MINI MEISTERS | 12 | 20 | 4.93 | 7.40 | 1 | 7.40 | 0.02 | 0.00 |
3 | INV-61319400030 | 2023-08-16 | 5822 | CASEY'S GENERAL STORE #2836 / MONROE | 509 S MONROE ST | MONROE | 50170.0 | POINT(-93.10249604 41.51755801) | None | JASPER | ... | 65259 | JAGERMEISTER LIQUEUR MINI MEISTERS | 12 | 20 | 5.63 | 8.45 | 1 | 8.45 | 0.02 | 0.00 |
4 | INV-33181800087 | 2021-01-04 | 2614 | HY-VEE #3 FOOD & DRUGSTORE / DAVENPORT | 1823 E KIMBERLY RD | DAVENPORT | 52807.0 | POINT(-90.548919 41.556781) | 82 | SCOTT | ... | 100413 | FIREBALL CINNAMON WHISKEY PARTY BUCKET | 1 | 50 | 51.60 | 77.40 | 1 | 77.40 | 0.05 | 0.01 |
5 rows × 24 columns
Not sure how much we might use this but if I wanted the data into a pandas
dataframe, I just add a name for it after the bigquery project name.
import pandas_gbq
# TODO: Set project_id to your Google Cloud Platform project ID.
# project_id = "my-project"
sql = """
SELECT
city,
store_name,
SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY city, store_name
"""
df = pandas_gbq.read_gbq(sql, project_id="pic-math")
Downloading: 100%|██████████|
df
city | store_name | total_sales | |
---|---|---|---|
0 | WEST DES MOINES | HY-VEE FOOD AND DRUG (1888) / GRAND / WDM | 1080634.67 |
1 | CEDAR RAPIDS | CASEY'S GENERAL STORE #2782 / CEDAR RAPIDS | 172046.88 |
2 | SIOUX CITY | HY-VEE FOOD STORE / SIOUX CITY | 10342610.75 |
3 | CEDAR RAPIDS | QUICK STOP / CEDAR RAPIDS | 734653.22 |
4 | DES MOINES | FAREWAY STORES #900 / EUCLID | 3001919.51 |
... | ... | ... | ... |
3260 | DES MOINES | CASEY'S GENERAL STORE #2099 / CLIVE | 2244.50 |
3261 | DUNCOMBE | WILDCAT DISTILLING | 1296.00 |
3262 | WEST DES MOINES | THE FRESH MARKET / W DES MOINES | 16748.67 |
3263 | COLFAX | COLFAX TRUCK STOP LLC / COLFAX | 1125.03 |
3264 | RAYMOND | THOME ENTERPRISES LLC | 288.00 |
3265 rows × 3 columns
groupeddf = df.groupby('city')
maxdf = groupeddf.max()
maxdf
store_name | total_sales | |
---|---|---|
city | ||
ACKLEY | KUM & GO #513 / ACKLEY | 256895.99 |
ADAIR | KUM & GO #76 / ADAIR | 474742.46 |
ADEL | WORLD LIQUOR & TOBACCO PLUS VAPORS / ADEL | 4036269.71 |
AFTON | CASEY'S GENERAL STORE #2626 / AFTON | 528773.74 |
AINSWORTH | AINSWORTH FOUR CORNERS / AINSWORTH | 77552.47 |
... | ... | ... |
WINTHROP | SPEEDE SHOP / WINTHROP | 444054.97 |
WOODBINE | FOODLAND SUPER MARKETS / WOODBINE | 864858.43 |
WOODWARD | CASEY'S GENERAL STORE #2551 / WOODWARD | 307674.78 |
WYOMING | CASEY'S GENERAL STORE #1126 / WYOMING | 108172.59 |
ZWINGLE | SNK GAS & FOOD LLC | 293386.31 |
486 rows × 2 columns
mindf = groupeddf.min()
mindf
store_name | total_sales | |
---|---|---|
city | ||
ACKLEY | ACKLEY SUPER FOODS | 126629.70 |
ADAIR | CASEY'S GENERAL STORE #2521 / ADAIR | 269633.10 |
ADEL | CASEY'S GENERAL STORE #1680 / ADEL | 18870.67 |
AFTON | CASEY'S GENERAL STORE #2626 / AFTON | 528773.74 |
AINSWORTH | AINSWORTH FOUR CORNERS / AINSWORTH | 77552.47 |
... | ... | ... |
WINTHROP | MCELROY'S FOOD / WINTHROP | 13408.22 |
WOODBINE | CASEY'S GENERAL STORE #2665 / WOODBINE | 6439.33 |
WOODWARD | CASEY'S GENERAL STORE #2551 / WOODWA | 136097.63 |
WYOMING | CASEY'S GENERAL STORE #1126 / WYOMING | 108172.59 |
ZWINGLE | OTTER CREEK STOP / ZWINGLE | 76944.05 |
486 rows × 2 columns
I am clearly just showing off now. I have left more along this line at the bottom but let’s get your assignment up!
I notice a few things attempting this. While I think I have a solution, it is clearly not the best. Zwingle and ZWINGLE are probably the same town and SNK may just be the only store but the fact that it appears four times in my lists is disappointing!
maxdf.sort_values('total_sales',ascending=False)
store_name | total_sales | |
---|---|---|
city | ||
DES MOINES | WORLD LIQUOR & TOBACCO + VAPORS / DES MOINES | 1.294751e+08 |
IOWA CITY | WALGREENS #05077 / IOWA CITY | 4.876863e+07 |
CEDAR RAPIDS | YR DOLLAR STOP | 4.252915e+07 |
WINDSOR HEIGHTS | WINE AND SPIRITS GALLERY | 3.638091e+07 |
BETTENDORF | WALGREENS #06553 / BETTENDORF | 3.092285e+07 |
... | ... | ... |
WILLIAMS | JAY BROS / WILLIAMS | 5.741640e+03 |
DOUDS | DOUDS KWIK STOP / DOUDS | 4.163310e+03 |
DUNCOMBE | WILDCAT DISTILLING / DUNCOMBE | 1.296000e+03 |
ALVORD | DIRECT LIQUOR / ALVORD | 7.659100e+02 |
COLORADO SPRINGS | DASH EVENTS LLC | 3.114000e+01 |
486 rows × 2 columns
I want to try the extra challenge and do it all in SQL. I’ll try to find which store had the most sales by date!
%%bigquery --project pic-math
WITH bestday as (
SELECT
date,
store_name,
city,
SUM(sale_dollars) as total_sales,
RANK() over (PARTITION BY date ORDER BY SUM(sale_dollars) desc) as top_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY date, store_name, city
)
SELECT
date,
store_name,
city,
total_sales
FROM bestday
WHERE top_sales_rank = 1
ORDER BY date
date | store_name | city | total_sales | |
---|---|---|---|---|
0 | 2012-01-03 | SAM'S CLUB 8238 / DAVENPORT | DAVENPORT | 51350.70 |
1 | 2012-01-04 | HY-VEE FOOD STORE / SHELDON | SHELDON | 73296.69 |
2 | 2012-01-05 | HY-VEE #3 / BDI / DES MOINES | DES MOINES | 74085.23 |
3 | 2012-01-09 | CENTRAL CITY LIQUOR, INC. | DES MOINES | 60435.09 |
4 | 2012-01-10 | SAM'S CLUB 8238 / DAVENPORT | DAVENPORT | 33423.60 |
... | ... | ... | ... | ... |
3097 | 2023-12-26 | HY-VEE #3 / BDI / DES MOINES | DES MOINES | 188329.84 |
3098 | 2023-12-27 | COSTCO WHOLESALE #788 / WDM | WEST DES MOINES | 112209.65 |
3099 | 2023-12-28 | MISSISSIPPI RIVER DISTILLERY - LECLAIRE (ET) | LECLAIRE | 184514.04 |
3100 | 2023-12-29 | HY-VEE #3 / BDI / DES MOINES | DES MOINES | 114838.96 |
3101 | 2023-12-30 | ANOTHER ROUND / DEWITT | DEWITT | 73238.16 |
3102 rows × 4 columns
Your Turn#
Start a notebook getting BigQuery to work. Feel free to use the authentication atop but you will need to create your own project. You should be able to follow the links at the top AND not have to enter a credit card.
Navigate to the dataset ‘bigquery-public-data.austin_bikeshare.bikeshare_trips’
Compute how many entries are in the dataset
Compute the longest trip from ‘duration_minutes’
Compute the average time for a trip