Learning Some SQL with BigQuery

Learning Some SQL with BigQuery#

The first bit below comes directly from Google, you’ll need to do each of those to be successful in getting this document to work.

Before you begin#

Use the Cloud Resource Manager to Create a Cloud Platform project if you do not already have one.
Enable billing for the project.
Enable BigQuery APIs for the project.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated

Now that I am authenticated, I can start to play around in the dataset. I am going to look at the liquor sales data from Iowa and try to find the most and least sales by city. I do have a project called pic-math in my BigQuery interface. So you’ll need to make one but keep the name simple but identifiable!

Why do we use SQL#

Below you’ll see a basic SQL call. This illustrates why excel is not useful, 22 million rows is about 21.5 million more than excel can handle! Essentially SQL will do the data manipulations on the database server side instead of on you machine (or in the cloud with colab)

%%bigquery --project pic-math
SELECT
  COUNT(*) as total_rows
FROM `bigquery-public-data.iowa_liquor_sales.sales`

	total_rows
0	28176372

We see that is a lot of rows. We really don’t want to try to store that into memory! Let’s have a peak at the data.

The only two required features of an SQL call are SELECT and FROM. SELECT picks the columns you want by name in the data table. FROM picks the table you want to look at. Both can be shorted in the call and sometimes it is nesseccary to all the table name with the column. Below I do the same thing in two different ways. Do you see a difference in the output?

%%bigquery --project pic-math
SELECT
  AVG(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`

	f0_
0	144.279646

%%bigquery --project pic-math
SELECT AVG(table.sale_dollars) as average_sale_dollars
FROM `bigquery-public-data.iowa_liquor_sales.sales` as table

	average_sale_dollars
0	144.279646

See any difference? You should be asking yourself why it would be adventageous to name your tables. Well, we will see shortly that joining the tables (remember relational database?) is going to be an important task! Sometimes some info we want will be in one table and some of it in another.

Here is a command that will allow you to peak at the whole table (like head!) The * gives you all the columns and the LIMIT gives only how many you specify. There is not preferred order in SQL.

%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5

	invoice_and_item_number	date	store_number	store_name	address	city	zip_code	store_location	county_number	county	...	item_number	item_description	pack	bottle_volume_ml	state_bottle_cost	state_bottle_retail	bottles_sold	sale_dollars	volume_sold_liters	volume_sold_gallons
0	INV-23344200024	2019-11-19	4452	SELECT MART GORDON DR	2825 GORDON DR	SIOUX CITY	51105	POINT(-96.372014 42.489632)	97	WOODBURY	...	100510	RUMCHATA MINICHATAS HOLIDAY MUG	12	25	5.00	7.50	1	7.50	0.02	0.00
1	INV-22726700022	2019-10-22	5504	NEIGHBORHOOD MART	2102 LAFAYETTE ST	WATERLOO	50703	POINT(-92.313923 42.489893)	7	BLACK HAWK	...	100794	RUMCHATA MINICHATAS SUMMER TUMBLER VAP	12	25	5.00	7.50	1	7.50	0.02	0.00
2	INV-40707200098	2021-10-05	5423	STAMMER LIQUOR CORP	615 2ND AVE	SHELDON	51201.0	None	71	O'BRIEN	...	65259	JAGERMEISTER LIQUEUR MINI MEISTERS	12	20	4.93	7.40	1	7.40	0.02	0.00
3	INV-61319400030	2023-08-16	5822	CASEY'S GENERAL STORE #2836 / MONROE	509 S MONROE ST	MONROE	50170.0	POINT(-93.10249604 41.51755801)	None	JASPER	...	65259	JAGERMEISTER LIQUEUR MINI MEISTERS	12	20	5.63	8.45	1	8.45	0.02	0.00
4	INV-33181800087	2021-01-04	2614	HY-VEE #3 FOOD & DRUGSTORE / DAVENPORT	1823 E KIMBERLY RD	DAVENPORT	52807.0	POINT(-90.548919 41.556781)	82	SCOTT	...	100413	FIREBALL CINNAMON WHISKEY PARTY BUCKET	1	50	51.60	77.40	1	77.40	0.05	0.01

5 rows × 24 columns

Not sure how much we might use this but if I wanted the data into a pandas dataframe, I just add a name for it after the bigquery project name.

import pandas_gbq

# TODO: Set project_id to your Google Cloud Platform project ID.
# project_id = "my-project"

sql = """
SELECT
  city,
  store_name,
  SUM(sale_dollars) as total_sales
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY city, store_name
"""
df = pandas_gbq.read_gbq(sql, project_id="pic-math")

Downloading: 100%|██████████|

df

	city	store_name	total_sales
0	WEST DES MOINES	HY-VEE FOOD AND DRUG (1888) / GRAND / WDM	1080634.67
1	CEDAR RAPIDS	CASEY'S GENERAL STORE #2782 / CEDAR RAPIDS	172046.88
2	SIOUX CITY	HY-VEE FOOD STORE / SIOUX CITY	10342610.75
3	CEDAR RAPIDS	QUICK STOP / CEDAR RAPIDS	734653.22
4	DES MOINES	FAREWAY STORES #900 / EUCLID	3001919.51
...	...	...	...
3260	DES MOINES	CASEY'S GENERAL STORE #2099 / CLIVE	2244.50
3261	DUNCOMBE	WILDCAT DISTILLING	1296.00
3262	WEST DES MOINES	THE FRESH MARKET / W DES MOINES	16748.67
3263	COLFAX	COLFAX TRUCK STOP LLC / COLFAX	1125.03
3264	RAYMOND	THOME ENTERPRISES LLC	288.00

3265 rows × 3 columns

groupeddf = df.groupby('city')

maxdf = groupeddf.max()

maxdf

	store_name	total_sales
city
ACKLEY	KUM & GO #513 / ACKLEY	256895.99
ADAIR	KUM & GO #76 / ADAIR	474742.46
ADEL	WORLD LIQUOR & TOBACCO PLUS VAPORS / ADEL	4036269.71
AFTON	CASEY'S GENERAL STORE #2626 / AFTON	528773.74
AINSWORTH	AINSWORTH FOUR CORNERS / AINSWORTH	77552.47
...	...	...
WINTHROP	SPEEDE SHOP / WINTHROP	444054.97
WOODBINE	FOODLAND SUPER MARKETS / WOODBINE	864858.43
WOODWARD	CASEY'S GENERAL STORE #2551 / WOODWARD	307674.78
WYOMING	CASEY'S GENERAL STORE #1126 / WYOMING	108172.59
ZWINGLE	SNK GAS & FOOD LLC	293386.31

486 rows × 2 columns

mindf = groupeddf.min()

mindf

	store_name	total_sales
city
ACKLEY	ACKLEY SUPER FOODS	126629.70
ADAIR	CASEY'S GENERAL STORE #2521 / ADAIR	269633.10
ADEL	CASEY'S GENERAL STORE #1680 / ADEL	18870.67
AFTON	CASEY'S GENERAL STORE #2626 / AFTON	528773.74
AINSWORTH	AINSWORTH FOUR CORNERS / AINSWORTH	77552.47
...	...	...
WINTHROP	MCELROY'S FOOD / WINTHROP	13408.22
WOODBINE	CASEY'S GENERAL STORE #2665 / WOODBINE	6439.33
WOODWARD	CASEY'S GENERAL STORE #2551 / WOODWA	136097.63
WYOMING	CASEY'S GENERAL STORE #1126 / WYOMING	108172.59
ZWINGLE	OTTER CREEK STOP / ZWINGLE	76944.05

486 rows × 2 columns

I am clearly just showing off now. I have left more along this line at the bottom but let’s get your assignment up!

I notice a few things attempting this. While I think I have a solution, it is clearly not the best. Zwingle and ZWINGLE are probably the same town and SNK may just be the only store but the fact that it appears four times in my lists is disappointing!

maxdf.sort_values('total_sales',ascending=False)

	store_name	total_sales
city
DES MOINES	WORLD LIQUOR & TOBACCO + VAPORS / DES MOINES	1.294751e+08
IOWA CITY	WALGREENS #05077 / IOWA CITY	4.876863e+07
CEDAR RAPIDS	YR DOLLAR STOP	4.252915e+07
WINDSOR HEIGHTS	WINE AND SPIRITS GALLERY	3.638091e+07
BETTENDORF	WALGREENS #06553 / BETTENDORF	3.092285e+07
...	...	...
WILLIAMS	JAY BROS / WILLIAMS	5.741640e+03
DOUDS	DOUDS KWIK STOP / DOUDS	4.163310e+03
DUNCOMBE	WILDCAT DISTILLING / DUNCOMBE	1.296000e+03
ALVORD	DIRECT LIQUOR / ALVORD	7.659100e+02
COLORADO SPRINGS	DASH EVENTS LLC	3.114000e+01

486 rows × 2 columns

I want to try the extra challenge and do it all in SQL. I’ll try to find which store had the most sales by date!

%%bigquery --project pic-math

WITH bestday as (
SELECT
  date,
  store_name,
  city,
  SUM(sale_dollars) as total_sales,
  RANK() over (PARTITION BY date ORDER BY SUM(sale_dollars) desc) as top_sales_rank
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE city is not null
GROUP BY date, store_name, city
)

SELECT
  date,
  store_name,
  city,
  total_sales
FROM bestday
WHERE top_sales_rank = 1
ORDER BY date

	date	store_name	city	total_sales
0	2012-01-03	SAM'S CLUB 8238 / DAVENPORT	DAVENPORT	51350.70
1	2012-01-04	HY-VEE FOOD STORE / SHELDON	SHELDON	73296.69
2	2012-01-05	HY-VEE #3 / BDI / DES MOINES	DES MOINES	74085.23
3	2012-01-09	CENTRAL CITY LIQUOR, INC.	DES MOINES	60435.09
4	2012-01-10	SAM'S CLUB 8238 / DAVENPORT	DAVENPORT	33423.60
...	...	...	...	...
3097	2023-12-26	HY-VEE #3 / BDI / DES MOINES	DES MOINES	188329.84
3098	2023-12-27	COSTCO WHOLESALE #788 / WDM	WEST DES MOINES	112209.65
3099	2023-12-28	MISSISSIPPI RIVER DISTILLERY - LECLAIRE (ET)	LECLAIRE	184514.04
3100	2023-12-29	HY-VEE #3 / BDI / DES MOINES	DES MOINES	114838.96
3101	2023-12-30	ANOTHER ROUND / DEWITT	DEWITT	73238.16

3102 rows × 4 columns

Your Turn#

Start a notebook getting BigQuery to work. Feel free to use the authentication atop but you will need to create your own project. You should be able to follow the links at the top AND not have to enter a credit card.
Navigate to the dataset ‘bigquery-public-data.austin_bikeshare.bikeshare_trips’
Compute how many entries are in the dataset
Compute the longest trip from ‘duration_minutes’
Compute the average time for a trip

Learning Some SQL with BigQuery

Contents

Learning Some SQL with BigQuery#

Before you begin#

Why do we use SQL#

Your Turn#