SQL Essentials#
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
Let’s start with the basics. I’ll continue to work with the liquor store data.
%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5
invoice_and_item_number | date | store_number | store_name | address | city | zip_code | store_location | county_number | county | ... | item_number | item_description | pack | bottle_volume_ml | state_bottle_cost | state_bottle_retail | bottles_sold | sale_dollars | volume_sold_liters | volume_sold_gallons | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INV-23344200024 | 2019-11-19 | 4452 | SELECT MART GORDON DR | 2825 GORDON DR | SIOUX CITY | 51105 | POINT(-96.372014 42.489632) | 97 | WOODBURY | ... | 100510 | RUMCHATA MINICHATAS HOLIDAY MUG | 12 | 25 | 5.00 | 7.50 | 1 | 7.50 | 0.02 | 0.00 |
1 | INV-22726700022 | 2019-10-22 | 5504 | NEIGHBORHOOD MART | 2102 LAFAYETTE ST | WATERLOO | 50703 | POINT(-92.313923 42.489893) | 7 | BLACK HAWK | ... | 100794 | RUMCHATA MINICHATAS SUMMER TUMBLER VAP | 12 | 25 | 5.00 | 7.50 | 1 | 7.50 | 0.02 | 0.00 |
2 | INV-40707200098 | 2021-10-05 | 5423 | STAMMER LIQUOR CORP | 615 2ND AVE | SHELDON | 51201.0 | None | 71 | O'BRIEN | ... | 65259 | JAGERMEISTER LIQUEUR MINI MEISTERS | 12 | 20 | 4.93 | 7.40 | 1 | 7.40 | 0.02 | 0.00 |
3 | INV-61319400030 | 2023-08-16 | 5822 | CASEY'S GENERAL STORE #2836 / MONROE | 509 S MONROE ST | MONROE | 50170.0 | POINT(-93.10249604 41.51755801) | None | JASPER | ... | 65259 | JAGERMEISTER LIQUEUR MINI MEISTERS | 12 | 20 | 5.63 | 8.45 | 1 | 8.45 | 0.02 | 0.00 |
4 | INV-33181800087 | 2021-01-04 | 2614 | HY-VEE #3 FOOD & DRUGSTORE / DAVENPORT | 1823 E KIMBERLY RD | DAVENPORT | 52807.0 | POINT(-90.548919 41.556781) | 82 | SCOTT | ... | 100413 | FIREBALL CINNAMON WHISKEY PARTY BUCKET | 1 | 50 | 51.60 | 77.40 | 1 | 77.40 | 0.05 | 0.01 |
5 rows × 24 columns
Why did I run that command? Well it gives me an idea of what is in the table to reference and think about what questions I might ask! Let’s see how many gallons of liqour have been sold.
%%bigquery --project pic-math
SELECT SUM(volume_sold_gallons) as Total_Gallons_of_Liquor
FROM `bigquery-public-data.iowa_liquor_sales.sales`
Total_Gallons_of_Liquor | |
---|---|
0 | 68071897.63 |
Let’s do a small conversion just to see what that means. An olympic pool holds 660 000 gallons so
(5.560763*10**7)/660000
84.25398484848485
About 84 swimming pools of liquor in Iowa! Fun times…
Let’s make it more complicated. Let’s see what the dollars per gallon is on the full dataset.
%%bigquery --project pic-math
SELECT SUM(sale_dollars)/SUM(volume_sold_gallons) as Total_Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`
Total_Dollars_Per_Gallon | |
---|---|
0 | 59.720321 |
Okay not terribly interesting. Let’s take that quesiton and add too it. Let’s create a column that is dollars per gallon of liquor.
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0
LIMIT 5
Dollars_Per_Gallon | |
---|---|
0 | 7740.0 |
1 | 7740.0 |
2 | 7740.0 |
3 | 7740.0 |
4 | 7740.0 |
I have added a few new things here. The WHERE
clause allows me to restrict what I consider. You can combine several of the statements logically
%%bigquery --project pic-math
SELECT sale_dollars/volume_sold_gallons as Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'COFFEE LIQUEURS'
Dollars_Per_Gallon | item_description | |
---|---|---|
0 | 792.000000 | KAHLUA COFFEE MINI |
1 | 792.000000 | KAHLUA COFFEE MINI |
2 | 792.000000 | KAHLUA COFFEE MINI |
3 | 112.333333 | KAHLUA COFFEE |
4 | 108.222222 | KAHLUA COFFEE |
... | ... | ... |
177721 | 86.500000 | KAHLUA COFFEE LIQUEUR |
177722 | 86.500000 | KAHLUA COFFEE LIQUEUR |
177723 | 86.500000 | KAHLUA COFFEE LIQUEUR |
177724 | 86.500000 | KAHLUA COFFEE LIQUEUR |
177725 | 86.500000 | KAHLUA COFFEE LIQUEUR |
177726 rows × 2 columns
I added the item_description
so that I could see which ones were different. It is not utilized the the analysis yet. Let’s include it by getting the average price of coffee liqueurs based on the description. To do this I’ll add the GROUP BY
command
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'COFFEE LIQUEURS'
GROUP BY item_description
Average_Dollars_Per_Gallon | item_description | |
---|---|---|
0 | 98.042729 | KAHLUA COFFEE |
1 | 79.648073 | KAHLUA COFFEE LIQUEUR MINI |
2 | 93.760237 | KAHLUA ESPECIAL |
3 | 89.804451 | KAHLUA COFFEE LIQUEUR |
4 | 38.196790 | CHILA COFFEE LIQUEUR |
5 | 63.064841 | KARIMBA COFFEE LIQUEUR |
6 | 16.996845 | CONCIERE COFFEE LIQUEUR |
7 | 99.494591 | KAHLUA VANILLA |
8 | 70.697000 | TEQUILA ROSE W/2 SHOT GLASSES HA |
9 | 48.306810 | KAMORA COFFEE LIQUEUR |
10 | 110.811697 | IRISHMAN IRISH COFFEE |
11 | 109.122690 | PATRON XO CAFE |
12 | 154.526005 | IOWA COFFEE COMPANY LIQUEUR |
13 | 35.378913 | SABROSO |
14 | 88.089693 | KAHLUA MOCHA LIQUEUR |
15 | 99.575605 | KAHLUA MIDNIGHT |
16 | 134.012833 | MR. BLACK COLD BREW COFFEE LIQUEUR |
17 | 126.801440 | LUXARDO ESPRESSO ITALIAN COFFEE LIQUEUR |
18 | 68.273560 | GRIND ESPRESSO SHOT |
19 | 90.549048 | KAHLUA FRENCH VANILLA LIQUEUR DISCO |
20 | 114.019258 | CANTERA NEGRA CAFE |
21 | 96.567398 | KAHLUA SALTED CARAMEL |
22 | 42.167376 | COPA DE ORO MEXICAN COFFEE |
23 | 114.796784 | MOZART CHOCOLATE COFFEE CREAM LIQUEUR |
24 | 72.934872 | CAROLANS COLD BREW |
25 | 90.036168 | AVION ESPRESSO |
26 | 138.051271 | ORIGINAL SECRET FAMILY RECIPE - A COFFEE LIQUEUR |
27 | 247.964566 | CCD COFFEE LIQUEUR |
28 | 1136.210526 | KAHLUA CHILI CHOCOLATE |
29 | 830.290931 | KAHLUA COFFEE MINI |
30 | 103.666408 | J. RIEGER & CO. CAFFE AMARO |
31 | 98.117663 | TIA MARIA COFFEE LIQUEUR |
32 | 109.663866 | PATRON XO CAFE DARK COCOA |
33 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR WITH N... |
34 | 974.160000 | KAMORA COFFEE LIQUEUR MINI |
35 | 38.087056 | CAFFE LOLITA COFFEE LIQUEUR |
36 | 88.947355 | KAHLUA FRENCH VANILLA LIQUEUR |
37 | 112.367730 | CAZADORES CAFE |
38 | 41.799355 | KAPALI COFFEE LIQUEUR |
39 | 163.065954 | PATRON XO CAFE MINI |
40 | 90.327941 | KAHLUA PEPPERMINT MOCHA |
41 | 102.377486 | XO CAFE DARK COCOA |
42 | 65.656795 | DR MCGILLICUDDYS COFFEE |
43 | 36.908517 | HIRAM WALKER COFFEE BRANDY |
44 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR BLACK ... |
Okay but we pointed out earlier that there is no order… Let’s force an order on it
%%bigquery --project pic-math
SELECT AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon, item_description
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and category_name = 'COFFEE LIQUEURS'
GROUP BY item_description
ORDER BY Average_Dollars_Per_Gallon DESC
Average_Dollars_Per_Gallon | item_description | |
---|---|---|
0 | 1136.210526 | KAHLUA CHILI CHOCOLATE |
1 | 974.160000 | KAMORA COFFEE LIQUEUR MINI |
2 | 830.290931 | KAHLUA COFFEE MINI |
3 | 247.964566 | CCD COFFEE LIQUEUR |
4 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR WITH N... |
5 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR BLACK ... |
6 | 163.065954 | PATRON XO CAFE MINI |
7 | 154.526005 | IOWA COFFEE COMPANY LIQUEUR |
8 | 138.051271 | ORIGINAL SECRET FAMILY RECIPE - A COFFEE LIQUEUR |
9 | 134.012833 | MR. BLACK COLD BREW COFFEE LIQUEUR |
10 | 126.801440 | LUXARDO ESPRESSO ITALIAN COFFEE LIQUEUR |
11 | 114.796784 | MOZART CHOCOLATE COFFEE CREAM LIQUEUR |
12 | 114.019258 | CANTERA NEGRA CAFE |
13 | 112.367730 | CAZADORES CAFE |
14 | 110.811697 | IRISHMAN IRISH COFFEE |
15 | 109.663866 | PATRON XO CAFE DARK COCOA |
16 | 109.122690 | PATRON XO CAFE |
17 | 103.666408 | J. RIEGER & CO. CAFFE AMARO |
18 | 102.377486 | XO CAFE DARK COCOA |
19 | 99.575605 | KAHLUA MIDNIGHT |
20 | 99.494591 | KAHLUA VANILLA |
21 | 98.117663 | TIA MARIA COFFEE LIQUEUR |
22 | 98.042729 | KAHLUA COFFEE |
23 | 96.567398 | KAHLUA SALTED CARAMEL |
24 | 93.760237 | KAHLUA ESPECIAL |
25 | 90.549048 | KAHLUA FRENCH VANILLA LIQUEUR DISCO |
26 | 90.327941 | KAHLUA PEPPERMINT MOCHA |
27 | 90.036168 | AVION ESPRESSO |
28 | 89.804451 | KAHLUA COFFEE LIQUEUR |
29 | 88.947355 | KAHLUA FRENCH VANILLA LIQUEUR |
30 | 88.089693 | KAHLUA MOCHA LIQUEUR |
31 | 79.648073 | KAHLUA COFFEE LIQUEUR MINI |
32 | 72.934872 | CAROLANS COLD BREW |
33 | 70.697000 | TEQUILA ROSE W/2 SHOT GLASSES HA |
34 | 68.273560 | GRIND ESPRESSO SHOT |
35 | 65.656795 | DR MCGILLICUDDYS COFFEE |
36 | 63.064841 | KARIMBA COFFEE LIQUEUR |
37 | 48.306810 | KAMORA COFFEE LIQUEUR |
38 | 42.167376 | COPA DE ORO MEXICAN COFFEE |
39 | 41.799355 | KAPALI COFFEE LIQUEUR |
40 | 38.196790 | CHILA COFFEE LIQUEUR |
41 | 38.087056 | CAFFE LOLITA COFFEE LIQUEUR |
42 | 36.908517 | HIRAM WALKER COFFEE BRANDY |
43 | 35.378913 | SABROSO |
44 | 16.996845 | CONCIERE COFFEE LIQUEUR |
Let’s keep going down the rabbit hole here. What if we want to rank them? There are lots of ways ROW_NUMBER
, RANK
and DENSE_RANK
. I find them difficult to use because they require lots of other inputs.
The general call is something like
ROW_NUMBER() OVER(PARTITION BY ________ ORDER BY _________)
Partition is like grouping. I’ll add another liquor to use it
%%bigquery --project pic-math
SELECT
AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon,
item_description,
ROW_NUMBER() OVER(
PARTITION BY category_name
ORDER BY item_description) row_num
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'COFFEE LIQUEURS')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC
Average_Dollars_Per_Gallon | item_description | row_num | |
---|---|---|---|
0 | 1136.210526 | KAHLUA CHILI CHOCOLATE | 18 |
1 | 974.160000 | KAMORA COFFEE LIQUEUR MINI | 32 |
2 | 830.290931 | KAHLUA COFFEE MINI | 22 |
3 | 247.964566 | CCD COFFEE LIQUEUR | 6 |
4 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR BLACK ... | 16 |
5 | 181.670886 | JUMPING GOAT COLD BREWED COFFEE LIQUEUR WITH N... | 17 |
6 | 163.065954 | PATRON XO CAFE MINI | 41 |
7 | 154.526005 | IOWA COFFEE COMPANY LIQUEUR | 13 |
8 | 138.051271 | ORIGINAL SECRET FAMILY RECIPE - A COFFEE LIQUEUR | 38 |
9 | 134.012833 | MR. BLACK COLD BREW COFFEE LIQUEUR | 37 |
10 | 126.801440 | LUXARDO ESPRESSO ITALIAN COFFEE LIQUEUR | 35 |
11 | 114.796784 | MOZART CHOCOLATE COFFEE CREAM LIQUEUR | 36 |
12 | 114.019258 | CANTERA NEGRA CAFE | 3 |
13 | 112.367730 | CAZADORES CAFE | 5 |
14 | 110.811697 | IRISHMAN IRISH COFFEE | 14 |
15 | 109.663866 | PATRON XO CAFE DARK COCOA | 40 |
16 | 109.122690 | PATRON XO CAFE | 39 |
17 | 103.666408 | J. RIEGER & CO. CAFFE AMARO | 15 |
18 | 102.377486 | XO CAFE DARK COCOA | 45 |
19 | 99.575605 | KAHLUA MIDNIGHT | 26 |
20 | 99.494591 | KAHLUA VANILLA | 30 |
21 | 98.117663 | TIA MARIA COFFEE LIQUEUR | 44 |
22 | 98.042729 | KAHLUA COFFEE | 19 |
23 | 96.567398 | KAHLUA SALTED CARAMEL | 29 |
24 | 93.760237 | KAHLUA ESPECIAL | 23 |
25 | 90.549048 | KAHLUA FRENCH VANILLA LIQUEUR DISCO | 25 |
26 | 90.327941 | KAHLUA PEPPERMINT MOCHA | 28 |
27 | 90.036168 | AVION ESPRESSO | 1 |
28 | 89.804451 | KAHLUA COFFEE LIQUEUR | 20 |
29 | 88.947355 | KAHLUA FRENCH VANILLA LIQUEUR | 24 |
30 | 88.089693 | KAHLUA MOCHA LIQUEUR | 27 |
31 | 79.648073 | KAHLUA COFFEE LIQUEUR MINI | 21 |
32 | 72.934872 | CAROLANS COLD BREW | 4 |
33 | 70.697000 | TEQUILA ROSE W/2 SHOT GLASSES HA | 43 |
34 | 68.273560 | GRIND ESPRESSO SHOT | 11 |
35 | 65.656795 | DR MCGILLICUDDYS COFFEE | 10 |
36 | 63.064841 | KARIMBA COFFEE LIQUEUR | 34 |
37 | 48.306810 | KAMORA COFFEE LIQUEUR | 31 |
38 | 42.167376 | COPA DE ORO MEXICAN COFFEE | 9 |
39 | 41.799355 | KAPALI COFFEE LIQUEUR | 33 |
40 | 38.196790 | CHILA COFFEE LIQUEUR | 7 |
41 | 38.087056 | CAFFE LOLITA COFFEE LIQUEUR | 2 |
42 | 36.908517 | HIRAM WALKER COFFEE BRANDY | 12 |
43 | 35.378913 | SABROSO | 42 |
44 | 16.996845 | CONCIERE COFFEE LIQUEUR | 8 |
You should notice that the ROW_NUMBER
didn’t do what we needed. You will not be able to do the row nor rank on the column we created because it is not yet available to the SQL call. This leads to sub-processees. Let’s show one today and come back to it next class.
%%bigquery --project pic-math
WITH t as(
SELECT
AVG(sale_dollars/volume_sold_gallons) as Average_Dollars_Per_Gallon,
item_description, category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE volume_sold_gallons != 0 and (category_name = 'COFFEE LIQUEURS' or category_name = 'IMPORTED VODKAS')
GROUP BY item_description, category_name
ORDER BY Average_Dollars_Per_Gallon DESC
)
SELECT *,
RANK() OVER(
PARTITION BY category_name
ORDER BY Average_Dollars_Per_Gallon) rk_num
FROM t
ORDER BY rk_num
Average_Dollars_Per_Gallon | item_description | category_name | rk_num | |
---|---|---|---|---|
0 | 16.996845 | CONCIERE COFFEE LIQUEUR | COFFEE LIQUEURS | 1 |
1 | 16.573795 | PURUS ORGANIC VODKA MINI | IMPORTED VODKAS | 1 |
2 | 35.378913 | SABROSO | COFFEE LIQUEURS | 2 |
3 | 23.316243 | SOOH RELSKA 80 PROOF VODKA | IMPORTED VODKAS | 2 |
4 | 36.908517 | HIRAM WALKER COFFEE BRANDY | COFFEE LIQUEURS | 3 |
... | ... | ... | ... | ... |
346 | 2939.445000 | GREY GOOSE ESSENCES STRAWBERRY & LEMONGRASS MINI | IMPORTED VODKAS | 302 |
347 | 2945.650394 | GREY GOOSE ESSENCES WATERMELON & BASIL MINI | IMPORTED VODKAS | 303 |
348 | 9000.000000 | OUTERSPACE VODKA MINI | IMPORTED VODKAS | 304 |
349 | 16996.723367 | E.T.51 PREMIUM VODKA MINI | IMPORTED VODKAS | 305 |
350 | 17136.000000 | E.T. 51 PREMIUM VODKA MINI | IMPORTED VODKAS | 306 |
351 rows × 4 columns
We will come back to this but a nice taste of some of the really powerful aspects of SQL!
Dealing With Strings#
I wanted to find why ‘vodka’ wasn’t in the table. Well it was but with other qualifiers. I needed to use a string command LIKE
I also looked at CONTAINS
but didn’t get it to work.
%%bigquery --project pic-math
SELECT category_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name LIKE '%VODKA%'
GROUP BY category_name
category_name | |
---|---|
0 | VODKA FLAVORED |
1 | IMPORTED VODKAS |
2 | IMPORTED FLAVORED VODKA |
3 | VODKA 80 PROOF |
4 | 100 PROOF VODKA |
5 | IMPORTED VODKA - MISC |
6 | OTHER PROOF VODKA |
7 | AMERICAN FLAVORED VODKA |
8 | AMERICAN VODKAS |
9 | IMPORTED VODKA - CHERRY |
10 | LOW PROOF VODKA |
Your Turn#
Assignement for today
Start a notebook getting BigQuery to work. Feel free to use the authentication atop.
Navigate to the dataset ‘austin_bikeshare.bikeshare_trips’
Compute the average time for a trip based on starting point
Compute how many trips start at each starting point.