SQL Sub-Queries#
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
I am still gonna keep playing around with liquor data. I want to add some complications into our lives and do somethings that will require multiple tables to be generated. We will do this all on the SQL server so there will be a sub-Query that we might not see. Let’s see it in action!
First the table to remind us what is in it.
%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5
invoice_and_item_number | date | store_number | store_name | address | city | zip_code | store_location | county_number | county | category | category_name | vendor_number | vendor_name | item_number | item_description | pack | bottle_volume_ml | state_bottle_cost | state_bottle_retail | bottles_sold | sale_dollars | volume_sold_liters | volume_sold_gallons | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INV-37596300067 | 2021-06-17 | 5102 | Wilkie Liquors | 724 1st St SE | Mount Vernon | 52314.0 | POINT (-91.410401 41.918328) | 57 | LINN | 1092100.0 | Imported Distilled Spirit Specialty | 434 | LUXCO INC | 75087 | Juarez Gold DSS | 12 | 1000 | 5.00 | 7.50 | 660 | 4950.00 | 660.0 | 174.35 |
1 | INV-19930800010 | 2019-06-11 | 3784 | Hartig Drug #14 / Independence | 200 1st St East | Independence | 50644 | POINT (-91.893016 42.468721) | 10 | BUCHANAN | 1092100.0 | Imported Distilled Spirit Specialty | 434 | LUXCO INC | 75087 | Juarez Gold Dss | 12 | 1000 | 5.00 | 7.50 | 48 | 360.00 | 48.0 | 12.68 |
2 | INV-13427700191 | 2018-07-24 | 2572 | Hy-Vee Food Store / Cedar Falls | 6301 University | Cedar Falls | 50613 | POINT (-92.435236 42.512789) | 7 | BLACK HAWK | 1081500.0 | Triple Sec | 421 | SAZERAC COMPANY INC | 86637 | Tortilla Triple Sec | 12 | 1000 | 3.01 | 4.52 | 4 | 18.08 | 4.0 | 1.06 |
3 | S06959800006 | 2012-08-07 | 4029 | The Store | 204 S MAIN ST | COLESBURG | 52035 | POINT (-91.368032 42.958986) | 28 | Delaware | 1081312.0 | BUTTERSCOTCH SCHNAPPS | 65 | Jim Beam Brands | 82787 | Dekuyper Buttershots | 12 | 1000 | 7.35 | 11.02 | 5 | 55.10 | 5.0 | 1.32 |
4 | S13503000052 | 2013-07-23 | 3705 | Liquor Locker | 507 1ST AVE #100 | ROCK RAPIDS | 51246 | None | 60 | Lyon | 1042100.0 | IMPORTED DRY GINS | 260 | Diageo Americas | 28867 | Tanqueray Gin | 12 | 1000 | 14.99 | 22.48 | 4 | 89.92 | 4.0 | 1.06 |
Let’s ask a simple question, what is the most expensive bottle is in each category.
%%bigquery --project pic-math
SELECT category_name, MAX(state_bottle_retail) as max_state_bottle_retail
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
category_name | max_state_bottle_retail | |
---|---|---|
0 | Single Barrel Bourbon Whiskies | 192.30 |
1 | Triple Sec | 18.50 |
2 | GRAPE SCHNAPPS | 11.81 |
3 | Single Malt Scotch | 1275.00 |
4 | Coffee Liqueurs | 39.72 |
... | ... | ... |
131 | Temporary & Specialty Packages | 13.11 |
132 | IMPORTED VODKA - CHERRY | 10.58 |
133 | Imported Whiskies | 122.76 |
134 | American Whiskies | 28.50 |
135 | Delisted / Special Order Items | 27.75 |
136 rows × 2 columns
This was not too hard. Let’s be mean though and ask for the second most expensive bottle of “DISTILLED SPIRITS SPECIALTY”. Here is where the sub-query is going to come into play. We don’t want to include the most expensive so instead we remove those from the computation.
%%bigquery --project pic-math
SELECT MAX(state_bottle_retail) as max_state_bottle_retail
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name = "DISTILLED SPIRITS SPECIALTY"
max_state_bottle_retail | |
---|---|
0 | 112.01 |
%%bigquery --project pic-math
SELECT MAX(state_bottle_retail) as second_most_expensive
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE state_bottle_retail != (SELECT MAX(state_bottle_retail) as max_state_bottle_retail FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE category_name = "DISTILLED SPIRITS SPECIALTY")
AND category_name = "DISTILLED SPIRITS SPECIALTY"
second_most_expensive | |
---|---|
0 | 96.51 |
BTW, this is not nessecarily the way I would have done this. (Rank = 2 might have worked well (still would have required a sub-query though!)
%%bigquery --project pic-math
WITH table as (
SELECT state_bottle_retail, RANK() OVER(ORDER BY state_bottle_retail DESC) as rank_by_price
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name = "DISTILLED SPIRITS SPECIALTY"
GROUP BY state_bottle_retail
)
SELECT state_bottle_retail as second_highest_price
FROM table
WHERE rank_by_price = 2
ORDER BY rank_by_price DESC
second_highest_price | |
---|---|
0 | 96.51 |
Let’s try another. What if we want to find how many bottles in a category are over $100.
%%bigquery --project pic-math
SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
Order BY number_bottles_over_benjamin DESC
category_name | number_bottles_over_benjamin | |
---|---|---|
0 | Scotch Whiskies | 2552 |
1 | SCOTCH WHISKIES | 1897 |
2 | 100% Agave Tequila | 1579 |
3 | SINGLE MALT SCOTCH | 1119 |
4 | Imported Brandies | 1090 |
... | ... | ... |
13825 | Straight Bourbon Whiskies | 0 |
13826 | American Cordials & Liqueur | 0 |
13827 | American Cordials & Liqueur | 0 |
13828 | MISC. AMERICAN CORDIALS & LIQUEURS | 0 |
13829 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 |
13830 rows × 2 columns
I think the book I am working from pre-dates COUNTIF
. Let me show you a sub-query solution.
I’ll make a column asking if the bottle is over $100.
%%bigquery --project pic-math
SELECT *, (CASE WHEN state_bottle_retail >100 THEN 1 ELSE 0 END) as over_benjamin
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 10
invoice_and_item_number | date | store_number | store_name | address | city | zip_code | store_location | county_number | county | category | category_name | vendor_number | vendor_name | item_number | item_description | pack | bottle_volume_ml | state_bottle_cost | state_bottle_retail | bottles_sold | sale_dollars | volume_sold_liters | volume_sold_gallons | over_benjamin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INV-37606800019 | 2021-06-18 | 2190 | Central City Liquor, Inc. | 1460 2ND AVE | Des Moines | 50314.0 | POINT (-93.619787 41.60566) | 77 | POLK | 1082100.0 | Imported Cordials & Liqueur | 619 | Campari America | 65127 | Grand Marnier Cordon Rouge | 6 | 1000 | 25.55 | 38.33 | 2 | 76.66 | 2.00 | 0.52 | 0 |
1 | INV-19821600132 | 2019-06-05 | 2629 | Hy-Vee Food Store #2 / Council Bluffs | 1745 Madison Ave | Council Bluffs | 51503 | POINT (-95.825137 41.242732) | 78 | POTTAWATTA | None | None | 55 | SAZERAC NORTH AMERICA | 84197 | 99 Pineapple Mini | 10 | 600 | 6.00 | 9.00 | 2 | 18.00 | 1.20 | 0.31 | 0 |
2 | INV-20266400001 | 2019-06-27 | 3713 | Wal-Mart 0810 / Mason City | 4151 4th St SW | Mason City | 50401 | POINT (-93.261648 43.148239) | 17 | CERRO GORD | 1062300.0 | Aged Dark Rum | 482 | Sovereign Brands, LLC | 64529 | Bumbu Rum | 6 | 750 | 18.50 | 27.75 | 18 | 499.50 | 13.50 | 3.56 | 0 |
3 | INV-13387900079 | 2018-07-23 | 4829 | Central City 2 | 1501 Michigan Ave | Des Moines | 50314 | POINT (-93.613739 41.60572) | 77 | POLK | 1081500.0 | Triple Sec | 434 | LUXCO INC | 86251 | Juarez Triple Sec | 12 | 1000 | 2.42 | 3.63 | 180 | 653.40 | 180.00 | 47.55 | 0 |
4 | INV-13998900081 | 2018-08-22 | 5257 | MAD Ave Quik Shop | 405, Madison Ave | Ottumwa | 52501 | None | 90 | WAPELLO | 1092100.0 | Imported Distilled Spirit Specialty | 55.0 | SAZERAC NORTH AMERICA | 76227 | Montezuma Blue | 12 | 1000 | 4.66 | 6.99 | 120 | 838.80 | 120.00 | 31.70 | 0 |
5 | S21846500003 | 2014-10-20 | 3618 | Wal-Mart 2716 / Cedar Rapids | 3601 29TH AVE SW | CEDAR RAPIDS | 52404 | POINT (-91.718042 41.949215) | 57 | Linn | 1701100.0 | DECANTERS & SPECIALTY PACKAGES | 65 | Jim Beam Brands | 789 | Canadian Club w/Glass | 6 | 750 | 7.88 | 11.82 | 30 | 354.60 | 22.50 | 5.94 | 0 |
6 | S26037800001 | 2015-06-09 | 3944 | Sam's Club 4973 / Dubuque | 4400 ASBURY RD | DUBUQUE | 52002 | POINT (-90.737821 42.515289) | 31 | Dubuque | 1701100.0 | DECANTERS & SPECIALTY PACKAGES | 65 | Jim Beam Brands | 19082 | Jim Beam Operation Homefront | 6 | 1750 | 18.37 | 27.56 | 210 | 5787.60 | 367.50 | 97.08 | 0 |
7 | INV-06908200123 | 2017-08-28 | 3612 | B and C Liquor / Maquoketa | 509 E Platt | Maquoketa | 52060.0 | POINT (-90.659692 42.06922000000001) | 49 | JACKSON | 1062100.0 | Gold Rum | 434 | LUXCO INC | 45248 | Paramount Gold Rum | 6 | 1750 | 7.84 | 11.76 | 3 | 35.28 | 5.25 | 1.39 | 0 |
8 | INV-20029700068 | 2019-06-17 | 2576 | Hy-Vee Wine and Spirits / Storm Lake | 1250 N Lake St | Storm Lake | 50588 | POINT (-95.200758 42.65318400000001) | 11 | BUENA VIST | 1062300.0 | Aged Dark Rum | 65 | Jim Beam Brands | 44487 | Cruzan Dark | 12 | 1000 | 8.39 | 12.59 | 2 | 25.18 | 2.00 | 0.52 | 0 |
9 | INV-24809400002 | 2020-01-27 | 3773 | Benz Distributing | 501 7th Ave SE | Cedar Rapids | 52401.0 | POINT (-91.659875 41.97574) | 57 | LINN | 1011800.0 | Iowa Distillery Whiskies | 125.0 | CEDAR RIDGE VINEYARDS LL | 928271 | Cedar Ridge Private Cask Iowa Bourbon Selectio... | 6 | 750 | 23.00 | 34.50 | 228 | 7866.00 | 171.00 | 45.17 | 0 |
The new column is at the end. I included the LIMIT
as to reduce computation time for displaying. I won’t use that in the subquery here.
%%bigquery --project pic-math
SELECT category_name, SUM(over_benjamin) as number_bottles_over_benjamin
FROM (SELECT *, (CASE WHEN state_bottle_retail >100 THEN 1 ELSE 0 END) as over_benjamin
FROM `bigquery-public-data.iowa_liquor_sales.sales`)
GROUP BY category_name, item_description
ORDER BY number_bottles_over_benjamin DESC
category_name | number_bottles_over_benjamin | |
---|---|---|
0 | Scotch Whiskies | 2552 |
1 | SCOTCH WHISKIES | 1897 |
2 | 100% Agave Tequila | 1579 |
3 | SINGLE MALT SCOTCH | 1119 |
4 | Imported Brandies | 1090 |
... | ... | ... |
13825 | American Cordials & Liqueur | 0 |
13826 | MISC. AMERICAN CORDIALS & LIQUEURS | 0 |
13827 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 |
13828 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 |
13829 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 |
13830 rows × 2 columns
Well not really sure which is faster but you get to see the CASE
command too for adding a column. By the way this is sometimes called an indicator variable. When something happens (bottle over $100) you get 1 and 0 otherwise. This is an excellent technique that we will revisist again!
Let’s add some more. Let’s get the percentage of bottles over a benjamin.
%%bigquery --project pic-math
SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin, COUNT(*) as number_of_bottles
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
Order BY number_bottles_over_benjamin DESC
category_name | number_bottles_over_benjamin | number_of_bottles | |
---|---|---|---|
0 | Scotch Whiskies | 2552 | 2553 |
1 | SCOTCH WHISKIES | 1897 | 1897 |
2 | 100% Agave Tequila | 1579 | 3069 |
3 | SINGLE MALT SCOTCH | 1119 | 1119 |
4 | Imported Brandies | 1090 | 1194 |
... | ... | ... | ... |
13825 | Straight Bourbon Whiskies | 0 | 2 |
13826 | American Cordials & Liqueur | 0 | 1 |
13827 | PUERTO RICO & VIRGIN ISLANDS RUM | 0 | 1 |
13828 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 | 2 |
13829 | MISC. IMPORTED CORDIALS & LIQUEURS | 0 | 1 |
13830 rows × 3 columns
Now I cannot just divide those two columns because they don’t exist yet!
%%bigquery --project pic-math
SELECT category_name, number_bottles_over_benjamin, number_of_bottles, number_bottles_over_benjamin/number_of_bottles as percent_over_benjamin
FROM (SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin, COUNT(*) as number_of_bottles
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description)
GROUP BY category_name, number_bottles_over_benjamin, number_of_bottles
ORDER BY percent_over_benjamin DESC
category_name | number_bottles_over_benjamin | number_of_bottles | percent_over_benjamin | |
---|---|---|---|---|
0 | Straight Rye Whiskies | 199 | 199 | 1.0 |
1 | DECANTERS & SPECIALTY PACKAGES | 292 | 292 | 1.0 |
2 | Special Order Items | 33 | 33 | 1.0 |
3 | Special Order Items | 24 | 24 | 1.0 |
4 | Single Malt Scotch | 256 | 256 | 1.0 |
... | ... | ... | ... | ... |
8349 | Temporary & Specialty Packages | 0 | 4 | 0.0 |
8350 | Flavored Gin | 0 | 2 | 0.0 |
8351 | CHERRY BRANDIES | 0 | 1 | 0.0 |
8352 | CREME DE ALMOND | 0 | 1 | 0.0 |
8353 | Delisted / Special Order Items | 0 | 1 | 0.0 |
8354 rows × 4 columns
I got here and realized this is not really the number of bottles. How might you fix that so that this number represents the actual number of bottles sold?
Your Turn#
Using the dataset ‘austin_bikeshare.bikeshare_trips’ answer the following questions:
What was the second most popular starting station?
How many trips lasted over an hour and were a round trip (started and stopped at the same station)?
The following table will be helpful place to start for number 1.
%%bigquery --project pic-math
SELECT COUNT(*) as number_trips, start_station_name
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name
ORDER BY number_trips DESC
number_trips | start_station_name | |
---|---|---|
0 | 72799 | 21st & Speedway @PCL |
1 | 40635 | Riverside @ S. Lamar |
2 | 36520 | City Hall / Lavaca & 2nd |
3 | 35307 | 2nd & Congress |
4 | 34758 | Rainey St @ Cummings |
... | ... | ... |
188 | 4 | Marketing Event |
189 | 2 | Eeyore's 2018 |
190 | 1 | Stolen |
191 | 1 | Eeyore's 2017 |
192 | 1 | cesar Chavez/Congress |
193 rows × 2 columns
This table (without limit) may be useful for number 2)
%%bigquery --project pic-math
SELECT start_station_name, duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_name = end_station_name
LIMIT 100
start_station_name | duration_minutes | |
---|---|---|
0 | Toomey Rd @ South Lamar | 31 |
1 | Toomey Rd @ South Lamar | 31 |
2 | Toomey Rd @ South Lamar | 30 |
3 | State Capitol @ 14th & Colorado | 19 |
4 | State Capitol @ 14th & Colorado | 17 |
... | ... | ... |
95 | Rainey @ River St | 87 |
96 | Zilker Park West | 11 |
97 | ACC - West & 12th Street | 17 |
98 | Rainey @ River St | 87 |
99 | Toomey Rd @ South Lamar | 0 |
100 rows × 2 columns