SQL Sub-Queries

Contents

Open In Colab

SQL Sub-Queries#

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated

I am still gonna keep playing around with liquor data. I want to add some complications into our lives and do somethings that will require multiple tables to be generated. We will do this all on the SQL server so there will be a sub-Query that we might not see. Let’s see it in action!

First the table to remind us what is in it.

%%bigquery --project pic-math
SELECT *
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 5
invoice_and_item_number date store_number store_name address city zip_code store_location county_number county category category_name vendor_number vendor_name item_number item_description pack bottle_volume_ml state_bottle_cost state_bottle_retail bottles_sold sale_dollars volume_sold_liters volume_sold_gallons
0 INV-37596300067 2021-06-17 5102 Wilkie Liquors 724 1st St SE Mount Vernon 52314.0 POINT (-91.410401 41.918328) 57 LINN 1092100.0 Imported Distilled Spirit Specialty 434 LUXCO INC 75087 Juarez Gold DSS 12 1000 5.00 7.50 660 4950.00 660.0 174.35
1 INV-19930800010 2019-06-11 3784 Hartig Drug #14 / Independence 200 1st St East Independence 50644 POINT (-91.893016 42.468721) 10 BUCHANAN 1092100.0 Imported Distilled Spirit Specialty 434 LUXCO INC 75087 Juarez Gold Dss 12 1000 5.00 7.50 48 360.00 48.0 12.68
2 INV-13427700191 2018-07-24 2572 Hy-Vee Food Store / Cedar Falls 6301 University Cedar Falls 50613 POINT (-92.435236 42.512789) 7 BLACK HAWK 1081500.0 Triple Sec 421 SAZERAC COMPANY INC 86637 Tortilla Triple Sec 12 1000 3.01 4.52 4 18.08 4.0 1.06
3 S06959800006 2012-08-07 4029 The Store 204 S MAIN ST COLESBURG 52035 POINT (-91.368032 42.958986) 28 Delaware 1081312.0 BUTTERSCOTCH SCHNAPPS 65 Jim Beam Brands 82787 Dekuyper Buttershots 12 1000 7.35 11.02 5 55.10 5.0 1.32
4 S13503000052 2013-07-23 3705 Liquor Locker 507 1ST AVE #100 ROCK RAPIDS 51246 None 60 Lyon 1042100.0 IMPORTED DRY GINS 260 Diageo Americas 28867 Tanqueray Gin 12 1000 14.99 22.48 4 89.92 4.0 1.06

Let’s ask a simple question, what is the most expensive bottle is in each category.

%%bigquery --project pic-math
SELECT category_name, MAX(state_bottle_retail) as max_state_bottle_retail
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
category_name max_state_bottle_retail
0 Single Barrel Bourbon Whiskies 192.30
1 Triple Sec 18.50
2 GRAPE SCHNAPPS 11.81
3 Single Malt Scotch 1275.00
4 Coffee Liqueurs 39.72
... ... ...
131 Temporary & Specialty Packages 13.11
132 IMPORTED VODKA - CHERRY 10.58
133 Imported Whiskies 122.76
134 American Whiskies 28.50
135 Delisted / Special Order Items 27.75

136 rows × 2 columns

This was not too hard. Let’s be mean though and ask for the second most expensive bottle of “DISTILLED SPIRITS SPECIALTY”. Here is where the sub-query is going to come into play. We don’t want to include the most expensive so instead we remove those from the computation.

%%bigquery --project pic-math

SELECT MAX(state_bottle_retail) as max_state_bottle_retail
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name = "DISTILLED SPIRITS SPECIALTY"

max_state_bottle_retail
0 112.01
%%bigquery --project pic-math
SELECT MAX(state_bottle_retail) as second_most_expensive
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE state_bottle_retail != (SELECT MAX(state_bottle_retail) as max_state_bottle_retail FROM `bigquery-public-data.iowa_liquor_sales.sales` WHERE category_name = "DISTILLED SPIRITS SPECIALTY")
      AND category_name = "DISTILLED SPIRITS SPECIALTY"
second_most_expensive
0 96.51

BTW, this is not nessecarily the way I would have done this. (Rank = 2 might have worked well (still would have required a sub-query though!)

%%bigquery --project pic-math
WITH table as (
SELECT state_bottle_retail, RANK() OVER(ORDER BY state_bottle_retail DESC) as rank_by_price
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE category_name = "DISTILLED SPIRITS SPECIALTY"
GROUP BY state_bottle_retail
)

SELECT state_bottle_retail as second_highest_price
FROM table
WHERE rank_by_price = 2
ORDER BY rank_by_price DESC

second_highest_price
0 96.51

Let’s try another. What if we want to find how many bottles in a category are over $100.

%%bigquery --project pic-math

SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
Order BY number_bottles_over_benjamin DESC
category_name number_bottles_over_benjamin
0 Scotch Whiskies 2552
1 SCOTCH WHISKIES 1897
2 100% Agave Tequila 1579
3 SINGLE MALT SCOTCH 1119
4 Imported Brandies 1090
... ... ...
13825 Straight Bourbon Whiskies 0
13826 American Cordials & Liqueur 0
13827 American Cordials & Liqueur 0
13828 MISC. AMERICAN CORDIALS & LIQUEURS 0
13829 MISC. IMPORTED CORDIALS & LIQUEURS 0

13830 rows × 2 columns

I think the book I am working from pre-dates COUNTIF. Let me show you a sub-query solution.

I’ll make a column asking if the bottle is over $100.

%%bigquery --project pic-math

SELECT *, (CASE WHEN state_bottle_retail >100 THEN 1 ELSE 0 END) as over_benjamin
FROM `bigquery-public-data.iowa_liquor_sales.sales`
LIMIT 10
invoice_and_item_number date store_number store_name address city zip_code store_location county_number county category category_name vendor_number vendor_name item_number item_description pack bottle_volume_ml state_bottle_cost state_bottle_retail bottles_sold sale_dollars volume_sold_liters volume_sold_gallons over_benjamin
0 INV-37606800019 2021-06-18 2190 Central City Liquor, Inc. 1460 2ND AVE Des Moines 50314.0 POINT (-93.619787 41.60566) 77 POLK 1082100.0 Imported Cordials & Liqueur 619 Campari America 65127 Grand Marnier Cordon Rouge 6 1000 25.55 38.33 2 76.66 2.00 0.52 0
1 INV-19821600132 2019-06-05 2629 Hy-Vee Food Store #2 / Council Bluffs 1745 Madison Ave Council Bluffs 51503 POINT (-95.825137 41.242732) 78 POTTAWATTA None None 55 SAZERAC NORTH AMERICA 84197 99 Pineapple Mini 10 600 6.00 9.00 2 18.00 1.20 0.31 0
2 INV-20266400001 2019-06-27 3713 Wal-Mart 0810 / Mason City 4151 4th St SW Mason City 50401 POINT (-93.261648 43.148239) 17 CERRO GORD 1062300.0 Aged Dark Rum 482 Sovereign Brands, LLC 64529 Bumbu Rum 6 750 18.50 27.75 18 499.50 13.50 3.56 0
3 INV-13387900079 2018-07-23 4829 Central City 2 1501 Michigan Ave Des Moines 50314 POINT (-93.613739 41.60572) 77 POLK 1081500.0 Triple Sec 434 LUXCO INC 86251 Juarez Triple Sec 12 1000 2.42 3.63 180 653.40 180.00 47.55 0
4 INV-13998900081 2018-08-22 5257 MAD Ave Quik Shop 405, Madison Ave Ottumwa 52501 None 90 WAPELLO 1092100.0 Imported Distilled Spirit Specialty 55.0 SAZERAC NORTH AMERICA 76227 Montezuma Blue 12 1000 4.66 6.99 120 838.80 120.00 31.70 0
5 S21846500003 2014-10-20 3618 Wal-Mart 2716 / Cedar Rapids 3601 29TH AVE SW CEDAR RAPIDS 52404 POINT (-91.718042 41.949215) 57 Linn 1701100.0 DECANTERS & SPECIALTY PACKAGES 65 Jim Beam Brands 789 Canadian Club w/Glass 6 750 7.88 11.82 30 354.60 22.50 5.94 0
6 S26037800001 2015-06-09 3944 Sam's Club 4973 / Dubuque 4400 ASBURY RD DUBUQUE 52002 POINT (-90.737821 42.515289) 31 Dubuque 1701100.0 DECANTERS & SPECIALTY PACKAGES 65 Jim Beam Brands 19082 Jim Beam Operation Homefront 6 1750 18.37 27.56 210 5787.60 367.50 97.08 0
7 INV-06908200123 2017-08-28 3612 B and C Liquor / Maquoketa 509 E Platt Maquoketa 52060.0 POINT (-90.659692 42.06922000000001) 49 JACKSON 1062100.0 Gold Rum 434 LUXCO INC 45248 Paramount Gold Rum 6 1750 7.84 11.76 3 35.28 5.25 1.39 0
8 INV-20029700068 2019-06-17 2576 Hy-Vee Wine and Spirits / Storm Lake 1250 N Lake St Storm Lake 50588 POINT (-95.200758 42.65318400000001) 11 BUENA VIST 1062300.0 Aged Dark Rum 65 Jim Beam Brands 44487 Cruzan Dark 12 1000 8.39 12.59 2 25.18 2.00 0.52 0
9 INV-24809400002 2020-01-27 3773 Benz Distributing 501 7th Ave SE Cedar Rapids 52401.0 POINT (-91.659875 41.97574) 57 LINN 1011800.0 Iowa Distillery Whiskies 125.0 CEDAR RIDGE VINEYARDS LL 928271 Cedar Ridge Private Cask Iowa Bourbon Selectio... 6 750 23.00 34.50 228 7866.00 171.00 45.17 0

The new column is at the end. I included the LIMIT as to reduce computation time for displaying. I won’t use that in the subquery here.

%%bigquery --project pic-math

SELECT category_name, SUM(over_benjamin) as number_bottles_over_benjamin
FROM (SELECT *, (CASE WHEN state_bottle_retail >100 THEN 1 ELSE 0 END) as over_benjamin
      FROM `bigquery-public-data.iowa_liquor_sales.sales`)
GROUP BY category_name, item_description
ORDER BY number_bottles_over_benjamin DESC
category_name number_bottles_over_benjamin
0 Scotch Whiskies 2552
1 SCOTCH WHISKIES 1897
2 100% Agave Tequila 1579
3 SINGLE MALT SCOTCH 1119
4 Imported Brandies 1090
... ... ...
13825 American Cordials & Liqueur 0
13826 MISC. AMERICAN CORDIALS & LIQUEURS 0
13827 MISC. IMPORTED CORDIALS & LIQUEURS 0
13828 MISC. IMPORTED CORDIALS & LIQUEURS 0
13829 MISC. IMPORTED CORDIALS & LIQUEURS 0

13830 rows × 2 columns

Well not really sure which is faster but you get to see the CASE command too for adding a column. By the way this is sometimes called an indicator variable. When something happens (bottle over $100) you get 1 and 0 otherwise. This is an excellent technique that we will revisist again!

Let’s add some more. Let’s get the percentage of bottles over a benjamin.

%%bigquery --project pic-math

SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin, COUNT(*) as number_of_bottles
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
Order BY number_bottles_over_benjamin DESC
category_name number_bottles_over_benjamin number_of_bottles
0 Scotch Whiskies 2552 2553
1 SCOTCH WHISKIES 1897 1897
2 100% Agave Tequila 1579 3069
3 SINGLE MALT SCOTCH 1119 1119
4 Imported Brandies 1090 1194
... ... ... ...
13825 Straight Bourbon Whiskies 0 2
13826 American Cordials & Liqueur 0 1
13827 PUERTO RICO & VIRGIN ISLANDS RUM 0 1
13828 MISC. IMPORTED CORDIALS & LIQUEURS 0 2
13829 MISC. IMPORTED CORDIALS & LIQUEURS 0 1

13830 rows × 3 columns

Now I cannot just divide those two columns because they don’t exist yet!

%%bigquery --project pic-math

SELECT category_name, number_bottles_over_benjamin, number_of_bottles, number_bottles_over_benjamin/number_of_bottles as percent_over_benjamin
FROM (SELECT category_name, COUNTIF(state_bottle_retail>100) as number_bottles_over_benjamin, COUNT(*) as number_of_bottles
      FROM `bigquery-public-data.iowa_liquor_sales.sales`
      GROUP BY category_name, item_description)
GROUP BY category_name, number_bottles_over_benjamin, number_of_bottles
ORDER BY percent_over_benjamin DESC
category_name number_bottles_over_benjamin number_of_bottles percent_over_benjamin
0 Straight Rye Whiskies 199 199 1.0
1 DECANTERS & SPECIALTY PACKAGES 292 292 1.0
2 Special Order Items 33 33 1.0
3 Special Order Items 24 24 1.0
4 Single Malt Scotch 256 256 1.0
... ... ... ... ...
8349 Temporary & Specialty Packages 0 4 0.0
8350 Flavored Gin 0 2 0.0
8351 CHERRY BRANDIES 0 1 0.0
8352 CREME DE ALMOND 0 1 0.0
8353 Delisted / Special Order Items 0 1 0.0

8354 rows × 4 columns

I got here and realized this is not really the number of bottles. How might you fix that so that this number represents the actual number of bottles sold?

Your Turn#

Using the dataset ‘austin_bikeshare.bikeshare_trips’ answer the following questions:

  1. What was the second most popular starting station?

  2. How many trips lasted over an hour and were a round trip (started and stopped at the same station)?

The following table will be helpful place to start for number 1.

%%bigquery --project pic-math

SELECT COUNT(*) as number_trips, start_station_name
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
GROUP BY start_station_name
ORDER BY number_trips DESC
number_trips start_station_name
0 72799 21st & Speedway @PCL
1 40635 Riverside @ S. Lamar
2 36520 City Hall / Lavaca & 2nd
3 35307 2nd & Congress
4 34758 Rainey St @ Cummings
... ... ...
188 4 Marketing Event
189 2 Eeyore's 2018
190 1 Stolen
191 1 Eeyore's 2017
192 1 cesar Chavez/Congress

193 rows × 2 columns

This table (without limit) may be useful for number 2)

%%bigquery --project pic-math

SELECT start_station_name, duration_minutes
FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips`
WHERE start_station_name = end_station_name
LIMIT 100
start_station_name duration_minutes
0 Toomey Rd @ South Lamar 31
1 Toomey Rd @ South Lamar 31
2 Toomey Rd @ South Lamar 30
3 State Capitol @ 14th & Colorado 19
4 State Capitol @ 14th & Colorado 17
... ... ...
95 Rainey @ River St 87
96 Zilker Park West 11
97 ACC - West & 12th Street 17
98 Rainey @ River St 87
99 Toomey Rd @ South Lamar 0

100 rows × 2 columns