SQL Wrap Up

Contents

Open In Colab

SQL Wrap Up#

To be honest, I am not sure what else to show you. Not that there isn’t more to cover but that I have exhausted my knowledge.

What I think we should do today is to challenge one another. We are familiar with some of the datasets, can we challenge one another with questions about the data.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated

I’ll start with an example.

How many bottles in each category cost more than two standard deviations above the mean?

First I’ll gather the stats of mean and standard deviation.

%%bigquery --project pic-math

SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
LIMIT 5
invoice_and_item_number date store_number store_name address city zip_code store_location county_number county category category_name vendor_number vendor_name item_number item_description pack bottle_volume_ml state_bottle_cost state_bottle_retail bottles_sold sale_dollars volume_sold_liters volume_sold_gallons
0 INV-37645300014 2021-06-18 3849 Hartig Drug Co #6 / Dyersville 711 16th Ave SE Dyersville 52040.0 POINT (-91.116087 42.470285) 31 DUBUQUE 1081100.0 Coffee Liqueurs 370 PERNOD RICARD USA 67527 Kahlua Coffee 12 1000 15.99 23.99 2 47.98 2.0 0.52
1 INV-37770500048 2021-06-24 5293 The Boonedocks 228, N Hwy 71 Arnolds Park 51331.0 POINT (-95.125928 43.370941) 30 DICKINSON 1081100.0 Coffee Liqueurs 370 PERNOD RICARD USA 67527 Kahlua Coffee 12 1000 15.99 23.99 4 95.96 4.0 1.05
2 S04111300054 2012-02-15 2565 Hy-Vee Food Store / Spencer 819 N GRAND AVE SPENCER 51301 POINT (-95.14506 43.145897) 21 Clay 1081330.0 PEACH SCHNAPPS 65 Jim Beam Brands 82847 Dekuyper Peachtree 12 1000 7.35 11.02 5 55.10 5.0 1.32
3 INV-06891500007 2017-08-25 2835 CVS Pharmacy #8538 / Cedar Falls 2302 West First St Cedar Falls 50613.0 POINT (-92.472778 42.539874) 7 BLACK HAWK 1011600.0 Straight Rye Whiskies 255 Infinium Spirits 27102 Templeton 4YR Rye 6 750 18.09 27.14 18 488.52 13.5 3.57
4 INV-37843000106 2021-06-25 2644 Hy-Vee Fort Dodge Wine and Spirits 1511 2nd Ave North Fort Dodge 50501.0 POINT (-94.177165 42.508344) 94 WEBSTER 1081500.0 Triple Sec 434 LUXCO INC 86251 Juarez Triple Sec 12 1000 2.42 3.63 48 174.24 48.0 12.68

I am going to use the above table to join with the full table and make the comparison.

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
)

SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name
category_name bottles_over_two_sd
0 Neutral Grain Spirits Flavored 1182
1 IMPORTED DRY GINS 10258
2 Triple Sec 5063
3 AMERICAN AMARETTO 5398
4 COFFEE LIQUEURS 2771
... ... ...
109 WHITE CREME DE MENTHE 2
110 WHITE CREME DE CACAO 1
111 SCHNAPPS - IMPORTED 1
112 DARK CREME DE CACAO 1
113 CHERRY BRANDIES 1

114 rows × 2 columns

I have a result but I see several issues here. One I am counting the same bottles over and over again. I think I need to group by the item_description before I take an average over the category.

%%bigquery --project pic-math

SELECT category_name, item_description, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
category_name item_description average standarddeviation
0 Triple Sec Tortilla Triple Sec 4.520000 0.000000
1 BUTTERSCOTCH SCHNAPPS Dekuyper Buttershots 11.354986 0.239511
2 IMPORTED DRY GINS Tanqueray Gin 21.887818 7.089758
3 Coffee Liqueurs Kahlua Coffee Liqueur 20.277870 6.306792
4 Coffee Liqueurs Kahlua Coffee 20.795766 6.624718
... ... ... ... ...
13825 Imported Distilled Spirit Specialty Plantation Trinidad Rum 2005 49.500000 NaN
13826 Imported Distilled Spirit Specialty Tequila Anejo 750ml Fenice 69.240000 NaN
13827 Imported Distilled Spirit Specialty Vikre Øvrevann Aquavit 29.010000 NaN
13828 Imported Distilled Spirit Specialty Amrita Indian Whiskey 375ml Pilar 56.450000 0.000000
13829 Imported Distilled Spirit Specialty Stoli Crushed Pineapple Mini 8.820000 NaN

13830 rows × 4 columns

That did not work as I’d have hoped (and took a long time to run!) Instead I am going to gather the MAX price of each bottle by item_description and use that. It won’t give perfect statistics but it will be better than what I have done.

%%bigquery --project pic-math

SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description, bottle_volume_ml
category_name item_description bottle_volume_ml retail_max
0 DECANTERS & SPECIALTY PACKAGES Hennessy VS w/Hennessy Black 50ml 800 27.74
1 DECANTERS & SPECIALTY PACKAGES Johnnie Walker Collection Pack 800 89.99
2 DECANTERS & SPECIALTY PACKAGES Jim Beam Black w/4-50mls 950 20.48
3 DECANTERS & SPECIALTY PACKAGES Malibu w/50ml Black & Red 850 11.24
4 DECANTERS & SPECIALTY PACKAGES Captain Morgan Original Spiced 1.75L w/CM Whit... 1950 27.00
... ... ... ... ...
16698 VODKA FLAVORED Burnetts Appleberry Mini 500 6.38
16699 American Flavored Vodka Pearl Vanilla Bean Mini 500 6.60
16700 None 360 Huckleberry Mini 500 7.80
16701 MISCELLANEOUS SCHNAPPS 99 Apples Mini Display 500 4.95
16702 CREAM LIQUEURS Tippy Cow Vanilla Mini 500 11.13

16703 rows × 4 columns

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation 
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
      FROM `bigquery-public-data.iowa_liquor_sales.sales`
      GROUP BY category_name, item_description, bottle_volume_ml) 
GROUP BY category_name
)

SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name
ORDER BY bottles_over_two_sd
category_name bottles_over_two_sd
0 Mezcal 1
1 FLAVORED RUM 1
2 PEPPERMINT SCHNAPPS 1
3 Triple Sec 1
4 CREME DE ALMOND 1
... ... ...
97 American Flavored Vodka 6817
98 Whiskey Liqueur 7330
99 Flavored Rum 10316
100 CANADIAN WHISKIES 15258
101 WHISKEY LIQUEUR 35516

102 rows × 2 columns

Can you improve the way I have done this? Does it make sense that there are that many bottles of Whiskey?

I am just going to see if these are correct. Let’s see if we can get the table to include total number of bottles.

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation 
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
      FROM `bigquery-public-data.iowa_liquor_sales.sales`
      GROUP BY category_name, item_description, bottle_volume_ml)
GROUP BY category_name
)

SELECT t.category_name, 
        COUNTIF(t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation) as bottles_over_two_sd, 
        COUNT(*) as total_bottles, 
        AVG(statsTable.average)as average_price, 
        AVG(statsTable.standarddeviation) as standard_deviation_price
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
GROUP BY category_name
ORDER BY bottles_over_two_sd
category_name bottles_over_two_sd total_bottles average_price standard_deviation_price
0 WATERMELON SCHNAPPS 0 17844 7.247500 4.970288
1 ROOT BEER SCHNAPPS 0 13615 10.398182 3.146531
2 American Sloe Gins 0 4252 11.630000 7.000000
3 GRAPE SCHNAPPS 0 13044 10.630000 1.668772
4 Distilled Spirits Specialty 0 293 20.760000 12.332202
... ... ... ... ... ...
130 American Flavored Vodka 6817 637286 11.274107 5.030962
131 Whiskey Liqueur 7330 651397 19.606241 21.243081
132 Flavored Rum 10316 328127 13.934458 5.546112
133 CANADIAN WHISKIES 15258 936212 18.787585 14.252754
134 WHISKEY LIQUEUR 35516 334572 13.477604 9.556290

135 rows × 5 columns

Yeah maybe that is a reasonable number of whiskeys…

If the price of the bottles was normally distributed, we’d only expect 2.5% but I doubt the price is normal. Using the mean and standard deviation we see that it is clearly a funky distribution.

Your Turn#

Challenge your mates with a difficult question. Try to stump them and don’t forget to try it yourself! For the quiz, pair (or thruple, three or less please!) up with someone in the class and try to come up with a challenge for you both to tackle. I’d prefer you consider the datasets we have already looked at; liquors, bikeshare (stations or trips) and census. Try to pose the challenge clearly and see if you can solve it. Only new questions please!

Examples#

Please only use these if you are really stumped on what to ask!

  1. What zip code consumes the most liquor per capita in Iowa?

  2. Which zip code uses the bikeshare the most per capita in Austin?

  3. Does the percentage of children in an area effect the amount of liquor consumed? (You need the per capita consumption for this to be accurate!)

  4. What type of bikeshare station had the most trips going over a day?