SQL Wrap Up#
To be honest, I am not sure what else to show you. Not that there isn’t more to cover but that I have exhausted my knowledge.
What I think we should do today is to challenge one another. We are familiar with some of the datasets, can we challenge one another with questions about the data.
from google.colab import auth
auth.authenticate_user()
print('Authenticated')
Authenticated
I’ll start with an example.
How many bottles in each category cost more than two standard deviations above the mean?
First I’ll gather the stats of mean and standard deviation.
%%bigquery --project pic-math
SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
LIMIT 5
invoice_and_item_number | date | store_number | store_name | address | city | zip_code | store_location | county_number | county | category | category_name | vendor_number | vendor_name | item_number | item_description | pack | bottle_volume_ml | state_bottle_cost | state_bottle_retail | bottles_sold | sale_dollars | volume_sold_liters | volume_sold_gallons | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | INV-37645300014 | 2021-06-18 | 3849 | Hartig Drug Co #6 / Dyersville | 711 16th Ave SE | Dyersville | 52040.0 | POINT (-91.116087 42.470285) | 31 | DUBUQUE | 1081100.0 | Coffee Liqueurs | 370 | PERNOD RICARD USA | 67527 | Kahlua Coffee | 12 | 1000 | 15.99 | 23.99 | 2 | 47.98 | 2.0 | 0.52 |
1 | INV-37770500048 | 2021-06-24 | 5293 | The Boonedocks | 228, N Hwy 71 | Arnolds Park | 51331.0 | POINT (-95.125928 43.370941) | 30 | DICKINSON | 1081100.0 | Coffee Liqueurs | 370 | PERNOD RICARD USA | 67527 | Kahlua Coffee | 12 | 1000 | 15.99 | 23.99 | 4 | 95.96 | 4.0 | 1.05 |
2 | S04111300054 | 2012-02-15 | 2565 | Hy-Vee Food Store / Spencer | 819 N GRAND AVE | SPENCER | 51301 | POINT (-95.14506 43.145897) | 21 | Clay | 1081330.0 | PEACH SCHNAPPS | 65 | Jim Beam Brands | 82847 | Dekuyper Peachtree | 12 | 1000 | 7.35 | 11.02 | 5 | 55.10 | 5.0 | 1.32 |
3 | INV-06891500007 | 2017-08-25 | 2835 | CVS Pharmacy #8538 / Cedar Falls | 2302 West First St | Cedar Falls | 50613.0 | POINT (-92.472778 42.539874) | 7 | BLACK HAWK | 1011600.0 | Straight Rye Whiskies | 255 | Infinium Spirits | 27102 | Templeton 4YR Rye | 6 | 750 | 18.09 | 27.14 | 18 | 488.52 | 13.5 | 3.57 |
4 | INV-37843000106 | 2021-06-25 | 2644 | Hy-Vee Fort Dodge Wine and Spirits | 1511 2nd Ave North | Fort Dodge | 50501.0 | POINT (-94.177165 42.508344) | 94 | WEBSTER | 1081500.0 | Triple Sec | 434 | LUXCO INC | 86251 | Juarez Triple Sec | 12 | 1000 | 2.42 | 3.63 | 48 | 174.24 | 48.0 | 12.68 |
I am going to use the above table to join with the full table and make the comparison.
%%bigquery --project pic-math
WITH statsTable as(
SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
)
SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable
ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name
category_name | bottles_over_two_sd | |
---|---|---|
0 | Neutral Grain Spirits Flavored | 1182 |
1 | IMPORTED DRY GINS | 10258 |
2 | Triple Sec | 5063 |
3 | AMERICAN AMARETTO | 5398 |
4 | COFFEE LIQUEURS | 2771 |
... | ... | ... |
109 | WHITE CREME DE MENTHE | 2 |
110 | WHITE CREME DE CACAO | 1 |
111 | SCHNAPPS - IMPORTED | 1 |
112 | DARK CREME DE CACAO | 1 |
113 | CHERRY BRANDIES | 1 |
114 rows × 2 columns
I have a result but I see several issues here. One I am counting the same bottles over and over again. I think I need to group by the item_description before I take an average over the category.
%%bigquery --project pic-math
SELECT category_name, item_description, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description
category_name | item_description | average | standarddeviation | |
---|---|---|---|---|
0 | Triple Sec | Tortilla Triple Sec | 4.520000 | 0.000000 |
1 | BUTTERSCOTCH SCHNAPPS | Dekuyper Buttershots | 11.354986 | 0.239511 |
2 | IMPORTED DRY GINS | Tanqueray Gin | 21.887818 | 7.089758 |
3 | Coffee Liqueurs | Kahlua Coffee Liqueur | 20.277870 | 6.306792 |
4 | Coffee Liqueurs | Kahlua Coffee | 20.795766 | 6.624718 |
... | ... | ... | ... | ... |
13825 | Imported Distilled Spirit Specialty | Plantation Trinidad Rum 2005 | 49.500000 | NaN |
13826 | Imported Distilled Spirit Specialty | Tequila Anejo 750ml Fenice | 69.240000 | NaN |
13827 | Imported Distilled Spirit Specialty | Vikre Øvrevann Aquavit | 29.010000 | NaN |
13828 | Imported Distilled Spirit Specialty | Amrita Indian Whiskey 375ml Pilar | 56.450000 | 0.000000 |
13829 | Imported Distilled Spirit Specialty | Stoli Crushed Pineapple Mini | 8.820000 | NaN |
13830 rows × 4 columns
That did not work as I’d have hoped (and took a long time to run!) Instead I am going to gather the MAX
price of each bottle by item_description and use that. It won’t give perfect statistics but it will be better than what I have done.
%%bigquery --project pic-math
SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description, bottle_volume_ml
category_name | item_description | bottle_volume_ml | retail_max | |
---|---|---|---|---|
0 | DECANTERS & SPECIALTY PACKAGES | Hennessy VS w/Hennessy Black 50ml | 800 | 27.74 |
1 | DECANTERS & SPECIALTY PACKAGES | Johnnie Walker Collection Pack | 800 | 89.99 |
2 | DECANTERS & SPECIALTY PACKAGES | Jim Beam Black w/4-50mls | 950 | 20.48 |
3 | DECANTERS & SPECIALTY PACKAGES | Malibu w/50ml Black & Red | 850 | 11.24 |
4 | DECANTERS & SPECIALTY PACKAGES | Captain Morgan Original Spiced 1.75L w/CM Whit... | 1950 | 27.00 |
... | ... | ... | ... | ... |
16698 | VODKA FLAVORED | Burnetts Appleberry Mini | 500 | 6.38 |
16699 | American Flavored Vodka | Pearl Vanilla Bean Mini | 500 | 6.60 |
16700 | None | 360 Huckleberry Mini | 500 | 7.80 |
16701 | MISCELLANEOUS SCHNAPPS | 99 Apples Mini Display | 500 | 4.95 |
16702 | CREAM LIQUEURS | Tippy Cow Vanilla Mini | 500 | 11.13 |
16703 rows × 4 columns
%%bigquery --project pic-math
WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description, bottle_volume_ml)
GROUP BY category_name
)
SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable
ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name
ORDER BY bottles_over_two_sd
category_name | bottles_over_two_sd | |
---|---|---|
0 | Mezcal | 1 |
1 | FLAVORED RUM | 1 |
2 | PEPPERMINT SCHNAPPS | 1 |
3 | Triple Sec | 1 |
4 | CREME DE ALMOND | 1 |
... | ... | ... |
97 | American Flavored Vodka | 6817 |
98 | Whiskey Liqueur | 7330 |
99 | Flavored Rum | 10316 |
100 | CANADIAN WHISKIES | 15258 |
101 | WHISKEY LIQUEUR | 35516 |
102 rows × 2 columns
Can you improve the way I have done this? Does it make sense that there are that many bottles of Whiskey?
I am just going to see if these are correct. Let’s see if we can get the table to include total number of bottles.
%%bigquery --project pic-math
WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description, bottle_volume_ml)
GROUP BY category_name
)
SELECT t.category_name,
COUNTIF(t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation) as bottles_over_two_sd,
COUNT(*) as total_bottles,
AVG(statsTable.average)as average_price,
AVG(statsTable.standarddeviation) as standard_deviation_price
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable
ON t.category_name = statsTable.category_name
GROUP BY category_name
ORDER BY bottles_over_two_sd
category_name | bottles_over_two_sd | total_bottles | average_price | standard_deviation_price | |
---|---|---|---|---|---|
0 | WATERMELON SCHNAPPS | 0 | 17844 | 7.247500 | 4.970288 |
1 | ROOT BEER SCHNAPPS | 0 | 13615 | 10.398182 | 3.146531 |
2 | American Sloe Gins | 0 | 4252 | 11.630000 | 7.000000 |
3 | GRAPE SCHNAPPS | 0 | 13044 | 10.630000 | 1.668772 |
4 | Distilled Spirits Specialty | 0 | 293 | 20.760000 | 12.332202 |
... | ... | ... | ... | ... | ... |
130 | American Flavored Vodka | 6817 | 637286 | 11.274107 | 5.030962 |
131 | Whiskey Liqueur | 7330 | 651397 | 19.606241 | 21.243081 |
132 | Flavored Rum | 10316 | 328127 | 13.934458 | 5.546112 |
133 | CANADIAN WHISKIES | 15258 | 936212 | 18.787585 | 14.252754 |
134 | WHISKEY LIQUEUR | 35516 | 334572 | 13.477604 | 9.556290 |
135 rows × 5 columns
Yeah maybe that is a reasonable number of whiskeys…
If the price of the bottles was normally distributed, we’d only expect 2.5% but I doubt the price is normal. Using the mean and standard deviation we see that it is clearly a funky distribution.
Your Turn#
Challenge your mates with a difficult question. Try to stump them and don’t forget to try it yourself! For the quiz, pair (or thruple, three or less please!) up with someone in the class and try to come up with a challenge for you both to tackle. I’d prefer you consider the datasets we have already looked at; liquors, bikeshare (stations or trips) and census. Try to pose the challenge clearly and see if you can solve it. Only new questions please!
Examples#
Please only use these if you are really stumped on what to ask!
What zip code consumes the most liquor per capita in Iowa?
Which zip code uses the bikeshare the most per capita in Austin?
Does the percentage of children in an area effect the amount of liquor consumed? (You need the per capita consumption for this to be accurate!)
What type of bikeshare station had the most trips going over a day?