SQL Wrap Up

SQL Wrap Up#

To be honest, I am not sure what else to show you. Not that there isn’t more to cover but that I have exhausted my knowledge.

What I think we should do today is to challenge one another. We are familiar with some of the datasets, can we challenge one another with questions about the data.

from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated

I’ll start with an example.

How many bottles in each category cost more than two standard deviations above the mean?

First I’ll gather the stats of mean and standard deviation.

%%bigquery --project pic-math

SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
LIMIT 5

	invoice_and_item_number	date	store_number	store_name	address	city	zip_code	store_location	county_number	county	category	category_name	vendor_number	vendor_name	item_number	item_description	pack	bottle_volume_ml	state_bottle_cost	state_bottle_retail	bottles_sold	sale_dollars	volume_sold_liters	volume_sold_gallons
0	INV-37645300014	2021-06-18	3849	Hartig Drug Co #6 / Dyersville	711 16th Ave SE	Dyersville	52040.0	POINT (-91.116087 42.470285)	31	DUBUQUE	1081100.0	Coffee Liqueurs	370	PERNOD RICARD USA	67527	Kahlua Coffee	12	1000	15.99	23.99	2	47.98	2.0	0.52
1	INV-37770500048	2021-06-24	5293	The Boonedocks	228, N Hwy 71	Arnolds Park	51331.0	POINT (-95.125928 43.370941)	30	DICKINSON	1081100.0	Coffee Liqueurs	370	PERNOD RICARD USA	67527	Kahlua Coffee	12	1000	15.99	23.99	4	95.96	4.0	1.05
2	S04111300054	2012-02-15	2565	Hy-Vee Food Store / Spencer	819 N GRAND AVE	SPENCER	51301	POINT (-95.14506 43.145897)	21	Clay	1081330.0	PEACH SCHNAPPS	65	Jim Beam Brands	82847	Dekuyper Peachtree	12	1000	7.35	11.02	5	55.10	5.0	1.32
3	INV-06891500007	2017-08-25	2835	CVS Pharmacy #8538 / Cedar Falls	2302 West First St	Cedar Falls	50613.0	POINT (-92.472778 42.539874)	7	BLACK HAWK	1011600.0	Straight Rye Whiskies	255	Infinium Spirits	27102	Templeton 4YR Rye	6	750	18.09	27.14	18	488.52	13.5	3.57
4	INV-37843000106	2021-06-25	2644	Hy-Vee Fort Dodge Wine and Spirits	1511 2nd Ave North	Fort Dodge	50501.0	POINT (-94.177165 42.508344)	94	WEBSTER	1081500.0	Triple Sec	434	LUXCO INC	86251	Juarez Triple Sec	12	1000	2.42	3.63	48	174.24	48.0	12.68

I am going to use the above table to join with the full table and make the comparison.

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name
)

SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name

	category_name	bottles_over_two_sd
0	Neutral Grain Spirits Flavored	1182
1	IMPORTED DRY GINS	10258
2	Triple Sec	5063
3	AMERICAN AMARETTO	5398
4	COFFEE LIQUEURS	2771
...	...	...
109	WHITE CREME DE MENTHE	2
110	WHITE CREME DE CACAO	1
111	SCHNAPPS - IMPORTED	1
112	DARK CREME DE CACAO	1
113	CHERRY BRANDIES	1

114 rows × 2 columns

I have a result but I see several issues here. One I am counting the same bottles over and over again. I think I need to group by the item_description before I take an average over the category.

%%bigquery --project pic-math

SELECT category_name, item_description, AVG(state_bottle_retail) as average, STDDEV(state_bottle_retail) as standarddeviation 
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description

	category_name	item_description	average	standarddeviation
0	Triple Sec	Tortilla Triple Sec	4.520000	0.000000
1	BUTTERSCOTCH SCHNAPPS	Dekuyper Buttershots	11.354986	0.239511
2	IMPORTED DRY GINS	Tanqueray Gin	21.887818	7.089758
3	Coffee Liqueurs	Kahlua Coffee Liqueur	20.277870	6.306792
4	Coffee Liqueurs	Kahlua Coffee	20.795766	6.624718
...	...	...	...	...
13825	Imported Distilled Spirit Specialty	Plantation Trinidad Rum 2005	49.500000	NaN
13826	Imported Distilled Spirit Specialty	Tequila Anejo 750ml Fenice	69.240000	NaN
13827	Imported Distilled Spirit Specialty	Vikre Øvrevann Aquavit	29.010000	NaN
13828	Imported Distilled Spirit Specialty	Amrita Indian Whiskey 375ml Pilar	56.450000	0.000000
13829	Imported Distilled Spirit Specialty	Stoli Crushed Pineapple Mini	8.820000	NaN

13830 rows × 4 columns

That did not work as I’d have hoped (and took a long time to run!) Instead I am going to gather the MAX price of each bottle by item_description and use that. It won’t give perfect statistics but it will be better than what I have done.

%%bigquery --project pic-math

SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
FROM `bigquery-public-data.iowa_liquor_sales.sales`
GROUP BY category_name, item_description, bottle_volume_ml

	category_name	item_description	bottle_volume_ml	retail_max
0	DECANTERS & SPECIALTY PACKAGES	Hennessy VS w/Hennessy Black 50ml	800	27.74
1	DECANTERS & SPECIALTY PACKAGES	Johnnie Walker Collection Pack	800	89.99
2	DECANTERS & SPECIALTY PACKAGES	Jim Beam Black w/4-50mls	950	20.48
3	DECANTERS & SPECIALTY PACKAGES	Malibu w/50ml Black & Red	850	11.24
4	DECANTERS & SPECIALTY PACKAGES	Captain Morgan Original Spiced 1.75L w/CM Whit...	1950	27.00
...	...	...	...	...
16698	VODKA FLAVORED	Burnetts Appleberry Mini	500	6.38
16699	American Flavored Vodka	Pearl Vanilla Bean Mini	500	6.60
16700	None	360 Huckleberry Mini	500	7.80
16701	MISCELLANEOUS SCHNAPPS	99 Apples Mini Display	500	4.95
16702	CREAM LIQUEURS	Tippy Cow Vanilla Mini	500	11.13

16703 rows × 4 columns

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation 
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
      FROM `bigquery-public-data.iowa_liquor_sales.sales`
      GROUP BY category_name, item_description, bottle_volume_ml) 
GROUP BY category_name
)

SELECT t.category_name, COUNT(*) as bottles_over_two_sd
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
WHERE t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation
GROUP BY category_name
ORDER BY bottles_over_two_sd

	category_name	bottles_over_two_sd
0	Mezcal	1
1	FLAVORED RUM	1
2	PEPPERMINT SCHNAPPS	1
3	Triple Sec	1
4	CREME DE ALMOND	1
...	...	...
97	American Flavored Vodka	6817
98	Whiskey Liqueur	7330
99	Flavored Rum	10316
100	CANADIAN WHISKIES	15258
101	WHISKEY LIQUEUR	35516

102 rows × 2 columns

Can you improve the way I have done this? Does it make sense that there are that many bottles of Whiskey?

I am just going to see if these are correct. Let’s see if we can get the table to include total number of bottles.

%%bigquery --project pic-math

WITH statsTable as(
SELECT category_name, AVG(retail_max) as average, STDDEV(retail_max) as standarddeviation 
FROM (SELECT category_name, item_description, bottle_volume_ml, MAX(state_bottle_retail) as retail_max
      FROM `bigquery-public-data.iowa_liquor_sales.sales`
      GROUP BY category_name, item_description, bottle_volume_ml)
GROUP BY category_name
)

SELECT t.category_name, 
        COUNTIF(t.state_bottle_retail > statsTable.average + 2*statsTable.standarddeviation) as bottles_over_two_sd, 
        COUNT(*) as total_bottles, 
        AVG(statsTable.average)as average_price, 
        AVG(statsTable.standarddeviation) as standard_deviation_price
FROM `bigquery-public-data.iowa_liquor_sales.sales` t JOIN statsTable 
      ON t.category_name = statsTable.category_name
GROUP BY category_name
ORDER BY bottles_over_two_sd

	category_name	bottles_over_two_sd	total_bottles	average_price	standard_deviation_price
0	WATERMELON SCHNAPPS	0	17844	7.247500	4.970288
1	ROOT BEER SCHNAPPS	0	13615	10.398182	3.146531
2	American Sloe Gins	0	4252	11.630000	7.000000
3	GRAPE SCHNAPPS	0	13044	10.630000	1.668772
4	Distilled Spirits Specialty	0	293	20.760000	12.332202
...	...	...	...	...	...
130	American Flavored Vodka	6817	637286	11.274107	5.030962
131	Whiskey Liqueur	7330	651397	19.606241	21.243081
132	Flavored Rum	10316	328127	13.934458	5.546112
133	CANADIAN WHISKIES	15258	936212	18.787585	14.252754
134	WHISKEY LIQUEUR	35516	334572	13.477604	9.556290

135 rows × 5 columns

Yeah maybe that is a reasonable number of whiskeys…

If the price of the bottles was normally distributed, we’d only expect 2.5% but I doubt the price is normal. Using the mean and standard deviation we see that it is clearly a funky distribution.

Your Turn#

Challenge your mates with a difficult question. Try to stump them and don’t forget to try it yourself! For the quiz, pair (or thruple, three or less please!) up with someone in the class and try to come up with a challenge for you both to tackle. I’d prefer you consider the datasets we have already looked at; liquors, bikeshare (stations or trips) and census. Try to pose the challenge clearly and see if you can solve it. Only new questions please!

Examples#

Please only use these if you are really stumped on what to ask!

What zip code consumes the most liquor per capita in Iowa?
Which zip code uses the bikeshare the most per capita in Austin?
Does the percentage of children in an area effect the amount of liquor consumed? (You need the per capita consumption for this to be accurate!)
What type of bikeshare station had the most trips going over a day?

SQL Wrap Up

Contents

SQL Wrap Up#

Your Turn#

Examples#