The SQL aggregate functions like , , and from the previous lesson process an entire column to come up with an answer. But what if you want to aggregate only part of a table?
For example, instead of finding the average price all Amazon products orders, what if you want to find the average spent by category (i.e. average spent on appliances vs. electronics?).
For Excel users, the"pivot table" comes to mind, and the SQL equivalent to that concept is the command which tells the database to separate your data into groups, which can then be aggregated independently.
Here's a example that groups Amazon product sales by category:
The output of that query yields this result:
category | sum |
---|---|
electronics | 1007.54 |
appliance | 1135.22 |
Try playing around with yourself by copy-pasting the earlier code and running it against the data in this Amazon SQL interview question:
Both in SQL and Pivot Tables in Excel are tools designed to aggregate and summarize data. To demonstrate their similarities, let's first create a Pivot Table for a data set of FAANG stocks.
This pivot table finds the average open price for each stock, and the max close price for each stock:
You can notice that the "rows" dialog box in the spreadsheet tool is essentially asking you what column you want to group by ("ticker"), and the values box is asking you what column values you want to display ("open", "close"), and how they'll be aggregated (average, max).
This is the equivalent to the SQL query:
Notice how the SQL query output matches up to the pivot table!
You can GROUP BY two columns, even multiple columns, in SQL. To group by multiple categories, just separate column names with commas (just like the syntax in the !).
For an example of grouping by multiple columns, here's the average stock open price grouped by each ticker symbol and group by each year:
Here's a sample of that output:
ticker | year | avg_open |
---|---|---|
NFLX | 2023 | 364.91 |
META | 2023 | 220.35 |
AMZN | 2023 | 109.27 |
MSFT | 2023 | 291.96 |
AAPL | 2023 | 171.21 |
GOOG | 2023 | 108.10 |
MSFT | 2022 | 276.78 |
AAPL | 2022 | 151.56 |
META | 2022 | 193.06 |
... | ... | ... |
And again, just like with syntax for multiple columns, for shorthand, you can use numbers instead of typing out the full column names, to make your query more concise. Here's the exact same query in the format:
Given FAANG stock prices data, can you write a SQL query which uses GROUP BY to find the lowest price each stock ever opened at? Order your results by price, in descending order. Your output should look like this:
ticker | min |
---|---|
NFLX | 176.49 |
MSFT | 153.00 |
META | 94.33 |
AMZN | 85.46 |
AAPL | 61.63 |
GOOG | 56.10 |
Click the below to practice this GROUP BY exercise question in the DataLemur interactive SQL editor:
While isn't exactly used for the purpose of finding duplicates, the command does allow you to collapse multiple rows with the same values into a single row. So, in a way, it could be considered as a way to get rid of duplicates for certain columns.
To demonstrate this, imagine you were analyzing this website traffic data from a Google Analytics report:
browser | visits |
---|---|
Chrome | 6 |
Safari | 2 |
Safari | 4 |
Edge | 3 |
Safari | 5 |
Chrome | 3 |
Chrome | 4 |
To get rid of duplicate browser data, and aggregate the visits information, you could use the following GROUP BY query:
This would eliminate the duplicate browser information, and output the following result:
browser | visits |
---|---|
Chrome | 13 |
Safari | 11 |
Edge | 3 |
Suppose you are given a table of Data Science candidates, and their technical skills:
Sample Input:
candidate_id | skill |
---|---|
123 | Python |
234 | R |
234 | Python |
234 | SQL Server |
345 | Python |
... | ... |
How many candidates possess each of the different skills?
Sort your answers based on the count of candidates, from highest to lowest. Here's what the expected output looks like:
skill | count |
---|---|
Python | 3 |
R | 1 |
SQL Server | 1 |
Click below to solve this GROUP BY practice question in the DataLemur interactive SQL editor:
People sometimes get confused between and because both commands have in them, but their actual function is quite different!
ORDER BY helps you output your rows in a specific order, such as alphabetically on some text column, or from smallest to biggest, for some text column.
, as you saw earlier, is all about grouping your data into categories! Because they are quite different commands, it's absolutely possible for a query to have both GROUP BY and ORDER BY, with GROUP BY coming first!
Here's a query that uses both GROUP BY and ORDER BY:
It takes the FAANG stock data, and finds the average stock open price for each ticker symbol. Then, the clause orders the average stock open price, from highest to lowest:
Your now ready for the next lesson on HAVING which is 99.9% of the time used in conjunction with the GROUP BY clause. Curious about this dynamic duo?
Click on over to the next tutorial!
Next Lesson
SQL HAVING π€