Back to questions
This is the same question as problem #2 in the SQL Chapter of Ace the Data Science Interview!
Assume you're given the tables containing completed trade orders and user details in a Robinhood trading system.
Write a query to retrieve the top three cities that have the highest number of completed trade orders listed in descending order. Output the city name and the corresponding number of completed trade orders.
Column Name | Type |
---|---|
order_id | integer |
user_id | integer |
quantity | integer |
status | string ('Completed', 'Cancelled') |
date | timestamp |
price | decimal (5, 2) |
order_id | user_id | quantity | status | date | price |
---|---|---|---|---|---|
100101 | 111 | 10 | Cancelled | 08/17/2022 12:00:00 | 9.80 |
100102 | 111 | 10 | Completed | 08/17/2022 12:00:00 | 10.00 |
100259 | 148 | 35 | Completed | 08/25/2022 12:00:00 | 5.10 |
100264 | 148 | 40 | Completed | 08/26/2022 12:00:00 | 4.80 |
100305 | 300 | 15 | Completed | 09/05/2022 12:00:00 | 10.00 |
100400 | 178 | 32 | Completed | 09/17/2022 12:00:00 | 12.00 |
100565 | 265 | 2 | Completed | 09/27/2022 12:00:00 | 8.70 |
Column Name | Type |
---|---|
user_id | integer |
city | string |
string | |
signup_date | datetime |
user_id | city | signup_date | |
---|---|---|---|
111 | San Francisco | rrok10@gmail.com | 08/03/2021 12:00:00 |
148 | Boston | sailor9820@gmail.com | 08/20/2021 12:00:00 |
178 | San Francisco | harrypotterfan182@gmail.com | 01/05/2022 12:00:00 |
265 | Denver | shadower_@hotmail.com | 02/26/2022 12:00:00 |
300 | San Francisco | houstoncowboy1122@hotmail.com | 06/30/2022 12:00:00 |
city | total_orders |
---|---|
San Francisco | 3 |
Boston | 2 |
Denver | 1 |
In the given dataset, San Francisco has the highest number of completed trade orders with 3 orders. Boston holds the second position with 2 orders, and Denver ranks third with 1 order.
The dataset you are querying against may have different input & output - this is just an example!
We begin by joining the and tables based on the related column . This is because the 'Completed' order is stored in the table, while the cities are stored in the table.
In the statement, we pull the field from the table and the field from the table.
Output (showing the first 5 rows only):
city | order_id |
---|---|
San Francisco | 100777 |
San Francisco | 100102 |
San Francisco | 100101 |
Boston | 100259 |
Boston | 100264 |
Next, we filter the 'Completed' orders and retrieve the number of orders for each city using the aggregate function. We group the results by the column using the statement.
The GROUP BY statement is commonly employed in conjunction with aggregate functions such as , , , , and to group the results based on non-aggregate columns.
Did you notice that our output is grouped by the column?
city | total_orders |
---|---|
Boston | 1 |
New York | 2 |
San Francisco | 4 |
Finally, to arrange the output in descending order based on the highest number of completed orders, we utilize the clause and limit the results to the top 3 orders using the clause.
city | total_orders |
---|---|
San Francisco | 4 |
Boston | 3 |
Denver | 2 |
Based on the results, San Francisco has the highest number of completed orders with 4 orders. Boston has the second-highest number of completed orders with 3 orders and Denver has the third-highest number of completed orders with 2 orders.