Back to questions
This is the same question as problem #30 in the SQL Chapter of Ace the Data Science Interview!
Assume you are given the following tables on Walmart transactions and products. Find the number of unique product combinations that are bought together (purchased in the same transaction).
For example, if I find two transactions where apples and bananas are bought, and another transaction where bananas and soy milk are bought, my output would be 2 to represent the 2 unique combinations. Your output should be a single number.
Assumption:
Column Name | Type |
---|---|
transaction_id | integer |
product_id | integer |
user_id | integer |
transaction_date | datetime |
transaction_id | product_id | user_id | transaction_date |
---|---|---|---|
231574 | 111 | 234 | 03/01/2022 12:00:00 |
231574 | 444 | 234 | 03/01/2022 12:00:00 |
231574 | 222 | 234 | 03/01/2022 12:00:00 |
137124 | 111 | 125 | 03/05/2022 12:00:00 |
137124 | 444 | 125 | 03/05/2022 12:00:00 |
Column Name | Type |
---|---|
product_id | integer |
product_name | string |
product_id | product_name |
---|---|
111 | apple |
222 | soy milk |
333 | instant oatmeal |
444 | banana |
555 | chia seed |
combo_num |
---|
4 |
There are 4 unique purchase combinations present in the example data.
The dataset you are querying against may have different input & output - this is just an example!
Goal: Find the number of unique combinations purchased in a single transaction.
Let's start small by joining the and tables together based on the related field so that we can see all the product names clearly:
Showing the first 5 rows of output:
product_id | transaction_id | product_name |
---|---|---|
111 | 231574 | apple |
444 | 231574 | banana |
222 | 231574 | soy milk |
444 | 137124 | banana |
222 | 256234 | soy milk |
We'll create a CTE containing the above query and perform a to fetch products that were purchased together by a single user by joining on .
Note that we want all pairs of the products, but we do not want to overcount the same product. For example, if user A purchased an apple and instant oatmeal in the same transaction, then we only want to count the (apple, instant oatmeal) transaction once, and not also (instant oatmeal, apple).
To avoid double-counting, we can use a condition within the that the of A is less than that of of B.
Note that we used here, but should yield the same result!
Do you see how this technique solves the overcounting problem? To visualize it more clearly, instead of the in the main and you'll see that there are no overcounting issues.
Solution:
PostgreSQL 14