Back to questions
When you log in to your retailer client's database, you notice that their product catalog data is full of gaps in the column. Can you write a SQL query that returns the product catalog with the missing data filled in?
Assumptions
Column Name | Type |
---|---|
product_id | integer |
category | varchar |
name | varchar |
product_id | category | name |
---|---|---|
1 | Shoes | Sperry Boat Shoe |
2 | Adidas Stan Smith | |
3 | Vans Authentic | |
4 | Jeans | Levi 511 |
5 | Wrangler Straight Fit | |
6 | Shirts | Lacoste Classic Polo |
7 | Nautica Linen Shirt |
product_id | category | name |
---|---|---|
1 | Shoes | Sperry Boat Shoe |
2 | Shoes | Adidas Stan Smith |
3 | Shoes | Vans Authentic |
4 | Jeans | Levi 511 |
5 | Jeans | Wrangler Straight Fit |
6 | Shirts | Lacoste Classic Polo |
7 | Shirts | Nautica Linen Shirt |
Shoes will replace all values below the product Sperry Boat Shoe until Jeans appears. Similarly, Jeans will replace s for the product Wrangler Straight Fit, and so on.
The dataset you are querying against may have different input & output - this is just an example!
It is given that all the products belonging to same category are listed together in sequence, and that the first product in a group will have its category defined. The first thing we want to do is label products that are in the same category so that we can tell them apart.
We can use as a window function to assign a number for each type of category. function will compute the number of rows with non-null values in the column.
The goal of this exercise is to understand the category groups so that we know which values should fill in the s.
The query should look like this:
Query output:
product_id | category | name | category_group |
---|---|---|---|
1 | Shoes | Sperry Boat Shoe | 1 |
2 | Adidas Stan Smith | 1 | |
3 | Vans Authentic | 1 | |
4 | Jeans | Levi 511 | 2 |
5 | Wrangler Straight Fit | 2 |
Let's interpret the output.
The first product in the table is the Sperry Boat Shoe. The value for this product will be determined by the function as 1. No category has been assigned to the second or third products. In this case, the counter will not be increased, and will remain at 1 for each of them. The fourth product, Levi 511, is labelled as Jeans, which is a non-null value, hence the counter gets increased for this product.
On the basis of the newly generated column , the next step is to fill the missing category for each product. We can accomplish this by using another window function: FIRST_VALUE.
Let's apply this function to the query we have created using a common table expression(CTE).
Output:
product_id | name | category_group | category |
---|---|---|---|
1 | Sperry Boat Shoe | 1 | Shoes |
2 | Adidas Stan Smith | 1 | Shoes |
3 | Vans Authentic | 1 | Shoes |
4 | Levi 511 | 2 | Jeans |
5 | Wrangler Straight Fit | 2 | Jeans |
The clause will distribute rows by into several partitions, one for 1, another for 2, and so on. The function is applied to each category group separately. For the first partition, it will return Shoes, and for the second partition it will return Jeans because these categories were the first value of each category group.
Let's clean-up our code a little bit to format the result as the task requested.
Bonus: The technical term for this type of task is a fill-down activity.
Full Solution:
PostgreSQL 14