Back to questions
You are trying to find the median number of items bought per order on Alibaba, rounded to 1 decimal point.
However, instead of doing analytics on all Alibaba orders, you have access to a summary table, which describes how many items were in an order, and the number of orders that had that many items.
Column Name | Type |
---|---|
item_count | integer |
order_occurrences | integer |
item_count | order_occurrences |
---|---|
1 | 500 |
2 | 1000 |
3 | 800 |
4 | 1000 |
median |
---|
3.0 |
The total orders in the field in this dataset is 3300, meaning that the median item count would be for the 1650th order (3300 / 2 = 1650).
If we compare this to the running sum of , we can see that the median item count is 3.
The dataset you are querying against may have different input & output - this is just an example!
Finding the median shouldn't be a difficult task. However, given the way the dataset is structured we have to perform some data prepping.
Start by forming a common table expression (CTE) or subquery that will expand each by the number of its occurrences.
Using the function, we can pass in 1 as the start value and the value as the stop value. For example, this will give us a set of numbers [1, 1, 1, ..., 1, 2, 2, 2, …] etc. See this link for more information on the function.
Since PostgreSQL does not provide a straightforward function to find the median value, we use the function. This function takes the given percentile as an argument, in this case, it is 0.5, which is the 50th percentile which is also the median. The clause creates an ordered subset of search values returned by the CTE or subquery mentioned above that can be used to perform aggregations.
To return the output in decimal type, it's necessary to convert the values returned by the from an integer type to decimal type using .
Finally, to ensure that the output is rounded to one decimal place, use the function.
In addition, we propose another solution which provides more optimised query performance for very large datasets (i.e. hundreds of millions of rows).
This is because Solution #1 essentially disaggregates the dataset while this solution uses the aggregated data.
We have to find the median which is the centre point of an ordered dataset. In other words, the median equals the total frequency of a dataset divided into 2 equal parts: or .
Therefore, we could find the median by comparing a given value against the total sum of orders. If they are equal, we arrived at the median value.
In our case, however, we do not have individual values but ranges of values. This is why in the clause,