Back to questions
As a data analyst at Uber, it's your job to report the latest metrics for specific groups of Uber users. Some riders create their Uber account the same day they book their first ride; the rider engagement team calls them "in-the-moment" users.
Uber wants to know the average delay between the day of user sign-up and the day of their 2nd ride. Write a query to pull the average 2nd ride delay for "in-the-moment" Uber users. Round the answer to 2-decimal places.
Tip:
| Column Name | Type |
|---|---|
| user_id | integer |
| registration_date | date |
| user_id | registration_date |
|---|---|
| 1 | 08/15/2022 |
| 2 | 08/21/2022 |
| Column Name | Type |
|---|---|
| ride_id | integer |
| user_id | integer |
| ride_date | date |
| ride_id | user_id | ride_date |
|---|---|---|
| 1 | 1 | 08/15/2022 |
| 2 | 1 | 08/16/2022 |
| 3 | 2 | 09/20/2022 |
| 4 | 2 | 09/23/2022 |
| average_delay |
|---|
| 1 |
The dataset you are querying against may have different input & output - this is just an example!
Step 1 Before we proceed with the calculations, let's identify users who can be classified as "in-the-moment" users. These are users who create their accounts on the same day as their first ride. We can join the table with on the common identifier and pull results where the registration date and ride date are equal.
As there may have users with more than 1 ride on the first date, simply use to obtain the unique users. In addition, wrap the query into a CTE, so we can reuse the output later.
The query will look something like this:
Step 2 The question asks for the average delay between the first and second rides. That means we need to identify a trip number for each user's rides. To do so, let's create a new column with that enumerates trips per user.
Now, we need to set this up so we can compare each ride date to the previous one using row-wise subtraction. To accomplish this, we will use the function to pull the date of the previous trip in a new column.
You can create a lagged ride date column by using the function on the ride date column while partitioning by users:
Query output:
| user_id | ride_date | ride_nr | lag_ride_date |
|---|---|---|---|
| 1 | 08/15/2022 00:00:00 | 1 | |
| 1 | 08/16/2022 00:00:00 | 2 | 08/15/2022 00:00:00 |
| 3 | 07/11/2022 00:00:00 | 1 | |
| 4 | 05/16/2022 00:00:00 | 1 | |
| 4 | 05/17/2022 00:00:00 | 2 | 05/16/2022 00:00:00 |
| 4 | 05/18/2022 00:00:00 | 3 | 05/17/2022 00:00:00 |
Step 3 We're almost there – all that's left to do is the row-wise subtraction and the average calculation!
Simply subtract the from to obtain the delay for each ride. Then wrap this subtraction in an function to obtain the average of all ride delays.
Keep in mind that the task asked for the 2nd ride delay, so make sure to filter on results where the is 2. Also, don't forget to round the result to 2 decimals ;)