At DiDi (aka the Chinese version of Uber), SQL is used quite frequently for analyzing traffic data for route optimization and analyzing customer data for personalized marketing campaigns. Because of this, DiDi asks SQL problems in interviews for Data Analyst, Data Science, and BI jobs.
As such, to help you prepare for the DiDi SQL interview, this blog covers 11 DiDi SQL interview questions – able to answer them all?
The ride-hailing company DiDi has a database which tracks rides taken by its users. Your task is to identify the "power users". For this exercise, a "power user" is defined as a user who has completed over 20 rides in a month. Write a SQL query to identify these users and calculate the total kilometres and average cost of their rides for every month in 2022.
Your database has two tables: users and rides. The table has a one-to-many relationship with the table (each user can have multiple rides, but each ride belongs to only one user).
Sample data for the table:
Sample data for the table:
This problem can be solved by using a combination of aggregation, filtering and joining operations on the and tables. Here's one possible way to write the SQL query in PostgreSQL format:
This query first filters the table to include only rides that took place in 2022. It then groups the results by user and month, and for each user and month, it calculates the number of rides taken, the total kilometres traveled and the average ride cost. Finally, it filters these groups to return only those where the number of rides taken is more than 20, and orders the results by user and month.
To practice a similar customer analytics SQL question where you can solve it interactively and have your SQL query automatically checked, try this Walmart SQL Interview Question:
Consider a ride-sharing company like DiDi where you have a record of completed trips data. Your task is to write a SQL query that calculates the running seven days average of ride durations for each rider_id.
The company stores data on trip durations in the table, which has the following columns:
trip_id | rider_id | trip_date | duration_minutes |
---|---|---|---|
1 | 001 | 2022-07-01 | 15 |
2 | 002 | 2022-07-01 | 20 |
3 | 001 | 2022-07-02 | 30 |
4 | 002 | 2022-07-02 | 25 |
5 | 001 | 2022-07-03 | 20 |
6 | 002 | 2022-07-03 | 30 |
7 | 001 | 2022-07-04 | 25 |
The output columns should include , and .
rider_id | trip_date | 7_days_avg_duration |
---|---|---|
001 | 2022-07-01 | 15.00 |
002 | 2022-07-01 | 20.00 |
001 | 2022-07-02 | 22.50 |
002 | 2022-07-02 | 22.50 |
001 | 2022-07-03 | 21.67 |
002 | 2022-07-03 | 25.00 |
001 | 2022-07-04 | 22.50 |
You should use the window function with the window defined to be trailing seven days (considering current row) using . This gives the running seven days average.
This query first partitions the data by and then sorts it by within each partition. It evaluates the average for each row with the help of a window frame that includes the current row and the six preceding rows. For the first six rows of each , the average is computed only for the available rows.
To practice a related window function SQL problem on DataLemur's free interactive SQL code editor, try this Google SQL Interview Question:
One way to find duplicate records in a table is by using , and then seeing which groups have more than one occurence:
Another way is by using the operator:
DiDi, a ride-hailing service, has launched a new promotional campaign. They want to measure the campaign's performance by analyzing the Click-through and Conversion rates. A user first views a promotional ad (Click-through) and then books a ride (Conversion). Here are the sample datasets.
ad_id | user_id | click_date |
---|---|---|
101 | 123 | 06/08/2022 |
102 | 265 | 06/10/2022 |
101 | 362 | 06/18/2022 |
103 | 192 | 07/26/2022 |
102 | 981 | 07/05/2022 |
booking_id | user_id | ad_id | booking_date |
---|---|---|---|
8001 | 123 | 101 | 06/09/2022 |
8050 | 265 | 102 | 06/11/2022 |
8502 | 362 | 101 | 06/19/2022 |
8259 | 192 | null | 07/26/2022 |
8901 | 981 | 102 | 07/06/2022 |
This query performs a left join of the table (a) and the table (b) on both and columns. Then, for each ad (), it calculates the Conversion Rate as the ratio of distinct users who booked a ride to the distinct users who clicked the ad. If none of the users clicked the ad (denominator is 0), prevent division by zero, returning .
To practice a similar problem about calculating rates, try this TikTok SQL question within DataLemur's online SQL code editor:
If you want to return records with no duplicates, you can use the keyword in your statement.
For example, if you had a table of DiDi employees, and wanted to see what are all the unique job titles that currently worked at the company, you could write the following query:
If had the following data:
f_name | job_title |
---|---|
Akash | Data Analyst |
Brittany | Data Scientist |
Carlos | Data Engineer |
Diego | Data Engineer |
Eva | Data Analyst |
Then the output from the query would be:
job_title |
---|
Data Analyst |
Data Scientist |
Data Engineer |
Imagine you work at DiDi, the company that provides ride-hailing services. You are asked to analyze the ride durations for each car type to understand the product usage and customer experience. Given the rides data, how would you find out what is the average ride duration for each car type?
Here's an example of how the table is structured:
ride_id | user_id | ride_date | car_type | duration_minutes |
---|---|---|---|---|
2001 | 123 | 06/08/2022 | Luxury | 30 |
2002 | 265 | 06/10/2022 | Standard | 45 |
2003 | 362 | 06/18/2022 | Luxury | 60 |
2004 | 192 | 07/26/2022 | Standard | 20 |
2005 | 981 | 07/05/2022 | Luxury | 45 |
Write a SQL query that will return the average ride duration for each car type.
car_type | avg_duration_minutes |
---|---|
Luxury | 45.00 |
Standard | 32.50 |
The SQL query uses the statement coupled with the function to find the average duration of rides for each car type. It pulls data from the column and groups them by the values in the column. It then calculates the average duration for each group.
Normalization involves breaking up your tables into smaller, more specialized ones and using primary and foreign keys to define relationships between them. Not only does this make your database more flexible and scalable, it also makes it easier to maintain. Plus, normalization helps to keep your data accurate by reducing the chance of inconsistencies and errors.
The only downside is now is that your queries will involve more joins, which are slow AF and often a DB performance botteleneck.
DiDi is a major player in the ride-hailing market. They have a huge customer database consisting of customer names, email addresses and their ride details. There is a requirement to find the details of customers whose email addresses belong to a particular domain, say 'gmail.com'.
We need to filter the records of DiDi's customers present in their "customers" database that have 'gmail.com' in their email addresses.
customer_id | name | ride_count | registration_date | |
---|---|---|---|---|
101 | John Doe | john.doe@gmail.com | 15 | 01/10/2019 |
102 | Jane Smith | jane.smith@yahoo.com | 22 | 15/04/2017 |
103 | Alicia Brown | aliciabrown@gmail.com | 30 | 20/11/2018 |
104 | Michael Black | michaelblack@hotmail.com | 45 | 01/01/2020 |
105 | Samuel Green | samuel.green@yahoo.com | 35 | 05/06/2018 |
106 | Anna White | annawhite@gmail.com | 18 | 01/12/2019 |
customer_id | name | ride_count | registration_date | |
---|---|---|---|---|
101 | John Doe | john.doe@gmail.com | 15 | 01/10/2019 |
103 | Alicia Brown | aliciabrown@gmail.com | 30 | 20/11/2018 |
106 | Anna White | annawhite@gmail.com | 18 | 01/12/2019 |
In this PostgreSQL query, we are making use of the SQL 'LIKE' keyword to filter information from the 'customers' table. The "%" before and after 'gmail.com' in the 'LIKE' clause are wildcards that match any sequence of characters. So, the query is searching for any data entry in the 'email' column that contains 'gmail.com'. The output is a table listing the details of all customers whose email addresses end with 'gmail.com'.
As a data analyst at DiDi, you are tasked to analyze the customer ride data. You need to find out which payment method is most popular among customers for each ride type. Write a SQL query to solve this problem.
You are provided with two tables:
table:
ride_id | customer_id | ride_type | start_time |
---|---|---|---|
1 | 2 | Economy | 2021-05-09 08:00:00 |
2 | 3 | Luxury | 2021-05-09 09:00:00 |
3 | 2 | Economy | 2021-05-10 10:00:00 |
4 | 1 | Luxury | 2021-05-11 11:00:00 |
table:
payment_id | ride_id | payment_type | payment_amount |
---|---|---|---|
1 | 1 | Credit Card | 30.00 |
2 | 2 | 150.00 | |
3 | 4 | AliPay | 200.00 |
4 | 3 | 35.00 |
In PostgreSQL, you could use a subquery to first find out the count of each payment type per ride type. The 'rank' function can then be used to select the most popular payment method for each ride type.
This query will return the most popular payment method for each type of ride based on the number of times each payment method was used for each ride type. If more than one payment method shares the top spot for a certain ride type, all will be included in the output.
Since joins come up frequently during SQL interviews, try this interactive Snapchat JOIN SQL interview question:
SQL constraints are used to specify rules for the data in a table. They can be applied to single or multiple fields in a table when the table is created, or after the table has been created using the ALTER TABLE command.
For example, say you had a database that stores ad campaign data from DiDi's Google Analytics account.
Here's what some constraints could look like:
Based on the trip data from the DiDi app, calculate the average trip distance from each starting location, round the result to 2 decimal places, and find the absolute difference between the maximum and minimum average trip distance. Also, find the square root of the total number of trips made from each starting location. Assume that the starting location is stored as a string.
trip_id | start_location | end_location | trip_distance (km) |
---|---|---|---|
1 | 'New York' | 'Los Angeles' | 500 |
2 | 'New York' | 'San Francisco' | 480 |
3 | 'San Francisco' | 'Los Angeles' | 385 |
4 | 'San Francisco' | 'New York' | 520 |
5 | 'Los Angeles' | 'New York' | 515 |
start_location | average_distance | distance_difference | sqrt_of_total_trips |
---|---|---|---|
'New York' | 490.00 | 35.00 | 14.14 |
'San Francisco' | 452.50 | 67.50 | 14.14 |
'Los Angeles' | 515.00 | 0.00 | 1.00 |
In the provided solution, a common table expression is created calculating the average trip distance and the total number of trips group by . Another CTE calculates the absolute difference of the maximum and minimum average distance for each .
Finally, the necessary information is selected from these CTEs and the square root of the total number of trips is also calculated.
To practice a very similar question try this interactive Amazon Highest-Grossing Items Question which is similar for requiring calculations with aggregate functions or this Alibaba Compressed Mean Question which is similar for providing statistics and rounding values to a specific decimal place.
The key to acing a DiDi SQL interview is to practice, practice, and then practice some more! Beyond just solving the earlier DiDi SQL interview questions, you should also solve the 200+ DataLemur SQL Interview Questions which come from companies like Microsoft, Google, and Facebook.
Each interview question has multiple hints, full answers and most importantly, there's an interactive coding environment so you can right online code up your SQL query and have it graded.
To prep for the DiDi SQL interview it is also helpful to practice SQL problems from other tech companies like:
However, if your SQL coding skills are weak, forget about jumping right into solving questions – improve your SQL foundations with this SQL interview tutorial.
This tutorial covers topics including window functions and removing NULLs – both of these pop up often during DiDi SQL assessments.
In addition to SQL interview questions, the other question categories covered in the DiDi Data Science Interview are:
The best way to prepare for DiDi Data Science interviews is by reading Ace the Data Science Interview. The book's got: