At Baidu, SQL is used day-to-day for analyzing user behavior patterns to fine-tune search algorithms, and managing large datasets to power the Baidu advertisement unit. Unsurprisingly this is why Baidu LOVES to ask SQL query questions in interviews for Data Analytics, Data Science, and Data Engineering jobs.
So, to help you practice for the Baidu SQL interview, this blog covers 9 Baidu SQL interview questions – can you solve them?
Baidu, being a web services company, greatly values its users. Power users, or those who frequently engage in the company's various activities, are of particular importance. For this assessment, assume that a "power user" is defined as a user who performs a significant number of searches monthly.
You are provided the following table with the following columns:
Please write a SQL query to find the top 5 users who performed the highest number of searches in the past month and provide both their and the number of searches they performed.
search_id | user_id | search_date |
---|---|---|
1001 | 2345 | 2023-12-10 |
1002 | 9876 | 2023-12-15 |
1003 | 2345 | 2023-12-16 |
1004 | 1111 | 2022-06-01 |
1005 | 9876 | 2023-12-17 |
1006 | 2345 | 2023-12-18 |
1007 | 1111 | 2023-12-19 |
1008 | 1111 | 2023-12-20 |
(Note: This is a simplified table and does not include other potential columns such as or , which are irrelevant for this question)
This query calculates the number of searches each user performed in the past month by counting the number of unique 's each performed. This result is then ordered in descending order and limited to the 5 users with the most searches. As such, the output will be a list of the top 5 users and the number of searches they performed in the last month.
To solve a related customer analytics question on DataLemur's free online SQL coding environment, try this Microsoft Azure Cloud SQL Interview Question:
As a data analyst at Baidu, you have been tasked with performing analysis on users' search history data. You have been asked to find out the most popular search terms per user for every month.
The database has a table with following structure.
search_id | user_id | search_date | search_term |
---|---|---|---|
101 | 123 | 04/02/2022 | machine learning |
102 | 234 | 04/03/2022 | AI |
103 | 123 | 04/04/2022 | machine learning |
104 | 234 | 04/05/2022 | big data |
105 | 456 | 04/06/2022 | blockchain |
106 | 123 | 04/07/2022 | machine learning |
107 | 123 | 05/02/2022 | data science |
108 | 234 | 05/03/2022 | AI |
109 | 123 | 05/04/2022 | data science |
In the table, is a unique identifier of the search, is the identifier of the search's user, is the date when the search was performed, and is the term that was searched.
Write a SQL query to find out the most popular search terms per user for each month. If there are multiple search terms with the same frequency, choose the most recent one.
month | user_id | popular_search_term |
---|---|---|
4 | 123 | machine learning |
4 | 234 | AI |
5 | 123 | data science |
The following PostgreSQL query can solve this problem:
This query calculates the frequency of each search term per user per month with a window function in the CTE . Then it selects the most popular (most frequent) search term per user per month. If there are multiple search terms with the same frequency, the most recent one is selected due to in the clause.
For more window function practice, try this Uber SQL problem on DataLemur's online SQL coding environment:
To explain the difference between a primary key and foreign key, let's start with some data from Baidu's marketing analytics database which stores data from Google Ads campaigns:
:
+------------+------------+------------+------------+ | ad_id | campaign_id| keyword | click_count| +------------+------------+------------+------------+ | 1 | 100 | Baidu pricing | 10 | | 2 | 100 | Baidu reviews | 15 | | 3 | 101 | Baidu alternatives | 7 | | 4 | 101 | buy Baidu | 12 | +------------+------------+------------+------------+
In this table, could be a primary key. It is unique for each row in the table and cannot contain null values.
could be a foreign key. It references the of the Google Ads campaign that each ad belongs to, establishing a relationship between the ads and their campaigns. This foreign key allows you to easily query the table to find out which ads belong to a specific campaign, or to find out which campaigns a specific ad belongs to.
It is also possible for a table to have multiple foreign keys that reference different primary keys in other tables. For example, the table could have additional foreign keys for the of the ad group that each ad belongs to, and the of the Google Ads account that the campaigns belong to.
As a data analyst at Baidu, one of your tasks is to understand the search behavior of different groups of users. The system has gathered user information, including their registration dates and regions. Additionally, the system logs all search queries made by these users along with timestamp. The aim is to find out the number of unique users in every region who have made at least one search query each month.
Baidu's user information can be stored in a table and the search activity in a table:
user_id | registration_date | region |
---|---|---|
4521 | 2019-11-27 | East Asia |
3857 | 2020-06-12 | South Asia |
2953 | 2019-12-15 | Europe |
4189 | 2021-05-22 | North America |
1234 | 2020-02-18 | South America |
user_id | search_date | query |
---|---|---|
4521 | 2022-06-10 | Python |
3857 | 2022-07-01 | Big Data |
3857 | 2022-07-05 | Machine Learning |
2953 | 2022-06-30 | Artificial Intelligence |
1234 | 2022-07-11 | Data Science |
4189 | 2022-08-05 | Java |
4521 | 2022-08-12 | JavaScript |
You can use the following SQL query in PostgreSQL to solve the problem:
This query joins and tables on , then groups by each month and region. Within each group, it counts the number of distinct to determine the monthly active users per region. The function truncates the search date to month, and the casting is there to convert the timestamp (with time and possibly timezone) to a simple date format.
combines the results from two or more SELECT statements, but only returns the rows that are exactly the same between the two sets.
For example, let's use to find all of Baidu's Facebook video ads with more than 10k views that are also being run on YouTube:
Baidu is a Chinese multinational technology company specializing in Internet-related services and products, and artificial intelligence. One valuable metric for them might be the average number of clicked ads per day. The following tables represent simplified versions of Baidu's , , and databases:
user_id | signup_date | region |
---|---|---|
123 | 01/08/2022 | South China |
265 | 11/12/2021 | North China |
362 | 25/07/2022 | East China |
192 | 10/08/2021 | West China |
981 | 15/01/2022 | Central China |
ad_id | company_id | post_date |
---|---|---|
1001 | 3000 | 05/08/2022 |
2100 | 4500 | 15/11/2021 |
1200 | 5500 | 06/10/2022 |
8900 | 6500 | 18/08/2022 |
6000 | 7500 | 19/07/2022 |
click_id | click_date | user_id | ad_id |
---|---|---|---|
5253 | 26/08/2022 | 123 | 2100 |
7202 | 28/08/2022 | 192 | 8900 |
3946 | 30/08/2022 | 362 | 6000 |
3245 | 28/08/2022 | 265 | 1001 |
9898 | 29/08/2022 | 981 | 2100 |
Using these tables, write a SQL query to find the average number of ads clicked per day for all users.
day | avg_clicks |
---|---|
26/08/2022 | 1.00 |
28/08/2022 | 2.00 |
29/08/2022 | 1.00 |
30/08/2022 | 1.00 |
The provided SQL query first groups the clicks per day, calculating the number of clicks each day. It then calculates the average number of clicks per day by using a window function that orders the data by click date. Finally, it orders the resulting output by click date.
To practice a very similar question try this interactive Facebook App Click-through Rate (CTR) Question which is similar for calculating click rates or this Facebook Active User Retention Question which is similar for handling user data.
Database views are virtual tables based on the results of a SQL statement. They're just like vanilla tables, except views allow you to create simplified versions of tables or hide sensitive data from certain users.
In PostgreSQL, you can create a view by using the command. Here's an example for the table:
You are a data analyst at Baidu, a popular search engine company in China. The development team needs to check if some search categories are slowing down. Your task is to find the average duration of search queries for each search category over the last month.
Please consider the following "searches" table for this analysis:
search_id | user_id | search_date | search_category | search_duration |
---|---|---|---|---|
101 | 456 | 08/10/2022 10:15:00 | Tech | 0.15 |
102 | 789 | 08/12/2022 11:25:00 | Fashion | 0.18 |
103 | 456 | 08/20/2022 13:33:00 | Travel | 0.2 |
104 | 321 | 08/21/2022 16:00:00 | Tech | 0.12 |
105 | 123 | 08/23/2022 17:45:00 | Fashion | 0.16 |
Your task is to provide a report that provides the average search duration by category for the last month. The output should include the month (in YYYY-MM format), search category, and average search duration up to two decimal places.
mth | category | avg_duration |
---|---|---|
2022-08 | Tech | 0.14 |
2022-08 | Fashion | 0.17 |
2022-08 | Travel | 0.20 |
This SQL query rounds the average search duration to two decimal places for each search category. It runs for the last month, considering the current date as the end point. For instance, if the current date is September 15, then it will consider the period from August 1 to August 31. The results are grouped by month and search category.
As a data analyst at Baidu, you've been given a dataset of user website visit logs. The user visits are marked with a timestamp. Your task is to generate a report that shows the number of daily unique visitors for the past week.
The table has the following structure:
log_id | user_id | visit_time |
---|---|---|
1 | 123 | 2022-07-29 08:30:00 |
2 | 123 | 2022-07-29 10:20:00 |
3 | 265 | 2022-07-28 15:42:00 |
4 | 362 | 2022-07-28 22:12:00 |
5 | 192 | 2022-07-27 16:45:00 |
6 | 192 | 2022-07-27 18:55:00 |
7 | 981 | 2022-07-26 13:22:00 |
8 | 981 | 2022-07-31 08:44:00 |
9 | 525 | 2022-07-25 05:40:00 |
10 | 525 | 2022-07-25 06:10:00 |
11 | 264 | 2022-07-24 14:42:00 |
12 | 123 | 2022-07-24 15:22:00 |
We would like to produce the following result:
visit_date | unique_visitors |
---|---|
2022-07-24 | 2 |
2022-07-25 | 1 |
2022-07-26 | 1 |
2022-07-27 | 1 |
2022-07-28 | 2 |
2022-07-29 | 1 |
2022-07-30 | 0 |
2022-07-31 | 1 |
A PostgreSQL query that can be used to answer this question is:
This query groups the logs by the date of visit (ignoring the time part) for the past 7 days. It counts the number of unique user IDs for each group, which gives us the number of unique visitors each day. The result is then ordered by the visit date in ascending order.
The best way to prepare for a Baidu SQL interview is to practice, practice, practice. In addition to solving the above Baidu SQL interview questions, you should also solve the 200+ tricky sql questions on DataLemur which come from companies like FAANG and tech startups.
Each DataLemur SQL question has multiple hints, step-by-step solutions and most importantly, there is an interactive coding environment so you can right online code up your SQL query and have it graded.
To prep for the Baidu SQL interview it is also wise to solve SQL problems from other tech companies like:
In case your SQL coding skills are weak, don't worry about going right into solving questions – go learn SQL with this SQL tutorial for Data Scientists & Analysts.
This tutorial covers topics including math functions like ROUND()/CEIL() and AND/OR/NOT – both of these pop up often in SQL interviews at Baidu.
Besides SQL interview questions, the other question categories covered in the Baidu Data Science Interview are:
To prepare for Baidu Data Science interviews read the book Ace the Data Science Interview because it's got: