At Chindata, SQL is crucial for extracting and manipulating customer data for insights and designing databases to house server-related information. That's why Chindata frequently asks SQL query questions during interviews for Data Analyst, Data Science, and BI jobs.
As such, to help you ace the Chindata SQL interview, here’s 11 Chindata SQL interview questions – how many can you solve?
Identify the top 5 power users in "Chindata" company based on the total number of transactions made within the last 30 days. Power Users are defined as users who have made at least 10 transactions within the last 30 days.
transactions
- Example Input:transaction_id | user_id | transaction_date | amount |
---|---|---|---|
1024 | 567 | 10/01/2022 00:00:00 | 50.00 |
2048 | 123 | 10/02/2022 00:00:00 | 75.00 |
4096 | 456 | 10/05/2022 00:00:00 | 100.00 |
8192 | 123 | 10/10/2022 00:00:00 | 150.00 |
16384 | 567 | 10/20/2022 00:00:00 | 50.00 |
user_id | total_transactions | total_amount |
---|---|---|
123 | 2 | 225.00 |
567 | 2 | 100.00 |
456 | 1 | 100.00 |
SELECT user_id, count(*) AS total_transactions, sum(amount) AS total_amount FROM transactions WHERE transaction_date >= current_date - INTERVAL '30 days' GROUP BY user_id HAVING count(*) >= 10 ORDER BY total_transactions DESC, total_amount DESC LIMIT 5;
This query identifies and returns the top 5 VIP users within the last 30 days. It does this by grouping transactions by user_id
, counting the number of transactions, and summing up the total transaction amount per user only for transactions made within the past 30 days. The query then filters to keep only the users with ten or more transactions within the time frame. The result is ordered first by the total number of transactions in descending order, and then by the total transaction amount in descending order (for situations where two users have an equal number of transactions).
To work on another SQL customer analytics question where you can solve it right in the browser and have your SQL code instantly graded, try this Walmart SQL Interview Question:
You're given a table representing the usage data of Chindata's servers. Each row of the table represents a single day of usage for a given server. The table contains the following columns:
server_id
: The unique identifier of a serverusage_date
: The date of data collectioncpu_usage
: The CPU usage in percentageram_usage
: The RAM usage in percentageWith this data, write a SQL query to find the average CPU usage and RAM usage of each server in the previous seven days for each given day.
server_usage
Example Input:server_id | usage_date | cpu_usage | ram_usage |
---|---|---|---|
S101 | 2022-05-01 | 45 | 63 |
S102 | 2022-05-01 | 46 | 60 |
S101 | 2022-05-02 | 35 | 50 |
S102 | 2022-05-02 | 55 | 75 |
S101 | 2022-05-03 | 40 | 65 |
S102 | 2022-05-03 | 50 | 70 |
S101 | 2022-05-04 | 55 | 75 |
S102 | 2022-05-04 | 45 | 68 |
server_id | usage_date | avg_cpu_usage | avg_ram_usage |
---|---|---|---|
S101 | 2022-05-02 | 40 | 56.5 |
S102 | 2022-05-02 | 50.5 | 67.5 |
S101 | 2022-05-03 | 40 | 59.3 |
S102 | 2022-05-03 | 50.3 | 68.3 |
S101 | 2022-05-04 | 43.7 | 63.7 |
S102 | 2022-05-04 | 49 | 68.3 |
SELECT server_id, usage_date, AVG(cpu_usage) OVER (PARTITION BY server_id ORDER BY usage_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_cpu_usage, AVG(ram_usage) OVER (PARTITION BY server_id ORDER BY usage_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_ram_usage FROM server_usage ORDER BY server_id, usage_date;
The SQL query above calculates the rolling 7 days average CPU and RAM usages. We must have to use a window function because we're calculating a running average that depends on the current row and the 6 preceding rows. This is accomplished by using the ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
clause in the AVG
function. The PARTITION BY
clause ensures that each server's usage is restarted when calculating the average, to prevent the overflow from one server to another.
p.s. Window functions show up pretty frequently during SQL interviews, so practice the 27+ window function questions on DataLemur
Think of SQL constraints like the rules of a game. Just like a game needs rules to keep things fair and fun, a database needs constraints to keep things organized and accurate.
There are several types of SQL constraints like:
NOT NULL: This constraint is like a bouncer at a nightclub - it won't let anything NULL through the door. UNIQUE: This constraint is like a VIP list - only special, one-of-a-kind values get in. PRIMARY KEY: This constraint is like an elected official - it's made up of NOT NULL and UNIQUE values and helps identify each row in the table. FOREIGN KEY: This constraint is like a diplomatic ambassador - it helps establish relationships between tables. CHECK: This constraint is like a referee - it makes sure everything follows the rules. DEFAULT: This constraint is like a backup plan - it provides a default value if no other value is specified.
So, whether you're playing a game or organizing a database, constraints are an important part of the process!
Chindata is a company that provides cloud storage services to its clients. As their cloud storage utilization is growing, they need to understand the usage patterns to offer better and efficient services. The company wants an SQL query that can determine the average usage and the maximum usage each month for every client, so they can plan for future upgrades and efficiently assign resources.
They maintain two main tables, clients
and usage
, representing the information about their clients and their storage usage respectively.
clients
Example Input:client_id | name | client_since |
---|---|---|
1 | Client A | 01/01/2022 |
2 | Client B | 03/25/2022 |
3 | Client C | 05/15/2022 |
usage
Example Input:usage_id | client_id | usage_date | usage_gb |
---|---|---|---|
1 | 1 | 06/01/2022 | 200 |
2 | 1 | 06/15/2022 | 215 |
3 | 1 | 07/01/2022 | 220 |
4 | 2 | 06/25/2022 | 150 |
5 | 2 | 07/01/2022 | 155 |
6 | 3 | 05/15/2022 | 100 |
7 | 3 | 06/01/2022 | 120 |
8 | 3 | 07/01/2022 | 130 |
9 | 3 | 07/15/2022 | 140 |
SELECT DATE_TRUNC('month', usage_date) AS month, name AS client_name, AVG(usage_gb) AS avg_usage, MAX(usage_gb) AS max_usage FROM usage INNER JOIN clients ON usage.client_id = clients.client_id GROUP BY month, client_name;
This query joins the clients
and usage
tables on the client_id
field and extracts month
from the usage_date
. It further groups data by month
and client_name
and computes the average and maximum usage for each group. The DATE_TRUNC
function truncates the date to month, and AVG
and MAX
functions are used to calculate the average and maximum usage respectively.
Normalization involves breaking up your tables into smaller, more specialized ones and using primary and foreign keys to define relationships between them. Not only does this make your database more flexible and scalable, it also makes it easier to maintain. Plus, normalization helps to keep your data accurate by reducing the chance of inconsistencies and errors.
The only downside is now is that your queries will involve more joins, which are slow AF and often a DB performance botteleneck.
As the data analyst of Chindata, your task is to track the performance of company servers. Given a table named serverlogs
that contains server data logged every minute, write an SQL query to find the average CPU usage by server, sorted by the server with the highest average usage.
serverlogs
Example Input:log_id | server_id | log_time | cpu_usage |
---|---|---|---|
101 | A1 | 01/10/2022 00:01:00 | 5% |
102 | A1 | 01/10/2022 00:02:00 | 10% |
103 | A2 | 01/10/2022 00:02:00 | 15% |
104 | B1 | 01/10/2022 00:03:00 | 20% |
105 | A1 | 01/10/2022 00:03:00 | 10% |
106 | B1 | 01/10/2022 00:04:00 | 15% |
Note: CPU usage is a percentage string.
server_id | avg_cpu_usage |
---|---|
B1 | 17.5% |
A2 | 15.0% |
A1 | 8.33% |
SELECT server_id, TO_CHAR(AVG(CAST(REPLACE(cpu_usage, '%', '') AS FLOAT)), 'FM990.00') || '%' AS avg_cpu_usage FROM serverlogs GROUP BY server_id ORDER BY AVG(CAST(REPLACE(cpu_usage, '%', '') AS FLOAT)) DESC;
The AVG function is used here to find the average CPU usage per server. The REPLACE and CAST functions are used to transform the cpu_usage field into a numeric field because it is originally a string field containing a percentage sign (%). The TO_CHAR function is used to convert the resulting average back into a percentage string format in the result set. The server_id results are then ordered in descending order according to their average CPU usage.
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating server time usage or this Google Odd and Even Measurements Question which is similar for working with different types of server data.
FOREIGN KEY
constraint used in a database?A FOREIGN KEY
is a column or set of columns in a table that references the primary key of another table. It is used to link the two tables together, and to ensure that the data in the foreign key column is valid.
The FOREIGN KEY
constraint helps to ensure the integrity of the data in the database by preventing the insertion of rows in the foreign key table that do not have corresponding entries in the primary key table. It also helps to enforce the relationship between the two tables, and can be used to ensure that data is not deleted from the primary key table if there are still references to it in the foreign key table.
For example, if you have a table of Chindata customers and an orders table, the customer_id column in the orders table could be a FOREIGN KEY
that references the id column (which is the primary key) in the Chindata customers table.
Recently, Chindata has launched several digital ads and they want to analyze their click-through rates (CTR). The CTR is calculated as the number of users who clicked on the ad divided by the number of total impressions (the number of times the ad was shown). Assume you have access to two tables.
The first one, named ad_impressions
, contains data for every time an ad was shown to a user. The second one, named ad_clicks
, contains data for every time a user clicked on an ad.
ad_impressions
Example Input:impression_id | user_id | time | ad_id |
---|---|---|---|
6751 | 823 | 06/08/2021 00:00:00 | 403 |
8752 | 432 | 06/10/2021 00:00:00 | 403 |
4543 | 678 | 06/18/2021 00:00:00 | 612 |
5872 | 995 | 07/26/2021 00:00:00 | 612 |
6437 | 900 | 07/05/2021 00:00:00 | 522 |
ad_clicks
Example Input:click_id | user_id | time | ad_id |
---|---|---|---|
4932 | 823 | 06/08/2021 00:00:01 | 403 |
2658 | 543 | 06/10/2021 00:00:01 | 612 |
7532 | 995 | 07/26/2021 00:00:01 | 612 |
4317 | 797 | 07/05/2021 00:00:01 | 403 |
3214 | 721 | 07/10/2021 00:00:01 | 522 |
Using the above two tables, write a SQL query that provides the CTR per advertisement.
SELECT i.ad_id, COUNT(DISTINCT c.user_id)::float / COUNT(DISTINCT i.user_id) AS click_through_rate FROM ad_impressions i LEFT JOIN ad_clicks c ON i.ad_id = c.ad_id AND i.user_id = c.user_id GROUP BY i.ad_id;
This query uses a LEFT JOIN to combine the impressions and clicks tables on ad_id and user_id. As we're interested in the CTR, it's necessary to count the number of unique users who clicked on the ad COUNT(DISTINCT c.user_id)
and divide this by the total number of unique impressions COUNT(DISTINCT i.user_id)
. The result is the CTR for each ad.
To practice a related SQL interview question on DataLemur's free online SQL coding environment, solve this Meta SQL interview question:
Assume Chindata is a data center providing hosting services. At the end of each day, the system automatically records the amount of data transferred by each hosted server. Your goal is to write a SQL query that calculates the total amount of data transferred for each server on a monthly basis.
The usage
table has the following columns:
usage_id
-- a unique identifier for each usage record (int)server_id
-- the identifier of the server (int)date
-- the date when the data usage was recorded (datetime)data_gb
-- the amount of data transferred in gigabytes (int)usage
Example Inputusage_id | server_id | date | data_gb |
---|---|---|---|
101 | 1 | 06/01/2022 00:00:00 | 5 |
102 | 1 | 06/02/2022 00:00:00 | 7 |
103 | 2 | 06/02/2022 00:00:00 | 2 |
104 | 1 | 07/01/2022 00:00:00 | 10 |
105 | 2 | 07/01/2022 00:00:00 | 15 |
month | server_id | total_data_gb |
---|---|---|
6 | 1 | 12 |
6 | 2 | 2 |
7 | 1 | 10 |
7 | 2 | 15 |
SELECT DATE_PART('month', date) AS month, server_id, SUM(data_gb) AS total_data_gb FROM usage GROUP BY month, server_id ORDER BY month, server_id;
This above query first extracts the month part from the date using the DATE_PART function. It then groups the rows by the month and server_id and calculates the sum of data_gb for each group. The output is ordered by the month and server_id for readability.
There are several normal forms that define the rules for normalizing a database:
A database is in first normal form (1NF) if it meets the following criteria:
A database is in second normal form (2NF) if it meets the following criteria:
Said another way, to achieve 2NF, besides following all the rules from 1NF all the columns in a given table should be dependent only on that table's primary key.
A database is in third normal form (3NF) if it meets the following criteria:
A transitive dependency means that a piece of data in one column is derived from another column. For example, it wouldn't make sense to keep a column called "user's age" and "user's birthdate" (because age can be derived from birthdate.
While there's also a 4th and 5th normal form, it's too pedantic and hence out-of-scope to know for the Chindata SQL interview.
Chindata has several data centers distributed across various regions. Each data center consists of several server racks. Each rack contains several servers, and the power consumption of each varies. Chindata wants to calculate the average power consumption per rack for each region.
Below are the data structures for the data_centers
, server_racks
, and servers
.
data_centers
Input:dc_id | region |
---|---|
101 | East |
102 | West |
103 | North |
104 | South |
server_racks
Input:rack_id | dc_id |
---|---|
201 | 101 |
202 | 101 |
203 | 102 |
204 | 103 |
205 | 104 |
servers
Input:server_id | rack_id | power_consumption |
---|---|---|
301 | 201 | 150 |
302 | 201 | 200 |
303 | 202 | 250 |
304 | 203 | 350 |
305 | 204 | 120 |
306 | 205 | 360 |
region | avg_power_per_rack |
---|---|
East | 200 |
West | 350 |
North | 120 |
South | 360 |
SELECT dc.region, ROUND(AVG(s.power_consumption)) as avg_power_per_rack FROM data_centers dc JOIN server_racks sr ON dc.dc_id = sr.dc_id JOIN servers s ON sr.rack_id = s.rack_id GROUP BY dc.region
This query joins the data_centers
, server_racks
, and servers
tables on their respective foreign keys. It then groups the result by region
, and generates an average power consumption for each rack in each region using PostgreSQL's ROUND()
and AVG()
functions. The ROUND()
function is used to round the result to the nearest whole number.
The two most similar questions to your query are:
"Server Utilization Time" by Amazon This question's similarity lies in the requirement to calculate a specific metric (total time) for server utilization.
"Supercloud Customer" by Microsoft This question is similar as it also involves working with different server racks, only in this question, it's about identifying companies instead of calculating average power consumption.
So the markdown-friendly output you requested is:
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating server related metrics or this Microsoft Supercloud Customer Question which is similar for dealing with server racks.
The best way to prepare for a Chindata SQL interview is to practice, practice, practice.
Beyond just solving the above Chindata SQL interview questions, you should also solve the 200+ SQL exercises on DataLemur which come from companies like Netflix, Google, and Amazon.
Each SQL question has hints to guide you, step-by-step solutions and best of all, there is an interactive coding environment so you can right in the browser run your query and have it executed.
To prep for the Chindata SQL interview it is also wise to practice interview questions from other tech companies like:
But if your SQL skills are weak, don't worry about going right into solving questions – go learn SQL with this DataLemur SQL Tutorial.
This tutorial covers SQL concepts such as window functions and CASE/WHEN statements – both of these come up routinely in SQL job interviews at Chindata.
In addition to SQL query questions, the other question categories to prepare for the Chindata Data Science Interview are:
The best way to prepare for Chindata Data Science interviews is by reading Ace the Data Science Interview. The book's got: