At Chindata, SQL is crucial for extracting and manipulating customer data for insights and designing databases to house server-related information. That's why Chindata frequently asks SQL query questions during interviews for Data Analyst, Data Science, and BI jobs.
As such, to help you ace the Chindata SQL interview, here’s 11 Chindata SQL interview questions – how many can you solve?
Identify the top 5 power users in "Chindata" company based on the total number of transactions made within the last 30 days. Power Users are defined as users who have made at least 10 transactions within the last 30 days.
transaction_id | user_id | transaction_date | amount |
---|---|---|---|
1024 | 567 | 10/01/2022 00:00:00 | 50.00 |
2048 | 123 | 10/02/2022 00:00:00 | 75.00 |
4096 | 456 | 10/05/2022 00:00:00 | 100.00 |
8192 | 123 | 10/10/2022 00:00:00 | 150.00 |
16384 | 567 | 10/20/2022 00:00:00 | 50.00 |
user_id | total_transactions | total_amount |
---|---|---|
123 | 2 | 225.00 |
567 | 2 | 100.00 |
456 | 1 | 100.00 |
This query identifies and returns the top 5 VIP users within the last 30 days. It does this by grouping transactions by , counting the number of transactions, and summing up the total transaction amount per user only for transactions made within the past 30 days. The query then filters to keep only the users with ten or more transactions within the time frame. The result is ordered first by the total number of transactions in descending order, and then by the total transaction amount in descending order (for situations where two users have an equal number of transactions).
To work on another SQL customer analytics question where you can solve it right in the browser and have your SQL code instantly graded, try this Walmart SQL Interview Question:
You're given a table representing the usage data of Chindata's servers. Each row of the table represents a single day of usage for a given server. The table contains the following columns:
With this data, write a SQL query to find the average CPU usage and RAM usage of each server in the previous seven days for each given day.
server_id | usage_date | cpu_usage | ram_usage |
---|---|---|---|
S101 | 2022-05-01 | 45 | 63 |
S102 | 2022-05-01 | 46 | 60 |
S101 | 2022-05-02 | 35 | 50 |
S102 | 2022-05-02 | 55 | 75 |
S101 | 2022-05-03 | 40 | 65 |
S102 | 2022-05-03 | 50 | 70 |
S101 | 2022-05-04 | 55 | 75 |
S102 | 2022-05-04 | 45 | 68 |
server_id | usage_date | avg_cpu_usage | avg_ram_usage |
---|---|---|---|
S101 | 2022-05-02 | 40 | 56.5 |
S102 | 2022-05-02 | 50.5 | 67.5 |
S101 | 2022-05-03 | 40 | 59.3 |
S102 | 2022-05-03 | 50.3 | 68.3 |
S101 | 2022-05-04 | 43.7 | 63.7 |
S102 | 2022-05-04 | 49 | 68.3 |
The SQL query above calculates the rolling 7 days average CPU and RAM usages. We must have to use a window function because we're calculating a running average that depends on the current row and the 6 preceding rows. This is accomplished by using the clause in the function. The clause ensures that each server's usage is restarted when calculating the average, to prevent the overflow from one server to another.
p.s. Window functions show up pretty frequently during SQL interviews, so practice the 27+ window function questions on DataLemur
Think of SQL constraints like the rules of a game. Just like a game needs rules to keep things fair and fun, a database needs constraints to keep things organized and accurate.
There are several types of SQL constraints like:
NOT NULL: This constraint is like a bouncer at a nightclub - it won't let anything NULL through the door. UNIQUE: This constraint is like a VIP list - only special, one-of-a-kind values get in. PRIMARY KEY: This constraint is like an elected official - it's made up of NOT NULL and UNIQUE values and helps identify each row in the table. FOREIGN KEY: This constraint is like a diplomatic ambassador - it helps establish relationships between tables. CHECK: This constraint is like a referee - it makes sure everything follows the rules. DEFAULT: This constraint is like a backup plan - it provides a default value if no other value is specified.
So, whether you're playing a game or organizing a database, constraints are an important part of the process!
Chindata is a company that provides cloud storage services to its clients. As their cloud storage utilization is growing, they need to understand the usage patterns to offer better and efficient services. The company wants an SQL query that can determine the average usage and the maximum usage each month for every client, so they can plan for future upgrades and efficiently assign resources.
They maintain two main tables, and , representing the information about their clients and their storage usage respectively.
client_id | name | client_since |
---|---|---|
1 | Client A | 01/01/2022 |
2 | Client B | 03/25/2022 |
3 | Client C | 05/15/2022 |
usage_id | client_id | usage_date | usage_gb |
---|---|---|---|
1 | 1 | 06/01/2022 | 200 |
2 | 1 | 06/15/2022 | 215 |
3 | 1 | 07/01/2022 | 220 |
4 | 2 | 06/25/2022 | 150 |
5 | 2 | 07/01/2022 | 155 |
6 | 3 | 05/15/2022 | 100 |
7 | 3 | 06/01/2022 | 120 |
8 | 3 | 07/01/2022 | 130 |
9 | 3 | 07/15/2022 | 140 |
This query joins the and tables on the field and extracts from the . It further groups data by and and computes the average and maximum usage for each group. The function truncates the date to month, and and functions are used to calculate the average and maximum usage respectively.
Normalization involves breaking up your tables into smaller, more specialized ones and using primary and foreign keys to define relationships between them. Not only does this make your database more flexible and scalable, it also makes it easier to maintain. Plus, normalization helps to keep your data accurate by reducing the chance of inconsistencies and errors.
The only downside is now is that your queries will involve more joins, which are slow AF and often a DB performance botteleneck.
As the data analyst of Chindata, your task is to track the performance of company servers. Given a table named that contains server data logged every minute, write an SQL query to find the average CPU usage by server, sorted by the server with the highest average usage.
log_id | server_id | log_time | cpu_usage |
---|---|---|---|
101 | A1 | 01/10/2022 00:01:00 | 5% |
102 | A1 | 01/10/2022 00:02:00 | 10% |
103 | A2 | 01/10/2022 00:02:00 | 15% |
104 | B1 | 01/10/2022 00:03:00 | 20% |
105 | A1 | 01/10/2022 00:03:00 | 10% |
106 | B1 | 01/10/2022 00:04:00 | 15% |
Note: CPU usage is a percentage string.
server_id | avg_cpu_usage |
---|---|
B1 | 17.5% |
A2 | 15.0% |
A1 | 8.33% |
The AVG function is used here to find the average CPU usage per server. The REPLACE and CAST functions are used to transform the cpu_usage field into a numeric field because it is originally a string field containing a percentage sign (%). The TO_CHAR function is used to convert the resulting average back into a percentage string format in the result set. The server_id results are then ordered in descending order according to their average CPU usage.
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating server time usage or this Google Odd and Even Measurements Question which is similar for working with different types of server data.
A is a column or set of columns in a table that references the primary key of another table. It is used to link the two tables together, and to ensure that the data in the foreign key column is valid.
The constraint helps to ensure the integrity of the data in the database by preventing the insertion of rows in the foreign key table that do not have corresponding entries in the primary key table. It also helps to enforce the relationship between the two tables, and can be used to ensure that data is not deleted from the primary key table if there are still references to it in the foreign key table.
For example, if you have a table of Chindata customers and an orders table, the customer_id column in the orders table could be a that references the id column (which is the primary key) in the Chindata customers table.
Recently, Chindata has launched several digital ads and they want to analyze their click-through rates (CTR). The CTR is calculated as the number of users who clicked on the ad divided by the number of total impressions (the number of times the ad was shown). Assume you have access to two tables.
The first one, named , contains data for every time an ad was shown to a user. The second one, named , contains data for every time a user clicked on an ad.
impression_id | user_id | time | ad_id |
---|---|---|---|
6751 | 823 | 06/08/2021 00:00:00 | 403 |
8752 | 432 | 06/10/2021 00:00:00 | 403 |
4543 | 678 | 06/18/2021 00:00:00 | 612 |
5872 | 995 | 07/26/2021 00:00:00 | 612 |
6437 | 900 | 07/05/2021 00:00:00 | 522 |
click_id | user_id | time | ad_id |
---|---|---|---|
4932 | 823 | 06/08/2021 00:00:01 | 403 |
2658 | 543 | 06/10/2021 00:00:01 | 612 |
7532 | 995 | 07/26/2021 00:00:01 | 612 |
4317 | 797 | 07/05/2021 00:00:01 | 403 |
3214 | 721 | 07/10/2021 00:00:01 | 522 |
Using the above two tables, write a SQL query that provides the CTR per advertisement.
This query uses a LEFT JOIN to combine the impressions and clicks tables on ad_id and user_id. As we're interested in the CTR, it's necessary to count the number of unique users who clicked on the ad and divide this by the total number of unique impressions . The result is the CTR for each ad.
To practice a related SQL interview question on DataLemur's free online SQL coding environment, solve this Meta SQL interview question:
Assume Chindata is a data center providing hosting services. At the end of each day, the system automatically records the amount of data transferred by each hosted server. Your goal is to write a SQL query that calculates the total amount of data transferred for each server on a monthly basis.
The table has the following columns:
usage_id | server_id | date | data_gb |
---|---|---|---|
101 | 1 | 06/01/2022 00:00:00 | 5 |
102 | 1 | 06/02/2022 00:00:00 | 7 |
103 | 2 | 06/02/2022 00:00:00 | 2 |
104 | 1 | 07/01/2022 00:00:00 | 10 |
105 | 2 | 07/01/2022 00:00:00 | 15 |
month | server_id | total_data_gb |
---|---|---|
6 | 1 | 12 |
6 | 2 | 2 |
7 | 1 | 10 |
7 | 2 | 15 |
This above query first extracts the month part from the date using the DATE_PART function. It then groups the rows by the month and server_id and calculates the sum of data_gb for each group. The output is ordered by the month and server_id for readability.
There are several normal forms that define the rules for normalizing a database:
A database is in first normal form (1NF) if it meets the following criteria:
A database is in second normal form (2NF) if it meets the following criteria:
Said another way, to achieve 2NF, besides following all the rules from 1NF all the columns in a given table should be dependent only on that table's primary key.
A database is in third normal form (3NF) if it meets the following criteria:
A transitive dependency means that a piece of data in one column is derived from another column. For example, it wouldn't make sense to keep a column called "user's age" and "user's birthdate" (because age can be derived from birthdate.
While there's also a 4th and 5th normal form, it's too pedantic and hence out-of-scope to know for the Chindata SQL interview.
Chindata has several data centers distributed across various regions. Each data center consists of several server racks. Each rack contains several servers, and the power consumption of each varies. Chindata wants to calculate the average power consumption per rack for each region.
Below are the data structures for the , , and .
dc_id | region |
---|---|
101 | East |
102 | West |
103 | North |
104 | South |
rack_id | dc_id |
---|---|
201 | 101 |
202 | 101 |
203 | 102 |
204 | 103 |
205 | 104 |
server_id | rack_id | power_consumption |
---|---|---|
301 | 201 | 150 |
302 | 201 | 200 |
303 | 202 | 250 |
304 | 203 | 350 |
305 | 204 | 120 |
306 | 205 | 360 |
region | avg_power_per_rack |
---|---|
East | 200 |
West | 350 |
North | 120 |
South | 360 |
This query joins the , , and tables on their respective foreign keys. It then groups the result by , and generates an average power consumption for each rack in each region using PostgreSQL's and functions. The function is used to round the result to the nearest whole number.
The two most similar questions to your query are:
"Server Utilization Time" by Amazon This question's similarity lies in the requirement to calculate a specific metric (total time) for server utilization.
"Supercloud Customer" by Microsoft This question is similar as it also involves working with different server racks, only in this question, it's about identifying companies instead of calculating average power consumption.
So the markdown-friendly output you requested is:
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating server related metrics or this Microsoft Supercloud Customer Question which is similar for dealing with server racks.
The best way to prepare for a Chindata SQL interview is to practice, practice, practice. Beyond just solving the above Chindata SQL interview questions, you should also solve the 200+ SQL exercises on DataLemur which come from companies like Netflix, Google, and Amazon.
Each SQL question has hints to guide you, step-by-step solutions and best of all, there is an interactive coding environment so you can right in the browser run your query and have it executed.
To prep for the Chindata SQL interview it is also wise to practice interview questions from other tech companies like:
But if your SQL skills are weak, don't worry about going right into solving questions – go learn SQL with this DataLemur SQL Tutorial.
This tutorial covers SQL concepts such as window functions and CASE/WHEN statements – both of these come up routinely in SQL job interviews at Chindata.
In addition to SQL query questions, the other question categories to prepare for the Chindata Data Science Interview are:
The best way to prepare for Chindata Data Science interviews is by reading Ace the Data Science Interview. The book's got: