At Cloudflare, SQL is used across the company for analyzing network security data and for analyzing web traffic data. They even offer an external service called D1 which allows developers to create server-less databases. Unsurprisingly this is why Cloudflare almost always asks SQL questions during interviews for Data Science, Data Engineering and Data Analytics jobs.
So, to help you prepare for the Cloudflare SQL interview, we've curated 10 Cloudflare SQL interview questions – can you solve them?
Let's say you are given a table where each row includes a timestamp of a request reaching Cloudflare's edge, its status code, and the corresponding Cloudflare data center (POP). Write a SQL query to find the total number of requests, along with the percentages of each status code (200, 403, 500, etc.) relative to the total from each POP for the last 30 days. Assume today is '2022-09-01'.
timestamp | pop | status_code |
---|---|---|
08/08/2022 09:10:13 | SFO | 200 |
08/13/2022 11:55:48 | LAX | 403 |
08/20/2022 16:31:07 | NYC | 500 |
08/28/2022 06:11:19 | SFO | 200 |
08/29/2022 23:55:55 | LAX | 500 |
08/25/2022 14:20:48 | NYC | 200 |
pop | request_count | status_200_perc | status_403_perc | status_500_perc |
---|---|---|---|---|
SFO | 2 | 100.00 | 0.00 | 0.00 |
LAX | 2 | 0.00 | 50.00 | 50.00 |
NYC | 2 | 50.00 | 0.00 | 50.00 |
This query utilizes a common table expression (CTE) to get the total number of requests for each POP for the last 30 days. In the main query, we use conditional aggregation to calculate the percentage of each status code by using a case when condition inside the COUNT function. The percentages are calculated as the ratio of the count of a specific status code to the total request count for each POP, which is then multiplied by 100 and rounded to two decimal points for readability.
To practice a similar window function question on DataLemur's free interactive SQL code editor, try this Google SQL Interview Question:
Assume you are a Data Scientist at Cloudlfare, and have the following two tables:
Please analyze the median page views per day for each country for the month of September 2022.
user_id | country | account_creation_date |
---|---|---|
1 | USA | 2022-01-01 |
2 | USA | 2022-02-01 |
3 | Germany | 2022-03-01 |
4 | Germany | 2022-01-01 |
5 | India | 2022-02-01 |
user_id | date | page_views |
---|---|---|
1 | 2022-09-01 | 3 |
2 | 2022-09-01 | 5 |
3 | 2022-09-01 | 2 |
1 | 2022-09-02 | 4 |
5 | 2022-09-02 | 1 |
4 | 2022-09-03 | 3 |
2 | 2022-09-03 | 6 |
1 | 2022-09-04 | 1 |
3 | 2022-09-04 | 2 |
5 | 2022-09-04 | 3 |
The solution requires joining the table with the table while aggregating the median page views per day by country for September 2022.
This SQL block first creates a subquery that represents the page views for each day by a user. Then, it joins this processed data with the table to track the country affiliation of the users. The percentile_cont function is finally used to calculate the median page views for each country. This method is accurate even when dealing with data skews due to spatial and time variances.
There's several steps you can take to troubleshoot a slow SQL query.
First things first, figure out why the query is slow! You can use and commands in PostgreSQL to identify any performance bottlenecks. After that, you can start changing your query, depending on what the source of the query performance issue is.
Generally, indexes can help speed up queries. Also de-normalizing your tables might help, to remove slow joins.
Lastly, you could always just upgrade your hardware! Time and money wasted improving query performance could just be better spent on mroe powerful database servers!
While this is a very surface-level approach to optimizing SQL query performance, it should do the trick for Data Analyst interviews and Data Science interviews at Cloudflare. Data Engineers should know a bit more about the and before the interview.
Cloudflare is an American web-infrastructure and website-security company that provides content-delivery-network services, DDoS mitigation, Internet security, and distributed domain-name-server services. You are given a table named 'data_processed' that consists of the number of gigabytes of data processed by Cloudflare each day. Write a query to find the average amount of data processed per day over a given month.
id | date | data_gb |
---|---|---|
1 | 2020-07-01 | 50 |
2 | 2020-07-02 | 54 |
3 | 2020-07-03 | 70 |
4 | 2020-07-04 | 32 |
5 | 2020-07-05 | 64 |
6 | 2020-07-06 | 46 |
7 | 2020-07-07 | 57 |
8 | 2020-07-08 | 60 |
9 | 2020-07-09 | 55 |
10 | 2020-07-10 | 56 |
month | avg_data_gb |
---|---|
7 | 54.4 |
This PostgreSQL query first extracts the month from the date using the EXTRACT function. The condition in the WHERE clause limits the data selection to the 7th month. The AVG function is then used, averaging the 'data_gb' from the gathered data. The result is grouped by the 'month', returning the average number of gigabytes processed per day for July.
To practice a very similar question try this interactive Amazon Server Utilization Time Question which is similar for calculating total data processing or this Alibaba Compressed Mean Question which is similar for calculating average data value.
Denormalization is a technique used to improve the read performance of a database, typically at the expense of some write performance.
By adding redundant copies of data or grouping data together in a way that does not follow normalization rules, denormalization improves the performance and scalability of a database by eliminating costly join operations, which is important for OLAP use cases that are read-heavy and have minimal updates/inserts.
Cloudflare is a web-infrastructure and website-security company, providing content-delivery-network services, DDoS mitigation, Internet security, and distributed domain-name-server services. Suppose you have been hired as a data analyst at Cloudflare, and your task for today is to calculate the average data traffic per service (product) on a monthly basis for the year 2022. This will provide important insights into the most data-intensive services Cloudflare provides.
For this task, assume that we have a table, , that contains the following columns:
service_id | user_id | data_usage (GB) | usage_date |
---|---|---|---|
101 | 111 | 10 | 02/01/2022 |
101 | 112 | 15 | 02/02/2022 |
102 | 111 | 200 | 03/15/2022 |
101 | 113 | 5 | 02/06/2022 |
101 | 111 | 15 | 03/01/2022 |
103 | 114 | 300 | 01/21/2022 |
102 | 115 | 100 | 01/11/2022 |
103 | 116 | 350 | 02/28/2022 |
Our goal is to create a result table with the following structure:
month | service_id | average_data (GB) |
---|---|---|
1 | 102 | 100.00 |
1 | 103 | 300.00 |
2 | 101 | 12.50 |
2 | 103 | 350.00 |
3 | 101 | 15.00 |
3 | 102 | 200.00 |
Given the problem, here's the PostgreSQL query we could use to calculate the average data traffic by service per month:
This query first filters data from the year 2022. Then, with the use of the clause, it aggregates data traffic per month and per service. The function is used to calculate the average data traffic. The clause sorts the output by month and service_id.
To explain the difference between a primary key and foreign key, let's start with some data from Cloudflare's marketing analytics database which stores data from Google Ads campaigns:
:
+------------+------------+------------+------------+ | ad_id | campaign_id| keyword | click_count| +------------+------------+------------+------------+ | 1 | 100 | Cloudflare pricing | 10 | | 2 | 100 | Cloudflare reviews | 15 | | 3 | 101 | Cloudflare alternatives | 7 | | 4 | 101 | buy Cloudflare | 12 | +------------+------------+------------+------------+
In this table, could be a primary key. It is unique for each row in the table and cannot contain null values.
could be a foreign key. It references the of the Google Ads campaign that each ad belongs to, establishing a relationship between the ads and their campaigns. This foreign key allows you to easily query the table to find out which ads belong to a specific campaign, or to find out which campaigns a specific ad belongs to.
It is also possible for a table to have multiple foreign keys that reference different primary keys in other tables. For example, the table could have additional foreign keys for the of the ad group that each ad belongs to, and the of the Google Ads account that the campaigns belong to.
There is a large database of customer records in the Cloudflare company. Your task is to find all customers that are using Cloudflare services in cities starting with the letter 'S'.
customer_id | first_name | last_name | service_id | city |
---|---|---|---|---|
1 | John | Smith | 101 | San Francisco |
2 | Jane | Doe | 102 | Seattle |
3 | Tom | Brown | 103 | Sacramento |
4 | Emma | Johnson | 101 | New York |
5 | Harry | White | 102 | Boston |
service_id | service_name |
---|---|
101 | DNS |
102 | DDoS Protection |
103 | Content Delivery Network |
We want to find customers using Cloudfare services in cities starting with the letter 'S' so the output should be:
customer_id | first_name | last_name | service_name | city |
---|---|---|---|---|
1 | John | Smith | DNS | San Francisco |
2 | Jane | Doe | DDoS Protection | Seattle |
3 | Tom | Brown | Content Delivery Network | Sacramento |
The above SQL query will join the 'customers' table with the 'services' table on 'service_id'. When filtering data from the 'customers' table, it uses the LIKE keyword to search for customers in cities that start with the letter 'S'. The '%' character in the LIKE pattern means any sequence of characters. Therefore, 'S%' will match any city names that begin with 'S'.
Suppose you are given two tables, and . The table contains information about the company's customers, including their , , , and . The table includes details about the products offered by the company, such as the , , and .
Your task is to write a SQL query that provides a list of customers along with the respective product name that they are on.
customer_id | name | signup_date | plan_id |
---|---|---|---|
8732 | John Doe | 2021-11-03 | 1001 |
4756 | Jane Smith | 2021-03-05 | 1002 |
9782 | Mary Johnson | 2021-07-15 | 1001 |
1635 | James Brown | 2021-09-20 | 1003 |
product_id | plan_id | product_name |
---|---|---|
50001 | 1001 | Basic |
69852 | 1002 | Advanced |
92370 | 1003 | Premium |
The PostgreSQL query above uses the JOIN clause to combine rows from and based on the common field between the two, which is . The output of this query will be a new table that contains the , (from the table), and (from the table) for each customer.
Since joins come up so often during SQL interviews, try an interactive Spotify JOIN SQL question:
Database denormalization is when you add redundancy to a database, and break typical normalization rules (specified by 1st, 2nd, 3rd normal forms). There's a few reasons to denormalize a database:
Improved performance: Denormalization often reduces the the number of costly join operations that are needed to retrieve data. This is helpful when the database is being used for OLAP (Online Analytical Processing) use cases, as joins can be expensive and slow.
Simplification: Denormalization can also be used to simplify the design of a database by reducing the number of tables and relationships that need to be managed. This can make it easier to understand and maintain the database.
Ease of use: Denormalization can also make it easier for users to work with a database by providing them with a more intuitive and straightforward data model.
While denormalization can be a useful tool for improving performance and scalability, it's important to keep in mind that it can make update and delete operations more complex. This is because denormalization can create data duplicates, which can make it harder to maintain data integrity. So, to avoid any headaches, it's a good idea to start with a well-normalized database design and then consider denormalization only if it's absolutely necessary for your specific performance and scalability needs. In other words, denormalization can be a bit of a wild card, so it's best to handle it with care!
The best way to prepare for a SQL interview, besides making sure you have strong SQL fundamentals, is to practice a ton of real SQL questions that were asked in recent job interviews. Besides solving the above Cloudflare SQL interview questions, you should also solve the 200+ SQL questions on DataLemur which come from companies like Netflix, Google, and Amazon.
Each DataLemur SQL question has hints to guide you, detailed solutions and most importantly, there is an online SQL coding environment so you can easily right in the browser your SQL query and have it executed.
To prep for the Cloudflare SQL interview it is also wise to solve SQL questions from other tech companies like:
But if your SQL foundations are weak, don't worry about going right into solving questions – go learn SQL with this SQL interview tutorial.
This tutorial covers SQL concepts such as LEAD window function and math functions in SQL – both of these come up frequently during SQL job interviews at Cloudflare.
Beyond writing SQL queries, the other topics tested in the Cloudflare Data Science Interview are:
To prepare for Cloudflare Data Science interviews read the book Ace the Data Science Interview because it's got: