Twitter employees use SQL to analyze tweet patterns, identifying trending topics and user engagement metrics to enhance the user experience. They also extract relevant data on user demographics and interests to help business clients optimize their ad campaigns, which is why Twitter includes SQL questions during interviews for Data Science, Data Engineering, and Data Analytics jobs.
So, to help you study, here's 11 Twitter SQL interview questions – can you solve them?
This is the same question as problem #6 in the SQL Chapter of Ace the Data Science Interview!
Assume you're given a table Twitter tweet data, write a query to obtain a histogram of tweets posted per user in 2022. Output the tweet count per user as the bucket and the number of Twitter users who fall into that bucket.
In other words, group the users by the number of tweets they posted in 2022 and count the number of users in each group.
Column Name | Type |
---|---|
tweet_id | integer |
user_id | integer |
msg | string |
tweet_date | timestamp |
tweet_id | user_id | msg | tweet_date |
---|---|---|---|
214252 | 111 | Am considering taking Tesla private at $420. Funding secured. | 12/30/2021 00:00:00 |
739252 | 111 | Despite the constant negative press covfefe | 01/01/2022 00:00:00 |
846402 | 111 | Following @NickSinghTech on Twitter changed my life! | 02/14/2022 00:00:00 |
241425 | 254 | If the salary is so competitive why won’t you tell me what it is? | 03/01/2022 00:00:00 |
231574 | 148 | I no longer have a manager. I can't be managed | 03/23/2022 00:00:00 |
tweet_bucket | users_num |
---|---|
1 | 2 |
2 | 1 |
To see an additional answer go to DataLemur's FREE interactive coding environment!
This is the same question as problem #24 in the SQL Chapter of Ace the Data Science Interview!
Assume you are given the table containing Twitter user session activities.
Write a query that ranks users according to their total session durations (in minutes) in descending order for each session type between the start date (2022-01-01) and the end date (2022-02-01).
Output the user ID, session type, and the ranking of the total session duration.
Column Name | Type |
---|---|
session_id | integer |
user_id | integer |
session_type | string ("like", "reply", "retweet") |
duration | integer (in minutes) |
start_date | timestamp |
session_id | user_id | session_type | duration | start_date |
---|---|---|---|---|
6368 | 111 | like | 3 | 12/25/2021 12:00:00 |
1742 | 111 | retweet | 6 | 01/02/2022 12:00:00 |
8464 | 222 | reply | 8 | 01/16/2022 12:00:00 |
7153 | 111 | retweet | 5 | 01/28/2022 12:00:00 |
3252 | 333 | reply | 15 | 01/10/2022 12:00:00 |
user_id | session_type | ranking |
---|---|---|
333 | reply | 1 |
222 | reply | 2 |
111 | retweet | 1 |
To see how to solve this question using a subquery go check out this Twitter SQL Interview Question on Data Lemur!
A NULL value represents a missing or unknown value. It is different from a zero or a blank space, which are actual values. It is important to handle NULL values properly in SQL because they can cause unexpected results if not treated correctly.
You are given the following tables:
table that contains information of the users:
user_id | username |
---|---|
1 | @jack |
2 | @elonmusk |
3 | @billgates |
table that contains information of the tweets:
tweet_id | user_id | content | timestamp |
---|---|---|---|
101 | 1 | "Hello Twitter!" | 2022-08-01 10:00:00 |
102 | 2 | "Launch successful!" | 2022-08-01 11:00:00 |
103 | 1 | "Acquiring new company..." | 2022-08-02 09:30:00 |
104 | 3 | "Reading a great book." | 2022-08-02 15:00:00 |
105 | 2 | "Mars, here we come!" | 2022-08-03 13:00:00 |
Write a SQL query to find the total number of tweets and the rank of total tweets for each user as of a specific timestamp (for example, '2022-08-02 12:00:00'). Order results by the total number of tweets in descending order, and if two users have the same number of tweets, order them by username in ascending order.
username | num_tweets | rank |
---|---|---|
@jack | 2 | 1 |
@elonmusk | 1 | 2 |
@billgates | 1 | 3 |
In this query, we first join the table and table on and then filter for tweets with timestamp less than or equal to the given date and time. The result is then grouped by . Within the select clause, we count the number of tweets per user and calculate the rank of each user based on the number of tweets they made. If the number of tweets is the same, the users are ordered by their username in ascending order. Finally, we order the overall result by the number of tweets in descending order, and by username in ascending order if the number of tweets is the same.
Pro Tip: Window functions are a frequent SQL interview topic, so practice every window function problem on DataLemur
A foreign key is a field in a database table that serves as a reference to the primary key of another table, allowing for the creation of a relationship between the two tables.
For a concrete example, let's inspect employee data from Twitter's HR database:
employee_id | first_name | last_name | manager_id |
---|---|---|---|
1 | Aubrey | Graham | 3 |
2 | Marshal | Mathers | 3 |
3 | Dwayne | Carter | 4 |
4 | Shawn | Carter |
In this table, is the primary key, and is used to uniquely identify each row.
could be a foreign key. It references the of the manager of each employee in the table, establishing a relationship between the employees and their managers. This foreign key allows you to easily query the table to find out who a specific employee's manager is, or to find out which employees report to a specific manager.
It is possible for a table to have multiple foreign keys that reference primary keys in different tables. For instance, the table could have additional foreign keys for the of the department where an employee works and the of the employee's location.
For a social media platform like Twitter, one critical metric might be an understanding of how often tweets are retweeted. As part of your data analytics team, we want you to determine the average number of retweets for each user in our "Twitter" database. To simplify this, you can assume you're working with a month's data.
Sample data:
tweet_id | user_id | tweet_date | retweets |
---|---|---|---|
101 | 1 | 06/01/2022 10:00:00 | 5 |
102 | 1 | 06/05/2022 14:24:00 | 3 |
103 | 2 | 06/07/2022 18:00:00 | 12 |
104 | 3 | 06/12/2022 12:30:00 | 4 |
105 | 2 | 06/27/2022 19:45:00 | 8 |
user_id | avg_retweets |
---|---|
1 | 4.00 |
2 | 10.00 |
3 | 4.00 |
This query calculates the average number of retweets for each user by grouping on the column and then applying the AVG function to the retweets column. Grouping by ensures that the average calculation is performed separately for each Twitter user. The result is a table that shows the average number of retweets for each user.
To practice a very similar question try this interactive Twitter Histogram of Tweets Question which is similar for requiring analysis of user tweets or this Twitter Tweets' Rolling Averages Question which is similar for involving tweet data aggregation per user.
A non-correlated sub-query is not linked to the outer query and can be run on its own. It does not reference any columns from the outer query and is used to retrieve additional data for the outer query. On the other hand, a correlated sub-query is one that relies on the outer query and cannot be run independently. It uses the outer query to filter or transform data by referencing a column from it, while the outer query uses the results of the inner query.
Here is an example of a non-correlated sub-query:
The sub-query in this case is non-correlated, as it does not reference any columns from the outer query.
Here is an example of a correlated sub-query:
This query selects the and total sales of all Twitter customers in the sales table whose are greater than the average of their own customer group. The sub-query in this case is correlated with the outer query, as it references the column from the outer query ().
Non-correlated sub-queries are typically faster, as they only have to be executed once, wheareas correlated sub-queries are slower since they have to be re-evaluated for each row of the outer query.
Twitter uses advertisement campaigns to generate revenue and engage users. Assume that you are a data analyst at Twitter, and your task is to calculate the click-through rates for different advertisement campaigns. Specifically you need to find the , and for each campaign where is calculated as the total divided by the total multiplied by 100.
Here are the sample tables:
campaign_id | name |
---|---|
1 | Campaign A |
2 | Campaign B |
3 | Campaign C |
ad_id | campaign_id | clicks | number_of_impressions |
---|---|---|---|
101 | 1 | 450 | 2300 |
102 | 2 | 120 | 1900 |
103 | 3 | 200 | 2400 |
104 | 1 | 230 | 1800 |
105 | 3 | 190 | 2600 |
This query first joins and on . It then groups the result set by and , and for each group, it calculates the click-through-rate by summing the and dividing it by the sum of , and multiplies the result by 100 to get the rate in percentage. This result gives us the click-through rates for each campaign.
To solve a similar problem about calculating rates, solve this TikTok SQL Interview Question within DataLemur's interactive coding environment:
As a data analyst at Twitter, we would like you to find out which user has the most tweets in every month. For this exercise, assume that a table took the following form:
tweet_id | user_id | tweet_date | content |
---|---|---|---|
10981 | 123 | 06/08/2022 00:00:00 | Hello World |
26502 | 322 | 06/10/2022 00:00:00 | Good Morning |
72393 | 123 | 06/18/2022 00:00:00 | This is a tweet |
99532 | 987 | 07/26/2022 00:00:00 | Loving the weather |
76417 | 322 | 07/05/2022 00:00:00 | Happy Independence Day |
Here is a PostgreSQL query to solve the problem:
This will output a table where each row represents a month and displays who tweeted the most during that month along with the count of tweets by that user. If there is a tie in the maximum count of tweets by multiple users, all of those users will be returned by this query.
A database index is a data structure that improves the speed of data retrieval operations on a database table.
There are few different types of indexes that can be used in a database:
For a concrete example, say you had a table of Twitter customer payments with the following columns:
Here's what a clustered index on the column would look like:
A clustered index on the column would determine the physical order of the records in the table based on the . This means that the records with the earliest values would be stored together physically in the table, followed by records with later payment_date values. This speeds up queries that filter or sort the data based on the , as the records are already physically sorted in the table. For example, if you want to retrieve all the payments made in the month of June, the database can use the clustered index to quickly locate and retrieve the desired records without having to search through the entire table.
Given a table named that contains data of all Twitter users, find all usernames that start with the letter 'a'.
user_id | username | signup_date | followers |
---|---|---|---|
1 | apple | 2015-06-26 | 30420 |
2 | ball | 2016-07-12 | 50000 |
3 | antic | 2019-09-02 | 21000 |
4 | arc | 2018-03-14 | 35000 |
5 | zebra | 2019-11-30 | 20000 |
username |
---|
apple |
antic |
arc |
This retrieves all usernames from the table where the username starts with 'a'. The SQL keyword is used to filter records, and the '%' after 'a' is a wildcard that matches any sequence of characters. So 'a%' will match any string that starts with 'a'.
The key to acing a Twitter SQL interview is to practice, practice, and then practice some more! In addition to solving the earlier Twitter SQL interview questions, you should also solve the 200+ SQL exercises on DataLemur which come from companies like Google, Microsoft and Silicon Valley startups.
Each SQL question has hints to guide you, step-by-step solutions and best of all, there's an online SQL code editor so you can easily right in the browser your SQL query answer and have it graded.
To prep for the Twitter SQL interview it is also a great idea to practice interview questions from other tech companies like:
However, if your SQL foundations are weak, forget about diving straight into solving questions – refresh your SQL knowledge with this free SQL tutorial.
This tutorial covers things like CASE/WHEN/ELSE statements and functions like SUM()/COUNT()/AVG() – both of these show up routinely in Twitter SQL interviews.
Explore exciting career opportunities at Twitter and be part of shaping the future of social media!
Beyond writing SQL queries, the other topics covered in the Twitter Data Science Interview include:
To prepare for the Twitter Data Science interview have a strong understanding of the company's culture and values – this will be important for acing the behavioral interview. For the technical Data Science interviews, get ready by reading Ace the Data Science Interview. The book's got: