Histogram of Tweets
This is the same question as problem #6 in the SQL Chapter of Ace the Data Science Interview!
Assume you are given the table below containing tweet data. Write a query to obtain a histogram of tweets posted per user in 2022. Output the tweet count per user as the bucket, and then the number of Twitter users who fall into that bucket.
|214252||111||Am considering taking Tesla private at $420. Funding secured.||12/30/2021 00:00:00|
|739252||111||Despite the constant negative press covfefe||01/01/2022 00:00:00|
|846402||111||Following @NickSinghTech on Twitter changed my life!||02/14/2022 00:00:00|
|241425||254||If the salary is so competitive why won’t you tell me what it is?||03/01/2022 00:00:00|
|231574||148||I no longer have a manager. I can't be managed||03/23/2022 00:00:00|
Explanation: 2 users fall under the 1 tweet bucket whereas 1 user is in the 2 tweets bucket.
First, we obtain the number of tweets for each user in 2022 and group them by the user id.
We can interpret the output as in the year 2022, user 111 has tweeted twice and users 148 and 254 have tweeted only one time.
Next, we use the query above as a subquery, then we use the number of tweets field as the tweet bucket and retrieve the number of users.
Solution #1: Using Subquery
Below is another method to solve this question using CTE.
A CTE is a temporary data set to be used as part of a query. It exists during the execution of the query. A subquery is a nested query. It’s a query within a query and unlike CTE, it can be used within that query only.
The advantages of using CTE are it is reusable during the entire session and more readable, whereas a subquery can be used in the and clauses and act as a column with a single value.
Solution #2: Using CTE