At Kakao, SQL is used often for analyzing user behavior data for targeted marketing, and to analyze KakaoTalk product analytics data. That's why Kakao frequently asks SQL query questions during interviews for Data Science and Data Engineering positions.
To help you prepare for the Kakao SQL interview, this blog covers 10 Kakao SQL interview questions – how many can you solve?
As a representative for Kakao, you have been supplied with a dataset providing detailed information about the usage habits of your users over the past month. The dataset is contained in a PostgreSQL table named 'user_activity'. Given below is a sample of the data:
activity_id | user_id | activity_date | service_id | duration_seconds |
---|---|---|---|---|
7241 | 1001 | 2022-08-01 08:30:00 | K01 | 1800 |
8347 | 1002 | 2022-08-01 09:00:00 | K02 | 2400 |
6025 | 1001 | 2022-08-01 15:30:00 | K01 | 1800 |
7218 | 1002 | 2022-08-02 08:00:00 | K02 | 3600 |
8360 | 1003 | 2022-08-02 10:00:00 | K01 | 1200 |
Your task is to write a SQL query to find the top 3 users who have the longest total duration of using each service for every day. You will need to make use of SQL window functions to solve this problem.
In the answer, we use window rows to compute the total duration for each user per day for each service. Every window is defined by the PARTITION BY clause, which separates our table into partitions of rows sharing the same activity_date and service_id.
Then the SUM function computes the total duration of use for each user_id within each partition. The ROW_NUMBER function assigns a unique rank to each row within a partition, ordered by the total duration in descending order.
Finally, we filter rows with a rank no higher than 3 (top 3). The result will be the top 3 users with the longest total duration of usage for each service each day.
To practice a related window function SQL problem on DataLemur's free interactive coding environment, try this Amazon BI Engineer interview question:
Kakao Corporation provides many services like messaging (KakaoTalk), music streaming (KakaoMusic), ride-hailing (KakaoT), and a digital wallet (KakaoPay). These services can affect each other's engagement. For instance, users who chat about a newly released album on KakaoTalk might decide to listen to it later on KakaoMusic.
You've been given two tables - and , with the following schema:
interaction_id | user_id | service_id | interaction_date | interaction_duration(mins) |
---|---|---|---|---|
001 | 123 | A | 06/08/2022 | 10 |
002 | 265 | B | 06/10/2022 | 5 |
003 | 362 | A | 06/18/2022 | 15 |
004 | 192 | C | 07/26/2022 | 20 |
005 | 981 | B | 07/05/2022 | 5 |
service_id | service_name |
---|---|
A | KakaoTalk |
B | KakaoMusic |
C | KakaoT |
D | KakaoPay |
Assume all interactions are unique. Write a SQL query to find the average interaction duration for each service.
This query first joins with on the field. The query then groups the data by . With each grouping of a service, it calculates the average duration of all interactions related to that service, providing a useful insight into which services are more engaging to users.
The primary key of a table is a column or set of columns that serves as a unique identifier for each row. It ensures that all rows are distinct and does not allow null values.
For example, say you had stored some Facebook ad campaign data that Kakao ran:
The column uniquely identifies each row in the table, and the PRIMARY KEY constraint ensures that no two rows have the same . This helps to maintain the integrity of the data in the table by preventing duplicate rows.
The primary key is also an important part of the table because it allows you to easily identify and reference specific campaigns in your Facebook Ad data. You can use it to join to other tables in the database, such as a table containing data on the results of the campaigns.
Kakao is interested in finding out which of their users are the most active on the platform. They would like a list of users who have sent more than 500 messages in the last 3 months and are part of at-least 10 different chat groups.
Please use the following tables for this analysis:
message_id | user_id | send_date | group_id | content |
---|---|---|---|---|
6181 | 541 | 12/15/2022 | 7032 | Hello everyone |
9842 | 681 | 12/16/2022 | 4321 | What's the update on the project? |
5393 | 541 | 12/18/2022 | 7032 | We have a meeting at 5pm today |
8666 | 982 | 12/20/2022 | 8901 | Done with the task |
5758 | 541 | 12/22/2022 | 4321 | Here is the presentation for tomorrow |
user_id | group_id |
---|---|
541 | 7032 |
681 | 4321 |
982 | 8901 |
541 | 4321 |
541 | 8796 |
Write an SQL query to find all user who have sent more than 500 messages in the last 3 months and are part of at-least 10 different chat groups.
This query joins the and tables on field. It then filters out the messages sent in the last 3 months. The clause groups the data by , and the clause filters the groups by users who have sent more than 500 messages and are part of at least 10 different groups.
Database denormalization is when you add redundancy to a database, and break typical normalization rules (specified by 1st, 2nd, 3rd normal forms). There's a few reasons to denormalize a database:
Improved performance: Denormalization often reduces the the number of costly join operations that are needed to retrieve data. This is helpful when the database is being used for OLAP (Online Analytical Processing) use cases, as joins can be expensive and slow.
Simplification: Denormalization can also be used to simplify the design of a database by reducing the number of tables and relationships that need to be managed. This can make it easier to understand and maintain the database.
Ease of use: Denormalization can also make it easier for users to work with a database by providing them with a more intuitive and straightforward data model.
While denormalization can be a useful tool for improving performance and scalability, it's important to keep in mind that it can make update and delete operations more complex. This is because denormalization can create data duplicates, which can make it harder to maintain data integrity. So, to avoid any headaches, it's a good idea to start with a well-normalized database design and then consider denormalization only if it's absolutely necessary for your specific performance and scalability needs. In other words, denormalization can be a bit of a wild card, so it's best to handle it with care!
Kakao sells various digital products. Each product has multiple transactions every day. Can you write a SQL query to find out the average monthly revenue of each product based on the data in the "transactions" table
Here are the sample data:
transaction_id | product_id | transaction_date | revenue |
---|---|---|---|
6171 | 50001 | 06/08/2022 00:00:00 | 500 |
7802 | 69852 | 06/10/2022 00:00:00 | 700 |
5293 | 50001 | 06/18/2022 00:00:00 | 550 |
6352 | 69852 | 07/26/2022 00:00:00 | 710 |
4517 | 69852 | 07/05/2022 00:00:00 | 450 |
product | month | avg_revenue |
---|---|---|
50001 | 6 | 525.00 |
69852 | 6 | 700.00 |
69852 | 7 | 580.00 |
In this query, we use GROUP BY along with the AVG function as asked in the question statement. The AVG function is used to find the average of 'revenue' for each 'product_id' for each 'month' of 'transaction_date'. The EXTRACT function is used to get the month from the 'transaction_date'. The result is ordered by 'product' and 'month'.
Database normalization is helpful because it improves the performance, flexibility, and scalability of a database. Some specific benefits of normalization include:
Reducing Redundancy: Normalization helps to minimize redundancy by breaking down a larger general table into smaller, more specific tables. This can often reduce the amount of data that needs to be accessed for particular queries.
Improving Data Integrity: Normalization helps to ensure the integrity of the data by minimizing the risk of data inconsistencies and anomalies. It does this by establishing clear relationships between the tables via primary and foreign keys.
Increasing Flexibility: Normalization makes it easier to modify the structure of the database, as there is less redundancy, so it allows you to make changes to one table without affecting others. This makes it easier to adapt the database to changing business needs (and change is basically constant at Kakao!)
Improving Performance: Normalization can improve the performance of the database by reducing the amount of data that needs to be stored and accessed in a single table. This can result in faster query times and better overall performance.
Easier Data Modeling: Normalization can make it easier to model the data in a logical and intuitive way, since less random data is commingled together, which improves the overall database design.
You're working with Kakao's customer records and you're asked to find all the users who live in Seoul and have an '@kakao.com' email address. For this problem, consider the following table schema for the 'users' table.
user_id | location | |
---|---|---|
123 | john@kakao.com | Seoul |
265 | mary@gmail.com | Busan |
362 | paul@kakao.com | Seoul |
192 | linda@kakao.com | Daegu |
981 | kim@kakao.com | Seoul |
Write a SQL query that will filter down the 'users' table and return all records with an location of 'Seoul' and an email address that ends with '@kakao.com'.
Hint: Use the SQL keyword LIKE in your where clause to match the specific string pattern for the email address.
Here is the PostgreSQL query that can be used to solve this problem:
This query will return all user records where the location is 'Seoul' and the email address ends with '@kakao.com'. The '%' symbol in the LIKE clause acts as a wildcard that matches any sequence of characters. Thus, '%@kakao.com' will match any string that ends with '@kakao.com'.
Kakao wants to analyze customer purchases and total revenue generated per city to create a growth strategy. Currently, all user data and purchase activity is stored in two separate tables: and . The user information is stored in the table which includes a unique and the residing of each user. The table stores all purchase data, with each order having a unique , the of the customer who made the purchase, the and the of each purchase.
Write a SQL query to find the total revenue per city in June 2022.
user_id | city |
---|---|
123 | Seoul |
265 | Busan |
362 | Daegu |
192 | Incheon |
981 | Seoul |
order_id | user_id | purchase_date | price |
---|---|---|---|
5041 | 123 | 06/08/2022 00:00:00 | 50000 |
6712 | 265 | 06/10/2022 00:00:00 | 69800 |
3923 | 362 | 06/18/2022 00:00:00 | 85000 |
7352 | 192 | 07/26/2022 00:00:00 | 64800 |
8157 | 981 | 07/05/2022 00:00:00 | 68000 |
This query first joins the and table using the . It then filters the data to only include orders made in June 2022 using the function. Finally, it groups by and calculates the total revenue for each city using the function.
Because joins come up so often during SQL interviews, try this interactive Snapchat JOIN SQL interview question:
While both types of indexes improve the performance of SQL queries by providing a faster way to lookup rows of data, a unique index enforces the uniqueness of the indexed columns while a non-unique index allows duplicate values in the indexed columns.
Suppose you had a table of Kakao employees. Here's an example of a unique index on the column:
This index would ensure that no two Kakao employees have the same , which could be used as a unique identifier for each employee.
Here's a non-unique index example example on the column:
This index would not enforce uniqueness, but it could be used to improve the performance of queries that filter or sort the data based on the column. For example, if you want to quicklly retreive all Data Scientists, the database can use the index to efficiently locate and retrieve the desired records without having to do a full table scan on all Kakao employees.
The best way to prepare for a SQL interview, besides making sure you have strong SQL fundamentals, is to practice a ton of real SQL questions that were asked in recent job interviews. Besides solving the above Kakao SQL interview questions, you should also solve the 200+ DataLemur SQL Interview Questions which come from companies like Microsoft, Google, Amazon, and tech startups.
Each interview question has hints to guide you, fully explained answers along with a discussion board to see how others solved it and best of all, there is an interactive SQL code editor so you can easily right in the browser your SQL query answer and have it checked.
To prep for the Kakao SQL interview you can also be wise to solve SQL problems from other tech companies like:
But if your SQL skills are weak, don't worry about going right into solving questions – improve your SQL foundations with this SQL interview tutorial.
This tutorial covers SQL concepts such as WHERE with AND/OR/NOT and INTERCEPT/EXCEPT – both of these pop up routinely in SQL job interviews at Kakao.
In addition to SQL interview questions, the other question categories covered in the Kakao Data Science Interview are:
The best way to prepare for Kakao Data Science interviews is by reading Ace the Data Science Interview. The book's got: