So far in the SQL tutorial we've only been analyzing data from one table at a time. However, in the real world, companies have databases containing thousands of tables. To combine multiple tables, and analyze their data at the same time, we can write a JOIN SQL query – the focus of this tutorial.
For example, let's pretend we worked at Spotify, were we'd have a table for albums, a table for artists, a table to represent users, and the list 🎵 goes on and on and on🎵 (sung to the tune of Don't Stop Believin').
As a Data Analyst or Data Scientist, you'll almost always be pulling data in from multiple of the these tables, which is why the SQL JOIN is crucial.
For example at Spotify you might analyze Drake 🐐 by writing queries which combine two tables like:
Now that you understand the real-world use case of a , let's write a specific SQL join query using Spotify data.
Suppose you worked at Spotify, and wanted to join information in the artists table with information in the songs table. Here's the schema for the two tables:
|101||Ed Sheeran||Warner Music Group|
|120||Drake||Warner Music Group|
|125||Bad Bunny||Rimas Entertainment|
|45202||101||Shape of You|
Notice how the two tables have the column in common:
Now, let's actually build our SQL JOIN query to combine the artists and songs table.
There are three key parts to get right when writing a in SQL. First, you need to specify what columns to show. For our artists x songs example, we'll just use for that.
Secondly, you need to specify what are the names of the two tables we are joining. Since we are joining the artists with songs table, our query so far is as follows:
The third and final in a SQL JOIN query is the clause, which explains to the RDBMS how the two tables relate to each other.
We need to explicitly write in the SQL query that the column in the artists table matches up against the column in the songs table with this clause:
Putting these 3 parts together, we get the following SQL JOIN query:
The above query combines info from the artists AND songs table into the result below:
|101||Ed Sheeran||Warner Music Group||55511||101||Perfect|
|101||Ed Sheeran||Warner Music Group||45202||101||Shape of You|
|120||Drake||Warner Music Group||22222||120||One Dance|
|120||Drake||Warner Music Group||19960||120||Hotline Bling|
|125||Bad Bunny||Rimas Entertainment||12636||125||Mia|
Now, let's practice writing a on a different dataset.
Suppose you work as a Data Scientist at the stock-trading app Robinhood. Assume you're given access to a table called which contains information about trades placed on the platform, and a table called which has information about a specific user.
Here's what the data looks like in both tables:
|111||San Franciscofirstname.lastname@example.org||08/03/2021 12:00:00|
|178||San Franciscoemail@example.com||01/05/2022 12:00:00|
|300||San Franciscofirstname.lastname@example.org||06/30/2022 12:00:00|
Write a SQL query to join the trades and users table.
The output should look something like this:
|100102||111||10||Completed||08/17/2022 12:00:00||10.00||111||San Franciscoemail@example.com||08/03/2021 12:00:00|
|100101||111||10||Cancelled||08/17/2022 12:00:00||9.80||111||San Franciscofirstname.lastname@example.org||08/03/2021 12:00:00|
|100900||148||50||Completed||07/14/2022 12:00:00||9.78||148||Bostonemail@example.com||08/20/2021 12:00:00|
|100259||148||35||Completed||08/25/2022 12:00:00||5.10||148||Bostonfirstname.lastname@example.org||08/20/2021 12:00:00|
|100264||148||40||Completed||08/26/2022 12:00:00||4.80||148||Bostonemail@example.com||08/20/2021 12:00:00|
|100777||178||60||Completed||07/25/2022 17:47:00||3.56||178||San Franciscofirstname.lastname@example.org||01/05/2022 12:00:00|
Be sure to complete this SQL JOIN practice exercise, as it's meant to warm you up before you solve a real SQL interview question using the same data in the next section.
A Data Scientist interviewing at Robinhood was asked this SQL interview question where you need to write a SQL JOIN query to find the top three cities that have the highest number of completed trade orders.
Your output should look like this:
Besides needing a , you'll also need and commands to solve this interview question.
So far in this tutorial, we've simply use the keyword, which is short for an . There's actually 4 different types of JOINS in SQL, which we'll now cover in detail.
|Type of Joins||Description|
|INNER JOIN||Returns only the rows with matching values from both tables.|
|LEFT JOIN||Returns all the rows from the left table and the matching rows from the right table.|
|RIGHT JOIN||Returns all the rows from the right table and the matching rows from the left table.|
|FULL OUTER JOIN||Returns all rows when there is a match in either the left or the right table. If there is no match, values are returned for columns from the table without a match.|
We'll be using the Goodreads Books dataset to demonstrate these different types of JOINs.
Picture this: you want to find out how many books priced at $20 and above have been ordered, and who the buyers are.
The returns only the rows where there is a matching book ID in both the and tables, focusing on books with prices $20 and above.
Next, let's explore how the works using our and tables.
Suppose we want to retrieve all the orders along with their corresponding deliveries information. Here's the query:
With , all rows from the left table () are fetched, along with matching rows from the right table (). If there is no matching data in the right table, the result will still include the left table's data, with values in the columns from the right table.
In our example, we're displaying a subset of 10 rows out of 20:
But what if we flip the tables? By swapping the positions of the tables in the , the same result can be achieved:
Notice how order IDs 2005, 2008, and 2010 are missing from the results? Since these order IDs are absent in the table, there are no corresponding matches in the table. This results in the values you see.
A is the opposite of a . It returns all rows from the right table and pairs them with matching rows from the left table. If there is no match, the columns from the left table have values.
Run the following query:
You can also swap the positions of the tables, and you'll get the exact same results as a .
💡A little tip: In the real world, is rarely used. Most people naturally think from left to right, making easier to understand and implement. So, if you find yourself in a situation where you might consider a , simply switch the positions of the tables and use a instead!
Now, let's explore the which allows you to bring unmatched rows into the results.
Ever wondered what happens when both tables have something to offer, but they're not a perfect match?
A returns all rows when there is a match in either the left or the right table. If there is no match, values are returned for columns from the table without a match.
Running this query is like uniting two tables: and . Each row represents an attempt at a match. If there's no match, you'll see values in the columns from the table without a match.
The first 20 rows are expected (we're just showing you the first 5 rows):
Scroll down, and you'll encounter rows with order IDs (marked in the red box). The ensures that even if there isn't an to match these rows, they're still include in the results:
💡 A little tip: Be careful with as it can introduce duplicate rows to your results without you realizing!
Now that we've learned to join tables, there's one last thing we need to cover in order to finish the intermediate SQL tutorial: handling dates & timestamps.
DATE FUNCTIONS 📅