A beginner-friendly SQL journey through real problems. It covers JOIN, GROUP BY, window functions, and how we actually think with data♦Early on, my learning as a data analyst was limited to Excel.
You know… just the basics like filtering rows, dragging formulas, and quickly creating pivot tables.
As the volume of data grew, everything started to slow down. Things got weirder and weirder. What used to be simple questions turned into complicated workarounds. Then suddenly, Excel just wasn’t enough anymore.
That’s when I started learning SQL for data analysis.
It felt simple. Just a single query… just at the beginning ;>
SELECT * FROM orders;
♦And.. Boommm, a lot of data showed.
But isn’t that just too much data to look at all at once?
It became difficult for me to understand patterns and relations between variables, or even just to see what was wrong with the data.
Where Did the Data Come From?Before jumping into the queries, fyi, I worked with a public Superstore sales dataset. This contains transactional data like orders, customers, products, and sales performance. This dataset is ideal for practicing SQL in data analysis because it reflects real-world business questions. But here, I converted the original flat table into three relational tables named customers (add new variables customer_id, orders, and products). For this project, I’m using PostgreSQL for the querying.
When you follow along with my learning process, you don’t need to use the same dataset as I did. More important is how we approach the business problems here that need to be solved.
It Started with a “SELECT”My first task actually seemed easy.
Can you get all orders from January?
SELECT *
FROM order
WHERE order_date >= '2013-01-01'
AND order_date < '2013-02-01';
♦As data analysts, we’re often asked to provide monthly sales details. In this case, our store has 709 sales records for January 2013.
When there is a lot of data, it can take a long time, so your boss might ask questions like“Why is this query taking so long?”.
I found out something new, SELECT * isn’t always a solution. And, avoid using unnecessary columns, so the database doesn’t have to read and send as much data, which can make queries run quicker.
So I changed it:
SELECT order_id, order_date, sales
FROM orders
WHERE order_date >= '2013-01-01'
AND order_date < '2013-02-01';
♦Same data, less load, and faster. As you can see, the query time went from 0.218 seconds to 0.166 seconds. It might not seem like a huge difference here, but imagine if we started with millions of records and needed to display thousands of them.
Because now, the database reads fewer columns (less disk I/O). And i felt like “Owhhh… I get it now, it makes my work more efficient.”
Then They Asked for “Just a Summary.”“Can you total the sales per month in 2014?”
Sounds normal, doesn’t it?
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(sales) AS total_sales
FROM orders
WHERE year = 2014
GROUP BY 1
ORDER BY 1;
♦When I first started learning SQL, I and maybe you too (right now hehe) might be a little confuse about what exactly DATE_TRUNCdoes?
It turns out it’s quite simple. Instead of storing the full date (like 2023-01-15), the DATE_TRUNC('month', order_date) function will truncate it to just the month, like for example, if a sale was made on 2014-01-15 it will be converted to the start of the month2014-01-01 to keep things clear and the format easy to read.
So now, all dates in January become the same value. Which means… SQL can group them together. And that’s exactly why GROUP BY works here and then adds them up by using SUM(sales) . Instead of raw transactions, we now receive total sales per month for 2014.
Then the Data Started Looks WeirdNext request.
“Can you include customer names?”
SELECT
o.order_id,
o.order_date,
o.sales,
c.customer_name,
c.segment,
c.country
FROM orders o
JOIN customers c
ON o.customer_id = c.customer_id;
♦We’re just combining two tables:
- orders → transactional data
- customers → customer details
And the JOIN condition o.customer_id = c.customer_id. Basically tells SQL “Match each order with its customer id”.
But “Why are the numbers duplicated?”
Because JOIN in SQL, it doesn’t just “add columns to” but combines rows. So if the join condition isn’t perfectly aligned, you might accidentally multiply rows.
And that means duplicate data, inflated totals, and… completely misleading insights. However, since our context here is only displaying orders by customer name, it’s safe because we need the details regardless of how many transactions that customer has. It would be a different story if we had to display total sales (because if we’re not careful, the data will just keep getting added over and over).
Then Came the Hard Question“What’s the top-selling product each month?”
Okay… this one will be tricky. My first instinct is
SELECT
DATE_TRUNC('month', order_date) AS month,
product_id,
MAX(sales)
FROM orders
GROUP BY 1, product_id;
♦It looks right. But it’s actually not. Let’s break it down:
- GROUP BY month, product_id→ groups data per product per month
- MAX(sales)→ gives the highest transaction value for each product
So what we actually get is the max sale per product. Not the top product per month.
And honestly, this is one of those moments where I go like “Hmmm wait… why is this wrong?”
This time for “Window Functions”This is where I discovered that one of the most powerful tools in SQL for data analysis is window functions.
SELECT *
FROM (
SELECT
DATE_TRUNC('month', order_date) AS month,
product_id,
SUM(sales) AS total_sales,
ROW_NUMBER() OVER (
PARTITION BY DATE_TRUNC('month', order_date)
ORDER BY SUM(sales) DESC
) AS rank
FROM orders
GROUP BY 1, product_id
) t
WHERE rank = 1;
♦Let’s discuss them one by one together.
Step 1. Aggregate firstSUM(sales)
GROUP BY month, product_id
Total sales per product each month.
Step 2. Rank within each monthROW_NUMBER() OVER (
PARTITION BY month
ORDER BY total_sales DESC
)
Here’s the main point of our query before.
- PARTITION BY month → reset ranking every month.
- ORDER BY total_sales DESC → highest sales = rank 1.
- ROW_NUMBER() → assigns ranking (1, 2, 3, etc).
Step 3 . Pick the top oneWHERE rank = 1
We got the Top-selling product for each month.
Then They Wanted Trends“Is this month better than last month?”
SELECT
month,
total_sales,
LAG(total_sales) OVER (ORDER BY month) AS prev_month_sales,
total_sales - LAG(total_sales) OVER (ORDER BY month) AS growth
FROM (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(sales) AS total_sales
FROM orders
GROUP BY 1
) t;
♦Let’s break this down too.
Step 1 . Monthly totalsSUM(sales) GROUP BY month
This is where we will sum up the sales and group them monthly.
Step 2. Look at the previous rowLAG(total_sales) OVER (ORDER BY month)
The LAG() in SQL is a window function that allows you to access data from the previous row in a dataset without using a self-join. It is commonly used in time-series analysis and trend calculations. This expression tells the database to look at the value total_sales from the previous row, based on the order of month.
Step 3 . Calculate growthtotal_sales - prev_month_sales
In this step calculates growth by subtracting previous month sales from the current total sales (total_sales - prev_month_sales) to measure month-over-month sales change.
Then My Queries Became a MessAt some point, I felt like my queries started to look like nested everywhere, hard to read, or slightly terrifying.
They worked… but next time it’s barely understandable.
That’s when I found CTEs (Common Table Expressions), a temporary result set that you can name and reuse inside our query.
WITH monthly_sales AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(sales) AS total_sales
FROM orders
GROUP BY 1
)
SELECT
month,
total_sales,
LAG(total_sales) OVER (ORDER BY month) AS prev_month
FROM monthly_sales;
♦Let’s see what changed. Instead of nesting everything inside one giant query, we split it into steps.
Step 1. The CTE partWITH monthly_sales AS (...)
This is the CTE. We’re basically saying, “Hey SQL, create a temporary table called monthly_sales using this query”.
For this part.
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(sales) AS total_sales
FROM orders
GROUP BY 1
So it becomes a cleaner and reusable query.
Step 2. Use it like a tableFROM monthly_sales
Instead of rewriting the same logic again (or nesting it deeply), we just use monthly_sales like a normal table.
With CTE we get same result, but with much better structure.
Woahh Done—?I used to think SQL was about writing queries. But now, I see it differently. Every step taught me something.
- Slow query → think about efficiency
- Wrong numbers → question your logic
- Hard questions → rethink your approach
- Messy queries → structure your thinking
From me,
SQL is just not about queries. But, it’s about how we think when solving problems with data in a more efficient and structured.
To explore my other projects, feel free to check out my Medium or GitHub. Stay tuned for more challenging projects~~
jihanKamilah - Overview
♦SQL Is More Than Just Queries: How to Think Like a Data Analyst was originally published in Code Like A Girl on Medium, where people are continuing the conversation by highlighting and responding to this story.