SQL Sorcery: Power Techniques Every Data Scientist Needs In Their Arsenal

# Introduction

Knowing just SELECT, WHERE, and GROUP BY covers basic data summarization, but many real-world analysis jobs demand more advanced query patterns. Think about tasks like spotting consecutive activity streaks, grouping customers by spending level, smoothing out noisy time-series data, or tracking how users move between subscription plans over time.

This guide covers 7 hands-on SQL techniques that go past the fundamentals, zeroing in on methods that tackle genuine analytical challenges.

# Setting Up the Dataset

We’ll work with a sample transactions table from a made-up subscription software (SaaS) business:

CREATE TABLE transactions (
    transaction_id   SERIAL PRIMARY KEY,
    customer_id      INT,
    plan_type        VARCHAR(20),   -- 'starter', 'pro', 'enterprise'
    amount           NUMERIC(10,2),
    status           VARCHAR(20),   -- 'completed', 'refunded', 'failed'
    created_at       TIMESTAMP
);

The complete dataset includes 36 transactions from 7 customers, covering September 2023 to June 2024, and is stored in seed.sql. Make sure to run it before trying the queries below.

# 1. Measuring Time Between Events with `LAG()`

LAG() and LEAD() let you pull values from the previous or next row without writing a self-join. They’re especially handy for figuring out the gap between events — like how often a customer renews, signs they might be about to churn, or how long it takes them to come back.

Task: Figure out how many days passed between each customer’s consecutive completed transactions.

SELECT
    customer_id,
    created_at,
    LAG(created_at) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
    ) AS previous_transaction_at,
    ROUND(
        EXTRACT(EPOCH FROM (
            created_at - LAG(created_at) OVER (
                PARTITION BY customer_id
                ORDER BY created_at
            )
        )) / 86400
    ) AS days_since_last
FROM transactions
WHERE status="completed"
ORDER BY customer_id, created_at;

Output (truncated):

customer_id |     created_at      | previous_transaction_at | days_since_last
-------------+---------------------+-------------------------+-----------------
        3317 | 2024-01-03 11:02:00 |                         |
        3317 | 2024-03-15 10:45:00 | 2024-01-03 11:02:00     |              72
        3317 | 2024-05-22 09:30:00 | 2024-03-15 10:45:00     |              68
        4482 | 2023-09-10 09:00:00 |                         |
        4482 | 2023-10-10 09:00:00 | 2023-09-10 09:00:00     |              30
        4482 | 2023-11-10 09:14:00 | 2023-10-10 09:00:00     |              31
        4482 | 2024-01-03 09:14:00 | 2023-11-10 09:14:00     |              54
        4482 | 2024-03-03 08:20:00 | 2024-01-03 09:14:00     |              60
        4482 | 2024-04-03 10:00:00 | 2024-03-03 08:20:00     |              31
        4482 | 2024-05-01 11:00:00 | 2024-04-03 10:00:00     |              28
        ...
        7891 | 2024-02-01 09:00:00 |                         |
        7891 | 2024-04-01 09:00:00 | 2024-02-01 09:00:00     |              60
        7891 | 2024-05-15 09:00:00 | 2024-04-01 09:00:00     |              44
        8810 | 2024-01-05 12:00:00 |                         |
        8810 | 2024-02-05 12:00:00 | 2024-01-05 12:00:00     |              31
        8810 | 2024-04-05 12:00:00 | 2024-02-05 12:00:00     |              60
(29 rows)

The first row for every customer shows NULL in both columns — there’s simply no earlier transaction to look back at. EXTRACT(EPOCH ...) turns the time difference into seconds; dividing by 86400 converts that into days.

LEAD() works identically but looks ahead instead of behind, which is useful for computing time until the next renewal or flagging the final transaction before a customer drops off.

# 2. Comparing a Row to Other Rows in the Same Table with a Self-Join

A self-join connects rows from the same table to one another. It’s the go-to approach when you need to compare two events for the same entity at different points in time — upgrades, downgrades, re-activations, or any before-and-after scenario.

Task: Identify customers who moved up from starter to pro (or from pro to enterprise) at some point.

SELECT DISTINCT t1.customer_id
FROM transactions t1
JOIN transactions t2
    ON  t1.customer_id = t2.customer_id
    AND t1.plan_type="starter"
    AND t2.plan_type="pro"
    AND t2.created_at  > t1.created_at
WHERE t1.status="completed"
  AND t2.status="completed"
ORDER BY t1.customer_id;

Output:

customer_id
-------------
        4482
        6204
        7891
(3 rows)

The table gets two aliases (t1, t2) so each one can stand for a different moment in time for the same customer. The condition t2.created_at > t1.created_at makes sure the order is correct — without it, you’d catch customers who had both plan types in any sequence, even the reverse. DISTINCT removes duplicates that would appear when a customer had several starter transactions before moving up.

This same pattern works for spotting downgrades, finding customers who left and returned, or comparing any two states that need to follow a chronological order.

# 3. Selecting the Top Row per Group with `ROW_NUMBER()`

When you need the top-N rows within each category — the biggest transaction per customer, the latest event per account, the earliest purchase per cohort — wrapping ROW_NUMBER() inside a common table expression (CTE) is the standard method.

Task: Pull each customer’s single largest completed transaction.

WITH ranked AS (
    SELECT
        customer_id,
        transaction_id,
        amount,
        plan_type,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY amount DESC, created_at DESC
        ) AS rn
    FROM transactions
    WHERE status="completed"
)
SELECT customer_id, transaction_id, amount, plan_type
FROM ranked
WHERE rn = 1
ORDER BY customer_id;

Output:

customer_id  | transaction_id  | amount  | plan_type
-------------+----------------+--------+------------
        3317 |             12 |  19.00 | starter
        4482 |              8 | 299.00 | enterprise
        5901 |             19 | 299.00 | enterprise
        6103 |             25 | 299.00 | enterprise
        6204 |             28 |  79.00 | pro
        7891 |             32 |  79.00 | pro
        8810 |             36 |  79.00 | pro
(7 rows)

ROW_NUMBER() gives the value 1 to the top-ranked row within each group. The outer query then keeps only those top rows. The secondary sort on created_at DESC serves as a tiebreaker — when two transactions share the same amount, the newer one takes priority.

If you’d rather keep ties instead of breaking them, replace ROW_NUMBER() with RANK(). RANK() gives tied rows the same number and then skips the next rank (1, 1, 3), while DENSE_RANK() assigns the same number to ties without skipping (1, 1, 2).

# 4. Grouping Customers by Spending Level Using `NTILE(n)`

NTILE(n) splits sorted rows into n roughly equal groups and labels each row with its group number. It’s ideal for sorting customers into tiers, calculating spend quartiles, or building cohorts for A/B testing — all without setting fixed thresholds.

Task: Sort customers into four spending groups based on their total completed transaction amount.

WITH customer_spend AS (
    SELECT
        customer_id,
        SUM(amount) AS total_spend,
        COUNT(*) AS total_transactions
    FROM transactions
    WHERE status="completed"
    GROUP BY customer_id
)
SELECT
    customer_id,
    total_spend,
    total_transactions,
    NTILE(4) OVER (ORDER BY total_spend) AS spend_quartile
FROM customer_spend
ORDER BY total_spend DESC;

Output:

customer_id | total_spend | total_transactions | spend_quartile
-------------+-------------+--------------------+----------------
        5901 |     1495.00 |                  5 |              4
        6103 |      835.00 |                  5 |              3
        4482 |      653.00 |                  7 |              3
        8810 |      237.00 |                  3 |              2
        6204 |      177.00 |                  3 |              2
        7891 |      177.00 |                  3 |              1
        3317 |       57.00 |                  3 |              1
(7 rows)

Quartile 4 represents your biggest spenders, while quartile 1 covers the smallest. Since NTILE() doesn’t rely on hardcoded spend cutoffs, the groups adjust automatically as new data comes in. This approach is far more flexible than fixed thresholds like CASE WHEN total_spend > 500.

# 5. Smoothing Out Fluctuations with a Rolling Average

A rolling (or moving) average evens out month-to-month swings, making trends in time-series data much clearer. Window functions with an explicit ROWS BETWEEN clause let you control exactly how many periods are included in the calculation.

Task: Compute a 3-month rolling average of monthly revenue to reduce noise.

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at)::DATE AS month,
        SUM(amount) AS monthly_revenue
    FROM transactions
    WHERE status="completed"
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    monthly_revenue,
    ROUND(AVG(monthly_revenue) OVER (
        ORDER BY month
        ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
    ), 2) AS revenue_3mo_avg
FROM monthly
ORDER BY month;

Output:

month    | monthly_revenue | revenue_3mo_avg
-------------+-----------------+-----------------
 2023-09-01  |           19.00 |           19.00
 2023-10-01  |           19.00 |           19.00
 2023-11-01  |           79.00 |           39.00
 2024-01-01  |          275.00 |          124.33
 2024-02-01  |          476.00 |          276.67
 2024-03-01  |          555.00 |          435.33
 2024-04-01  |          835.00 |          622.00
 2024-05-01  |          775.00 |          721.67
 2024-06-01  |          598.00 |          736.00
(9 rows)

ROWS BETWEEN 2 PRECEDING AND CURRENT ROW tells the window function to consider the current row plus the two rows before it. For the first two rows, there isn’t enough prior data, so they use a 1-month and 2-month average respectively.

Switch ROWS to RANGE if you want to include all rows that share the same ORDER BY value (handy when multiple rows have the same timestamp). For a wider smoothing window, change 2 PRECEDING to 5 PRECEDING for a 6-month rolling average.

# 6. Conditional Aggregation Using `FILTER`

FILTER allows you to apply a WHERE condition to a specific aggregate without breaking the query into multiple subqueries. This lets you compute several conditional summaries in a single pass through the data.

Task: Pull total revenue, refunds, and failed transaction counts grouped by month — all in one row per month.

SELECT
    DATE_TRUNC('month', created_at) AS month,
    SUM(amount) FILTER (WHERE status="completed") AS revenue_completed,
    SUM(amount) FILTER (WHERE status="refunded")  AS revenue_refunded,
    COUNT(*)    FILTER (WHERE status="failed")    AS failed_count
FROM transactions
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

Output:

month             | revenue_completed | revenue_refunded | failed_count
------------------------+-------------------+------------------+--------------
 2023-09-01 00:00:00+00 |             19.00 |                  |            0
 2023-10-01 00:00:00+00 |             19.00 |                  |            0
 2023-11-01 00:00:00+00 |             79.00 |                  |            0
 2024-01-01 00:00:00+00 |            275.00 |                  |            0
 2024-02-01 00:00:00+00 |            476.00 |            79.00 |            1
 2024-03-01 00:00:00+00 |            555.00 |            79.00 |            0
 2024-04-01 00:00:00+00 |            835.00 |           299.00 |            0
 2024-05-01 00:00:00+00 |            775.00 |                  |            1
 2024-06-01 00:00:00+00 |            598.00 |                  |            2
(9 rows)

Instead of using FILTER, you could write three separate subqueries and join them together — but that means more code, reduced readability, and often worse performance. Keep in mind that SUM paired with FILTER yields NULL (rather than zero) when no rows qualify in a given month, which is correct: those months truly had no refunds. Use COALESCE(..., 0) if you’d rather see zeros.

FILTER is part of standard SQL and is supported in PostgreSQL and BigQuery. On Snowflake and certain other platforms, switch to SUM(CASE WHEN status="completed" THEN amount END) as an alternative.

# 7. Spotting Consecutive Activity Streaks Using Window Functions

Identifying uninterrupted sequences — such as months with no inactivity gap, back-to-back days with transactions, or ongoing subscription runs — ranks among the more challenging SQL tasks. The go-to approach leverages a window function to cluster rows into streaks, avoiding the need for a recursive CTE.

Here’s the idea: give each active month a sequential row number within its customer group. When months are truly back-to-back, subtracting that row number from the month date yields the same value across every month in the streak. Any gap disrupts that constant.

Task: Identify each customer’s runs of consecutive active months (months containing at least one completed transaction).

WITH monthly_activity AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', created_at)::DATE AS active_month
    FROM transactions
    WHERE status="completed"
    GROUP BY customer_id, DATE_TRUNC('month', created_at)
),
with_prev AS (
    SELECT
        customer_id,
        active_month,
        LAG(active_month) OVER (
            PARTITION BY customer_id
            ORDER BY active_month
        ) AS prev_month
    FROM monthly_activity
),
streak_groups AS (
    SELECT
        customer_id,
        active_month,
        SUM(CASE WHEN active_month = prev_month + INTERVAL '1 month' THEN 0 ELSE 1 END)
            OVER (PARTITION BY customer_id ORDER BY active_month) AS streak_id
    FROM with_prev
),
streaks AS (
    SELECT
        customer_id,
        streak_id,
        MIN(active_month) AS streak_start,
        MAX(active_month) AS streak_end,
        COUNT(*) AS streak_length_months
    FROM streak_groups
    GROUP BY customer_id, streak_id
)
SELECT customer_id, streak_start, streak_end, streak_length_months
FROM streaks
ORDER BY customer_id, streak_start;

Output:

customer_id | streak_start | streak_end | streak_length_months
-------------+--------------+------------+----------------------
        3317 | 2024-01-01   | 2024-01-01 |                    1
        3317 | 2024-03-01   | 2024-03-01 |                    1
        3317 | 2024-05-01   | 2024-05-01 |                    1
        4482 | 2023-09-01   | 2023-11-01 |                    3
        4482 | 2024-01-01   | 2024-01-01 |                    1
        4482 | 2024-03-01   | 2024-05-01 |                    3
        5901 | 2024-02-01   | 2024-06-01 |                    5
        6103 | 2024-01-01   | 2024-04-01 |                    4
        6103 | 2024-06-01   | 2024-06-01 |                    1
        6204 | 2024-01-01   | 2024-01-01 |                    1
        6204 | 2024-03-01   | 2024-03-01 |                    1
        6204 | 2024-05-01   | 2024-05-01 |                    1
        7891 | 2024-02-01   | 2024-02-01 |                    1
        7891 | 2024-04-01   | 2024-05-01 |                    2
        8810 | 2024-01-01   | 2024-02-01 |                    2
        8810 | 2024-04-01   | 2024-04-01 |                    1
(16 rows)

# Quick Reference

These techniques rely on standard SQL without depending on any database-specific features, and they come up regularly in analytical tasks like retention studies, upgrade funnel monitoring, and revenue reporting.

Tip	When to Use It
`LAG()` / `LEAD()`	Measuring time between events, comparing before/after values per entity
Self-join	Tracking state transitions (plan upgrades, account re-activations)
`ROW_NUMBER()`	Selecting top-N rows per group, removing duplicates
`NTILE(n)`	Dividing customers into spending or activity tiers
Rolling window (`ROWS BETWEEN`)	Smoothing out noisy time-series data, computing moving averages
`FILTER`	Applying multiple conditional aggregations in a single query pass
Consecutive streak detection	Subscription continuity, retention analysis, identifying session gaps

Once these patterns feel natural, many multi-step data transformations that people typically handle in Python can be written more clearly and run faster in a single SQL query.

Bala Priya C is a developer and technical writer from India. She loves exploring the crossroads of math, programming, data science, and content creation. Her focus areas and expertise span DevOps, data science, and natural language processing. When she’s not reading, writing, coding, or enjoying coffee, she’s building tutorials, how-to guides, opinion pieces, and resource overviews to help fellow developers grow.

Top Posts

SQL Sorcery: Power Techniques Every Data Scientist Needs in Their Arsenal

The Future of Value-Added Resellers: What’s on the Horizon?

Where Connected Devices, Smarter Code Collide

SQL Sorcery: Power Techniques Every Data Scientist Needs in Their Arsenal

SAMJ: Accelerating Image Annotation in ImageJ/Fiji with the Segment Anything Model

Salesforce-Style Code Generation in OWL: Build, Test, and Safely Ranked Python Functions with Unit Tests

Mastering Structured LLM Outputs: The Definitive Guide to JSON Mode vs Function Calling and When to Deploy Each

Mastering LATERAL, Semi, and Anti Unlock Elite-Level Joins for High-Performance PostgreSQL

Perception-Driven World Models for Embodied AI

OpenAI Unveils LifeSciBench: A 750-Task Benchmark That Grades AI Models on Real Life-Science Research Using Expert-Crafted Rubrics

SQL Sorcery: Power Techniques Every Data Scientist Needs in Their Arsenal

The Future of Value-Added Resellers: What’s on the Horizon?

Where Connected Devices, Smarter Code Collide

Understanding Loss Functions: How AI Learns From Its Mistakes

Aave Survived the Storm—But Its $8.45 Billion Stress Test Revealed Hidden Fault Lines

Operation Endgame Wipes Out SocGholish Servers and Purifies 14,971 WordPress Sites

Unlock Real-Time Web Search for AI Agents on Amazon Bedrock AgentCore

RADAR Secures $170M to Revolutionize Physical Retail with AI

Trending

SQL Sorcery: Power Techniques Every Data Scientist Needs in Their Arsenal

The Future of Value-Added Resellers: What’s on the Horizon?

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

SQL Sorcery: Power Techniques Every Data Scientist Needs in Their Arsenal

# Introduction

# Setting Up the Dataset

# 1. Measuring Time Between Events with LAG()

# 2. Comparing a Row to Other Rows in the Same Table with a Self-Join

# 3. Selecting the Top Row per Group with ROW_NUMBER()

# 4. Grouping Customers by Spending Level Using NTILE(n)

# 5. Smoothing Out Fluctuations with a Rolling Average

# 6. Conditional Aggregation Using FILTER

# 7. Spotting Consecutive Activity Streaks Using Window Functions

# Quick Reference

Related Posts

# 1. Measuring Time Between Events with `LAG()`

# 3. Selecting the Top Row per Group with `ROW_NUMBER()`

# 4. Grouping Customers by Spending Level Using `NTILE(n)`

# 6. Conditional Aggregation Using `FILTER`