ECON250 - Big Data Analytics | Week 4
2026-02-25
Who watched the video on window functions?
What do they do — can someone describe it in one sentence?
Has anyone seen them used somewhere? A query at work, a BI tool, someone else’s code?
PARTITION BY — splits rows into independent groups, like GROUP BY but without collapsing themORDER BY — sets the sequence within each partition; required for rankings and running calculationsROWS BETWEEN — window width; has sensible defaults, rarely needs explicit specificationNot all three are needed for every function. Come back to this slide during practice.
PARTITION BY doesInput
| category | quarter | sales |
|---|---|---|
| Furniture | Q1 | 22,657 |
| Furniture | Q2 | 28,064 |
| Technology | Q1 | 18,419 |
| Technology | Q2 | 35,043 |
Output — same rows, one new column
| category | quarter | sales | category_total |
|---|---|---|---|
| Furniture | Q1 | 22,657 | 50,721 |
| Furniture | Q2 | 28,064 | 50,721 |
| Technology | Q1 | 18,419 | 53,462 |
| Technology | Q2 | 35,043 | 53,462 |
No rows disappeared. Each row kept its data and gained the group total as a new column.
A payments table — p1 was retried:
| payment_id | user_id | amount | created_at |
|---|---|---|---|
| p1 | u1 | 100 | 08:00 ✗ |
| p2 | u2 | 200 | 09:00 |
| p1 | u1 | 100 | 08:05 ✓ |
| p3 | u3 | 150 | 10:00 |
The retry at 08:05 completed. Keep it, discard 08:00.
PARTITION BY payment_id — each ID forms one group.
ORDER BY DESC — most recent gets rank 1.
WHERE rn = 1 — one row per group, the right one.
The only things that change between use cases:
| Use case | PARTITION BY | ORDER BY |
|---|---|---|
| Latest status per order | order_id | status_changed_at DESC |
| First purchase per customer | user_id | created_at ASC |
| Best exam attempt | student_id, exam_id | score DESC |
| Most recent device login | user_id, device_id | login_at DESC |
The partition defines what counts as a duplicate.
The ORDER BY direction defines which copy survives.
Input — two customers, five orders
| user_id | order_id | created_at |
|---|---|---|
| u1 | o1 | Jan 5 |
| u2 | o2 | Jan 7 |
| u1 | o3 | Feb 10 |
| u2 | o4 | Feb 15 |
| u1 | o5 | Mar 20 |
Result
| user_id | order_id | order_seq |
|---|---|---|
| u1 | o1 | 1 |
| u2 | o2 | 1 |
| u1 | o3 | 2 |
| u2 | o4 | 2 |
| u1 | o5 | 3 |
Aggregate by order_seq to see how average spend, return rate, or category preference evolves across a customer’s history.
You cleaned the flatfy dataset in Week 2. “Similar” means same region, same room count, same deal type.
SELECT
listing_id, region, room_count, deal_type, price_usd,
AVG(price_usd) OVER (
PARTITION BY region, room_count, deal_type
) AS comparable_avg,
(price_usd / AVG(price_usd) OVER (
PARTITION BY region, room_count, deal_type
) - 1) * 100 AS pct_vs_comparable
FROM `econ250-2026.course_data.flatfy_2019_cleaned`pct_vs_comparable = +47 → this listing costs 47% more than similar apartments in the same area.
The three-column PARTITION BY is the analytical decision — it defines “comparable.” The SQL follows from that choice.
-- Percent of total
-- change OVER() to OVER(PARTITION BY x) to get percent-within-group instead
revenue * 100.0 / SUM(revenue) OVER () AS pct_of_total
-- Top-N per group
-- QUALIFY is BigQuery shorthand for the CTE + filter approach
QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) <= 5One thing that catches people early:
Window functions are computed after WHERE. Filter on them in a CTE or with QUALIFY.
You have raw events. A session is a sequence of events with no gap longer than 30 minutes.
Raw events
| user_id | event_time | page_url | event_type |
|---|---|---|---|
| u1 | 10:01 | /home | view |
| u1 | 10:14 | /products | click |
| u1 | 10:22 | /cart | view |
| u1 | 11:08 | /home | view |
| u1 | 11:19 | /products | click |
Gap between 10:22 and 11:08 is 46 minutes → new session.
What we want to produce
| user_id | event_time | page_url | session_id |
|---|---|---|---|
| u1 | 10:01 | /home | u1_10:01 |
| u1 | 10:14 | /products | u1_10:01 |
| u1 | 10:22 | /cart | u1_10:01 |
| u1 | 11:08 | /home | u1_11:08 |
| u1 | 11:19 | /products | u1_11:08 |
Session ID = user + start time of that session.
WITH gaps AS (
SELECT
user_id, event_time, event_type,
TIMESTAMP_DIFF(
event_time,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
MINUTE
) AS minutes_since_prev
FROM events
),
flagged AS (
SELECT *,
CASE
WHEN minutes_since_prev IS NULL OR minutes_since_prev > 30 THEN 1
ELSE 0
END AS is_session_start
FROM gaps
)SELECT
user_id, event_time, page_url,
-- integer session number per user
SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) AS session_num,
-- readable session ID: user + start time of that session
CONCAT(user_id, '_',
FORMAT_TIMESTAMP('%H:%M',
FIRST_VALUE(event_time) OVER (PARTITION BY user_id, session_num ORDER BY event_time)
)
) AS session_id
FROM flagged| is_session_start | session_num | session_id |
|---|---|---|
| 1 | 1 | u1_10:01 |
| 0 | 1 | u1_10:01 |
| 0 | 1 | u1_10:01 |
| 1 | 2 | u1_11:08 |
| 0 | 2 | u1_11:08 |
The integer session_num is the mechanism; session_id is what you’d actually store.
“Who was active at least 4 days in a row?”
A streak continues as long as consecutive days are exactly 1 apart. Any gap breaks it.
| user_id | day | gap |
|---|---|---|
| u1 | Jan 1 | — |
| u1 | Jan 2 | 1 |
| u1 | Jan 3 | 1 |
| u1 | Jan 4 | 1 |
| u2 | Jan 1 | — |
| u2 | Jan 2 | 1 |
| u2 | Jan 4 | 2 |
| u2 | Jan 5 | 1 |
| u2 | Jan 6 | 1 |
u1: 4 consecutive days ✓ u2: gap on Jan 3 breaks the streak; longest run is 3 days ✗
WITH daily AS (
SELECT user_id, DATE(event_time) AS day
FROM events
GROUP BY 1, 2
),
with_gaps AS (
SELECT user_id, day,
DATE_DIFF(day,
LAG(day) OVER (PARTITION BY user_id ORDER BY day),
DAY) AS gap
FROM daily
),
streak_groups AS (
SELECT user_id, day,
SUM(CASE WHEN gap != 1 OR gap IS NULL THEN 1 ELSE 0 END)
OVER (PARTITION BY user_id ORDER BY day) AS streak_id
FROM with_gaps
),
SELECT user_id, streak_id,
MIN(day) AS streak_start,
MAX(day) AS streak_end,
COUNT(*) AS streak_length
FROM streak_groups
GROUP BY user_id, streak_id
HAVING COUNT(*) >= 4This query scans the entire payments table every time it runs — even if only one new day of data arrived since last night.
If the table covers three years of transactions, you are reading three years of data to produce one new row.
Even if payments is partitioned by date, CREATE OR REPLACE has no WHERE clause — it reads every partition, every time.
| Table age | Partitions scanned | What actually changed |
|---|---|---|
| 1 month | 30 daily partitions | 1 new day |
| 1 year | 365 daily partitions | 1 new day |
| 3 years | 1,095 daily partitions | 1 new day |
The longer the table exists, the worse the ratio of work done to work needed. A query that was fine in month one becomes wasteful by year three — even if daily volume stays the same.
daily_stats (target) — partitioned by date
| date | transactions | revenue |
|---|---|---|
| Feb 23 | 1,205 | $24,100 |
| Feb 24 | 980 | $19,600 |
Feb 24 had late-arriving payments — the real count is 995. Feb 25 is new.
After MERGE:
| date | transactions | revenue |
|---|---|---|
| Feb 23 | 1,205 | $24,100 |
| Feb 24 | 995 | $19,850 |
| Feb 25 | 1,130 | $22,600 |
Feb 23 untouched (not in source). Feb 24 corrected. Feb 25 inserted.
MERGE daily_stats AS target
USING (
SELECT
DATE(payment_date) AS date,
COUNT(*) AS transactions,
SUM(amount) AS revenue
FROM payments
WHERE payment_date
>= DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)
GROUP BY date
) AS source
ON target.date = source.date
WHEN MATCHED THEN
UPDATE SET
transactions = source.transactions,
revenue = source.revenue
WHEN NOT MATCHED THEN
INSERT (date, transactions, revenue)
VALUES (source.date, source.transactions, source.revenue)The source query reads only the last 2 days — not the full history. The target table must be partitioned by date for this to work efficiently.
Real payment systems have late-arriving data — a transaction on February 23 might not clear until February 25. When the February 23 partition is corrected and the pipeline re-runs, WHEN MATCHED overwrites the row with updated figures rather than creating a duplicate.
Idempotency: running the same pipeline twice produces the same result as running it once.
A nightly job that fails at 3 AM and retries at 5 AM must produce identical output either way. Pipelines that aren’t idempotent produce different results on retry, and those differences can be hard to trace.
user_first_payment (target)
| user_id | first_payment_date |
|---|---|
| u1 | Jan 15 |
| u2 | Jan 22 |
| u3 | Feb 1 |
Feb 25 payments arrive: u2 paid again (existing), u4 paid for the first time (new).
After MERGE:
| user_id | first_payment_date |
|---|---|
| u1 | Jan 15 |
| u2 | Jan 22 |
| u3 | Feb 1 |
| u4 | Feb 25 |
u2 unchanged — first dates are historical fact.
MERGE user_first_payment AS target
USING (
SELECT user_id,
MIN(payment_date) AS first_payment_date
FROM payments
WHERE payment_date = CURRENT_DATE()
GROUP BY user_id
) AS source
ON target.user_id = source.user_id
WHEN NOT MATCHED THEN
INSERT (user_id, first_payment_date)
VALUES (source.user_id, source.first_payment_date)
-- no WHEN MATCHED: once written, first dates don't changeOnly new users touch the target table. The full history is never scanned.
Your daily_stats table has total_revenue per day. A dashboard question arrives: “What is user u42’s total lifetime spending?”
The daily table cannot answer this — revenue is aggregated across all users. To compute per-user lifetime totals, you need the raw payments table and its full history.
The same applies to any question that cuts data differently from how the summary was built: revenue by product category, spending by acquisition cohort, retention rates — all require the raw data, not the daily rollup.
Summary tables speed up known queries but don’t replace the raw data. Design the summary for the questions you know you’ll ask repeatedly.
Tomorrow’s practices: deduplication, session detection, and the submission query — a monthly analytics view combining several of today’s patterns.
Week 5: cohort analysis — the same ROW_NUMBER and LAG from today, applied to “what fraction of users who joined in January were still active in March?”

Kyiv School of Economics