What Window Functions Are Actually For

ECON250 - Big Data Analytics | Week 4

Oleh Omelchenko

2026-02-25

Before we start

Who watched the video on window functions?
What do they do — can someone describe it in one sentence?
Has anyone seen them used somewhere? A query at work, a BI tool, someone else’s code?

Opening — 3 minutes, discussion

Ask the first question and read the room. If coverage is thin, slow down slightly on the syntax reference. If most watched, move quickly.

Second question: take 2–3 answers. You’re listening for something like “they compute aggregates without collapsing rows” or “they let you look at other rows without a join.” Acknowledge what they say and bridge to the reference card.

Third question: students in data-adjacent roles may have seen window functions in Python (pandas rolling, groupby().transform()), Tableau calculated fields, or someone else’s query. If someone has experience, let them share for 30 seconds — it frames the session as building on what exists, not introducing a foreign concept.

Transition: “Good. Let me put the grammar on screen as a reference — then we’re going straight to problems.”

Window function syntax

function() OVER (
  PARTITION BY ...   -- which group
  ORDER BY ...       -- order within the group
  ROWS BETWEEN ...   -- how wide the window (optional)
)

PARTITION BY — splits rows into independent groups, like GROUP BY but without collapsing them
ORDER BY — sets the sequence within each partition; required for rankings and running calculations
ROWS BETWEEN — window width; has sensible defaults, rarely needs explicit specification

Not all three are needed for every function. Come back to this slide during practice.

What `PARTITION BY` does

SUM(sales) OVER (PARTITION BY category) AS category_total

Input

category	quarter	sales
Furniture	Q1	22,657
Furniture	Q2	28,064
Technology	Q1	18,419
Technology	Q2	35,043

Output — same rows, one new column

category	quarter	sales	category_total
Furniture	Q1	22,657	50,721
Furniture	Q2	28,064	50,721
Technology	Q1	18,419	53,462
Technology	Q2	35,043	53,462

No rows disappeared. Each row kept its data and gained the group total as a new column.

Act 1: Applications

Real data always accumulates duplicates

A payments table — p1 was retried:

payment_id	user_id	amount	created_at
p1	u1	100	08:00 ✗
p2	u2	200	09:00
p1	u1	100	08:05 ✓
p3	u3	150	10:00

The retry at 08:05 completed. Keep it, discard 08:00.

WITH numbered AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY payment_id
      ORDER BY created_at DESC
    ) AS rn
  FROM payments
)
SELECT * EXCEPT(rn)
FROM numbered
WHERE rn = 1

PARTITION BY payment_id — each ID forms one group.

ORDER BY DESC — most recent gets rank 1.

WHERE rn = 1 — one row per group, the right one.

Pattern 1a: Deduplication — 7 minutes

Setup: every real data system accumulates duplicates. Retried transactions, import jobs that ran twice, API callbacks received more than once.

Show the table — point out p1 appears twice. Ask: “Which row should we keep?” The 08:05 one — the retry that completed.

Walk through the query: 1. ROW_NUMBER + PARTITION BY: within each payment_id, rows get independent rankings 2. ORDER BY DESC: most recent timestamp → rn = 1 3. WHERE rn = 1: one row per payment_id

Click to trigger the fade-out on the 08:00 row — shows visually what the query removes.

CTE is necessary: you can’t filter on window functions in WHERE (they’re computed after WHERE). Compute first in the CTE, then filter.

SELECT * EXCEPT(rn) is BigQuery syntax — selects all columns except the rank column.

Before deleting anything: inspect what would be removed by using WHERE rn > 1 instead. Same query, different filter.

The dedup pattern generalizes

The only things that change between use cases:

Use case	PARTITION BY	ORDER BY
Latest status per order	order_id	status_changed_at DESC
First purchase per customer	user_id	created_at ASC
Best exam attempt	student_id, exam_id	score DESC
Most recent device login	user_id, device_id	login_at DESC

The partition defines what counts as a duplicate.

The ORDER BY direction defines which copy survives.

Sequencing a customer’s purchase history

Input — two customers, five orders

user_id	order_id	created_at
u1	o1	Jan 5
u2	o2	Jan 7
u1	o3	Feb 10
u2	o4	Feb 15
u1	o5	Mar 20

SELECT
  user_id,
  order_id,
  created_at,
  ROW_NUMBER() OVER (
    PARTITION BY user_id
    ORDER BY created_at
  ) AS order_seq
FROM orders

Each user gets their own independent sequence.

order_seq = 1 → first-ever order (acquisition).

order_seq = 2 → first repeat — the hardest conversion in most businesses.

What order_seq unlocks

Result

user_id	order_id	order_seq
u1	o1	1
u2	o2	1
u1	o3	2
u2	o4	2
u1	o5	3

Aggregate by order_seq to see how average spend, return rate, or category preference evolves across a customer’s history.

This directly solves the “new vs returning” problem from Week 3 — the one marked optional:

SELECT
  DATE_TRUNC(DATE(created_at), MONTH) AS month,
  COUNTIF(order_seq = 1) AS new_customers,
  COUNTIF(order_seq > 1) AS returning_customers
FROM (...) -- the sequenced query
GROUP BY month

Is this apartment priced fairly?

You cleaned the flatfy dataset in Week 2. “Similar” means same region, same room count, same deal type.

SELECT
  listing_id, region, room_count, deal_type, price_usd,
  AVG(price_usd) OVER (
    PARTITION BY region, room_count, deal_type
  ) AS comparable_avg,
  (price_usd / AVG(price_usd) OVER (
    PARTITION BY region, room_count, deal_type
  ) - 1) * 100 AS pct_vs_comparable
FROM `econ250-2026.course_data.flatfy_2019_cleaned`

pct_vs_comparable = +47 → this listing costs 47% more than similar apartments in the same area.

The three-column PARTITION BY is the analytical decision — it defines “comparable.” The SQL follows from that choice.

Pattern 1c: Aggregate windows — 8 minutes

Before showing the query, ask: “How would you define ‘comparable’ for a 2-room apartment for sale in Kyiv Oblast? Which columns matter?”

Take 1–2 answers. Then walk through: - AVG(price_usd) OVER (PARTITION BY region, room_count, deal_type): average price within the peer group - (price_usd / AVG(…) - 1) * 100: percent deviation from that average

Why three columns? Partitioning only by region compares studios to penthouses — meaningless. The three-column partition creates a meaningful peer group.

Without window functions: CTE to compute the average per (region, room_count, deal_type), then JOIN back to individual rows. With window functions: one query, no JOIN.

Transition: “Two quick patterns that show up constantly, plus one thing worth knowing about how filtering works.”

Two common patterns — and one thing to know

-- Percent of total
-- change OVER() to OVER(PARTITION BY x) to get percent-within-group instead
revenue * 100.0 / SUM(revenue) OVER () AS pct_of_total

-- Top-N per group
-- QUALIFY is BigQuery shorthand for the CTE + filter approach
QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) <= 5

One thing that catches people early:

WHERE ROW_NUMBER() OVER (...) <= 5  -- doesn't work: window functions run after WHERE

Window functions are computed after WHERE. Filter on them in a CTE or with QUALIFY.

Session detection

You have raw events. A session is a sequence of events with no gap longer than 30 minutes.

Raw events

user_id	event_time	page_url	event_type
u1	10:01	/home	view
u1	10:14	/products	click
u1	10:22	/cart	view
u1	11:08	/home	view
u1	11:19	/products	click

Gap between 10:22 and 11:08 is 46 minutes → new session.

What we want to produce

user_id	event_time	page_url	session_id
u1	10:01	/home	u1_10:01
u1	10:14	/products	u1_10:01
u1	10:22	/cart	u1_10:01
u1	11:08	/home	u1_11:08
u1	11:19	/products	u1_11:08

Session ID = user + start time of that session.

Step 1: compute gaps, flag where sessions start

WITH gaps AS (
  SELECT
    user_id, event_time, event_type,
    TIMESTAMP_DIFF(
      event_time,
      LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
      MINUTE
    ) AS minutes_since_prev
  FROM events
),
flagged AS (
  SELECT *,
    CASE
      WHEN minutes_since_prev IS NULL OR minutes_since_prev > 30 THEN 1
      ELSE 0
    END AS is_session_start
  FROM gaps
)

Step 2: running sum of flags = session ID

SELECT
  user_id, event_time, page_url,
  -- integer session number per user
  SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) AS session_num,
  -- readable session ID: user + start time of that session
  CONCAT(user_id, '_',
    FORMAT_TIMESTAMP('%H:%M',
      FIRST_VALUE(event_time) OVER (PARTITION BY user_id, session_num ORDER BY event_time)
    )
  ) AS session_id
FROM flagged

is_session_start	session_num	session_id
1	1	u1_10:01
0	1	u1_10:01
0	1	u1_10:01
1	2	u1_11:08
0	2	u1_11:08

The integer session_num is the mechanism; session_id is what you’d actually store.

Streaks: the same reset logic

“Who was active at least 4 days in a row?”

A streak continues as long as consecutive days are exactly 1 apart. Any gap breaks it.

user_id	day	gap
u1	Jan 1	—
u1	Jan 2	1
u1	Jan 3	1
u1	Jan 4	1
u2	Jan 1	—
u2	Jan 2	1
u2	Jan 4	2
u2	Jan 5	1
u2	Jan 6	1

u1: 4 consecutive days ✓ u2: gap on Jan 3 breaks the streak; longest run is 3 days ✗

WITH daily AS (
  SELECT user_id, DATE(event_time) AS day
  FROM events
  GROUP BY 1, 2
),
with_gaps AS (
  SELECT user_id, day,
    DATE_DIFF(day,
      LAG(day) OVER (PARTITION BY user_id ORDER BY day),
    DAY) AS gap
  FROM daily
),
streak_groups AS (
  SELECT user_id, day,
    SUM(CASE WHEN gap != 1 OR gap IS NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY user_id ORDER BY day) AS streak_id
  FROM with_gaps
),
SELECT user_id, streak_id,
  MIN(day) AS streak_start,
  MAX(day) AS streak_end,
  COUNT(*) AS streak_length
FROM streak_groups
GROUP BY user_id, streak_id
HAVING COUNT(*) >= 4

Act 2: As Your Data Grows

You’ve been writing this

CREATE OR REPLACE TABLE daily_stats AS
SELECT
  DATE(payment_date)        AS date,
  COUNT(*)                  AS transaction_count,
  SUM(amount)               AS total_revenue,
  COUNT(DISTINCT user_id)   AS active_users
FROM payments
GROUP BY date

This query scans the entire payments table every time it runs — even if only one new day of data arrived since last night.

If the table covers three years of transactions, you are reading three years of data to produce one new row.

Rebuilding from scratch doesn’t scale

Even if payments is partitioned by date, CREATE OR REPLACE has no WHERE clause — it reads every partition, every time.

Table age	Partitions scanned	What actually changed
1 month	30 daily partitions	1 new day
1 year	365 daily partitions	1 new day
3 years	1,095 daily partitions	1 new day

The longer the table exists, the worse the ratio of work done to work needed. A query that was fine in month one becomes wasteful by year three — even if daily volume stays the same.

Daily aggregates: process only what changed

daily_stats (target) — partitioned by date

date	transactions	revenue
Feb 23	1,205	$24,100
Feb 24	980	$19,600

Feb 24 had late-arriving payments — the real count is 995. Feb 25 is new.

After MERGE:

date	transactions	revenue
Feb 23	1,205	$24,100
Feb 24	995	$19,850
Feb 25	1,130	$22,600

Feb 23 untouched (not in source). Feb 24 corrected. Feb 25 inserted.

MERGE daily_stats AS target
USING (
  SELECT
    DATE(payment_date) AS date,
    COUNT(*)           AS transactions,
    SUM(amount)        AS revenue
  FROM payments
  WHERE payment_date
    >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)
  GROUP BY date
) AS source
ON target.date = source.date
WHEN MATCHED THEN
  UPDATE SET
    transactions = source.transactions,
    revenue      = source.revenue
WHEN NOT MATCHED THEN
  INSERT (date, transactions, revenue)
  VALUES (source.date, source.transactions, source.revenue)

The source query reads only the last 2 days — not the full history. The target table must be partitioned by date for this to work efficiently.

5 minutes

This directly follows from the CREATE OR REPLACE problem on the previous slides. Instead of rebuilding the entire summary, MERGE touches only what changed.

Walk through the before/after:

Feb 23 is not in the source (older than 2 days ago), so MERGE doesn’t touch it
Feb 24: target already has a row → WHEN MATCHED → UPDATE with corrected figures
Feb 25: no row in target → WHEN NOT MATCHED → INSERT

The DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY) makes this reusable — run it any day, it processes yesterday and the day before. No hard-coded dates to update.

Why the target table should be partitioned by date: BigQuery’s MERGE only reads the partitions it needs when matching. Without partitioning, the ON clause still scans the entire target.

Transition: “What about re-running this if it fails? Does it produce the same result?”

Late-arriving data and idempotency

Real payment systems have late-arriving data — a transaction on February 23 might not clear until February 25. When the February 23 partition is corrected and the pipeline re-runs, WHEN MATCHED overwrites the row with updated figures rather than creating a duplicate.

Idempotency: running the same pipeline twice produces the same result as running it once.

A nightly job that fails at 3 AM and retries at 5 AM must produce identical output either way. Pipelines that aren’t idempotent produce different results on retry, and those differences can be hard to trace.

Lookup table: append only

user_first_payment (target)

user_id	first_payment_date
u1	Jan 15
u2	Jan 22
u3	Feb 1

Feb 25 payments arrive: u2 paid again (existing), u4 paid for the first time (new).

After MERGE:

user_id	first_payment_date
u1	Jan 15
u2	Jan 22
u3	Feb 1
u4	Feb 25

u2 unchanged — first dates are historical fact.

MERGE user_first_payment AS target
USING (
  SELECT user_id,
    MIN(payment_date) AS first_payment_date
  FROM payments
  WHERE payment_date = CURRENT_DATE()
  GROUP BY user_id
) AS source
ON target.user_id = source.user_id
WHEN NOT MATCHED THEN
  INSERT (user_id, first_payment_date)
  VALUES (source.user_id, source.first_payment_date)
-- no WHEN MATCHED: once written, first dates don't change

Only new users touch the target table. The full history is never scanned.

When aggregated tables aren’t enough

Your daily_stats table has total_revenue per day. A dashboard question arrives: “What is user u42’s total lifetime spending?”

The daily table cannot answer this — revenue is aggregated across all users. To compute per-user lifetime totals, you need the raw payments table and its full history.

The same applies to any question that cuts data differently from how the summary was built: revenue by product category, spending by acquisition cohort, retention rates — all require the raw data, not the daily rollup.

Summary tables speed up known queries but don’t replace the raw data. Design the summary for the questions you know you’ll ask repeatedly.

Looking ahead

Tomorrow’s practices: deduplication, session detection, and the submission query — a monthly analytics view combining several of today’s patterns.

Week 5: cohort analysis — the same ROW_NUMBER and LAG from today, applied to “what fraction of users who joined in January were still active in March?”

What Window Functions Are Actually For

Before we start

Window function syntax

What PARTITION BY does

Act 1: Applications

Real data always accumulates duplicates

The dedup pattern generalizes

Sequencing a customer’s purchase history

What order_seq unlocks

Is this apartment priced fairly?

Two common patterns — and one thing to know

Session detection

Step 1: compute gaps, flag where sessions start

Step 2: running sum of flags = session ID

Streaks: the same reset logic

Act 2: As Your Data Grows

You’ve been writing this

Rebuilding from scratch doesn’t scale

Daily aggregates: process only what changed

Late-arriving data and idempotency

Lookup table: append only

When aggregated tables aren’t enough

Looking ahead

What `PARTITION BY` does