What Window Functions Are Actually For

ECON250 - Big Data Analytics | Week 4

Oleh Omelchenko

2026-02-25

Before we start


  • Who watched the video on window functions?

  • What do they do — can someone describe it in one sentence?

  • Has anyone seen them used somewhere? A query at work, a BI tool, someone else’s code?

Window function syntax

function() OVER (
  PARTITION BY ...   -- which group
  ORDER BY ...       -- order within the group
  ROWS BETWEEN ...   -- how wide the window (optional)
)


  • PARTITION BY — splits rows into independent groups, like GROUP BY but without collapsing them
  • ORDER BY — sets the sequence within each partition; required for rankings and running calculations
  • ROWS BETWEEN — window width; has sensible defaults, rarely needs explicit specification

Not all three are needed for every function. Come back to this slide during practice.

What PARTITION BY does

SUM(sales) OVER (PARTITION BY category) AS category_total

Input

category quarter sales
Furniture Q1 22,657
Furniture Q2 28,064
Technology Q1 18,419
Technology Q2 35,043

Output — same rows, one new column

category quarter sales category_total
Furniture Q1 22,657 50,721
Furniture Q2 28,064 50,721
Technology Q1 18,419 53,462
Technology Q2 35,043 53,462


No rows disappeared. Each row kept its data and gained the group total as a new column.

Act 1: Applications

Real data always accumulates duplicates

A payments table — p1 was retried:

payment_id user_id amount created_at
p1 u1 100 08:00 ✗
p2 u2 200 09:00
p1 u1 100 08:05 ✓
p3 u3 150 10:00


The retry at 08:05 completed. Keep it, discard 08:00.

WITH numbered AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY payment_id
      ORDER BY created_at DESC
    ) AS rn
  FROM payments
)
SELECT * EXCEPT(rn)
FROM numbered
WHERE rn = 1

PARTITION BY payment_id — each ID forms one group.

ORDER BY DESC — most recent gets rank 1.

WHERE rn = 1 — one row per group, the right one.

The dedup pattern generalizes

The only things that change between use cases:

Use case PARTITION BY ORDER BY
Latest status per order order_id status_changed_at DESC
First purchase per customer user_id created_at ASC
Best exam attempt student_id, exam_id score DESC
Most recent device login user_id, device_id login_at DESC

The partition defines what counts as a duplicate.

The ORDER BY direction defines which copy survives.


Sequencing a customer’s purchase history

Input — two customers, five orders

user_id order_id created_at
u1 o1 Jan 5
u2 o2 Jan 7
u1 o3 Feb 10
u2 o4 Feb 15
u1 o5 Mar 20
SELECT
  user_id,
  order_id,
  created_at,
  ROW_NUMBER() OVER (
    PARTITION BY user_id
    ORDER BY created_at
  ) AS order_seq
FROM orders

Each user gets their own independent sequence.

order_seq = 1 → first-ever order (acquisition).

order_seq = 2 → first repeat — the hardest conversion in most businesses.

What order_seq unlocks

Result

user_id order_id order_seq
u1 o1 1
u2 o2 1
u1 o3 2
u2 o4 2
u1 o5 3


Aggregate by order_seq to see how average spend, return rate, or category preference evolves across a customer’s history.

This directly solves the “new vs returning” problem from Week 3 — the one marked optional:

SELECT
  DATE_TRUNC(DATE(created_at), MONTH) AS month,
  COUNTIF(order_seq = 1) AS new_customers,
  COUNTIF(order_seq > 1) AS returning_customers
FROM (...) -- the sequenced query
GROUP BY month

Is this apartment priced fairly?

You cleaned the flatfy dataset in Week 2. “Similar” means same region, same room count, same deal type.

SELECT
  listing_id, region, room_count, deal_type, price_usd,
  AVG(price_usd) OVER (
    PARTITION BY region, room_count, deal_type
  ) AS comparable_avg,
  (price_usd / AVG(price_usd) OVER (
    PARTITION BY region, room_count, deal_type
  ) - 1) * 100 AS pct_vs_comparable
FROM `econ250-2026.course_data.flatfy_2019_cleaned`

pct_vs_comparable = +47 → this listing costs 47% more than similar apartments in the same area.

The three-column PARTITION BY is the analytical decision — it defines “comparable.” The SQL follows from that choice.

Two common patterns — and one thing to know

-- Percent of total
-- change OVER() to OVER(PARTITION BY x) to get percent-within-group instead
revenue * 100.0 / SUM(revenue) OVER () AS pct_of_total

-- Top-N per group
-- QUALIFY is BigQuery shorthand for the CTE + filter approach
QUALIFY ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) <= 5


One thing that catches people early:

WHERE ROW_NUMBER() OVER (...) <= 5  -- doesn't work: window functions run after WHERE

Window functions are computed after WHERE. Filter on them in a CTE or with QUALIFY.

Session detection

You have raw events. A session is a sequence of events with no gap longer than 30 minutes.

Raw events

user_id event_time page_url event_type
u1 10:01 /home view
u1 10:14 /products click
u1 10:22 /cart view
u1 11:08 /home view
u1 11:19 /products click

Gap between 10:22 and 11:08 is 46 minutes → new session.

What we want to produce

user_id event_time page_url session_id
u1 10:01 /home u1_10:01
u1 10:14 /products u1_10:01
u1 10:22 /cart u1_10:01
u1 11:08 /home u1_11:08
u1 11:19 /products u1_11:08

Session ID = user + start time of that session.

Step 1: compute gaps, flag where sessions start

WITH gaps AS (
  SELECT
    user_id, event_time, event_type,
    TIMESTAMP_DIFF(
      event_time,
      LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
      MINUTE
    ) AS minutes_since_prev
  FROM events
),
flagged AS (
  SELECT *,
    CASE
      WHEN minutes_since_prev IS NULL OR minutes_since_prev > 30 THEN 1
      ELSE 0
    END AS is_session_start
  FROM gaps
)

Step 2: running sum of flags = session ID

SELECT
  user_id, event_time, page_url,
  -- integer session number per user
  SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY event_time) AS session_num,
  -- readable session ID: user + start time of that session
  CONCAT(user_id, '_',
    FORMAT_TIMESTAMP('%H:%M',
      FIRST_VALUE(event_time) OVER (PARTITION BY user_id, session_num ORDER BY event_time)
    )
  ) AS session_id
FROM flagged
is_session_start session_num session_id
1 1 u1_10:01
0 1 u1_10:01
0 1 u1_10:01
1 2 u1_11:08
0 2 u1_11:08

The integer session_num is the mechanism; session_id is what you’d actually store.

Streaks: the same reset logic

“Who was active at least 4 days in a row?”

A streak continues as long as consecutive days are exactly 1 apart. Any gap breaks it.

user_id day gap
u1 Jan 1
u1 Jan 2 1
u1 Jan 3 1
u1 Jan 4 1
u2 Jan 1
u2 Jan 2 1
u2 Jan 4 2
u2 Jan 5 1
u2 Jan 6 1

u1: 4 consecutive days ✓ u2: gap on Jan 3 breaks the streak; longest run is 3 days ✗

WITH daily AS (
  SELECT user_id, DATE(event_time) AS day
  FROM events
  GROUP BY 1, 2
),
with_gaps AS (
  SELECT user_id, day,
    DATE_DIFF(day,
      LAG(day) OVER (PARTITION BY user_id ORDER BY day),
    DAY) AS gap
  FROM daily
),
streak_groups AS (
  SELECT user_id, day,
    SUM(CASE WHEN gap != 1 OR gap IS NULL THEN 1 ELSE 0 END)
      OVER (PARTITION BY user_id ORDER BY day) AS streak_id
  FROM with_gaps
),
SELECT user_id, streak_id,
  MIN(day) AS streak_start,
  MAX(day) AS streak_end,
  COUNT(*) AS streak_length
FROM streak_groups
GROUP BY user_id, streak_id
HAVING COUNT(*) >= 4

Act 2: As Your Data Grows

You’ve been writing this

CREATE OR REPLACE TABLE daily_stats AS
SELECT
  DATE(payment_date)        AS date,
  COUNT(*)                  AS transaction_count,
  SUM(amount)               AS total_revenue,
  COUNT(DISTINCT user_id)   AS active_users
FROM payments
GROUP BY date


This query scans the entire payments table every time it runs — even if only one new day of data arrived since last night.

If the table covers three years of transactions, you are reading three years of data to produce one new row.

Rebuilding from scratch doesn’t scale


Even if payments is partitioned by date, CREATE OR REPLACE has no WHERE clause — it reads every partition, every time.

Table age Partitions scanned What actually changed
1 month 30 daily partitions 1 new day
1 year 365 daily partitions 1 new day
3 years 1,095 daily partitions 1 new day


The longer the table exists, the worse the ratio of work done to work needed. A query that was fine in month one becomes wasteful by year three — even if daily volume stays the same.

Daily aggregates: process only what changed

daily_stats (target) — partitioned by date

date transactions revenue
Feb 23 1,205 $24,100
Feb 24 980 $19,600

Feb 24 had late-arriving payments — the real count is 995. Feb 25 is new.

After MERGE:

date transactions revenue
Feb 23 1,205 $24,100
Feb 24 995 $19,850
Feb 25 1,130 $22,600

Feb 23 untouched (not in source). Feb 24 corrected. Feb 25 inserted.

MERGE daily_stats AS target
USING (
  SELECT
    DATE(payment_date) AS date,
    COUNT(*)           AS transactions,
    SUM(amount)        AS revenue
  FROM payments
  WHERE payment_date
    >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)
  GROUP BY date
) AS source
ON target.date = source.date
WHEN MATCHED THEN
  UPDATE SET
    transactions = source.transactions,
    revenue      = source.revenue
WHEN NOT MATCHED THEN
  INSERT (date, transactions, revenue)
  VALUES (source.date, source.transactions, source.revenue)

The source query reads only the last 2 days — not the full history. The target table must be partitioned by date for this to work efficiently.

Late-arriving data and idempotency


Real payment systems have late-arriving data — a transaction on February 23 might not clear until February 25. When the February 23 partition is corrected and the pipeline re-runs, WHEN MATCHED overwrites the row with updated figures rather than creating a duplicate.


Idempotency: running the same pipeline twice produces the same result as running it once.


A nightly job that fails at 3 AM and retries at 5 AM must produce identical output either way. Pipelines that aren’t idempotent produce different results on retry, and those differences can be hard to trace.

Lookup table: append only

user_first_payment (target)

user_id first_payment_date
u1 Jan 15
u2 Jan 22
u3 Feb 1

Feb 25 payments arrive: u2 paid again (existing), u4 paid for the first time (new).

After MERGE:

user_id first_payment_date
u1 Jan 15
u2 Jan 22
u3 Feb 1
u4 Feb 25

u2 unchanged — first dates are historical fact.

MERGE user_first_payment AS target
USING (
  SELECT user_id,
    MIN(payment_date) AS first_payment_date
  FROM payments
  WHERE payment_date = CURRENT_DATE()
  GROUP BY user_id
) AS source
ON target.user_id = source.user_id
WHEN NOT MATCHED THEN
  INSERT (user_id, first_payment_date)
  VALUES (source.user_id, source.first_payment_date)
-- no WHEN MATCHED: once written, first dates don't change

Only new users touch the target table. The full history is never scanned.

When aggregated tables aren’t enough


Your daily_stats table has total_revenue per day. A dashboard question arrives: “What is user u42’s total lifetime spending?”

The daily table cannot answer this — revenue is aggregated across all users. To compute per-user lifetime totals, you need the raw payments table and its full history.


The same applies to any question that cuts data differently from how the summary was built: revenue by product category, spending by acquisition cohort, retention rates — all require the raw data, not the daily rollup.


Summary tables speed up known queries but don’t replace the raw data. Design the summary for the questions you know you’ll ask repeatedly.

Looking ahead


Tomorrow’s practices: deduplication, session detection, and the submission query — a monthly analytics view combining several of today’s patterns.

Week 5: cohort analysis — the same ROW_NUMBER and LAG from today, applied to “what fraction of users who joined in January were still active in March?”