Week 4 Practice: Window Functions & Incremental Updates

ECON250: Big Data Analytics

Published

February 25, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

As you work through the exercises, save each one as a table (MERGE exercises) or view (window function exercises) in your personal dataset. Name them week4_1, week4_2, etc.

What to submit: The link to the latest view or table you created during the practice. If you completed all exercises, that would be week4_7. Reaching week4_4 or beyond demonstrates satisfactory understanding.

How to Submit

  1. In BigQuery Explorer, find the latest view/table in your student__yourname dataset
  2. Click the three dots menu (⋮) next to its name
  3. Select “Copy link”
  4. Paste the link in Moodle

Before You Start

Required: BigQuery console open at console.cloud.google.com/bigquery?project=econ250-2026

What is Hacker News?

Hacker News (news.ycombinator.com) is a social news site run by Y Combinator, focused on technology, startups, and science. Users submit links or text posts, which are upvoted by the community. A story’s score is its net upvotes — most stories get 1–3, while a few go viral with scores in the thousands.

Dataset

econ250-2026.course_data.hn_stories — Hacker News story submissions from 2016 to present (~3M rows). Built from the BigQuery public hacker_news dataset: non-dead stories with scores, plus comment aggregates.

Column Type Notes
id INTEGER Unique story ID
author STRING Username of the poster
title STRING Story title
score INTEGER Upvotes (highly skewed — median 2, max ~6,000)
num_comments INTEGER Total comment count (NULL for some)
posted_at TIMESTAMP When the story was posted (UTC) — partition key via DATE(posted_at)
domain STRING Source domain extracted from URL (NULL for self-posts like Ask HN)
num_direct_replies INTEGER Top-level comments
num_unique_commenters INTEGER Distinct authors of direct replies
first_comment_at TIMESTAMP Earliest direct reply
last_comment_at TIMESTAMP Latest direct reply

Partitioned by DATE(posted_at) (daily), clustered by domain.

The full table scans ~400 MB — well within your daily quota. For exercises that focus on a recent period, we’ll use February 2026 data.

What you’ll practice

  • Incremental updates: MERGE for daily summaries and lookup tables
  • Aggregate windows: SUM() OVER (), COUNT() OVER (PARTITION BY ...)
  • Ranking: RANK, ROW_NUMBER, FIRST_VALUE
  • Filtering on window functions: CTE + WHERE, QUALIFY
  • Period comparisons: LAG
  • Running calculations: moving averages, frame specifications

Exploring the Data

Before computing anything, explore the dataset.

Quick stats

In BigQuery Explorer, find econ250-2026course_datahn_stories, click the table name, then open the Preview tab (free — no bytes scanned). Browse a few rows. Then run:

SELECT
  COUNT(*) AS total_stories,
  COUNT(DISTINCT author) AS unique_authors,
  MIN(DATE(posted_at)) AS earliest,
  MAX(DATE(posted_at)) AS latest,
  ROUND(AVG(score), 1) AS avg_score,
  APPROX_QUANTILES(score, 100)[OFFSET(50)] AS median_score
FROM `econ250-2026.course_data.hn_stories`

What you should notice: The average score is far higher than the median. Most stories get almost no engagement, while a small number go viral. You’ll see this skew throughout today’s exercises.

Date continuity check

Are there any missing dates in the dataset — days with zero stories? Use LAG to compare each date to its predecessor:

WITH daily AS (
  SELECT DISTINCT DATE(posted_at) AS posted_date
  FROM `econ250-2026.course_data.hn_stories`
),
with_prev AS (
  SELECT
    posted_date,
    ??? AS prev_date
  FROM daily
)
SELECT
  posted_date,
  prev_date,
  DATE_DIFF(posted_date, prev_date, DAY) AS gap_days
FROM with_prev
WHERE DATE_DIFF(posted_date, prev_date, DAY) > 1
ORDER BY posted_date

Replace ??? with the appropriate expression. LAG(column) OVER (ORDER BY column) looks back one row in the specified order.

Expected: If the query returns no rows, there are no gaps — every calendar day has at least one story.

NoteWhy two CTEs?

You might wonder why we need the with_prev CTE — why not put the LAG and the WHERE in the same query? Window functions are computed after WHERE, so you can’t filter on a window function result in the same query that defines it. The CTE computes prev_date first; the outer query filters on it. You’ll see this pattern again in later exercises.


Incremental Updates with MERGE

When maintaining a summary table, rebuilding the whole table every day means rescanning the entire history — even if only one day of data is new. MERGE lets you update an existing table incrementally: insert new rows and update existing ones in a single statement.

Exercise 1: Daily Summary with MERGE

Step 1 — Create a daily summary table from all data through January 31, 2026. This table is partitioned by posted_date — the same concept you saw on the source table:

CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week4_1`
PARTITION BY posted_date
AS
SELECT
  DATE(posted_at) AS posted_date,
  COUNT(*) AS stories,
  SUM(score) AS total_score,
  COUNT(DISTINCT author) AS unique_authors
FROM `econ250-2026.course_data.hn_stories`
WHERE DATE(posted_at) <= '2026-01-31'
GROUP BY posted_date

Verify how many rows the table has and the date range:

SELECT COUNT(*) AS days, MIN(posted_date) AS first_day, MAX(posted_date) AS last_day
FROM `econ250-2026.student__yourname.week4_1`

You should see ~3,680 rows — one per day from 2016 to January 31, 2026.

Step 2 — Write a MERGE to add all of 2026 data. Since January is already in the table, those rows will overlap (matched → updated). February rows are new (not matched → inserted).

The source subquery should have the same structure as Step 1, but filtered to DATE(posted_at) >= '2026-01-01'.

The MERGE structure:

MERGE target_table AS target
USING (source_query) AS source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET col1 = source.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, ...) VALUES (source.col1, ...)
  • Match on posted_date
  • WHEN MATCHED: update all metric columns with the source values
  • WHEN NOT MATCHED: insert the new row

Expected: The output says “N affected rows” — this includes both the ~31 January rows (updated with identical values) and the ~25 February rows (newly inserted). Verify: the table should now have ~3,700+ rows and MAX(posted_date) should be in February 2026.

Step 3 — Run the exact same MERGE again. Check the row count: unchanged. The third run updates all 2026 rows with identical values rather than creating duplicates. This is idempotency — re-running produces the same result. If you had used INSERT INTO instead of MERGE, you’d get duplicates every time.

NoteWhy WHEN MATCHED matters

In this exercise, January rows get “updated” with the same values — nothing visibly changes. But WHEN MATCHED is what makes the query safe to re-run. In production, scores on existing days might change (late upvotes), and the UPDATE clause handles that correctly. Always include both clauses unless you have a specific reason not to.

Exercise 2: Author Lookup with MERGE

Goal: Build an author lookup table from historical data, then incrementally add new authors from 2026.

Step 1 — Create the base table from all data before 2026:

CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week4_2` AS
SELECT
  author,
  MIN(DATE(posted_at)) AS first_post_date,
  MAX(DATE(posted_at)) AS last_post_date,
  COUNT(*) AS total_stories
FROM `econ250-2026.course_data.hn_stories`
WHERE author IS NOT NULL
  AND DATE(posted_at) < '2026-01-01'
GROUP BY author

Verify: check the row count — this is the number of distinct authors who posted before 2026.

Step 2 — Write a MERGE using 2026 data as the source. But this time, you only want to add new authors — people who first appeared in 2026. Authors who already exist in the table should be left untouched.

This is an append-only MERGE — you only need a WHEN NOT MATCHED clause. Omit WHEN MATCHED entirely.

The source subquery has the same structure as Step 1, but filtered to DATE(posted_at) >= '2026-01-01'.

Expected: The output shows how many new authors were inserted — these are people who posted for the first time in 2026. The original rows are untouched.

Question to consider: What if you also wanted to update existing authors — refresh their last_post_date and add their 2026 stories to total_stories? How would you add a WHEN MATCHED clause? What expression would you use for total_stories to combine the old and new counts?


Aggregate Windows — Adding Context

Window functions let you add group-level statistics to individual rows without collapsing the result. In Week 3, computing percent of total required a CTE for the subtotals, another CTE for the grand total, and a CROSS JOIN. Window functions make this a one-liner.

Exercise 3: Domain Traction

Goal: For February 2026, find each domain’s share of total stories AND its share of total score. Compute a “traction ratio” — if a domain gets 2% of score but only 1% of stories, its ratio is 2.0, meaning each story gets disproportionate engagement.

Hints:

  • Start with a CTE that aggregates stories and total score per domain (filter to February 2026, exclude NULL domains, require at least 5 stories via HAVING)
  • In the outer query, SUM(total_score) OVER () gives the grand total across all rows — the empty OVER () means “the entire result set is one window”
  • Same pattern for stories: SUM(stories) OVER ()
  • Traction ratio = pct_of_score / pct_of_stories
WITH domain_stats AS (
  SELECT
    domain,
    COUNT(*) AS stories,
    SUM(score) AS total_score
  FROM `econ250-2026.course_data.hn_stories`
  WHERE DATE(posted_at) BETWEEN '2026-02-01' AND '2026-02-28'
    AND domain IS NOT NULL
  GROUP BY domain
  HAVING COUNT(*) >= 5
)
SELECT
  domain,
  stories,
  total_score,
  ROUND(stories * 100.0 / ???, 2) AS pct_of_stories,
  ROUND(total_score * 100.0 / ???, 2) AS pct_of_score,
  ROUND(??? / ???, 2) AS traction_ratio
FROM domain_stats
ORDER BY total_score DESC
LIMIT 20

Expected: ~20 rows. github.com likely leads in raw score. Look for domains with traction_ratio well above 1.0 — they get more engagement per story. Domains below 1.0 post often but get less engagement per story.

ImportantSave as view
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_3` AS
-- Your query (without the LIMIT 20 — save the full result)

Exercise 4: Top Stories with Domain Context

Goal: Find the 30 highest-scoring stories in February 2026. For each story, add a column showing how many total stories that domain posted in the same period. You can see whether a top story came from a frequent source (like github.com) or a little-known domain.

Hints:

  • This needs only a single query — no CTE required
  • COUNT(*) OVER (PARTITION BY domain) computes the domain’s total story count across the result set, attached to every row from that domain
  • Filter to February 2026, non-null domains
  • ORDER BY score DESC, LIMIT 30

Expected: 30 rows. Some top stories come from domains with 100+ stories in the month; others come from domains with fewer than 5.

Follow-up — find the one-hit wonders: Now filter to stories that scored above 100 but came from domains that posted 3 or fewer stories total in the period. You can’t filter on a window function in WHERE (it’s computed after WHERE). Two options:

  • Wrap your query in a CTE, then filter domain_stories <= 3 in the outer query
  • Use QUALIFY to filter on the window function result directly
ImportantSave as view

Save the follow-up query (the one-hit wonders) as your view:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_4` AS
-- Your one-hit wonders query

Rankings

Exercise 5: Weekday Posting Patterns

Goal: For each day of the week (Monday through Sunday), find: the average number of stories per day, the specific date with the most stories ever, and the specific date with the fewest stories. Use the full dataset but exclude the most recent date (it may have incomplete data). Your output should have exactly 7 rows.

Approach:

  1. Start with a CTE that counts stories per calendar date. Important: exclude the last date in the table — it may have incomplete data (the table is refreshed periodically, and the final day often has a partial count). Use a subquery: WHERE DATE(posted_at) < (SELECT MAX(DATE(posted_at)) FROM ...)
  2. Add the weekday number (EXTRACT(DAYOFWEEK FROM posted_date)) and name (FORMAT_DATE('%A', posted_date))
  3. Use window functions partitioned by weekday:
    • AVG(stories) OVER (PARTITION BY ...) for the average
    • FIRST_VALUE(posted_date) OVER (PARTITION BY ... ORDER BY stories DESC) for the busiest date
    • FIRST_VALUE(stories) OVER (PARTITION BY ... ORDER BY stories DESC) for that date’s count
    • Same pattern with ORDER BY stories ASC for the quietest date
  4. Collapse to one row per weekday using SELECT DISTINCT

Expected: 7 rows. Weekdays should have higher average story counts than weekends. The busiest and quietest dates should correspond to recognizable events or holidays.

ImportantSave as view
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_5` AS
-- Your query

Exercise 6: Top N per Group

Goal: Find the top 3 stories by score for each week of February 2026.

Think through:

  • Which ranking function should you use — RANK() or ROW_NUMBER()? What happens if two stories tie for 3rd place?
  • What do you partition by to get separate rankings per week?
  • What do you order by within each partition?
  • Window functions are computed after WHERE — how do you keep only the top 3?

After you solve it using a CTE, try rewriting the query with QUALIFY — a BigQuery extension that filters on window function results directly, the same way HAVING filters on GROUP BY results. QUALIFY goes after WHERE/GROUP BY/HAVING, before ORDER BY.

NoteQUALIFY is not standard SQL

QUALIFY works in BigQuery and Snowflake but not in PostgreSQL, MySQL, or most other databases. The CTE approach works everywhere. Know both — use whichever is clearer for the query at hand.

Expected: Roughly 12–15 rows (3 per week, 4–5 weeks in February). You should recognize some of the story titles from recent tech news.

ImportantSave as view
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_6` AS
-- Your query (either approach)

Running Calculations

Going Further (optional)

If you finish early, try one of these — or come up with your own idea and give it a try:

  • Author career stats: For a specific prolific author, list their stories with a running post number (ROW_NUMBER), running average score across their career, and their all-time total. Can you spot when they had a viral hit?
  • Domain leaderboard over time: For the top 10 domains by total score, compute each domain’s monthly rank. Use LAG on the rank to show rank changes month-over-month. Which domains are rising vs falling?
  • Your idea: What other questions could you answer with window functions on this dataset?