Week 4 Practice: Window Functions & Incremental Updates

ECON250: Big Data Analytics

Published

February 25, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

As you work through the exercises, save each one as a table (MERGE exercises) or view (window function exercises) in your personal dataset. Name them week4_1, week4_2, etc.

What to submit: The link to the latest view or table you created during the practice. If you completed all exercises, that would be week4_7. Reaching week4_4 or beyond demonstrates satisfactory understanding.

How to Submit

In BigQuery Explorer, find the latest view/table in your student__yourname dataset
Click the three dots menu (⋮) next to its name
Select “Copy link”
Paste the link in Moodle

Before You Start

Required: BigQuery console open at console.cloud.google.com/bigquery?project=econ250-2026

What is Hacker News?

Hacker News (news.ycombinator.com) is a social news site run by Y Combinator, focused on technology, startups, and science. Users submit links or text posts, which are upvoted by the community. A story’s score is its net upvotes — most stories get 1–3, while a few go viral with scores in the thousands.

Dataset

econ250-2026.course_data.hn_stories — Hacker News story submissions from 2016 to present (~3M rows). Built from the BigQuery public hacker_news dataset: non-dead stories with scores, plus comment aggregates.

Column	Type	Notes
`id`	INTEGER	Unique story ID
`author`	STRING	Username of the poster
`title`	STRING	Story title
`score`	INTEGER	Upvotes (highly skewed — median 2, max ~6,000)
`num_comments`	INTEGER	Total comment count (NULL for some)
`posted_at`	TIMESTAMP	When the story was posted (UTC) — partition key via `DATE(posted_at)`
`domain`	STRING	Source domain extracted from URL (NULL for self-posts like Ask HN)
`num_direct_replies`	INTEGER	Top-level comments
`num_unique_commenters`	INTEGER	Distinct authors of direct replies
`first_comment_at`	TIMESTAMP	Earliest direct reply
`last_comment_at`	TIMESTAMP	Latest direct reply

Partitioned by DATE(posted_at) (daily), clustered by domain.

The full table scans ~400 MB — well within your daily quota. For exercises that focus on a recent period, we’ll use February 2026 data.

What you’ll practice

Incremental updates: MERGE for daily summaries and lookup tables
Aggregate windows: SUM() OVER (), COUNT() OVER (PARTITION BY ...)
Ranking: RANK, ROW_NUMBER, FIRST_VALUE
Filtering on window functions: CTE + WHERE, QUALIFY
Period comparisons: LAG
Running calculations: moving averages, frame specifications

Exploring the Data

Before computing anything, explore the dataset.

Quick stats

In BigQuery Explorer, find econ250-2026 → course_data → hn_stories, click the table name, then open the Preview tab (free — no bytes scanned). Browse a few rows. Then run:

SELECT
  COUNT(*) AS total_stories,
  COUNT(DISTINCT author) AS unique_authors,
  MIN(DATE(posted_at)) AS earliest,
  MAX(DATE(posted_at)) AS latest,
  ROUND(AVG(score), 1) AS avg_score,
  APPROX_QUANTILES(score, 100)[OFFSET(50)] AS median_score
FROM `econ250-2026.course_data.hn_stories`

What you should notice: The average score is far higher than the median. Most stories get almost no engagement, while a small number go viral. You’ll see this skew throughout today’s exercises.

Date continuity check

Are there any missing dates in the dataset — days with zero stories? Use LAG to compare each date to its predecessor:

WITH daily AS (
  SELECT DISTINCT DATE(posted_at) AS posted_date
  FROM `econ250-2026.course_data.hn_stories`
),
with_prev AS (
  SELECT
    posted_date,
    ??? AS prev_date
  FROM daily
)
SELECT
  posted_date,
  prev_date,
  DATE_DIFF(posted_date, prev_date, DAY) AS gap_days
FROM with_prev
WHERE DATE_DIFF(posted_date, prev_date, DAY) > 1
ORDER BY posted_date

Replace ??? with the appropriate expression. LAG(column) OVER (ORDER BY column) looks back one row in the specified order.

Expected: If the query returns no rows, there are no gaps — every calendar day has at least one story.

Why two CTEs?

You might wonder why we need the with_prev CTE — why not put the LAG and the WHERE in the same query? Window functions are computed after WHERE, so you can’t filter on a window function result in the same query that defines it. The CTE computes prev_date first; the outer query filters on it. You’ll see this pattern again in later exercises.

Incremental Updates with MERGE

When maintaining a summary table, rebuilding the whole table every day means rescanning the entire history — even if only one day of data is new. MERGE lets you update an existing table incrementally: insert new rows and update existing ones in a single statement.

Exercise 1: Daily Summary with MERGE

Step 1 — Create a daily summary table from all data through January 31, 2026. This table is partitioned by posted_date — the same concept you saw on the source table:

CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week4_1`
PARTITION BY posted_date
AS
SELECT
  DATE(posted_at) AS posted_date,
  COUNT(*) AS stories,
  SUM(score) AS total_score,
  COUNT(DISTINCT author) AS unique_authors
FROM `econ250-2026.course_data.hn_stories`
WHERE DATE(posted_at) <= '2026-01-31'
GROUP BY posted_date

Verify how many rows the table has and the date range:

SELECT COUNT(*) AS days, MIN(posted_date) AS first_day, MAX(posted_date) AS last_day
FROM `econ250-2026.student__yourname.week4_1`

You should see ~3,680 rows — one per day from 2016 to January 31, 2026.

Step 2 — Write a MERGE to add all of 2026 data. Since January is already in the table, those rows will overlap (matched → updated). February rows are new (not matched → inserted).

The source subquery should have the same structure as Step 1, but filtered to DATE(posted_at) >= '2026-01-01'.

The MERGE structure:

MERGE target_table AS target
USING (source_query) AS source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET col1 = source.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, ...) VALUES (source.col1, ...)

Match on posted_date
WHEN MATCHED: update all metric columns with the source values
WHEN NOT MATCHED: insert the new row

Expected: The output says “N affected rows” — this includes both the ~31 January rows (updated with identical values) and the ~25 February rows (newly inserted). Verify: the table should now have ~3,700+ rows and MAX(posted_date) should be in February 2026.

Step 3 — Run the exact same MERGE again. Check the row count: unchanged. The third run updates all 2026 rows with identical values rather than creating duplicates. This is idempotency — re-running produces the same result. If you had used INSERT INTO instead of MERGE, you’d get duplicates every time.

Why WHEN MATCHED matters

In this exercise, January rows get “updated” with the same values — nothing visibly changes. But WHEN MATCHED is what makes the query safe to re-run. In production, scores on existing days might change (late upvotes), and the UPDATE clause handles that correctly. Always include both clauses unless you have a specific reason not to.

Exercise 2: Author Lookup with MERGE

Goal: Build an author lookup table from historical data, then incrementally add new authors from 2026.

Step 1 — Create the base table from all data before 2026:

CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week4_2` AS
SELECT
  author,
  MIN(DATE(posted_at)) AS first_post_date,
  MAX(DATE(posted_at)) AS last_post_date,
  COUNT(*) AS total_stories
FROM `econ250-2026.course_data.hn_stories`
WHERE author IS NOT NULL
  AND DATE(posted_at) < '2026-01-01'
GROUP BY author

Verify: check the row count — this is the number of distinct authors who posted before 2026.

Step 2 — Write a MERGE using 2026 data as the source. But this time, you only want to add new authors — people who first appeared in 2026. Authors who already exist in the table should be left untouched.

This is an append-only MERGE — you only need a WHEN NOT MATCHED clause. Omit WHEN MATCHED entirely.

The source subquery has the same structure as Step 1, but filtered to DATE(posted_at) >= '2026-01-01'.

Expected: The output shows how many new authors were inserted — these are people who posted for the first time in 2026. The original rows are untouched.

Question to consider: What if you also wanted to update existing authors — refresh their last_post_date and add their 2026 stories to total_stories? How would you add a WHEN MATCHED clause? What expression would you use for total_stories to combine the old and new counts?

Aggregate Windows — Adding Context

Window functions let you add group-level statistics to individual rows without collapsing the result. In Week 3, computing percent of total required a CTE for the subtotals, another CTE for the grand total, and a CROSS JOIN. Window functions make this a one-liner.

Exercise 3: Domain Traction

Goal: For February 2026, find each domain’s share of total stories AND its share of total score. Compute a “traction ratio” — if a domain gets 2% of score but only 1% of stories, its ratio is 2.0, meaning each story gets disproportionate engagement.

Hints:

Start with a CTE that aggregates stories and total score per domain (filter to February 2026, exclude NULL domains, require at least 5 stories via HAVING)
In the outer query, SUM(total_score) OVER () gives the grand total across all rows — the empty OVER () means “the entire result set is one window”
Same pattern for stories: SUM(stories) OVER ()
Traction ratio = pct_of_score / pct_of_stories

WITH domain_stats AS (
  SELECT
    domain,
    COUNT(*) AS stories,
    SUM(score) AS total_score
  FROM `econ250-2026.course_data.hn_stories`
  WHERE DATE(posted_at) BETWEEN '2026-02-01' AND '2026-02-28'
    AND domain IS NOT NULL
  GROUP BY domain
  HAVING COUNT(*) >= 5
)
SELECT
  domain,
  stories,
  total_score,
  ROUND(stories * 100.0 / ???, 2) AS pct_of_stories,
  ROUND(total_score * 100.0 / ???, 2) AS pct_of_score,
  ROUND(??? / ???, 2) AS traction_ratio
FROM domain_stats
ORDER BY total_score DESC
LIMIT 20

Expected: ~20 rows. github.com likely leads in raw score. Look for domains with traction_ratio well above 1.0 — they get more engagement per story. Domains below 1.0 post often but get less engagement per story.

Save as view

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_3` AS
-- Your query (without the LIMIT 20 — save the full result)

Exercise 4: Top Stories with Domain Context

Goal: Find the 30 highest-scoring stories in February 2026. For each story, add a column showing how many total stories that domain posted in the same period. You can see whether a top story came from a frequent source (like github.com) or a little-known domain.

Hints:

This needs only a single query — no CTE required
COUNT(*) OVER (PARTITION BY domain) computes the domain’s total story count across the result set, attached to every row from that domain
Filter to February 2026, non-null domains
ORDER BY score DESC, LIMIT 30

Expected: 30 rows. Some top stories come from domains with 100+ stories in the month; others come from domains with fewer than 5.

Follow-up — find the one-hit wonders: Now filter to stories that scored above 100 but came from domains that posted 3 or fewer stories total in the period. You can’t filter on a window function in WHERE (it’s computed after WHERE). Two options:

Wrap your query in a CTE, then filter domain_stories <= 3 in the outer query
Use QUALIFY to filter on the window function result directly

Save as view

Save the follow-up query (the one-hit wonders) as your view:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_4` AS
-- Your one-hit wonders query

Rankings

Exercise 5: Weekday Posting Patterns

Goal: For each day of the week (Monday through Sunday), find: the average number of stories per day, the specific date with the most stories ever, and the specific date with the fewest stories. Use the full dataset but exclude the most recent date (it may have incomplete data). Your output should have exactly 7 rows.

Approach:

Start with a CTE that counts stories per calendar date. Important: exclude the last date in the table — it may have incomplete data (the table is refreshed periodically, and the final day often has a partial count). Use a subquery: WHERE DATE(posted_at) < (SELECT MAX(DATE(posted_at)) FROM ...)
Add the weekday number (EXTRACT(DAYOFWEEK FROM posted_date)) and name (FORMAT_DATE('%A', posted_date))
Use window functions partitioned by weekday:
- AVG(stories) OVER (PARTITION BY ...) for the average
- FIRST_VALUE(posted_date) OVER (PARTITION BY ... ORDER BY stories DESC) for the busiest date
- FIRST_VALUE(stories) OVER (PARTITION BY ... ORDER BY stories DESC) for that date’s count
- Same pattern with ORDER BY stories ASC for the quietest date
Collapse to one row per weekday using SELECT DISTINCT

Expected: 7 rows. Weekdays should have higher average story counts than weekends. The busiest and quietest dates should correspond to recognizable events or holidays.

Save as view

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_5` AS
-- Your query

Exercise 6: Top N per Group

Goal: Find the top 3 stories by score for each week of February 2026.

Think through:

Which ranking function should you use — RANK() or ROW_NUMBER()? What happens if two stories tie for 3rd place?
What do you partition by to get separate rankings per week?
What do you order by within each partition?
Window functions are computed after WHERE — how do you keep only the top 3?

After you solve it using a CTE, try rewriting the query with QUALIFY — a BigQuery extension that filters on window function results directly, the same way HAVING filters on GROUP BY results. QUALIFY goes after WHERE/GROUP BY/HAVING, before ORDER BY.

QUALIFY is not standard SQL

QUALIFY works in BigQuery and Snowflake but not in PostgreSQL, MySQL, or most other databases. The CTE approach works everywhere. Know both — use whichever is clearer for the query at hand.

Expected: Roughly 12–15 rows (3 per week, 4–5 weeks in February). You should recognize some of the story titles from recent tech news.

Save as view

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_6` AS
-- Your query (either approach)

Running Calculations

Exercise 7: Monthly Trends

Goal: Build a monthly summary of HN activity using the full dataset. For each month, compute the total score, then add month-over-month growth and a 12-month moving average to see long-term trends.

Required columns:

month — first day of each month (DATE)
stories — total stories posted
total_score — sum of all scores
prev_month_score — previous month’s total_score
mom_growth_pct — month-over-month growth in total_score, rounded to 1 decimal
avg_score_12m — 12-month moving average of total_score, rounded to 0 decimals

Structure:

A CTE that aggregates stories and total_score by DATE_TRUNC(DATE(posted_at), MONTH)
An outer query that adds the window function columns

Key hints:

LAG: LAG(total_score) OVER (ORDER BY month) gives the previous month’s value
MoM growth: (current - previous) * 100.0 / previous
Moving average: you need an explicit frame — AVG(total_score) OVER (ORDER BY month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW). Without ROWS BETWEEN, the default frame with ORDER BY is cumulative from the start — you’d get a cumulative average instead of a 12-month rolling window

Expected: One row per month from 2016 to present (~120 rows). The first month has NULL for prev_month_score and mom_growth_pct. The first 11 months have a moving average based on fewer than 12 months (that’s correct — AVG uses whatever rows are available in the frame). The 12-month average should be much smoother than the raw monthly values — seasonal spikes and dips get absorbed.

Save as view

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_7` AS
-- Your query

Going Further (optional)

If you finish early, try one of these — or come up with your own idea and give it a try:

Author career stats: For a specific prolific author, list their stories with a running post number (ROW_NUMBER), running average score across their career, and their all-time total. Can you spot when they had a viral hit?
Domain leaderboard over time: For the top 10 domains by total score, compute each domain’s monthly rank. Use LAG on the rank to show rank changes month-over-month. Which domains are rising vs falling?
Your idea: What other questions could you answer with window functions on this dataset?