Week 4 Practice: Window Functions & Incremental Updates
ECON250: Big Data Analytics
Your Submission
Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion
As you work through the exercises, save each one as a table (MERGE exercises) or view (window function exercises) in your personal dataset. Name them week4_1, week4_2, etc.
What to submit: The link to the latest view or table you created during the practice. If you completed all exercises, that would be week4_7. Reaching week4_4 or beyond demonstrates satisfactory understanding.
How to Submit
- In BigQuery Explorer, find the latest view/table in your
student__yournamedataset - Click the three dots menu (⋮) next to its name
- Select “Copy link”
- Paste the link in Moodle
Before You Start
Required: BigQuery console open at console.cloud.google.com/bigquery?project=econ250-2026
What is Hacker News?
Hacker News (news.ycombinator.com) is a social news site run by Y Combinator, focused on technology, startups, and science. Users submit links or text posts, which are upvoted by the community. A story’s score is its net upvotes — most stories get 1–3, while a few go viral with scores in the thousands.
Dataset
econ250-2026.course_data.hn_stories — Hacker News story submissions from 2016 to present (~3M rows). Built from the BigQuery public hacker_news dataset: non-dead stories with scores, plus comment aggregates.
| Column | Type | Notes |
|---|---|---|
id |
INTEGER | Unique story ID |
author |
STRING | Username of the poster |
title |
STRING | Story title |
score |
INTEGER | Upvotes (highly skewed — median 2, max ~6,000) |
num_comments |
INTEGER | Total comment count (NULL for some) |
posted_at |
TIMESTAMP | When the story was posted (UTC) — partition key via DATE(posted_at) |
domain |
STRING | Source domain extracted from URL (NULL for self-posts like Ask HN) |
num_direct_replies |
INTEGER | Top-level comments |
num_unique_commenters |
INTEGER | Distinct authors of direct replies |
first_comment_at |
TIMESTAMP | Earliest direct reply |
last_comment_at |
TIMESTAMP | Latest direct reply |
Partitioned by DATE(posted_at) (daily), clustered by domain.
The full table scans ~400 MB — well within your daily quota. For exercises that focus on a recent period, we’ll use February 2026 data.
What you’ll practice
- Incremental updates:
MERGEfor daily summaries and lookup tables - Aggregate windows:
SUM() OVER (),COUNT() OVER (PARTITION BY ...) - Ranking:
RANK,ROW_NUMBER,FIRST_VALUE - Filtering on window functions: CTE + WHERE, QUALIFY
- Period comparisons:
LAG - Running calculations: moving averages, frame specifications
Exploring the Data
Before computing anything, explore the dataset.
Quick stats
In BigQuery Explorer, find econ250-2026 → course_data → hn_stories, click the table name, then open the Preview tab (free — no bytes scanned). Browse a few rows. Then run:
SELECT
COUNT(*) AS total_stories,
COUNT(DISTINCT author) AS unique_authors,
MIN(DATE(posted_at)) AS earliest,
MAX(DATE(posted_at)) AS latest,
ROUND(AVG(score), 1) AS avg_score,
APPROX_QUANTILES(score, 100)[OFFSET(50)] AS median_score
FROM `econ250-2026.course_data.hn_stories`What you should notice: The average score is far higher than the median. Most stories get almost no engagement, while a small number go viral. You’ll see this skew throughout today’s exercises.
Date continuity check
Are there any missing dates in the dataset — days with zero stories? Use LAG to compare each date to its predecessor:
WITH daily AS (
SELECT DISTINCT DATE(posted_at) AS posted_date
FROM `econ250-2026.course_data.hn_stories`
),
with_prev AS (
SELECT
posted_date,
??? AS prev_date
FROM daily
)
SELECT
posted_date,
prev_date,
DATE_DIFF(posted_date, prev_date, DAY) AS gap_days
FROM with_prev
WHERE DATE_DIFF(posted_date, prev_date, DAY) > 1
ORDER BY posted_dateReplace ??? with the appropriate expression. LAG(column) OVER (ORDER BY column) looks back one row in the specified order.
Expected: If the query returns no rows, there are no gaps — every calendar day has at least one story.
You might wonder why we need the with_prev CTE — why not put the LAG and the WHERE in the same query? Window functions are computed after WHERE, so you can’t filter on a window function result in the same query that defines it. The CTE computes prev_date first; the outer query filters on it. You’ll see this pattern again in later exercises.
Incremental Updates with MERGE
When maintaining a summary table, rebuilding the whole table every day means rescanning the entire history — even if only one day of data is new. MERGE lets you update an existing table incrementally: insert new rows and update existing ones in a single statement.
Exercise 1: Daily Summary with MERGE
Step 1 — Create a daily summary table from all data through January 31, 2026. This table is partitioned by posted_date — the same concept you saw on the source table:
CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week4_1`
PARTITION BY posted_date
AS
SELECT
DATE(posted_at) AS posted_date,
COUNT(*) AS stories,
SUM(score) AS total_score,
COUNT(DISTINCT author) AS unique_authors
FROM `econ250-2026.course_data.hn_stories`
WHERE DATE(posted_at) <= '2026-01-31'
GROUP BY posted_dateVerify how many rows the table has and the date range:
SELECT COUNT(*) AS days, MIN(posted_date) AS first_day, MAX(posted_date) AS last_day
FROM `econ250-2026.student__yourname.week4_1`You should see ~3,680 rows — one per day from 2016 to January 31, 2026.
Step 2 — Write a MERGE to add all of 2026 data. Since January is already in the table, those rows will overlap (matched → updated). February rows are new (not matched → inserted).
The source subquery should have the same structure as Step 1, but filtered to DATE(posted_at) >= '2026-01-01'.
The MERGE structure:
MERGE target_table AS target
USING (source_query) AS source
ON target.key = source.key
WHEN MATCHED THEN UPDATE SET col1 = source.col1, ...
WHEN NOT MATCHED THEN INSERT (col1, ...) VALUES (source.col1, ...)- Match on
posted_date - WHEN MATCHED: update all metric columns with the source values
- WHEN NOT MATCHED: insert the new row
Expected: The output says “N affected rows” — this includes both the ~31 January rows (updated with identical values) and the ~25 February rows (newly inserted). Verify: the table should now have ~3,700+ rows and MAX(posted_date) should be in February 2026.
Step 3 — Run the exact same MERGE again. Check the row count: unchanged. The third run updates all 2026 rows with identical values rather than creating duplicates. This is idempotency — re-running produces the same result. If you had used INSERT INTO instead of MERGE, you’d get duplicates every time.
In this exercise, January rows get “updated” with the same values — nothing visibly changes. But WHEN MATCHED is what makes the query safe to re-run. In production, scores on existing days might change (late upvotes), and the UPDATE clause handles that correctly. Always include both clauses unless you have a specific reason not to.
Aggregate Windows — Adding Context
Window functions let you add group-level statistics to individual rows without collapsing the result. In Week 3, computing percent of total required a CTE for the subtotals, another CTE for the grand total, and a CROSS JOIN. Window functions make this a one-liner.
Exercise 3: Domain Traction
Goal: For February 2026, find each domain’s share of total stories AND its share of total score. Compute a “traction ratio” — if a domain gets 2% of score but only 1% of stories, its ratio is 2.0, meaning each story gets disproportionate engagement.
Hints:
- Start with a CTE that aggregates stories and total score per domain (filter to February 2026, exclude NULL domains, require at least 5 stories via HAVING)
- In the outer query,
SUM(total_score) OVER ()gives the grand total across all rows — the emptyOVER ()means “the entire result set is one window” - Same pattern for stories:
SUM(stories) OVER () - Traction ratio = pct_of_score / pct_of_stories
WITH domain_stats AS (
SELECT
domain,
COUNT(*) AS stories,
SUM(score) AS total_score
FROM `econ250-2026.course_data.hn_stories`
WHERE DATE(posted_at) BETWEEN '2026-02-01' AND '2026-02-28'
AND domain IS NOT NULL
GROUP BY domain
HAVING COUNT(*) >= 5
)
SELECT
domain,
stories,
total_score,
ROUND(stories * 100.0 / ???, 2) AS pct_of_stories,
ROUND(total_score * 100.0 / ???, 2) AS pct_of_score,
ROUND(??? / ???, 2) AS traction_ratio
FROM domain_stats
ORDER BY total_score DESC
LIMIT 20Expected: ~20 rows. github.com likely leads in raw score. Look for domains with traction_ratio well above 1.0 — they get more engagement per story. Domains below 1.0 post often but get less engagement per story.
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_3` AS
-- Your query (without the LIMIT 20 — save the full result)Exercise 4: Top Stories with Domain Context
Goal: Find the 30 highest-scoring stories in February 2026. For each story, add a column showing how many total stories that domain posted in the same period. You can see whether a top story came from a frequent source (like github.com) or a little-known domain.
Hints:
- This needs only a single query — no CTE required
COUNT(*) OVER (PARTITION BY domain)computes the domain’s total story count across the result set, attached to every row from that domain- Filter to February 2026, non-null domains
- ORDER BY score DESC, LIMIT 30
Expected: 30 rows. Some top stories come from domains with 100+ stories in the month; others come from domains with fewer than 5.
Follow-up — find the one-hit wonders: Now filter to stories that scored above 100 but came from domains that posted 3 or fewer stories total in the period. You can’t filter on a window function in WHERE (it’s computed after WHERE). Two options:
- Wrap your query in a CTE, then filter
domain_stories <= 3in the outer query - Use
QUALIFYto filter on the window function result directly
Save the follow-up query (the one-hit wonders) as your view:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_4` AS
-- Your one-hit wonders queryRankings
Exercise 5: Weekday Posting Patterns
Goal: For each day of the week (Monday through Sunday), find: the average number of stories per day, the specific date with the most stories ever, and the specific date with the fewest stories. Use the full dataset but exclude the most recent date (it may have incomplete data). Your output should have exactly 7 rows.
Approach:
- Start with a CTE that counts stories per calendar date. Important: exclude the last date in the table — it may have incomplete data (the table is refreshed periodically, and the final day often has a partial count). Use a subquery:
WHERE DATE(posted_at) < (SELECT MAX(DATE(posted_at)) FROM ...) - Add the weekday number (
EXTRACT(DAYOFWEEK FROM posted_date)) and name (FORMAT_DATE('%A', posted_date)) - Use window functions partitioned by weekday:
AVG(stories) OVER (PARTITION BY ...)for the averageFIRST_VALUE(posted_date) OVER (PARTITION BY ... ORDER BY stories DESC)for the busiest dateFIRST_VALUE(stories) OVER (PARTITION BY ... ORDER BY stories DESC)for that date’s count- Same pattern with
ORDER BY stories ASCfor the quietest date
- Collapse to one row per weekday using
SELECT DISTINCT
Expected: 7 rows. Weekdays should have higher average story counts than weekends. The busiest and quietest dates should correspond to recognizable events or holidays.
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_5` AS
-- Your queryExercise 6: Top N per Group
Goal: Find the top 3 stories by score for each week of February 2026.
Think through:
- Which ranking function should you use —
RANK()orROW_NUMBER()? What happens if two stories tie for 3rd place? - What do you partition by to get separate rankings per week?
- What do you order by within each partition?
- Window functions are computed after WHERE — how do you keep only the top 3?
After you solve it using a CTE, try rewriting the query with QUALIFY — a BigQuery extension that filters on window function results directly, the same way HAVING filters on GROUP BY results. QUALIFY goes after WHERE/GROUP BY/HAVING, before ORDER BY.
QUALIFY works in BigQuery and Snowflake but not in PostgreSQL, MySQL, or most other databases. The CTE approach works everywhere. Know both — use whichever is clearer for the query at hand.
Expected: Roughly 12–15 rows (3 per week, 4–5 weeks in February). You should recognize some of the story titles from recent tech news.
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_6` AS
-- Your query (either approach)Running Calculations
Exercise 7: Monthly Trends
Goal: Build a monthly summary of HN activity using the full dataset. For each month, compute the total score, then add month-over-month growth and a 12-month moving average to see long-term trends.
Required columns:
month— first day of each month (DATE)stories— total stories postedtotal_score— sum of all scoresprev_month_score— previous month’s total_scoremom_growth_pct— month-over-month growth in total_score, rounded to 1 decimalavg_score_12m— 12-month moving average of total_score, rounded to 0 decimals
Structure:
- A CTE that aggregates stories and total_score by
DATE_TRUNC(DATE(posted_at), MONTH) - An outer query that adds the window function columns
Key hints:
- LAG:
LAG(total_score) OVER (ORDER BY month)gives the previous month’s value - MoM growth:
(current - previous) * 100.0 / previous - Moving average: you need an explicit frame —
AVG(total_score) OVER (ORDER BY month ROWS BETWEEN 11 PRECEDING AND CURRENT ROW). WithoutROWS BETWEEN, the default frame with ORDER BY is cumulative from the start — you’d get a cumulative average instead of a 12-month rolling window
Expected: One row per month from 2016 to present (~120 rows). The first month has NULL for prev_month_score and mom_growth_pct. The first 11 months have a moving average based on fewer than 12 months (that’s correct — AVG uses whatever rows are available in the frame). The 12-month average should be much smoother than the raw monthly values — seasonal spikes and dips get absorbed.
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week4_7` AS
-- Your queryGoing Further (optional)
If you finish early, try one of these — or come up with your own idea and give it a try:
- Author career stats: For a specific prolific author, list their stories with a running post number (
ROW_NUMBER), running average score across their career, and their all-time total. Can you spot when they had a viral hit? - Domain leaderboard over time: For the top 10 domains by total score, compute each domain’s monthly rank. Use LAG on the rank to show rank changes month-over-month. Which domains are rising vs falling?
- Your idea: What other questions could you answer with window functions on this dataset?