Week 5 Practice: Air Raid Alerts & Cohort Retention

ECON250: Big Data Analytics

Published

March 6, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

As you work through the exercises, save your key queries as views in your personal dataset. Each section tells you what to save and what to name it.

How to Submit

In BigQuery Explorer, find your w5_ views in your student__yourname dataset
For each view: click the three dots menu (⋮) → “Copy link”
Paste all view links in Moodle

Practice 1: Air Raid Alerts — How Many Lectures Were Cancelled?

KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture (08:30) is cancelled because students cannot safely commute to campus.

Your goal: count how many mornings triggered this policy since the start of 2025. You’ll do this through three investigations.

The Data

`econ250-2026.course_data.air_raid_alerts`

~124K alerts covering 2022–present.

Column	Type	Notes
`oblast`	STRING	Region name (e.g., “Kyiv City”, “Kharkivska oblast”)
`raion`	STRING	District name (NULL for oblast-level alerts)
`hromada`	STRING	Community name (NULL for oblast/raion-level)
`level`	STRING	`oblast`, `raion`, or `hromada`
`started_at`	TIMESTAMP	UTC start of alert
`finished_at`	TIMESTAMP	UTC end of alert

Syntax you’ll need

Pattern	Syntax
Timezone conversion	`DATETIME(started_at, "Europe/Kiev")`
Extract just the time part	`TIME(datetime_value)`
Alert duration in minutes	`TIMESTAMP_DIFF(finished_at, started_at, MINUTE)`
Generate a timestamp sequence	`UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts`
Count distinct values in an interval	`COUNT(DISTINCT x) ... ON start <= ts AND end > ts`

Investigation 1: The 2025 Volume Spike

Build: Count monthly alerts across all years. Include level as a dimension.

Spot: 2025 appears to have 2-3x more alerts than any previous year. Is this real?

Pinpoint: Break the counts by level. What changed in the mix? You should find that the system shifted from primarily oblast-level alerts to raion/hromada-level alerts in mid-2025. The total number of incidents didn’t necessarily double — the recording became more granular.

This is a data collection anomaly: the metric spiked because the measurement changed, not because reality changed. For consistent time-series analysis, filter to level = 'oblast'.

Save your final query as view w5_alert_volume.

Investigation 2: How Many Lectures Were Cancelled?

KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture is cancelled. An alert is “active at 07:30” if it started at or before 07:30 AND finished after 07:30.

Build: Start with a CTE that converts Kyiv City oblast-level alerts to local time. Then count how many alerts were active at 07:30 on their start date:

Convert started_at and finished_at to local time using DATETIME(..., "Europe/Kiev")
Filter to oblast = 'Kyiv City' and level = 'oblast' and 2025 onward
Check: TIME(started_local) < TIME "07:30:00" AND TIME(finished_local) > TIME "07:30:00"

Spot: How many alerts satisfy this condition? Use COUNT(DISTINCT DATE(started_local)) to get the number of affected mornings. How many of those were weekdays?

Pinpoint: Aggregate by month. Which months had the most disrupted mornings?

Think about edge cases: What about an alert that starts at 22:00 and ends at 08:00 the next day? Your TIME check would miss it because TIME "22:00" < TIME "07:30" is false. Add an OR clause to handle cross-midnight alerts: if DATE(started_local) < DATE(finished_local) (the alert spans midnight), you only need TIME(finished_local) > TIME "07:30:00".

Save your final query as view w5_cancelled_lectures.

Investigation 3: How Much of Ukraine Was Under Alert?

For each day in 2025, determine the maximum number of oblasts simultaneously under alert. This tells you the scale of each day’s threat — was it a localized alert or a nationwide one?

Build: Use a timestamp spine with hourly granularity to sample the alert state across the country:

Generate an hourly spine: UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts
LEFT JOIN the alerts table on the overlap condition: started_at <= ts AND finished_at > ts
For each hour, count COUNT(DISTINCT oblast) — this naturally deduplicates when multiple raion-level alerts exist for the same oblast
Aggregate to daily level: take the MAX(...) per day to get peak simultaneous coverage

Spot: Which days had the highest peak? Are there days where 20+ oblasts were under alert simultaneously? What does the daily average look like — how does it trend over the year?

Pinpoint: Add a 7-day moving average using AVG(...) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Are there periods of sustained high coverage? Do they correlate with known events?

Extension: Change INTERVAL 1 HOUR to INTERVAL 1 MINUTE for finer granularity. Does the peak change much? Why or why not?

Save your final query as view w5_alert_coverage.

Going Further

Hourly heatmap: Build an hour × day-of-week count matrix for Kyiv City alerts. When do alerts cluster?
Are alerts getting longer? Compute monthly average and median duration for Kyiv City. Is there a trend? Compare front-line vs. rear oblasts.
Longest gap: What was the longest streak of consecutive days with no Kyiv City alerts? Use LAG on a date spine.

Practice 2: Cohort Retention with StreamUA Subscriptions

Apply the three-step cohort retention pattern from lecture: assign cohorts, compute period offsets, aggregate.

The Data

StreamUA is a fictional online subscription service (Basic, Standard, Premium plans). The dataset is artificially generated but models realistic subscription patterns, including seasonal variation and several deliberate anomalies for you to find. Two tables:

`econ250-2026.course_data.w5_customers`

Column	Type	Notes
`customer_id`	INTEGER	Unique customer ID
`signup_date`	STRING	`YYYY-MM-DD` — use `DATE(signup_date)` to cast
`acquisition_channel`	STRING	organic, paid_search, social, referral, email
`country`	STRING	UA, PL, DE, UK, US
`signup_device`	STRING	mobile, desktop, tablet

`econ250-2026.course_data.w5_payments`

One row per monthly billing event. A customer who subscribes for 6 months has 6 payment rows. When a customer cancels, their final payment row has a cancellation_reason; all other rows have an empty string.

Column	Type	Notes
`payment_id`	INTEGER	Unique payment ID
`customer_id`	INTEGER	FK to w5_customers
`payment_date`	STRING	`YYYY-MM-DD`, 1st of the billing month
`plan`	STRING	basic, standard, premium
`amount`	FLOAT	Monthly charge
`cancellation_reason`	STRING	Reason for leaving, or `''` if not a final payment

Build the Retention Query

You need three CTEs, then a final aggregation with a window function.

CTE 1 — customer_cohorts: For each customer, compute their cohort as the signup month using DATE_TRUNC. Dates are stored as strings, so cast first.

CTE 2 — monthly_active: Get distinct active months per customer from the payments table. Each payment row already represents one active month:

SELECT
  customer_id,
  DATE_TRUNC(DATE(payment_date), MONTH) AS active_month
FROM `econ250-2026.course_data.w5_payments`

CTE 3 — cohort_activity: Join customer_cohorts to monthly_active on customer_id. Compute months_since_signup using DATE_DIFF between the active month and the cohort month.

Aggregation: Group by cohort_month and months_since_signup. Count distinct customers (not rows — a resubscriber can appear twice in the same month). Use FIRST_VALUE partitioned by cohort, ordered by offset, to get the month-0 count as cohort_size. Compute retention_pct from that.

Gotchas:

Use COUNT(DISTINCT customer_id), not COUNT(*) — otherwise retention can exceed 100%
DATE_DIFF(later_date, earlier_date, MONTH) — argument order matters

Verify: retention_pct at month 0 should be 100.0 for every cohort. Month 2 should be around 92-93%.

Save as view: `w5_retention`

Wrap your query in CREATE OR REPLACE VIEWecon250-2026.student__yourname.w5_retention`, filter tomonths_since_signupbetween 0 and 6, and exclude cohorts after June 2025 (too recent for 6 months of data). Replacestudent__yourname` with your actual dataset name.

The view should have columns: cohort_month, months_since_signup, cohort_size, retained_users, retention_pct. At least 12 cohort months, ordered by cohort and offset.

Investigate: What Went Wrong at StreamUA?

Something changed in StreamUA’s business during 2024–2025. Three things went wrong at different times. Your job: find each one using the same pattern — build a metric → spot the anomaly → pinpoint the cause.

Start by computing monthly signups and monthly cancellations as your baselines. Then slice by different dimensions to isolate what changed.

Anomaly A: Where did the new customers go?

Build: Monthly signup counts, broken down by acquisition_channel.

Spot: Total signups dip sharply in certain months. Which months?

Pinpoint: Which channel explains the dip? Does it recover? Compute each channel’s share of signups per month to see the shift.

Save your final query as view w5_anomaly_a.

Anomaly B: Why did so many people cancel?

Build: Monthly cancellation counts (rows where cancellation_reason != ''). Try breaking down by signup_device (requires joining w5_customers).

Spot: One month has a dramatic spike in cancellations. When?

Pinpoint: Which device shows the spike? Look at cancellation_reason for that device in that month — what reason dominates? What does that suggest about the likely cause?

Save your final query as view w5_anomaly_b.

Anomaly C: What happened to the Premium plan?

Build: Monthly cancellation counts, broken down by plan.

Spot: One plan has an unexpected surge in cancellations starting at a specific point. Which plan? When?

Pinpoint: Check cancellation_reason for that plan — do you see a new reason appear? Check amount before and after the event. Can you find evidence of what the company did?

Save your final query as view w5_anomaly_c.

Further extensions

Pivot the retention table into a wide matrix using conditional aggregation (MAX(IF(...)))
Compare retention by plan or channel: add a dimension to your cohort CTE
Churn rate: compute cancellations as a percentage of active subscribers per month — raw counts grow with the subscriber base, churn rate controls for that

Troubleshooting

Problem	Solution
Air raid timezone issues	Use `DATETIME(started_at, "Europe/Kiev")`, not manual offset — handles DST
Checking TIME on UTC timestamps	Convert to local time first, then extract TIME
Missing cross-midnight alerts	Add OR clause: `DATE(started) < DATE(finished) AND TIME(finished) > 07:30`
Inflated oblast counts in Inv. 3	Use `COUNT(DISTINCT oblast)`, not `COUNT(*)` — deduplicates raion-level rows
“not a DATE” error	Dates are strings — use `DATE(signup_date)` or `DATE(payment_date)`
Active subs in cancellation counts	Filter `WHERE cancellation_reason != ''` — non-final payments have empty string
`months_since_signup` negative	Check DATE_DIFF order: `(active_month, cohort_month, MONTH)` — later date first
`retention_pct` > 100%	Use `COUNT(DISTINCT customer_id)`, not `COUNT(*)`

Key Takeaways

Timezone conversion via DATETIME(ts, "Europe/Kiev") handles DST automatically — always convert before extracting TIME
Edge case reasoning: cross-midnight alerts require thinking about what TIME extraction loses (the date component)
Timestamp spines (GENERATE_TIMESTAMP_ARRAY + LEFT JOIN) let you check what’s happening at regular intervals — useful for overlap/coverage analysis
Cohort analysis follows a three-step pattern: assign cohorts, compute offsets, aggregate — works for any data with timestamps
FIRST_VALUE retrieves cohort size without a self-join
Build metric → spot anomaly → find cause — use this pattern for any analytical question

Your Submission

How to Submit

Practice 1: Air Raid Alerts — How Many Lectures Were Cancelled?

The Data

econ250-2026.course_data.air_raid_alerts

Syntax you’ll need

Investigation 1: The 2025 Volume Spike

Investigation 2: How Many Lectures Were Cancelled?

Investigation 3: How Much of Ukraine Was Under Alert?

Going Further

Practice 2: Cohort Retention with StreamUA Subscriptions

The Data

econ250-2026.course_data.w5_customers

econ250-2026.course_data.w5_payments

Build the Retention Query

Save as view: w5_retention

Investigate: What Went Wrong at StreamUA?

Anomaly A: Where did the new customers go?

Anomaly B: Why did so many people cancel?

Anomaly C: What happened to the Premium plan?

Further extensions

Troubleshooting

Key Takeaways

`econ250-2026.course_data.air_raid_alerts`

`econ250-2026.course_data.w5_customers`

`econ250-2026.course_data.w5_payments`

Save as view: `w5_retention`