Week 5 Practice: Air Raid Alerts & Cohort Retention

ECON250: Big Data Analytics

Published

March 6, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

As you work through the exercises, save your key queries as views in your personal dataset. Each section tells you what to save and what to name it.

How to Submit

  1. In BigQuery Explorer, find your w5_ views in your student__yourname dataset
  2. For each view: click the three dots menu (⋮) → “Copy link”
  3. Paste all view links in Moodle

Practice 1: Air Raid Alerts — How Many Lectures Were Cancelled?

KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture (08:30) is cancelled because students cannot safely commute to campus.

Your goal: count how many mornings triggered this policy since the start of 2025. You’ll do this through three investigations.

The Data

econ250-2026.course_data.air_raid_alerts

~124K alerts covering 2022–present.

Column Type Notes
oblast STRING Region name (e.g., “Kyiv City”, “Kharkivska oblast”)
raion STRING District name (NULL for oblast-level alerts)
hromada STRING Community name (NULL for oblast/raion-level)
level STRING oblast, raion, or hromada
started_at TIMESTAMP UTC start of alert
finished_at TIMESTAMP UTC end of alert

Syntax you’ll need

Pattern Syntax
Timezone conversion DATETIME(started_at, "Europe/Kiev")
Extract just the time part TIME(datetime_value)
Alert duration in minutes TIMESTAMP_DIFF(finished_at, started_at, MINUTE)
Generate a timestamp sequence UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts
Count distinct values in an interval COUNT(DISTINCT x) ... ON start <= ts AND end > ts

Investigation 1: The 2025 Volume Spike

Build: Count monthly alerts across all years. Include level as a dimension.

Spot: 2025 appears to have 2-3x more alerts than any previous year. Is this real?

Pinpoint: Break the counts by level. What changed in the mix? You should find that the system shifted from primarily oblast-level alerts to raion/hromada-level alerts in mid-2025. The total number of incidents didn’t necessarily double — the recording became more granular.

This is a data collection anomaly: the metric spiked because the measurement changed, not because reality changed. For consistent time-series analysis, filter to level = 'oblast'.

Save your final query as view w5_alert_volume.

Investigation 2: How Many Lectures Were Cancelled?

KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture is cancelled. An alert is “active at 07:30” if it started at or before 07:30 AND finished after 07:30.

Build: Start with a CTE that converts Kyiv City oblast-level alerts to local time. Then count how many alerts were active at 07:30 on their start date:

  • Convert started_at and finished_at to local time using DATETIME(..., "Europe/Kiev")
  • Filter to oblast = 'Kyiv City' and level = 'oblast' and 2025 onward
  • Check: TIME(started_local) < TIME "07:30:00" AND TIME(finished_local) > TIME "07:30:00"

Spot: How many alerts satisfy this condition? Use COUNT(DISTINCT DATE(started_local)) to get the number of affected mornings. How many of those were weekdays?

Pinpoint: Aggregate by month. Which months had the most disrupted mornings?

Think about edge cases: What about an alert that starts at 22:00 and ends at 08:00 the next day? Your TIME check would miss it because TIME "22:00" < TIME "07:30" is false. Add an OR clause to handle cross-midnight alerts: if DATE(started_local) < DATE(finished_local) (the alert spans midnight), you only need TIME(finished_local) > TIME "07:30:00".

Save your final query as view w5_cancelled_lectures.

Investigation 3: How Much of Ukraine Was Under Alert?

For each day in 2025, determine the maximum number of oblasts simultaneously under alert. This tells you the scale of each day’s threat — was it a localized alert or a nationwide one?

Build: Use a timestamp spine with hourly granularity to sample the alert state across the country:

  • Generate an hourly spine: UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts
  • LEFT JOIN the alerts table on the overlap condition: started_at <= ts AND finished_at > ts
  • For each hour, count COUNT(DISTINCT oblast) — this naturally deduplicates when multiple raion-level alerts exist for the same oblast
  • Aggregate to daily level: take the MAX(...) per day to get peak simultaneous coverage

Spot: Which days had the highest peak? Are there days where 20+ oblasts were under alert simultaneously? What does the daily average look like — how does it trend over the year?

Pinpoint: Add a 7-day moving average using AVG(...) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Are there periods of sustained high coverage? Do they correlate with known events?

Extension: Change INTERVAL 1 HOUR to INTERVAL 1 MINUTE for finer granularity. Does the peak change much? Why or why not?

Save your final query as view w5_alert_coverage.

Going Further

  • Hourly heatmap: Build an hour × day-of-week count matrix for Kyiv City alerts. When do alerts cluster?
  • Are alerts getting longer? Compute monthly average and median duration for Kyiv City. Is there a trend? Compare front-line vs. rear oblasts.
  • Longest gap: What was the longest streak of consecutive days with no Kyiv City alerts? Use LAG on a date spine.

Practice 2: Cohort Retention with StreamUA Subscriptions

Apply the three-step cohort retention pattern from lecture: assign cohorts, compute period offsets, aggregate.

The Data

StreamUA is a fictional online subscription service (Basic, Standard, Premium plans). The dataset is artificially generated but models realistic subscription patterns, including seasonal variation and several deliberate anomalies for you to find. Two tables:

econ250-2026.course_data.w5_customers

Column Type Notes
customer_id INTEGER Unique customer ID
signup_date STRING YYYY-MM-DD — use DATE(signup_date) to cast
acquisition_channel STRING organic, paid_search, social, referral, email
country STRING UA, PL, DE, UK, US
signup_device STRING mobile, desktop, tablet

econ250-2026.course_data.w5_payments

One row per monthly billing event. A customer who subscribes for 6 months has 6 payment rows. When a customer cancels, their final payment row has a cancellation_reason; all other rows have an empty string.

Column Type Notes
payment_id INTEGER Unique payment ID
customer_id INTEGER FK to w5_customers
payment_date STRING YYYY-MM-DD, 1st of the billing month
plan STRING basic, standard, premium
amount FLOAT Monthly charge
cancellation_reason STRING Reason for leaving, or '' if not a final payment

Build the Retention Query

You need three CTEs, then a final aggregation with a window function.

CTE 1 — customer_cohorts: For each customer, compute their cohort as the signup month using DATE_TRUNC. Dates are stored as strings, so cast first.

CTE 2 — monthly_active: Get distinct active months per customer from the payments table. Each payment row already represents one active month:

SELECT
  customer_id,
  DATE_TRUNC(DATE(payment_date), MONTH) AS active_month
FROM `econ250-2026.course_data.w5_payments`

CTE 3 — cohort_activity: Join customer_cohorts to monthly_active on customer_id. Compute months_since_signup using DATE_DIFF between the active month and the cohort month.

Aggregation: Group by cohort_month and months_since_signup. Count distinct customers (not rows — a resubscriber can appear twice in the same month). Use FIRST_VALUE partitioned by cohort, ordered by offset, to get the month-0 count as cohort_size. Compute retention_pct from that.

Gotchas:

  • Use COUNT(DISTINCT customer_id), not COUNT(*) — otherwise retention can exceed 100%
  • DATE_DIFF(later_date, earlier_date, MONTH) — argument order matters

Verify: retention_pct at month 0 should be 100.0 for every cohort. Month 2 should be around 92-93%.

Save as view: w5_retention

Wrap your query in CREATE OR REPLACE VIEWecon250-2026.student__yourname.w5_retention`, filter tomonths_since_signupbetween 0 and 6, and exclude cohorts after June 2025 (too recent for 6 months of data). Replacestudent__yourname` with your actual dataset name.

The view should have columns: cohort_month, months_since_signup, cohort_size, retained_users, retention_pct. At least 12 cohort months, ordered by cohort and offset.

Investigate: What Went Wrong at StreamUA?

Something changed in StreamUA’s business during 2024–2025. Three things went wrong at different times. Your job: find each one using the same pattern — build a metric → spot the anomaly → pinpoint the cause.

Start by computing monthly signups and monthly cancellations as your baselines. Then slice by different dimensions to isolate what changed.

Anomaly A: Where did the new customers go?

Build: Monthly signup counts, broken down by acquisition_channel.

Spot: Total signups dip sharply in certain months. Which months?

Pinpoint: Which channel explains the dip? Does it recover? Compute each channel’s share of signups per month to see the shift.

Save your final query as view w5_anomaly_a.

Anomaly B: Why did so many people cancel?

Build: Monthly cancellation counts (rows where cancellation_reason != ''). Try breaking down by signup_device (requires joining w5_customers).

Spot: One month has a dramatic spike in cancellations. When?

Pinpoint: Which device shows the spike? Look at cancellation_reason for that device in that month — what reason dominates? What does that suggest about the likely cause?

Save your final query as view w5_anomaly_b.

Anomaly C: What happened to the Premium plan?

Build: Monthly cancellation counts, broken down by plan.

Spot: One plan has an unexpected surge in cancellations starting at a specific point. Which plan? When?

Pinpoint: Check cancellation_reason for that plan — do you see a new reason appear? Check amount before and after the event. Can you find evidence of what the company did?

Save your final query as view w5_anomaly_c.

Further extensions

  • Pivot the retention table into a wide matrix using conditional aggregation (MAX(IF(...)))
  • Compare retention by plan or channel: add a dimension to your cohort CTE
  • Churn rate: compute cancellations as a percentage of active subscribers per month — raw counts grow with the subscriber base, churn rate controls for that

Troubleshooting

Problem Solution
Air raid timezone issues Use DATETIME(started_at, "Europe/Kiev"), not manual offset — handles DST
Checking TIME on UTC timestamps Convert to local time first, then extract TIME
Missing cross-midnight alerts Add OR clause: DATE(started) < DATE(finished) AND TIME(finished) > 07:30
Inflated oblast counts in Inv. 3 Use COUNT(DISTINCT oblast), not COUNT(*) — deduplicates raion-level rows
“not a DATE” error Dates are strings — use DATE(signup_date) or DATE(payment_date)
Active subs in cancellation counts Filter WHERE cancellation_reason != '' — non-final payments have empty string
months_since_signup negative Check DATE_DIFF order: (active_month, cohort_month, MONTH) — later date first
retention_pct > 100% Use COUNT(DISTINCT customer_id), not COUNT(*)

Key Takeaways

  • Timezone conversion via DATETIME(ts, "Europe/Kiev") handles DST automatically — always convert before extracting TIME
  • Edge case reasoning: cross-midnight alerts require thinking about what TIME extraction loses (the date component)
  • Timestamp spines (GENERATE_TIMESTAMP_ARRAY + LEFT JOIN) let you check what’s happening at regular intervals — useful for overlap/coverage analysis
  • Cohort analysis follows a three-step pattern: assign cohorts, compute offsets, aggregate — works for any data with timestamps
  • FIRST_VALUE retrieves cohort size without a self-join
  • Build metric → spot anomaly → find cause — use this pattern for any analytical question