Week 5 Practice: Air Raid Alerts & Cohort Retention
ECON250: Big Data Analytics
Your Submission
Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion
As you work through the exercises, save your key queries as views in your personal dataset. Each section tells you what to save and what to name it.
How to Submit
- In BigQuery Explorer, find your
w5_views in yourstudent__yournamedataset - For each view: click the three dots menu (⋮) → “Copy link”
- Paste all view links in Moodle
Practice 1: Air Raid Alerts — How Many Lectures Were Cancelled?
KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture (08:30) is cancelled because students cannot safely commute to campus.
Your goal: count how many mornings triggered this policy since the start of 2025. You’ll do this through three investigations.
The Data
econ250-2026.course_data.air_raid_alerts
~124K alerts covering 2022–present.
| Column | Type | Notes |
|---|---|---|
oblast |
STRING | Region name (e.g., “Kyiv City”, “Kharkivska oblast”) |
raion |
STRING | District name (NULL for oblast-level alerts) |
hromada |
STRING | Community name (NULL for oblast/raion-level) |
level |
STRING | oblast, raion, or hromada |
started_at |
TIMESTAMP | UTC start of alert |
finished_at |
TIMESTAMP | UTC end of alert |
Syntax you’ll need
| Pattern | Syntax |
|---|---|
| Timezone conversion | DATETIME(started_at, "Europe/Kiev") |
| Extract just the time part | TIME(datetime_value) |
| Alert duration in minutes | TIMESTAMP_DIFF(finished_at, started_at, MINUTE) |
| Generate a timestamp sequence | UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts |
| Count distinct values in an interval | COUNT(DISTINCT x) ... ON start <= ts AND end > ts |
Investigation 1: The 2025 Volume Spike
Build: Count monthly alerts across all years. Include level as a dimension.
Spot: 2025 appears to have 2-3x more alerts than any previous year. Is this real?
Pinpoint: Break the counts by level. What changed in the mix? You should find that the system shifted from primarily oblast-level alerts to raion/hromada-level alerts in mid-2025. The total number of incidents didn’t necessarily double — the recording became more granular.
This is a data collection anomaly: the metric spiked because the measurement changed, not because reality changed. For consistent time-series analysis, filter to level = 'oblast'.
Save your final query as view w5_alert_volume.
Investigation 2: How Many Lectures Were Cancelled?
KSE policy: if an air raid alert is active in Kyiv at 07:30 local time, the first lecture is cancelled. An alert is “active at 07:30” if it started at or before 07:30 AND finished after 07:30.
Build: Start with a CTE that converts Kyiv City oblast-level alerts to local time. Then count how many alerts were active at 07:30 on their start date:
- Convert
started_atandfinished_atto local time usingDATETIME(..., "Europe/Kiev") - Filter to
oblast = 'Kyiv City'andlevel = 'oblast'and 2025 onward - Check:
TIME(started_local) < TIME "07:30:00" AND TIME(finished_local) > TIME "07:30:00"
Spot: How many alerts satisfy this condition? Use COUNT(DISTINCT DATE(started_local)) to get the number of affected mornings. How many of those were weekdays?
Pinpoint: Aggregate by month. Which months had the most disrupted mornings?
Think about edge cases: What about an alert that starts at 22:00 and ends at 08:00 the next day? Your TIME check would miss it because TIME "22:00" < TIME "07:30" is false. Add an OR clause to handle cross-midnight alerts: if DATE(started_local) < DATE(finished_local) (the alert spans midnight), you only need TIME(finished_local) > TIME "07:30:00".
Save your final query as view w5_cancelled_lectures.
Investigation 3: How Much of Ukraine Was Under Alert?
For each day in 2025, determine the maximum number of oblasts simultaneously under alert. This tells you the scale of each day’s threat — was it a localized alert or a nationwide one?
Build: Use a timestamp spine with hourly granularity to sample the alert state across the country:
- Generate an hourly spine:
UNNEST(GENERATE_TIMESTAMP_ARRAY('2025-01-01', CURRENT_TIMESTAMP(), INTERVAL 1 HOUR)) AS ts LEFT JOINthe alerts table on the overlap condition:started_at <= ts AND finished_at > ts- For each hour, count
COUNT(DISTINCT oblast)— this naturally deduplicates when multiple raion-level alerts exist for the same oblast - Aggregate to daily level: take the
MAX(...)per day to get peak simultaneous coverage
Spot: Which days had the highest peak? Are there days where 20+ oblasts were under alert simultaneously? What does the daily average look like — how does it trend over the year?
Pinpoint: Add a 7-day moving average using AVG(...) OVER (ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). Are there periods of sustained high coverage? Do they correlate with known events?
Extension: Change INTERVAL 1 HOUR to INTERVAL 1 MINUTE for finer granularity. Does the peak change much? Why or why not?
Save your final query as view w5_alert_coverage.
Going Further
- Hourly heatmap: Build an hour × day-of-week count matrix for Kyiv City alerts. When do alerts cluster?
- Are alerts getting longer? Compute monthly average and median duration for Kyiv City. Is there a trend? Compare front-line vs. rear oblasts.
- Longest gap: What was the longest streak of consecutive days with no Kyiv City alerts? Use LAG on a date spine.
Practice 2: Cohort Retention with StreamUA Subscriptions
Apply the three-step cohort retention pattern from lecture: assign cohorts, compute period offsets, aggregate.
The Data
StreamUA is a fictional online subscription service (Basic, Standard, Premium plans). The dataset is artificially generated but models realistic subscription patterns, including seasonal variation and several deliberate anomalies for you to find. Two tables:
econ250-2026.course_data.w5_customers
| Column | Type | Notes |
|---|---|---|
customer_id |
INTEGER | Unique customer ID |
signup_date |
STRING | YYYY-MM-DD — use DATE(signup_date) to cast |
acquisition_channel |
STRING | organic, paid_search, social, referral, email |
country |
STRING | UA, PL, DE, UK, US |
signup_device |
STRING | mobile, desktop, tablet |
econ250-2026.course_data.w5_payments
One row per monthly billing event. A customer who subscribes for 6 months has 6 payment rows. When a customer cancels, their final payment row has a cancellation_reason; all other rows have an empty string.
| Column | Type | Notes |
|---|---|---|
payment_id |
INTEGER | Unique payment ID |
customer_id |
INTEGER | FK to w5_customers |
payment_date |
STRING | YYYY-MM-DD, 1st of the billing month |
plan |
STRING | basic, standard, premium |
amount |
FLOAT | Monthly charge |
cancellation_reason |
STRING | Reason for leaving, or '' if not a final payment |
Build the Retention Query
You need three CTEs, then a final aggregation with a window function.
CTE 1 — customer_cohorts: For each customer, compute their cohort as the signup month using DATE_TRUNC. Dates are stored as strings, so cast first.
CTE 2 — monthly_active: Get distinct active months per customer from the payments table. Each payment row already represents one active month:
SELECT
customer_id,
DATE_TRUNC(DATE(payment_date), MONTH) AS active_month
FROM `econ250-2026.course_data.w5_payments`CTE 3 — cohort_activity: Join customer_cohorts to monthly_active on customer_id. Compute months_since_signup using DATE_DIFF between the active month and the cohort month.
Aggregation: Group by cohort_month and months_since_signup. Count distinct customers (not rows — a resubscriber can appear twice in the same month). Use FIRST_VALUE partitioned by cohort, ordered by offset, to get the month-0 count as cohort_size. Compute retention_pct from that.
Gotchas:
- Use
COUNT(DISTINCT customer_id), notCOUNT(*)— otherwise retention can exceed 100% DATE_DIFF(later_date, earlier_date, MONTH)— argument order matters
Verify: retention_pct at month 0 should be 100.0 for every cohort. Month 2 should be around 92-93%.
Save as view: w5_retention
Wrap your query in CREATE OR REPLACE VIEWecon250-2026.student__yourname.w5_retention`, filter tomonths_since_signupbetween 0 and 6, and exclude cohorts after June 2025 (too recent for 6 months of data). Replacestudent__yourname` with your actual dataset name.
The view should have columns: cohort_month, months_since_signup, cohort_size, retained_users, retention_pct. At least 12 cohort months, ordered by cohort and offset.
Investigate: What Went Wrong at StreamUA?
Something changed in StreamUA’s business during 2024–2025. Three things went wrong at different times. Your job: find each one using the same pattern — build a metric → spot the anomaly → pinpoint the cause.
Start by computing monthly signups and monthly cancellations as your baselines. Then slice by different dimensions to isolate what changed.
Anomaly A: Where did the new customers go?
Build: Monthly signup counts, broken down by acquisition_channel.
Spot: Total signups dip sharply in certain months. Which months?
Pinpoint: Which channel explains the dip? Does it recover? Compute each channel’s share of signups per month to see the shift.
Save your final query as view w5_anomaly_a.
Anomaly B: Why did so many people cancel?
Build: Monthly cancellation counts (rows where cancellation_reason != ''). Try breaking down by signup_device (requires joining w5_customers).
Spot: One month has a dramatic spike in cancellations. When?
Pinpoint: Which device shows the spike? Look at cancellation_reason for that device in that month — what reason dominates? What does that suggest about the likely cause?
Save your final query as view w5_anomaly_b.
Further extensions
- Pivot the retention table into a wide matrix using conditional aggregation (
MAX(IF(...))) - Compare retention by plan or channel: add a dimension to your cohort CTE
- Churn rate: compute cancellations as a percentage of active subscribers per month — raw counts grow with the subscriber base, churn rate controls for that
Troubleshooting
| Problem | Solution |
|---|---|
| Air raid timezone issues | Use DATETIME(started_at, "Europe/Kiev"), not manual offset — handles DST |
| Checking TIME on UTC timestamps | Convert to local time first, then extract TIME |
| Missing cross-midnight alerts | Add OR clause: DATE(started) < DATE(finished) AND TIME(finished) > 07:30 |
| Inflated oblast counts in Inv. 3 | Use COUNT(DISTINCT oblast), not COUNT(*) — deduplicates raion-level rows |
| “not a DATE” error | Dates are strings — use DATE(signup_date) or DATE(payment_date) |
| Active subs in cancellation counts | Filter WHERE cancellation_reason != '' — non-final payments have empty string |
months_since_signup negative |
Check DATE_DIFF order: (active_month, cohort_month, MONTH) — later date first |
retention_pct > 100% |
Use COUNT(DISTINCT customer_id), not COUNT(*) |
Key Takeaways
- Timezone conversion via
DATETIME(ts, "Europe/Kiev")handles DST automatically — always convert before extracting TIME - Edge case reasoning: cross-midnight alerts require thinking about what TIME extraction loses (the date component)
- Timestamp spines (GENERATE_TIMESTAMP_ARRAY + LEFT JOIN) let you check what’s happening at regular intervals — useful for overlap/coverage analysis
- Cohort analysis follows a three-step pattern: assign cohorts, compute offsets, aggregate — works for any data with timestamps
- FIRST_VALUE retrieves cohort size without a self-join
- Build metric → spot anomaly → find cause — use this pattern for any analytical question