Time Series & Anomaly Detection

ECON250 - Big Data Analytics | Week 5

Oleh Omelchenko

2026-03-05

Air raid alerts dataset

You have start time, end time, and region. What questions can you answer?

What kinds of questions are these?


Category Examples
Counting & aggregating Alerts per month, total hours per week
Comparing Front-line vs western oblasts, weekday vs weekend
Finding extremes Longest alert, biggest gap, most alerts in one day
Detecting patterns Trends over time, seasonality, streaks


Most counting and comparing you can already solve with GROUP BY and window functions.

Today: the last two categories. Finding what’s unusual, and building the right expectations.

Date Spines

Daily alert count: what’s missing?

Kyiv City, oblast-level alerts, January 2025:

date alerts
Jan 1 3
Jan 2 2
Jan 3 1
Jan 5 4
Jan 6 2
Jan 8 1


Where is January 4th? January 7th?

Date spine: generate → LEFT JOIN → COALESCE

Without date spine

date alerts
Jan 1 3
Jan 2 2
Jan 3 1
Jan 4
Jan 5 4
Jan 6 2
Jan 7
Jan 8 1

With date spine

date alerts
Jan 1 3
Jan 2 2
Jan 3 1
Jan 4 0
Jan 5 4
Jan 6 2
Jan 7 0
Jan 8 1
WITH date_spine AS (
  SELECT date FROM UNNEST(GENERATE_DATE_ARRAY('2025-01-01', CURRENT_DATE())) AS date
)
SELECT ds.date, COALESCE(da.alerts, 0) AS alerts
FROM date_spine ds
LEFT JOIN daily_alerts da ON ds.date = da.alert_date

Analyzing Time Series

Patterns by hour and day of week

hour Mon Tue Wed Thu Fri Sat Sun
00-03 12% 11% 13% 10% 14% 9% 8%
03-06 18% 16% 19% 17% 15% 11% 10%
06-09 5% 6% 4% 7% 5% 4% 3%
09-12 8% 9% 7% 8% 10% 8% 7%
12-15 14% 13% 15% 12% 11% 14% 12%
15-18 10% 11% 9% 12% 10% 11% 13%
18-21 15% 16% 14% 16% 17% 18% 20%
21-00 18% 18% 19% 18% 18% 25% 27%


Percentage = share of alerts starting in that time block. What patterns do you see?

Simultaneous alerts

How many oblasts are under alert at the same time?

timestamp oblasts under alert
Jan 15, 14:00 18
Jan 15, 14:30 22
Jan 15, 15:00 25
Jan 15, 15:30 19


How would you compute this from start_time and end_time?

Think about it: you don’t have a row per timestamp. You have a row per alert.

How many lectures did we miss?

KSE policy: if an air raid alert is active at 7:30, the first class of the day is cancelled.


semester day alert active at 7:30? first class
Mon, Feb 3 no held
Tue, Feb 4 yes cancelled
Mon, Feb 10 no held
Tue, Feb 11 yes cancelled
Mon, Feb 17 yes cancelled


How many first classes were cancelled this semester? What if the window were 7:00? Or 8:00?

Anomalies

What is an anomaly?




Let’s hear examples from you:


What makes something “anomalous”?

Five types of anomalies

Type Example Is the data “wrong”?
Data quality issues Negative order amount, timestamp from 1970 Yes
Genuine outliers Front-line oblast with a 600-day continuous alert No
Pattern breaks Revenue drops 30% vs last month Maybe
Data collection changes Alert count doubles overnight No
Distribution shifts Mobile traffic jumps from 60% to 85% Maybe


Only the first type is “wrong.” The rest need investigation before you act.

Could you name more possible types of anomalies not in this table?

Anomaly detection in practice


In a textbook: compute z-scores, flag outliers.

In practice:

  1. Build expectations for what the metric should look like
  2. Flag when reality diverges
  3. Investigate the flagged items


This is called sanity checking, and it’s what most monitoring systems do.

Sanity Checks

Pattern 1: Threshold monitoring

Compare each day to its own rolling baseline. Flag significant deviations.

date revenue 30-day avg ratio status
Mar 8 $4,200 $4,050 1.04 normal
Mar 9 $3,800 $4,060 0.94 normal
Mar 10 $4,100 $4,055 1.01 normal
Mar 11 $1,900 $4,050 0.47 ALERT: low
Mar 12 $4,300 $4,020 1.07 normal


Thresholds come from domain knowledge, not statistics. You decide what “too different” means.

Pattern 1: SQL

WITH daily_revenue AS (
  SELECT DATE(created_at) AS order_date, SUM(sale_price) AS revenue
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  WHERE created_at >= '2025-01-01'
  GROUP BY order_date
),
with_baseline AS (
  SELECT *, AVG(revenue) OVER (
    ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
  ) AS avg_30d
  FROM daily_revenue
)
SELECT *, ROUND(revenue / avg_30d, 2) AS ratio,
  CASE
    WHEN revenue < avg_30d * 0.5 THEN 'ALERT: low'
    WHEN revenue > avg_30d * 2.0 THEN 'ALERT: high'
    ELSE 'normal'
  END AS status
FROM with_baseline
WHERE avg_30d IS NOT NULL

Pattern 2: Data collection drops

Monthly alert count by oblast. Is every region still reporting?

oblast month alerts prev_month ratio
Kyiv City Jan 42 38 1.11
Kyiv City Feb 45 42 1.07
Lvivska Jan 28 31 0.90
Lvivska Feb 3 28 0.11


An 89% drop. Data pipeline issue? Or something real that needs investigation?

Pattern 3: Distribution shifts

Payment method shares on Olist marketplace, monthly:

month credit_card boleto voucher debit
2017-07 73.8% 19.2% 5.4% 1.6%
2017-08 74.1% 18.9% 5.3% 1.7%
2017-09 74.0% 19.1% 5.2% 1.7%
2017-10 68.2% 24.5% 5.5% 1.8%


Individual values are fine. But the mix changed: credit card share dropped 6 points, boleto gained 5.

Why? A UI change? A promotion? Fraud? Seasonal effect?

Pattern 3: SQL

WITH monthly_payments AS (
  SELECT
    DATE_TRUNC(DATE(order_purchase_timestamp), MONTH) AS month,
    payment_type,
    COUNT(*) AS payment_count
  FROM olist_orders JOIN olist_order_payments USING (order_id)
  GROUP BY month, payment_type
),
with_share AS (
  SELECT *,
    ROUND(payment_count * 100.0
      / SUM(payment_count) OVER (PARTITION BY month), 1) AS pct_share
  FROM monthly_payments
)
SELECT * FROM with_share ORDER BY payment_type, month


SUM() OVER (PARTITION BY month) in the denominator gives percent-of-total within each month.

Drill down: broad to specific

When something looks unusual: start broad, then narrow down.


Step 1: yearly totals

year alerts
2022 8,400
2023 14,200
2024 15,800
2025 31,600


2025 has double the alerts. What happened?

Drill down: add the level column

year level alerts
2024 oblast 15,800
2025 oblast 15,900
2025 raion 15,700


Not more alerts. More granular recording.

The system started reporting at raion level in addition to oblast. The total doubled because there are now two rows per alert event, not because there are more alerts.


Always investigate before you remove or correct.

Where would you look first?

Pick a scenario. What’s your drill-down path?


Scenario Starting signal
Marketing spend doubled but signups didn’t Total signups flat vs. last month
E-commerce revenue dropped 20% on Monday Daily revenue below threshold
Mobile app crashes spiked this week Error count 3× normal
Support tickets doubled in February Monthly ticket volume


For each: what dimension would you break down by first? Then what?

When groups aren’t obvious

So far we’ve broken data down by dimensions that already exist in the data: date, region, payment type.


But sometimes the most important grouping isn’t in your data — you have to construct it.


“Are January customers different from March customers?”

The group is defined by when someone first appeared — a value you compute, not one you read from a column.

Cohort Analysis

How would you measure “do customers come back?”


You run an online store. The CEO asks: “Are we keeping our customers?”


You have an orders table with user_id, created_at, sale_price.


How would you approach this? What do you need to compute?

Cohort analysis


A cohort = a group of entities that share a common starting event in a time period.


Most common use: retention — “Of the customers who first purchased in January, how many came back in February? In March?”


Same structure works for customers, sellers, app users, subscribers, or any entity with timestamps.

Three-step recipe

1. Assign cohort

user first_order
u1 Jan
u2 Jan
u3 Feb
u4 Feb
u5 Feb

MIN(created_at) per user → DATE_TRUNC to month

2. Period offset

user cohort order_month offset
u1 Jan Jan 0
u1 Jan Mar 2
u2 Jan Jan 0
u3 Feb Feb 0
u3 Feb Apr 2

DATE_DIFF from cohort to activity

3. Aggregate

cohort offset users
Jan 0 2
Jan 2 1
Feb 0 3
Feb 2 1

COUNT(DISTINCT user_id) per cohort × period

Retention query in SQL

-- Step 1: Assign cohort
WITH user_cohorts AS (
  SELECT user_id,
    DATE_TRUNC(DATE(MIN(created_at)), MONTH) AS cohort_month
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  GROUP BY user_id
),
-- Step 2: Period offset
cohort_activity AS (
  SELECT uc.user_id, uc.cohort_month,
    DATE_DIFF(DATE_TRUNC(DATE(oi.created_at), MONTH),
              uc.cohort_month, MONTH) AS months_since_first
  FROM user_cohorts uc
  JOIN `bigquery-public-data.thelook_ecommerce.order_items` oi USING (user_id)
),
-- Step 3: Aggregate
retention AS (
  SELECT cohort_month, months_since_first,
    COUNT(DISTINCT user_id) AS retained_users
  FROM cohort_activity GROUP BY 1, 2
),
-- Add retention rate
SELECT cohort_month, months_since_first, retained_users,
  FIRST_VALUE(retained_users) OVER (
    PARTITION BY cohort_month ORDER BY months_since_first
  ) AS cohort_size,
  ROUND(retained_users * 100.0 / FIRST_VALUE(retained_users) OVER (
    PARTITION BY cohort_month ORDER BY months_since_first
  ), 1) AS retention_pct
FROM retention
ORDER BY cohort_month, months_since_first

Reading a retention table

cohort month 0 month 1 month 2 month 3 month 4
Jan ’25 100% 8.2% 5.1% 4.3% 3.8%
Feb ’25 100% 7.9% 4.8% 4.1%
Mar ’25 100% 8.5% 5.0%
Apr ’25 100% 7.6%


  • Month 0 is always 100% (by definition)
  • The triangle: newer cohorts have fewer periods of data
  • Steep initial drop, then flattening — a common pattern

Wrap-up

Today’s patterns


Pattern What it catches
Date spine Missing days that silently break trends
Threshold monitoring Values that deviate from their own baseline
Collection drop check Data that stopped arriving from a source
Distribution shift Composition changes hidden inside stable totals
Drill-down The real explanation behind a surface-level anomaly
Cohort retention Whether the same users come back over time


The common thread: understand what “normal” looks like before deciding something is wrong.

Next week and announcements


Tomorrow’s practices:

  • Practice 1: Build cohort retention on thelook_ecommerce → submission view
  • Practice 2: Open-ended exploration of air raid alerts — the brainstorm questions from today are your starting point


Coming up:

  • Assignment 5 (complex analytical patterns) released this week
  • Group project: groups should be formed, start exploring your dataset
  • Next week: complex data structures — ARRAYs, STRUCTs, JSON, UNNEST