Time Series & Anomaly Detection

ECON250 - Big Data Analytics | Week 5

Oleh Omelchenko

2026-03-05

Air raid alerts dataset

You have start time, end time, and region. What questions can you answer?

What kinds of questions are these?

Category	Examples
Counting & aggregating	Alerts per month, total hours per week
Comparing	Front-line vs western oblasts, weekday vs weekend
Finding extremes	Longest alert, biggest gap, most alerts in one day
Detecting patterns	Trends over time, seasonality, streaks

Most counting and comparing you can already solve with GROUP BY and window functions.

Today: the last two categories. Finding what’s unusual, and building the right expectations.

Date Spines

Daily alert count: what’s missing?

Kyiv City, oblast-level alerts, January 2025:

date	alerts
Jan 1	3
Jan 2	2
Jan 3	1
Jan 5	4
Jan 6	2
Jan 8	1

Where is January 4th? January 7th?

Date spine: generate → LEFT JOIN → COALESCE

Without date spine

date	alerts
Jan 1	3
Jan 2	2
Jan 3	1
~~Jan 4~~
Jan 5	4
Jan 6	2
~~Jan 7~~
Jan 8	1

With date spine

date	alerts
Jan 1	3
Jan 2	2
Jan 3	1
Jan 4	0
Jan 5	4
Jan 6	2
Jan 7	0
Jan 8	1

WITH date_spine AS (
  SELECT date FROM UNNEST(GENERATE_DATE_ARRAY('2025-01-01', CURRENT_DATE())) AS date
)
SELECT ds.date, COALESCE(da.alerts, 0) AS alerts
FROM date_spine ds
LEFT JOIN daily_alerts da ON ds.date = da.alert_date

Incomplete periods distort trends

month	revenue
January	$142,000
February	$138,000
March	$68,000 ←

“March is down 52%!”

…but it’s March 15th. The month isn’t over yet.

Filter out incomplete periods before comparing: WHERE date < DATE_TRUNC(CURRENT_DATE(), MONTH)

Analyzing Time Series

Patterns by hour and day of week

hour	Mon	Tue	Wed	Thu	Fri	Sat	Sun
00-03	12%	11%	13%	10%	14%	9%	8%
03-06	18%	16%	19%	17%	15%	11%	10%
06-09	5%	6%	4%	7%	5%	4%	3%
09-12	8%	9%	7%	8%	10%	8%	7%
12-15	14%	13%	15%	12%	11%	14%	12%
15-18	10%	11%	9%	12%	10%	11%	13%
18-21	15%	16%	14%	16%	17%	18%	20%
21-00	18%	18%	19%	18%	18%	25%	27%

Percentage = share of alerts starting in that time block. What patterns do you see?

4 minutes

This is a heatmap in table form. Let students read it before discussing.

Patterns to spot:

Early morning (03-06) is high on weekdays, lower on weekends
Late evening (21-00) is higher on weekends, especially Sunday
Morning (06-09) is consistently low across all days

Ask: “What would explain these patterns?” Let them speculate. There may not be one answer — that’s fine.

The SQL for this is straightforward: EXTRACT(HOUR FROM start_time), EXTRACT(DAYOFWEEK FROM start_time), then conditional aggregation to pivot. But reading the result and spotting patterns is the analytical skill.

“This is something you can build in practice tomorrow. The SQL is a GROUP BY with EXTRACT. The interesting part is what the data tells you.”

Transition: “That table tells us when alerts tend to happen. But what if you need to know what’s happening at one specific moment?”

Simultaneous alerts

How many oblasts are under alert at the same time?

timestamp	oblasts under alert
Jan 15, 14:00	18
Jan 15, 14:30	22
Jan 15, 15:00	25
Jan 15, 15:30	19

How would you compute this from start_time and end_time?

Think about it: you don’t have a row per timestamp. You have a row per alert.

4 minutes

This is a hard question. Let students struggle with it.

The challenge: the data has one row per alert (start, end, region). To count simultaneous alerts at a point in time, you need to turn intervals into a timeline.

Two approaches to mention (don’t write full SQL):

Brute force: generate timestamps at some interval (every 30 min), CROSS JOIN with alerts, filter where timestamp BETWEEN start_time AND end_time, then COUNT DISTINCT region per timestamp. Works but expensive.
Event-based: create two rows per alert — a +1 at start_time and a -1 at end_time. Then running SUM gives you the count at any point. This is the classic interval counting pattern.

“This is one of those questions where the analytical thinking matters more than the SQL syntax. The running-sum approach uses window functions you already know. The hard part is reshaping the data.”

Transition: “Now let’s combine these ideas into one question that hits close to home.”

How many lectures did we miss?

KSE policy: if an air raid alert is active at 7:30, the first class of the day is cancelled.

semester day	alert active at 7:30?	first class
Mon, Feb 3	no	held
Tue, Feb 4	yes	cancelled
Mon, Feb 10	no	held
Tue, Feb 11	yes	cancelled
Mon, Feb 17	yes	cancelled

How many first classes were cancelled this semester? What if the window were 7:00? Or 8:00?

7 minutes — this is the payoff of the section

This question is personal — students live this. Let them feel it before you break it down.

“How would you answer this with SQL?” Let them think for a minute.

The pieces they need (all from today or earlier weeks):

Date spine — generate every class day (or at least every weekday) in the semester
Time-of-day extraction — check what’s happening at a specific time (from the hour/day patterns slide)
Interval containment — “is an alert active at 7:30?” means: does any alert have start_time <= 7:30 AND end_time > 7:30? (from the simultaneous alerts slide)
Timezone conversion — alerts are stored in UTC, but 7:30 means 7:30 Kyiv time

Don’t write full SQL. Outline the approach:

Generate semester dates (date spine)
For each date, construct the 7:30 timestamp in Kyiv time
Check if any alert overlaps that moment (EXISTS with interval check)
Count the cancelled days

Then the interesting follow-up: “What if we shift the window?” At 7:00 instead of 7:30, how many more cancellations? At 8:00, how many fewer? This is a parameter sweep — same query, different threshold. It turns a personal frustration into an analytical question.

“You could actually build this in practice tomorrow. Every technique you need — date spines, interval checks, timezone conversion — we’ve covered.”

Transition: “We’ve been asking ‘what does the data show?’ Now let’s ask a different question: how do you know when something in your data is genuinely unusual?”

Anomalies

What is an anomaly?

Let’s hear examples from you:

What makes something “anomalous”?

Five types of anomalies

Type	Example	Is the data “wrong”?
Data quality issues	Negative order amount, timestamp from 1970	Yes
Genuine outliers	Front-line oblast with a 600-day continuous alert	No
Pattern breaks	Revenue drops 30% vs last month	Maybe
Data collection changes	Alert count doubles overnight	No
Distribution shifts	Mobile traffic jumps from 60% to 85%	Maybe

Only the first type is “wrong.” The rest need investigation before you act.

Could you name more possible types of anomalies not in this table?

3 minutes

Walk through each row. Connect to student examples from the discussion where possible.

Key point: most anomalies are NOT bad data. They’re data that needs your attention because it diverges from expectations.

Data quality: fix or remove — these are errors
Genuine outliers: real values that distort aggregates. A front-line oblast with continuous alerts is real, but it makes the national average meaningless.
Pattern breaks: revenue dropped — is it a real problem or a seasonal effect?
Collection changes: the alert system started reporting at raion level, inflating totals. Not more alerts, more granular recording.
Distribution shifts: mobile share spiked — real trend or bot traffic?

“The common thread: you need to know what ‘normal’ looks like before you can spot what isn’t.”

Anomaly detection in practice

In a textbook: compute z-scores, flag outliers.

In practice:

Build expectations for what the metric should look like
Flag when reality diverges
Investigate the flagged items

This is called sanity checking, and it’s what most monitoring systems do.

Sanity Checks

Pattern 1: Threshold monitoring

Compare each day to its own rolling baseline. Flag significant deviations.

date	revenue	30-day avg	ratio	status
Mar 8	$4,200	$4,050	1.04	normal
Mar 9	$3,800	$4,060	0.94	normal
Mar 10	$4,100	$4,055	1.01	normal
Mar 11	$1,900	$4,050	0.47	ALERT: low
Mar 12	$4,300	$4,020	1.07	normal

Thresholds come from domain knowledge, not statistics. You decide what “too different” means.

Pattern 1: SQL

WITH daily_revenue AS (
  SELECT DATE(created_at) AS order_date, SUM(sale_price) AS revenue
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  WHERE created_at >= '2025-01-01'
  GROUP BY order_date
),
with_baseline AS (
  SELECT *, AVG(revenue) OVER (
    ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
  ) AS avg_30d
  FROM daily_revenue
)
SELECT *, ROUND(revenue / avg_30d, 2) AS ratio,
  CASE
    WHEN revenue < avg_30d * 0.5 THEN 'ALERT: low'
    WHEN revenue > avg_30d * 2.0 THEN 'ALERT: high'
    ELSE 'normal'
  END AS status
FROM with_baseline
WHERE avg_30d IS NOT NULL

Pattern 2: Data collection drops

Monthly alert count by oblast. Is every region still reporting?

oblast	month	alerts	prev_month	ratio
Kyiv City	Jan	42	38	1.11
Kyiv City	Feb	45	42	1.07
Lvivska	Jan	28	31	0.90
Lvivska	Feb	3	28	0.11 ←

An 89% drop. Data pipeline issue? Or something real that needs investigation?

Pattern 3: Distribution shifts

Payment method shares on Olist marketplace, monthly:

month	credit_card	boleto	voucher	debit
2017-07	73.8%	19.2%	5.4%	1.6%
2017-08	74.1%	18.9%	5.3%	1.7%
2017-09	74.0%	19.1%	5.2%	1.7%
2017-10	68.2%	24.5%	5.5%	1.8%

Individual values are fine. But the mix changed: credit card share dropped 6 points, boleto gained 5.

Why? A UI change? A promotion? Fraud? Seasonal effect?

Pattern 3: SQL

WITH monthly_payments AS (
  SELECT
    DATE_TRUNC(DATE(order_purchase_timestamp), MONTH) AS month,
    payment_type,
    COUNT(*) AS payment_count
  FROM olist_orders JOIN olist_order_payments USING (order_id)
  GROUP BY month, payment_type
),
with_share AS (
  SELECT *,
    ROUND(payment_count * 100.0
      / SUM(payment_count) OVER (PARTITION BY month), 1) AS pct_share
  FROM monthly_payments
)
SELECT * FROM with_share ORDER BY payment_type, month

SUM() OVER (PARTITION BY month) in the denominator gives percent-of-total within each month.

Drill down: broad to specific

When something looks unusual: start broad, then narrow down.

Step 1: yearly totals

year	alerts
2022	8,400
2023	14,200
2024	15,800
2025	31,600 ←

2025 has double the alerts. What happened?

Drill down: add the `level` column

year	level	alerts
2024	oblast	15,800
2025	oblast	15,900
2025	raion	15,700

Not more alerts. More granular recording.

The system started reporting at raion level in addition to oblast. The total doubled because there are now two rows per alert event, not because there are more alerts.

Always investigate before you remove or correct.

3 minutes — this is the strongest teaching moment

Let the reveal land. The 2× increase is entirely explained by a data collection change — the alert system added raion-level granularity.

“If you had automatically removed the ‘outlier’ year, you’d have deleted real data. If you had reported to your manager that alerts doubled, you’d have been wrong.”

“This happens constantly in product analytics. A new SDK version starts logging events differently. A tracking change creates an apparent spike. A data schema migration changes how records are counted. The anomaly is real in the data, but not in the underlying reality.”

Click to reveal: “Always investigate before you remove or correct.”

Transition: “This pattern — drill down from broad to specific — comes from SQL DA Chapter 6. The SQL is always the same: add a GROUP BY column and see which segment drives the change.”

Where would you look first?

Pick a scenario. What’s your drill-down path?

Scenario	Starting signal
Marketing spend doubled but signups didn’t	Total signups flat vs. last month
E-commerce revenue dropped 20% on Monday	Daily revenue below threshold
Mobile app crashes spiked this week	Error count 3× normal
Support tickets doubled in February	Monthly ticket volume

For each: what dimension would you break down by first? Then what?

Discussion — 5 minutes (skip if short on time)

Pick 2–3 scenarios. For each, ask: “Where do you look first?”

Marketing spend: signups → by channel (paid vs organic) → by campaign → by ad group. Maybe paid signups are fine but organic dropped because of an SEO change. Or one campaign is burning budget on low-quality traffic.

Revenue drop: revenue → by product category → by region → by device. Maybe one category had a stockout. Or mobile checkout broke. Or a single large B2B order from last Monday didn’t repeat.

App crashes: crashes → by app version → by OS → by device model. Classic: a new release introduced a bug that only affects Android 12 on Samsung devices.

Support tickets: tickets → by category (billing, technical, shipping) → by product → by region. Maybe a price change triggered billing questions. Or a warehouse delay caused shipping complaints.

The point: the drill-down path depends on the domain. The SQL is always the same — add a GROUP BY column and see which segment drives the change. The analytical skill is knowing which dimension to check first.

When groups aren’t obvious

So far we’ve broken data down by dimensions that already exist in the data: date, region, payment type.

But sometimes the most important grouping isn’t in your data — you have to construct it.

“Are January customers different from March customers?”

The group is defined by when someone first appeared — a value you compute, not one you read from a column.

Cohort Analysis

How would you measure “do customers come back?”

You run an online store. The CEO asks: “Are we keeping our customers?”

You have an orders table with user_id, created_at, sale_price.

How would you approach this? What do you need to compute?

Cohort analysis

A cohort = a group of entities that share a common starting event in a time period.

Most common use: retention — “Of the customers who first purchased in January, how many came back in February? In March?”

Same structure works for customers, sellers, app users, subscribers, or any entity with timestamps.

Three-step recipe

1. Assign cohort

user	first_order
u1	Jan
u2	Jan
u3	Feb
u4	Feb
u5	Feb

MIN(created_at) per user → DATE_TRUNC to month

2. Period offset

user	cohort	order_month	offset
u1	Jan	Jan	0
u1	Jan	Mar	2
u2	Jan	Jan	0
u3	Feb	Feb	0
u3	Feb	Apr	2

DATE_DIFF from cohort to activity

3. Aggregate

cohort	offset	users
Jan	0	2
Jan	2	1
Feb	0	3
Feb	2	1

COUNT(DISTINCT user_id) per cohort × period

Retention query in SQL

-- Step 1: Assign cohort
WITH user_cohorts AS (
  SELECT user_id,
    DATE_TRUNC(DATE(MIN(created_at)), MONTH) AS cohort_month
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  GROUP BY user_id
),
-- Step 2: Period offset
cohort_activity AS (
  SELECT uc.user_id, uc.cohort_month,
    DATE_DIFF(DATE_TRUNC(DATE(oi.created_at), MONTH),
              uc.cohort_month, MONTH) AS months_since_first
  FROM user_cohorts uc
  JOIN `bigquery-public-data.thelook_ecommerce.order_items` oi USING (user_id)
),
-- Step 3: Aggregate
retention AS (
  SELECT cohort_month, months_since_first,
    COUNT(DISTINCT user_id) AS retained_users
  FROM cohort_activity GROUP BY 1, 2
),
-- Add retention rate
SELECT cohort_month, months_since_first, retained_users,
  FIRST_VALUE(retained_users) OVER (
    PARTITION BY cohort_month ORDER BY months_since_first
  ) AS cohort_size,
  ROUND(retained_users * 100.0 / FIRST_VALUE(retained_users) OVER (
    PARTITION BY cohort_month ORDER BY months_since_first
  ), 1) AS retention_pct
FROM retention
ORDER BY cohort_month, months_since_first

Reading a retention table

cohort	month 0	month 1	month 2	month 3	month 4
Jan ’25	100%	8.2%	5.1%	4.3%	3.8%
Feb ’25	100%	7.9%	4.8%	4.1%
Mar ’25	100%	8.5%	5.0%
Apr ’25	100%	7.6%

Month 0 is always 100% (by definition)
The triangle: newer cohorts have fewer periods of data
Steep initial drop, then flattening — a common pattern

Wrap-up

Today’s patterns

Pattern	What it catches
Date spine	Missing days that silently break trends
Threshold monitoring	Values that deviate from their own baseline
Collection drop check	Data that stopped arriving from a source
Distribution shift	Composition changes hidden inside stable totals
Drill-down	The real explanation behind a surface-level anomaly
Cohort retention	Whether the same users come back over time

The common thread: understand what “normal” looks like before deciding something is wrong.

Next week and announcements

Tomorrow’s practices:

Practice 1: Build cohort retention on thelook_ecommerce → submission view
Practice 2: Open-ended exploration of air raid alerts — the brainstorm questions from today are your starting point

Coming up:

Assignment 5 (complex analytical patterns) released this week
Group project: groups should be formed, start exploring your dataset
Next week: complex data structures — ARRAYs, STRUCTs, JSON, UNNEST