ECON250 - Big Data Analytics | Week 5
2026-03-05
You have start time, end time, and region. What questions can you answer?
| Category | Examples |
|---|---|
| Counting & aggregating | Alerts per month, total hours per week |
| Comparing | Front-line vs western oblasts, weekday vs weekend |
| Finding extremes | Longest alert, biggest gap, most alerts in one day |
| Detecting patterns | Trends over time, seasonality, streaks |
Most counting and comparing you can already solve with GROUP BY and window functions.
Today: the last two categories. Finding what’s unusual, and building the right expectations.
Kyiv City, oblast-level alerts, January 2025:
| date | alerts |
|---|---|
| Jan 1 | 3 |
| Jan 2 | 2 |
| Jan 3 | 1 |
| Jan 5 | 4 |
| Jan 6 | 2 |
| Jan 8 | 1 |
Where is January 4th? January 7th?
Without date spine
| date | alerts |
|---|---|
| Jan 1 | 3 |
| Jan 2 | 2 |
| Jan 3 | 1 |
| Jan 5 | 4 |
| Jan 6 | 2 |
| Jan 8 | 1 |
With date spine
| date | alerts |
|---|---|
| Jan 1 | 3 |
| Jan 2 | 2 |
| Jan 3 | 1 |
| Jan 4 | 0 |
| Jan 5 | 4 |
| Jan 6 | 2 |
| Jan 7 | 0 |
| Jan 8 | 1 |
| month | revenue |
|---|---|
| January | $142,000 |
| February | $138,000 |
| March | $68,000 ← |
“March is down 52%!”
…but it’s March 15th. The month isn’t over yet.
Filter out incomplete periods before comparing: WHERE date < DATE_TRUNC(CURRENT_DATE(), MONTH)
| hour | Mon | Tue | Wed | Thu | Fri | Sat | Sun |
|---|---|---|---|---|---|---|---|
| 00-03 | 12% | 11% | 13% | 10% | 14% | 9% | 8% |
| 03-06 | 18% | 16% | 19% | 17% | 15% | 11% | 10% |
| 06-09 | 5% | 6% | 4% | 7% | 5% | 4% | 3% |
| 09-12 | 8% | 9% | 7% | 8% | 10% | 8% | 7% |
| 12-15 | 14% | 13% | 15% | 12% | 11% | 14% | 12% |
| 15-18 | 10% | 11% | 9% | 12% | 10% | 11% | 13% |
| 18-21 | 15% | 16% | 14% | 16% | 17% | 18% | 20% |
| 21-00 | 18% | 18% | 19% | 18% | 18% | 25% | 27% |
Percentage = share of alerts starting in that time block. What patterns do you see?
How many oblasts are under alert at the same time?
| timestamp | oblasts under alert |
|---|---|
| Jan 15, 14:00 | 18 |
| Jan 15, 14:30 | 22 |
| Jan 15, 15:00 | 25 |
| Jan 15, 15:30 | 19 |
How would you compute this from start_time and end_time?
Think about it: you don’t have a row per timestamp. You have a row per alert.
KSE policy: if an air raid alert is active at 7:30, the first class of the day is cancelled.
| semester day | alert active at 7:30? | first class |
|---|---|---|
| Mon, Feb 3 | no | held |
| Tue, Feb 4 | yes | cancelled |
| Mon, Feb 10 | no | held |
| Tue, Feb 11 | yes | cancelled |
| Mon, Feb 17 | yes | cancelled |
How many first classes were cancelled this semester? What if the window were 7:00? Or 8:00?
Let’s hear examples from you:
What makes something “anomalous”?
| Type | Example | Is the data “wrong”? |
|---|---|---|
| Data quality issues | Negative order amount, timestamp from 1970 | Yes |
| Genuine outliers | Front-line oblast with a 600-day continuous alert | No |
| Pattern breaks | Revenue drops 30% vs last month | Maybe |
| Data collection changes | Alert count doubles overnight | No |
| Distribution shifts | Mobile traffic jumps from 60% to 85% | Maybe |
Only the first type is “wrong.” The rest need investigation before you act.
Could you name more possible types of anomalies not in this table?
In a textbook: compute z-scores, flag outliers.
In practice:
This is called sanity checking, and it’s what most monitoring systems do.
Compare each day to its own rolling baseline. Flag significant deviations.
| date | revenue | 30-day avg | ratio | status |
|---|---|---|---|---|
| Mar 8 | $4,200 | $4,050 | 1.04 | normal |
| Mar 9 | $3,800 | $4,060 | 0.94 | normal |
| Mar 10 | $4,100 | $4,055 | 1.01 | normal |
| Mar 11 | $1,900 | $4,050 | 0.47 | ALERT: low |
| Mar 12 | $4,300 | $4,020 | 1.07 | normal |
Thresholds come from domain knowledge, not statistics. You decide what “too different” means.
WITH daily_revenue AS (
SELECT DATE(created_at) AS order_date, SUM(sale_price) AS revenue
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'
GROUP BY order_date
),
with_baseline AS (
SELECT *, AVG(revenue) OVER (
ORDER BY order_date ROWS BETWEEN 30 PRECEDING AND 1 PRECEDING
) AS avg_30d
FROM daily_revenue
)
SELECT *, ROUND(revenue / avg_30d, 2) AS ratio,
CASE
WHEN revenue < avg_30d * 0.5 THEN 'ALERT: low'
WHEN revenue > avg_30d * 2.0 THEN 'ALERT: high'
ELSE 'normal'
END AS status
FROM with_baseline
WHERE avg_30d IS NOT NULLMonthly alert count by oblast. Is every region still reporting?
| oblast | month | alerts | prev_month | ratio |
|---|---|---|---|---|
| Kyiv City | Jan | 42 | 38 | 1.11 |
| Kyiv City | Feb | 45 | 42 | 1.07 |
| Lvivska | Jan | 28 | 31 | 0.90 |
| Lvivska | Feb | 3 | 28 | 0.11 ← |
An 89% drop. Data pipeline issue? Or something real that needs investigation?
Payment method shares on Olist marketplace, monthly:
| month | credit_card | boleto | voucher | debit |
|---|---|---|---|---|
| 2017-07 | 73.8% | 19.2% | 5.4% | 1.6% |
| 2017-08 | 74.1% | 18.9% | 5.3% | 1.7% |
| 2017-09 | 74.0% | 19.1% | 5.2% | 1.7% |
| 2017-10 | 68.2% | 24.5% | 5.5% | 1.8% |
Individual values are fine. But the mix changed: credit card share dropped 6 points, boleto gained 5.
Why? A UI change? A promotion? Fraud? Seasonal effect?
WITH monthly_payments AS (
SELECT
DATE_TRUNC(DATE(order_purchase_timestamp), MONTH) AS month,
payment_type,
COUNT(*) AS payment_count
FROM olist_orders JOIN olist_order_payments USING (order_id)
GROUP BY month, payment_type
),
with_share AS (
SELECT *,
ROUND(payment_count * 100.0
/ SUM(payment_count) OVER (PARTITION BY month), 1) AS pct_share
FROM monthly_payments
)
SELECT * FROM with_share ORDER BY payment_type, monthSUM() OVER (PARTITION BY month) in the denominator gives percent-of-total within each month.
When something looks unusual: start broad, then narrow down.
Step 1: yearly totals
| year | alerts |
|---|---|
| 2022 | 8,400 |
| 2023 | 14,200 |
| 2024 | 15,800 |
| 2025 | 31,600 ← |
2025 has double the alerts. What happened?
level column| year | level | alerts |
|---|---|---|
| 2024 | oblast | 15,800 |
| 2025 | oblast | 15,900 |
| 2025 | raion | 15,700 |
Not more alerts. More granular recording.
The system started reporting at raion level in addition to oblast. The total doubled because there are now two rows per alert event, not because there are more alerts.
Always investigate before you remove or correct.
Pick a scenario. What’s your drill-down path?
| Scenario | Starting signal |
|---|---|
| Marketing spend doubled but signups didn’t | Total signups flat vs. last month |
| E-commerce revenue dropped 20% on Monday | Daily revenue below threshold |
| Mobile app crashes spiked this week | Error count 3× normal |
| Support tickets doubled in February | Monthly ticket volume |
For each: what dimension would you break down by first? Then what?
So far we’ve broken data down by dimensions that already exist in the data: date, region, payment type.
But sometimes the most important grouping isn’t in your data — you have to construct it.
“Are January customers different from March customers?”
The group is defined by when someone first appeared — a value you compute, not one you read from a column.
You run an online store. The CEO asks: “Are we keeping our customers?”
You have an orders table with user_id, created_at, sale_price.
How would you approach this? What do you need to compute?
A cohort = a group of entities that share a common starting event in a time period.
Most common use: retention — “Of the customers who first purchased in January, how many came back in February? In March?”
Same structure works for customers, sellers, app users, subscribers, or any entity with timestamps.
1. Assign cohort
| user | first_order |
|---|---|
| u1 | Jan |
| u2 | Jan |
| u3 | Feb |
| u4 | Feb |
| u5 | Feb |
MIN(created_at) per user → DATE_TRUNC to month
2. Period offset
| user | cohort | order_month | offset |
|---|---|---|---|
| u1 | Jan | Jan | 0 |
| u1 | Jan | Mar | 2 |
| u2 | Jan | Jan | 0 |
| u3 | Feb | Feb | 0 |
| u3 | Feb | Apr | 2 |
DATE_DIFF from cohort to activity
3. Aggregate
| cohort | offset | users |
|---|---|---|
| Jan | 0 | 2 |
| Jan | 2 | 1 |
| Feb | 0 | 3 |
| Feb | 2 | 1 |
COUNT(DISTINCT user_id) per cohort × period
-- Step 1: Assign cohort
WITH user_cohorts AS (
SELECT user_id,
DATE_TRUNC(DATE(MIN(created_at)), MONTH) AS cohort_month
FROM `bigquery-public-data.thelook_ecommerce.order_items`
GROUP BY user_id
),
-- Step 2: Period offset
cohort_activity AS (
SELECT uc.user_id, uc.cohort_month,
DATE_DIFF(DATE_TRUNC(DATE(oi.created_at), MONTH),
uc.cohort_month, MONTH) AS months_since_first
FROM user_cohorts uc
JOIN `bigquery-public-data.thelook_ecommerce.order_items` oi USING (user_id)
),
-- Step 3: Aggregate
retention AS (
SELECT cohort_month, months_since_first,
COUNT(DISTINCT user_id) AS retained_users
FROM cohort_activity GROUP BY 1, 2
),
-- Add retention rate
SELECT cohort_month, months_since_first, retained_users,
FIRST_VALUE(retained_users) OVER (
PARTITION BY cohort_month ORDER BY months_since_first
) AS cohort_size,
ROUND(retained_users * 100.0 / FIRST_VALUE(retained_users) OVER (
PARTITION BY cohort_month ORDER BY months_since_first
), 1) AS retention_pct
FROM retention
ORDER BY cohort_month, months_since_first| cohort | month 0 | month 1 | month 2 | month 3 | month 4 |
|---|---|---|---|---|---|
| Jan ’25 | 100% | 8.2% | 5.1% | 4.3% | 3.8% |
| Feb ’25 | 100% | 7.9% | 4.8% | 4.1% | |
| Mar ’25 | 100% | 8.5% | 5.0% | ||
| Apr ’25 | 100% | 7.6% |
| Pattern | What it catches |
|---|---|
| Date spine | Missing days that silently break trends |
| Threshold monitoring | Values that deviate from their own baseline |
| Collection drop check | Data that stopped arriving from a source |
| Distribution shift | Composition changes hidden inside stable totals |
| Drill-down | The real explanation behind a surface-level anomaly |
| Cohort retention | Whether the same users come back over time |
The common thread: understand what “normal” looks like before deciding something is wrong.
Tomorrow’s practices:
Coming up:

Kyiv School of Economics