ECON250 - Big Data Analytics | Week 3
2026-02-17
Most commonly, the questions that sound simple may be not so easy to answer.
What do you need to know to answer the question correctly?
| Analyst | Definition | Result |
|---|---|---|
| A | All order_items.sale_price summed |
$1,240,000 |
| B | Excluding returned items | $1,105,000 |
| C | After discounts and shipping | $980,000 |
The SQL was correct in all three cases — the disagreement was about what “revenue” means.
Clarify the definition — what exactly do we mean by this metric?
Identify the grain — one row per what?
Choose the aggregation type — SUM, COUNT, AVG, ratio? ← rest of today
Write the query
Validate — does the result make sense?
COUNTIF(condition) — BigQuery shorthand for SUM(CASE WHEN ... THEN 1 ELSE 0 END)orders
| order_id | revenue |
|---|---|
| 1 | $50 |
| 2 | $30 |
| 3 | $20 |
Total revenue: $100
order_items (multiple items per order)
| order_id | item |
|---|---|
| 1 | Shirt |
| 1 | Pants |
| 2 | Shoes |
| 3 | Hat |
| 3 | Scarf |
| 3 | Gloves |
The total should be $100. What will this query return?
After the JOIN, each order row gets duplicated for every matching item:
| order_id | revenue | item |
|---|---|---|
| 1 | $50 | Shirt |
| 1 | $50 | Pants |
| 2 | $30 | Shoes |
| 3 | $20 | Hat |
| 3 | $20 | Scarf |
| 3 | $20 | Gloves |
SUM(revenue) = $50 + $50 + $30 + $20 + $20 + $20 = $190 (wrong!)
Defensive pattern: compare COUNT(*) before and after the JOIN. If it increases, you have fan-out.
Consider these 5 rows from order_items:
| order_id | sale_price | shipped_at | returned_at |
|---|---|---|---|
| 1001 | $45 | 2025-01-03 | NULL |
| 1002 | $30 | 2025-01-04 | 2025-01-15 |
| 1003 | $60 | NULL | NULL |
| 1004 | $25 | 2025-01-06 | 2025-01-20 |
| 1005 | $80 | 2025-01-05 | NULL |
Predict: will these three numbers be the same or different?
COUNT(*) = 5 | COUNT(shipped_at) = 4 | COUNT(returned_at) = 2 — all different, because COUNT(column) skips NULL values.
I calculated daily unique customers for January and summed them up. Will this equal the monthly unique customer count?
Raw orders (January)
| day | user_id |
|---|---|
| Jan 5 | Alice |
| Jan 5 | Bob |
| Jan 12 | Carol |
| Jan 20 | Alice |
| Jan 20 | Carol |
Daily uniques
| day | unique_users |
|---|---|
| Jan 5 | 2 (Alice, Bob) |
| Jan 12 | 1 (Carol) |
| Jan 20 | 2 (Alice, Carol) |
Sum of daily: 5
Monthly unique users: 3 (Alice, Bob, Carol)
Alice and Carol each ordered on two different days — they get counted once per day but once for the whole month.
WITH daily AS (
SELECT
DATE(created_at) AS day,
COUNT(DISTINCT user_id) AS unique_users
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
GROUP BY day
)
SELECT
SUM(unique_users) AS summed_daily_uniques,
(SELECT COUNT(DISTINCT user_id)
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
) AS monthly_uniques
FROM dailyA dashboard that sums daily active users to get monthly active users will overcount
“Unique visitors by region” columns won’t sum to “total unique visitors” (same visitor can appear in multiple regions)
This is why platforms report total views (additive — easy to sum) but are careful about unique reach (requires deduplication at each grain)
Rule: always compute distinct counts at the grain you need — don’t try to aggregate from a finer grain.
“Unique customers” depends on the identifier you choose:
| Identifier | Counts |
|---|---|
COUNT(DISTINCT user_id) |
Unique accounts |
COUNT(DISTINCT email) |
Unique emails (some users have multiple accounts) |
COUNT(DISTINCT session_id) |
Unique visits (one user = many sessions) |
Different identifiers produce different numbers. The metric definition must specify which.
Think of 2–3 business metrics that are calculated as ratios or averages.
Every average is also a ratio: AVG(x) = SUM(x) / COUNT(x). Ratios have their own category of pitfalls.
“What share of revenue does each product category contribute?”
-- Step 1: Revenue per category
WITH category_revenue AS (
SELECT category, SUM(sale_price) AS revenue
FROM order_items
JOIN products USING (product_id)
GROUP BY category
),
-- Step 2: Compute the grand total
total AS (
SELECT SUM(revenue) AS grand_total FROM category_revenue
)
-- Step 3: Divide each category by the total
SELECT category, revenue,
ROUND(revenue * 100.0 / grand_total, 2) AS pct_of_total
FROM category_revenue CROSS JOIN total
ORDER BY revenue DESCThe CROSS JOIN attaches the single grand total to every row, so each row can compute its percentage.
Consider this data — two orders, five items total, one item returned:
| order_id | item | returned_at |
|---|---|---|
| 101 | Shirt | NULL |
| 101 | Pants | NULL |
| 101 | Belt | 2025-01-20 |
| 102 | Jacket | NULL |
| 102 | Shoes | NULL |
What’s the return rate? Can you find more than one way to calculate it?
Approach A: item-level
Approach B: order-level
-- Orders with returns / total orders
WITH order_returns AS (
SELECT order_id,
MAX(CASE WHEN returned_at
IS NOT NULL THEN 1 ELSE 0
END) AS has_return
FROM `bigquery-public-data
.thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'
GROUP BY order_id
)
SELECT ROUND(
SUM(has_return) * 100.0
/ COUNT(*), 2
) AS return_rate
FROM order_returnsDifferent denominators (items vs orders), different numbers. Neither is wrong — they answer different questions.
| Category | Returned | Total | Return rate |
|---|---|---|---|
| Accessories | 1 | 2 | 50% |
| Electronics | 99 | 1,000 | 9.9% |
Overall: 100 returned out of 1,002 total
Naive average of rates: (50% + 9.9%) / 2 = 30%
Actual overall rate: 100 / 1,002 = 10%
Wrong
Kimball’s design rule: store the additive components (returned count, total count), compute the ratio at query time at the grain you need.
This query counts items that are NOT in “Complete” status. The analyst says the count seems too low. What will each row evaluate to?
| order_id | status | status != 'Complete' |
included? |
|---|---|---|---|
| 1001 | Complete | ??? | ??? |
| 1002 | Shipped | ??? | ??? |
| 1003 | Returned | ??? | ??? |
| 1004 | NULL | ??? | ??? |
| 1005 | Processing | ??? | ??? |
| order_id | status | status != 'Complete' |
included? |
|---|---|---|---|
| 1001 | Complete | FALSE | no |
| 1002 | Shipped | TRUE | yes |
| 1003 | Returned | TRUE | yes |
| 1004 | NULL | NULL | no |
| 1005 | Processing | TRUE | yes |
Expected: 4 rows. Actual: 3 rows — row 1004 silently dropped because NULL != 'Complete' evaluates to NULL, not TRUE.
The same trap appears with NOT IN:
WHERE x NOT IN (1, 2, NULL) returns zero rows — any comparison with NULL produces NULL, and the entire expression evaluates to NULL.
AVG(x) = SUM(x) / COUNT(x)
COUNT(x) excludes NULLs — so AVG only considers rows where the value exists.
Five orders, two with discounts:
| order_id | discount |
|---|---|
| 1001 | 10 |
| 1002 | NULL |
| 1003 | NULL |
| 1004 | 30 |
| 1005 | NULL |
Will these two queries return the same number?
| Question | SQL |
|---|---|
| Average discount when offered | AVG(discount) |
| Average discount across all orders | AVG(COALESCE(discount, 0)) |
COALESCE(x, fallback) returns the first non-NULL value — use it inside the aggregate when NULLs should be treated as zeros.
I computed AOV per category, then averaged those. Will this match the overall AOV?
| Category | Revenue | Orders | AOV |
|---|---|---|---|
| Clothing | $10,000 | 500 | $20 |
| Electronics | $50,000 | 200 | $250 |
Average of the two AOVs: ($20 + $250) / 2 = $135
Actual overall AOV: $60,000 / 700 = $85.71
| Function | Purpose | When to use |
|---|---|---|
COALESCE(x, 0) |
First non-NULL value | NULLs should be treated as zeros in AVG/SUM |
IFNULL(x, 0) |
Two-argument shorthand | Same as COALESCE with one fallback |
NULLIF(x, 'Unknown') |
Return NULL if x = value | Convert sentinel values to proper NULLs |
COALESCE checks multiple columns in order — useful beyond simple fallbacks:
| Type | Examples | Key property | Watch out for |
|---|---|---|---|
| Additive | Revenue, order count | Parts sum to the whole | Fan-out, COUNT(*) vs COUNT(col) |
| Distinct counts | Unique customers, sessions | Cannot sum across time | Non-additivity |
| Ratios / rates | Return rate, conversion | Derived from two metrics | Denominator choice, can’t average ratios |
| Averages | AOV, avg shipping time | Sensitive to inclusions | NULLs silently excluded, can’t average |
Definitional ambiguity — “revenue” means different things to different people
NULL behavior — aggregations handle NULLs differently, and comparisons with NULL are neither true nor false
Wrong granularity — computing at one grain and aggregating to another (fan-out, non-additivity, averaging ratios)
Most metric bugs are silent: the query runs fine, but the number it returns is wrong.
Practice 1: Calculate business KPIs from thelook_ecommerce, encountering these pitfalls firsthand
Practice 2: Diagnose and fix queries with metric bugs, then build a defensive metrics summary

Kyiv School of Economics