Anatomy of a Business Metric

ECON250 - Big Data Analytics | Week 3

Oleh Omelchenko

2026-02-17

Metrics Are Definitions

“What was our revenue last month?”

Most commonly, the questions that sound simple may be not so easy to answer.

What do you need to know to answer the question correctly?

Same question, different definitions

Analyst	Definition	Result
A	All `order_items.sale_price` summed	$1,240,000
B	Excluding returned items	$1,105,000
C	After discounts and shipping	$980,000

The SQL was correct in all three cases — the disagreement was about what “revenue” means.

Before you write any SQL

Clarify the definition — what exactly do we mean by this metric?
Identify the grain — one row per what?
Choose the aggregation type — SUM, COUNT, AVG, ratio? ← rest of today
Write the query
Validate — does the result make sense?

Additive Metrics

Aggregation toolkit recap

SELECT
  COUNT(*) AS total_items,
  ROUND(SUM(sale_price), 2) AS total_revenue,
  COUNTIF(returned_at IS NOT NULL) AS returned_items
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'

COUNTIF(condition) — BigQuery shorthand for SUM(CASE WHEN ... THEN 1 ELSE 0 END)
These are all additive: slice by any dimension and the parts sum to the whole

Caveat: JOIN fan-out

orders

order_id	revenue
1	$50
2	$30
3	$20

Total revenue: $100

order_items (multiple items per order)

order_id	item
1	Shirt
1	Pants
2	Shoes
3	Hat
3	Scarf
3	Gloves

SELECT SUM(o.revenue) AS total_revenue
FROM orders o
JOIN order_items oi USING (order_id)

The total should be $100. What will this query return?

Fan-out: one-to-many JOINs duplicate rows

After the JOIN, each order row gets duplicated for every matching item:

order_id	revenue	item
1	$50	Shirt
1	$50	Pants
2	$30	Shoes
3	$20	Hat
3	$20	Scarf
3	$20	Gloves

SUM(revenue) = $50 + $50 + $30 + $20 + $20 + $20 = $190 (wrong!)

Defensive pattern: compare COUNT(*) before and after the JOIN. If it increases, you have fan-out.

Caveat: COUNT(*) vs COUNT(column)

Consider these 5 rows from order_items:

order_id	sale_price	shipped_at	returned_at
1001	$45	2025-01-03	NULL
1002	$30	2025-01-04	2025-01-15
1003	$60	NULL	NULL
1004	$25	2025-01-06	2025-01-20
1005	$80	2025-01-05	NULL

Predict: will these three numbers be the same or different?

SELECT
  COUNT(*) AS total_rows,
  COUNT(returned_at) AS count_returned_at,
  COUNT(shipped_at) AS count_shipped_at
FROM `bigquery-public-data.thelook_ecommerce.order_items`

COUNT(*) = 5 | COUNT(shipped_at) = 4 | COUNT(returned_at) = 2 — all different, because COUNT(column) skips NULL values.

Distinct Counts

Summing daily uniques ≠ monthly uniques

I calculated daily unique customers for January and summed them up. Will this equal the monthly unique customer count?

Raw orders (January)

day	user_id
Jan 5	Alice
Jan 5	Bob
Jan 12	Carol
Jan 20	Alice
Jan 20	Carol

Daily uniques

day	unique_users
Jan 5	2 (Alice, Bob)
Jan 12	1 (Carol)
Jan 20	2 (Alice, Carol)

Sum of daily: 5

Monthly unique users: 3 (Alice, Bob, Carol)

Alice and Carol each ordered on two different days — they get counted once per day but once for the whole month.

The query that proves it

WITH daily AS (
  SELECT
    DATE(created_at) AS day,
    COUNT(DISTINCT user_id) AS unique_users
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
  GROUP BY day
)
SELECT
  SUM(unique_users) AS summed_daily_uniques,
  (SELECT COUNT(DISTINCT user_id)
   FROM `bigquery-public-data.thelook_ecommerce.order_items`
   WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
  ) AS monthly_uniques
FROM daily

Why this matters in production

A dashboard that sums daily active users to get monthly active users will overcount
“Unique visitors by region” columns won’t sum to “total unique visitors” (same visitor can appear in multiple regions)
This is why platforms report total views (additive — easy to sum) but are careful about unique reach (requires deduplication at each grain)

Rule: always compute distinct counts at the grain you need — don’t try to aggregate from a finer grain.

What defines “distinct”?

“Unique customers” depends on the identifier you choose:

Identifier	Counts
`COUNT(DISTINCT user_id)`	Unique accounts
`COUNT(DISTINCT email)`	Unique emails (some users have multiple accounts)
`COUNT(DISTINCT session_id)`	Unique visits (one user = many sessions)

Different identifiers produce different numbers. The metric definition must specify which.

Ratios and Rates

Ratios are everywhere

Think of 2–3 business metrics that are calculated as ratios or averages.

AOV (average order value) — revenue / orders
Conversion rate — purchases / visits
Return rate — returned items / total items
GDP per capita — GDP / population

Every average is also a ratio: AVG(x) = SUM(x) / COUNT(x). Ratios have their own category of pitfalls.

Percent of total

“What share of revenue does each product category contribute?”

-- Step 1: Revenue per category
WITH category_revenue AS (
  SELECT category, SUM(sale_price) AS revenue
  FROM order_items
  JOIN products USING (product_id)
  GROUP BY category
),
-- Step 2: Compute the grand total
total AS (
  SELECT SUM(revenue) AS grand_total FROM category_revenue
)
-- Step 3: Divide each category by the total
SELECT category, revenue,
  ROUND(revenue * 100.0 / grand_total, 2) AS pct_of_total
FROM category_revenue CROSS JOIN total
ORDER BY revenue DESC

The CROSS JOIN attaches the single grand total to every row, so each row can compute its percentage.

Caveat: the denominator changes the answer

Consider this data — two orders, five items total, one item returned:

order_id	item	returned_at
101	Shirt	NULL
101	Pants	NULL
101	Belt	2025-01-20
102	Jacket	NULL
102	Shoes	NULL

What’s the return rate? Can you find more than one way to calculate it?

Two definitions of “return rate”

Approach A: item-level

-- Returned items / total items
SELECT ROUND(
  COUNTIF(returned_at IS NOT NULL)
  * 100.0 / COUNT(*), 2
) AS return_rate
FROM `bigquery-public-data
  .thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'

Approach B: order-level

-- Orders with returns / total orders
WITH order_returns AS (
  SELECT order_id,
    MAX(CASE WHEN returned_at
      IS NOT NULL THEN 1 ELSE 0
      END) AS has_return
  FROM `bigquery-public-data
    .thelook_ecommerce.order_items`
  WHERE created_at >= '2025-01-01'
  GROUP BY order_id
)
SELECT ROUND(
  SUM(has_return) * 100.0
  / COUNT(*), 2
) AS return_rate
FROM order_returns

Different denominators (items vs orders), different numbers. Neither is wrong — they answer different questions.

You can’t average ratios across groups

Category	Returned	Total	Return rate
Accessories	1	2	50%
Electronics	99	1,000	9.9%

Overall: 100 returned out of 1,002 total

Naive average of rates: (50% + 9.9%) / 2 = 30%

Actual overall rate: 100 / 1,002 = 10%

Recompute from totals, don’t average rates

Wrong

-- Average of per-category rates
WITH rates AS (
  SELECT category,
    COUNTIF(...) / COUNT(*) AS rate
  FROM ...
  GROUP BY category
)
SELECT AVG(rate)   -- wrong!
FROM rates

Correct

-- Overall rate from raw counts
SELECT
  COUNTIF(returned_at IS NOT NULL)
    * 100.0 / COUNT(*) AS rate
FROM order_items
WHERE created_at >= '2025-01-01'

Kimball’s design rule: store the additive components (returned count, total count), compute the ratio at query time at the grain you need.

Caveat: NULL in comparisons silently drops rows

This query counts items that are NOT in “Complete” status. The analyst says the count seems too low. What will each row evaluate to?

SELECT COUNT(*) AS not_complete
FROM order_items
WHERE status != 'Complete'

order_id	status	`status != 'Complete'`	included?
1001	Complete	???	???
1002	Shipped	???	???
1003	Returned	???	???
1004	NULL	???	???
1005	Processing	???	???

NULL comparison: the reveal

SELECT COUNT(*) AS not_complete
FROM order_items
WHERE status != 'Complete'

order_id	status	`status != 'Complete'`	included?
1001	Complete	FALSE	no
1002	Shipped	TRUE	yes
1003	Returned	TRUE	yes
1004	NULL	*NULL*	no
1005	Processing	TRUE	yes

Expected: 4 rows. Actual: 3 rows — row 1004 silently dropped because NULL != 'Complete' evaluates to NULL, not TRUE.

NULL comparisons return NULL, not FALSE

-- Fix option 1: explicit NULL check
WHERE status != 'Complete' OR status IS NULL

-- Fix option 2: IS DISTINCT FROM (treats NULLs as a value)
WHERE status IS DISTINCT FROM 'Complete'

The same trap appears with NOT IN:

WHERE x NOT IN (1, 2, NULL) returns zero rows — any comparison with NULL produces NULL, and the entire expression evaluates to NULL.

Averages

How AVG works

AVG(x) = SUM(x) / COUNT(x)

COUNT(x) excludes NULLs — so AVG only considers rows where the value exists.

AVG ignores NULLs silently

Five orders, two with discounts:

order_id	discount
1001	10
1002	NULL
1003	NULL
1004	30
1005	NULL

Will these two queries return the same number?

-- Query A
SELECT AVG(discount)
FROM orders

(10 + 30) / 2 = 20

-- Query B
SELECT SUM(discount) / COUNT(*)
FROM orders

(10 + 30) / 5 = 8

Which average do you want?

Question	SQL
Average discount when offered	`AVG(discount)`
Average discount across all orders	`AVG(COALESCE(discount, 0))`

COALESCE(x, fallback) returns the first non-NULL value — use it inside the aggregate when NULLs should be treated as zeros.

Average of averages ≠ overall average

I computed AOV per category, then averaged those. Will this match the overall AOV?

Category	Revenue	Orders	AOV
Clothing	$10,000	500	$20
Electronics	$50,000	200	$250

Average of the two AOVs: ($20 + $250) / 2 = $135

Actual overall AOV: $60,000 / 700 = $85.71

Prediction — 3 minutes

This is the same principle as “you can’t average ratios” from Block 4, reinforced here deliberately. Students need to see it twice because AVG hides the ratio, making this mistake even more common in practice. Unequal group sizes make naive averaging misleading.

Electronics has fewer orders but much higher AOV. When you average the two AOVs, Electronics gets 50% weight despite representing only 29% of orders.

The correct overall AOV must be computed from total revenue / total orders, not averaged from per-group AOVs.

Same applies to average shipping time per region, average review score per category, average salary per department.

If you need a weighted average: SUM(aov * orders) / SUM(orders).

Transition: Let’s do a quick recap of the NULL-handling tools you have.

NULL-handling functions

Function	Purpose	When to use
`COALESCE(x, 0)`	First non-NULL value	NULLs should be treated as zeros in AVG/SUM
`IFNULL(x, 0)`	Two-argument shorthand	Same as COALESCE with one fallback
`NULLIF(x, 'Unknown')`	Return NULL if x = value	Convert sentinel values to proper NULLs

COALESCE checks multiple columns in order — useful beyond simple fallbacks:

SELECT COALESCE(mobile_phone, work_phone, email, 'no contact') AS best_contact

Wrap-up

Four types of metrics — reference card

Type	Examples	Key property	Watch out for
Additive	Revenue, order count	Parts sum to the whole	Fan-out, COUNT(*) vs COUNT(col)
Distinct counts	Unique customers, sessions	Cannot sum across time	Non-additivity
Ratios / rates	Return rate, conversion	Derived from two metrics	Denominator choice, can’t average ratios
Averages	AOV, avg shipping time	Sensitive to inclusions	NULLs silently excluded, can’t average

Three sources of metric bugs

Definitional ambiguity — “revenue” means different things to different people
NULL behavior — aggregations handle NULLs differently, and comparisons with NULL are neither true nor false
Wrong granularity — computing at one grain and aggregating to another (fan-out, non-additivity, averaging ratios)

Most metric bugs are silent: the query runs fine, but the number it returns is wrong.

What’s next

Practice 1: Calculate business KPIs from thelook_ecommerce, encountering these pitfalls firsthand

Practice 2: Diagnose and fix queries with metric bugs, then build a defensive metrics summary