Anatomy of a Business Metric

ECON250 - Big Data Analytics | Week 3

Oleh Omelchenko

2026-02-17

Metrics Are Definitions

“What was our revenue last month?”


Most commonly, the questions that sound simple may be not so easy to answer.

What do you need to know to answer the question correctly?

Same question, different definitions


Analyst Definition Result
A All order_items.sale_price summed $1,240,000
B Excluding returned items $1,105,000
C After discounts and shipping $980,000


The SQL was correct in all three cases — the disagreement was about what “revenue” means.

Before you write any SQL


  1. Clarify the definition — what exactly do we mean by this metric?

  2. Identify the grain — one row per what?

  3. Choose the aggregation type — SUM, COUNT, AVG, ratio? ← rest of today

  4. Write the query

  5. Validate — does the result make sense?

Additive Metrics

Aggregation toolkit recap

SELECT
  COUNT(*) AS total_items,
  ROUND(SUM(sale_price), 2) AS total_revenue,
  COUNTIF(returned_at IS NOT NULL) AS returned_items
FROM `bigquery-public-data.thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'


  • COUNTIF(condition) — BigQuery shorthand for SUM(CASE WHEN ... THEN 1 ELSE 0 END)
  • These are all additive: slice by any dimension and the parts sum to the whole

Caveat: JOIN fan-out

orders

order_id revenue
1 $50
2 $30
3 $20

Total revenue: $100

order_items (multiple items per order)

order_id item
1 Shirt
1 Pants
2 Shoes
3 Hat
3 Scarf
3 Gloves
SELECT SUM(o.revenue) AS total_revenue
FROM orders o
JOIN order_items oi USING (order_id)

The total should be $100. What will this query return?

Fan-out: one-to-many JOINs duplicate rows

After the JOIN, each order row gets duplicated for every matching item:

order_id revenue item
1 $50 Shirt
1 $50 Pants
2 $30 Shoes
3 $20 Hat
3 $20 Scarf
3 $20 Gloves

SUM(revenue) = $50 + $50 + $30 + $20 + $20 + $20 = $190 (wrong!)


Defensive pattern: compare COUNT(*) before and after the JOIN. If it increases, you have fan-out.

Caveat: COUNT(*) vs COUNT(column)

Consider these 5 rows from order_items:

order_id sale_price shipped_at returned_at
1001 $45 2025-01-03 NULL
1002 $30 2025-01-04 2025-01-15
1003 $60 NULL NULL
1004 $25 2025-01-06 2025-01-20
1005 $80 2025-01-05 NULL

Predict: will these three numbers be the same or different?

SELECT
  COUNT(*) AS total_rows,
  COUNT(returned_at) AS count_returned_at,
  COUNT(shipped_at) AS count_shipped_at
FROM `bigquery-public-data.thelook_ecommerce.order_items`

COUNT(*) = 5  |  COUNT(shipped_at) = 4  |  COUNT(returned_at) = 2 — all different, because COUNT(column) skips NULL values.

Distinct Counts

Summing daily uniques ≠ monthly uniques

I calculated daily unique customers for January and summed them up. Will this equal the monthly unique customer count?

Raw orders (January)

day user_id
Jan 5 Alice
Jan 5 Bob
Jan 12 Carol
Jan 20 Alice
Jan 20 Carol

Daily uniques

day unique_users
Jan 5 2 (Alice, Bob)
Jan 12 1 (Carol)
Jan 20 2 (Alice, Carol)

Sum of daily: 5

Monthly unique users: 3 (Alice, Bob, Carol)


Alice and Carol each ordered on two different days — they get counted once per day but once for the whole month.

The query that proves it

WITH daily AS (
  SELECT
    DATE(created_at) AS day,
    COUNT(DISTINCT user_id) AS unique_users
  FROM `bigquery-public-data.thelook_ecommerce.order_items`
  WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
  GROUP BY day
)
SELECT
  SUM(unique_users) AS summed_daily_uniques,
  (SELECT COUNT(DISTINCT user_id)
   FROM `bigquery-public-data.thelook_ecommerce.order_items`
   WHERE created_at BETWEEN '2025-01-01' AND '2025-02-01'
  ) AS monthly_uniques
FROM daily

Why this matters in production


  • A dashboard that sums daily active users to get monthly active users will overcount

  • “Unique visitors by region” columns won’t sum to “total unique visitors” (same visitor can appear in multiple regions)

  • This is why platforms report total views (additive — easy to sum) but are careful about unique reach (requires deduplication at each grain)


Rule: always compute distinct counts at the grain you need — don’t try to aggregate from a finer grain.

What defines “distinct”?


“Unique customers” depends on the identifier you choose:

Identifier Counts
COUNT(DISTINCT user_id) Unique accounts
COUNT(DISTINCT email) Unique emails (some users have multiple accounts)
COUNT(DISTINCT session_id) Unique visits (one user = many sessions)


Different identifiers produce different numbers. The metric definition must specify which.

Ratios and Rates

Ratios are everywhere


Think of 2–3 business metrics that are calculated as ratios or averages.

  • AOV (average order value) — revenue / orders
  • Conversion rate — purchases / visits
  • Return rate — returned items / total items
  • GDP per capita — GDP / population

Every average is also a ratio: AVG(x) = SUM(x) / COUNT(x). Ratios have their own category of pitfalls.

Percent of total

“What share of revenue does each product category contribute?”

-- Step 1: Revenue per category
WITH category_revenue AS (
  SELECT category, SUM(sale_price) AS revenue
  FROM order_items
  JOIN products USING (product_id)
  GROUP BY category
),
-- Step 2: Compute the grand total
total AS (
  SELECT SUM(revenue) AS grand_total FROM category_revenue
)
-- Step 3: Divide each category by the total
SELECT category, revenue,
  ROUND(revenue * 100.0 / grand_total, 2) AS pct_of_total
FROM category_revenue CROSS JOIN total
ORDER BY revenue DESC

The CROSS JOIN attaches the single grand total to every row, so each row can compute its percentage.

Caveat: the denominator changes the answer

Consider this data — two orders, five items total, one item returned:

order_id item returned_at
101 Shirt NULL
101 Pants NULL
101 Belt 2025-01-20
102 Jacket NULL
102 Shoes NULL

What’s the return rate? Can you find more than one way to calculate it?

Two definitions of “return rate”

Approach A: item-level

-- Returned items / total items
SELECT ROUND(
  COUNTIF(returned_at IS NOT NULL)
  * 100.0 / COUNT(*), 2
) AS return_rate
FROM `bigquery-public-data
  .thelook_ecommerce.order_items`
WHERE created_at >= '2025-01-01'

Approach B: order-level

-- Orders with returns / total orders
WITH order_returns AS (
  SELECT order_id,
    MAX(CASE WHEN returned_at
      IS NOT NULL THEN 1 ELSE 0
      END) AS has_return
  FROM `bigquery-public-data
    .thelook_ecommerce.order_items`
  WHERE created_at >= '2025-01-01'
  GROUP BY order_id
)
SELECT ROUND(
  SUM(has_return) * 100.0
  / COUNT(*), 2
) AS return_rate
FROM order_returns


Different denominators (items vs orders), different numbers. Neither is wrong — they answer different questions.

You can’t average ratios across groups


Category Returned Total Return rate
Accessories 1 2 50%
Electronics 99 1,000 9.9%

Overall: 100 returned out of 1,002 total

Naive average of rates: (50% + 9.9%) / 2 = 30%

Actual overall rate: 100 / 1,002 = 10%

Recompute from totals, don’t average rates


Wrong

-- Average of per-category rates
WITH rates AS (
  SELECT category,
    COUNTIF(...) / COUNT(*) AS rate
  FROM ...
  GROUP BY category
)
SELECT AVG(rate)   -- wrong!
FROM rates

Correct

-- Overall rate from raw counts
SELECT
  COUNTIF(returned_at IS NOT NULL)
    * 100.0 / COUNT(*) AS rate
FROM order_items
WHERE created_at >= '2025-01-01'


Kimball’s design rule: store the additive components (returned count, total count), compute the ratio at query time at the grain you need.

Caveat: NULL in comparisons silently drops rows

This query counts items that are NOT in “Complete” status. The analyst says the count seems too low. What will each row evaluate to?

SELECT COUNT(*) AS not_complete
FROM order_items
WHERE status != 'Complete'
order_id status status != 'Complete' included?
1001 Complete ??? ???
1002 Shipped ??? ???
1003 Returned ??? ???
1004 NULL ??? ???
1005 Processing ??? ???

NULL comparison: the reveal

SELECT COUNT(*) AS not_complete
FROM order_items
WHERE status != 'Complete'
order_id status status != 'Complete' included?
1001 Complete FALSE no
1002 Shipped TRUE yes
1003 Returned TRUE yes
1004 NULL NULL no
1005 Processing TRUE yes

Expected: 4 rows. Actual: 3 rows — row 1004 silently dropped because NULL != 'Complete' evaluates to NULL, not TRUE.

NULL comparisons return NULL, not FALSE


-- Fix option 1: explicit NULL check
WHERE status != 'Complete' OR status IS NULL

-- Fix option 2: IS DISTINCT FROM (treats NULLs as a value)
WHERE status IS DISTINCT FROM 'Complete'


The same trap appears with NOT IN:

WHERE x NOT IN (1, 2, NULL) returns zero rows — any comparison with NULL produces NULL, and the entire expression evaluates to NULL.

Averages

How AVG works


AVG(x) = SUM(x) / COUNT(x)


COUNT(x) excludes NULLs — so AVG only considers rows where the value exists.

AVG ignores NULLs silently

Five orders, two with discounts:

order_id discount
1001 10
1002 NULL
1003 NULL
1004 30
1005 NULL

Will these two queries return the same number?

-- Query A
SELECT AVG(discount)
FROM orders

(10 + 30) / 2 = 20

-- Query B
SELECT SUM(discount) / COUNT(*)
FROM orders

(10 + 30) / 5 = 8

Which average do you want?


Question SQL
Average discount when offered AVG(discount)
Average discount across all orders AVG(COALESCE(discount, 0))


COALESCE(x, fallback) returns the first non-NULL value — use it inside the aggregate when NULLs should be treated as zeros.

Average of averages ≠ overall average

I computed AOV per category, then averaged those. Will this match the overall AOV?


Category Revenue Orders AOV
Clothing $10,000 500 $20
Electronics $50,000 200 $250


Average of the two AOVs: ($20 + $250) / 2 = $135

Actual overall AOV: $60,000 / 700 = $85.71

NULL-handling functions


Function Purpose When to use
COALESCE(x, 0) First non-NULL value NULLs should be treated as zeros in AVG/SUM
IFNULL(x, 0) Two-argument shorthand Same as COALESCE with one fallback
NULLIF(x, 'Unknown') Return NULL if x = value Convert sentinel values to proper NULLs


COALESCE checks multiple columns in order — useful beyond simple fallbacks:

SELECT COALESCE(mobile_phone, work_phone, email, 'no contact') AS best_contact

Wrap-up

Four types of metrics — reference card


Type Examples Key property Watch out for
Additive Revenue, order count Parts sum to the whole Fan-out, COUNT(*) vs COUNT(col)
Distinct counts Unique customers, sessions Cannot sum across time Non-additivity
Ratios / rates Return rate, conversion Derived from two metrics Denominator choice, can’t average ratios
Averages AOV, avg shipping time Sensitive to inclusions NULLs silently excluded, can’t average

Three sources of metric bugs


  1. Definitional ambiguity — “revenue” means different things to different people

  2. NULL behavior — aggregations handle NULLs differently, and comparisons with NULL are neither true nor false

  3. Wrong granularity — computing at one grain and aggregating to another (fan-out, non-additivity, averaging ratios)


Most metric bugs are silent: the query runs fine, but the number it returns is wrong.

What’s next


Practice 1: Calculate business KPIs from thelook_ecommerce, encountering these pitfalls firsthand

Practice 2: Diagnose and fix queries with metric bugs, then build a defensive metrics summary