Under the Hood & What Can Go Wrong

ECON250 - Big Data Analytics | Week 2

Oleh Omelchenko

2026-02-10

Recap

What happens when the data itself is wrong?

Last week: connected to BigQuery, ran queries, saw bytes scanned, created personal datasets, explored partitioning.

Today: how BigQuery works under the hood, and what goes wrong when data isn’t clean.

Key idea 1: Columnar storage

Row storage (traditional DB)

[Anna, 25, Kyiv, 45000]
[Boris, 31, Lviv, 52000]
[Clara, 28, Odesa, 48000]

To compute AVG(salary) — must read every column of every row

Column storage (BigQuery)

name:   [Anna, Boris, Clara]
age:    [25, 31, 28]
city:   [Kyiv, Lviv, Odesa]
salary: [45000, 52000, 48000]

To compute AVG(salary) — reads only the salary column

You already saw this

-- Scans ALL columns
SELECT * FROM `bigquery-public-data.thelook_ecommerce.order_items`
-- ~13 MB

-- Scans only 3 columns
SELECT order_id, sale_price, status
FROM `bigquery-public-data.thelook_ecommerce.order_items`
-- ~4.5 MB

With columnar storage, you only pay for the columns you read.

This is also why LIMIT doesn’t save money. BigQuery reads full columns first, then truncates.

Key idea 2: Distributed parallel execution

Traditional databases need indexes to avoid reading everything.

BigQuery does the opposite — distributes data across thousands of machines, scans everything in parallel.

Full table scans are not a problem in BigQuery — that’s the intended design.

Key idea 3: Compute and storage are separate

Traditional warehouse

Storage + compute on same machines

More data → need more compute

More users → everything slows down

BigQuery

Storage and compute are independent

Store petabytes, pay storage only

Query → BQ assigns compute on demand, then releases it

What this means for how you write SQL

Traditional DB	BigQuery
Create indexes for speed	No indexes
`SELECT *` is fine on small data	`SELECT *` wastes money
`LIMIT` reduces work	`LIMIT` doesn’t reduce bytes scanned
Optimize for row lookups	Optimize for column selection

How data gets into BigQuery

Data arrives via pipelines (ETL, ELT, CSV uploads). BigQuery enforces schema and types at load time.

But BigQuery doesn’t validate whether the data makes sense — if the source data has problems, BigQuery stores them as-is. Your queries will run fine and return results, but those results can be wrong.

What Can Go Wrong

Data quality issues you already know

From stats and econometrics courses:

Missing values (NULLs, blanks)
Typos (“Kyiv” vs “Kyv”)
Wrong types (numbers as text)
Outliers (salary of 999,999,999)

With smaller datasets, you could fix these by opening the file, scrolling through, spotting the problem.

What changes at scale

You can’t scroll through 1.4 million rows

At scale, new categories of problems appear that don’t exist in small datasets.

Small problems get buried

A 0.1% duplicate rate

In 1,000 rows: 1 duplicate — easy to spot
In 10,000,000 rows: 10,000 extra rows — invisible without a query

A misspelled category

In 200 rows: obvious when you scan the column
In 50,000 distinct values: buried

Where do problems come from?

From systems

Pipelines, ingestion processes, ETL jobs — automated processes that create errors on their own

From humans at scale

User-generated data: millions of people entering free-text values, each in their own way

Both are different from the small, hand-entered datasets you’ve cleaned before.

System problem: Ingestion duplicates

An ETL pipeline loads orders every night. On a network timeout, the pipeline retries automatically — and the same 500 orders get loaded twice.

Every column is identical — no way to tell which copy is “real”
Your revenue report double-counts those orders

What happened here?

A company’s quarterly revenue report:

Quarter	Revenue
2023 Q1	$4.2M
2023 Q2	$4.5M
2023 Q3	$4.8M
2023 Q4	$5.1M
2024 Q1	$4.3M
2024 Q2	$4.1M

The CEO asks: “Why is revenue declining?” What data problems could explain this?

Many possible data explanations

Explanation	What went wrong
Ingestion duplicates	2023 had duplicate rows inflating totals, later fixed
Schema drift	Column meant gross in 2023, net in 2024 after a migration
Source change	A data feed was added or dropped between years
Currency mixing	Some rows switched from USD to local currency mid-2024
Late-arriving refunds	2024 includes retroactive returns; 2023 doesn’t yet

Same symptom, five different root causes — and each needs a different fix.

Example: Schema drift

A column called revenue existed since 2023.

In 2024, after a system migration, it started meaning net revenue (after returns). Before, it was gross.

2023 revenue: gross (higher)
2024 revenue: net (lower)
“Revenue dropped 15%” ← wrong

System problem: Type coercion at ingestion

CSV files have no type information — everything is text. When you load a CSV, BigQuery’s auto-detection infers column types, but it doesn’t always get them right.

You already experienced this

In practice, one group loaded a CSV where a numeric column contained "NA" values.

BigQuery couldn’t infer it as INT64 or FLOAT64 — so it loaded the entire column as STRING.

SELECT price * 1.2 AS with_vat FROM orders
-- Error: No matching signature for operator *
-- Found: STRING * FLOAT64

Fix: SAFE_CAST(price AS FLOAT64) — converts valid numbers, returns NULL for "NA".

Human problem: User-generated data at scale

Imagine a real estate platform — thousands of agents posting listings.

Each agent enters the address, price, and property details their own way:

Addresses: same region spelled differently depending on source language and abbreviation style
Prices: some in UAH, some in USD, some in EUR — all in one price column
Listing types: hundreds of distinct values mixing Ukrainian, Russian, and English

Human problem: Multi-source conflicts

Same region, same dataset, different conventions:

Variant	Source convention
Киевская область	Full Russian name
Киевская обл.	Abbreviated Russian
Київська область	Full Ukrainian name
Киевская	Shortened form

One GROUP BY produces four rows instead of one. How many listings are you miscounting?

Late-arriving changes

Scenario: E-commerce platform. Order placed and shipped in January.

Customer requests a refund in March. Chargeback processed in April.

Your January revenue report was correct — at the time
After the refund, January’s numbers change retroactively
How far back do you restate?

Empty strings aren’t NULL

-- This finds NULLs
SELECT * FROM orders WHERE city IS NULL
-- Returns 0 rows ✓

-- But...
SELECT * FROM orders WHERE city = ''
-- Returns 2,847 rows!

Empty strings are not NULL. They pass IS NOT NULL checks.

Your completeness report says “no missing cities” — but 2,847 are blank.

Before you analyze anything

Every new dataset gets the same treatment:

Profile — what do we actually have?
Clean — fix what’s broken
Analyze — answer the question

Skipping steps 1 and 2 is how wrong numbers end up in reports.

Profiling with SQL

Profiling = asking structured questions about the data

Six categories of profiling questions:

Shape & completeness — what’s here, what’s missing?
Distributions — what values exist, how often?
Uniqueness & duplicates — is there a primary key?
Types & formats — is the data what it claims to be?
Value ranges & sentinels — are the numbers reasonable?
Cross-column logic — do related columns agree?

1. Shape & completeness

-- How complete is each column?
SELECT
  COUNT(*) AS total,
  COUNTIF(price IS NULL) AS null_price,
  COUNTIF(area_total IS NULL) AS null_area,
  COUNTIF(area_kitchen IS NULL) AS null_kitchen,
  COUNTIF(ceiling_height IS NULL) AS null_ceiling
FROM listings

Not all columns have the same NULL rate — some are 0%, others can be over 90%. A column that’s almost entirely NULL may not be worth analyzing.

One query tells you which columns have gaps — and what kind of gaps.

2. Distributions

Run this on every categorical column:

SELECT
  currency,
  COUNT(*) AS frequency
FROM listings
GROUP BY currency
ORDER BY frequency DESC

What to look for:

Long tails — thousands of values appearing once (typos?)
Near-duplicates — “UAH” and “uah” and “грн”
Unexpected values — a currency of “test” or a blank

3. Uniqueness & duplicates

-- First signal: do these match?
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT listing_id) AS unique_listings
FROM listings
-- If total_rows > unique_listings → duplicates exist

-- Find them
SELECT listing_id, COUNT(*) AS copies
FROM listings
GROUP BY listing_id
HAVING COUNT(*) > 1

It gets harder when the same key has different values. Which row is correct?

4. Types & format checks

Check column types in the BigQuery web UI: click the table → Schema tab.

Then verify the data actually matches:

-- How many values fail to convert to numbers?
SELECT
  COUNTIF(SAFE_CAST(price AS FLOAT64) IS NULL
          AND price IS NOT NULL) AS bad_prices
FROM listings

SAFE_CAST returns NULL on failure instead of crashing. Good for diagnosis.

Find the bug

This query should count listings with invalid prices:

SELECT COUNT(*) AS bad_prices
FROM listings
WHERE CAST(price AS FLOAT64) IS NULL
  AND price IS NOT NULL

What’s wrong?

5. Value ranges & sentinels

SELECT MIN(price), MAX(price) FROM listings

What if MIN is 1 and MAX is in the billions?

Price = 1 often means “contact us” — the field was required, so the agent typed something
Price in the billions — typos, test data, or someone typing random digits
Area of 9,999,999,999 m² — a classic “fill with nines” placeholder

These pass IS NOT NULL, > 0, and even basic range checks.

6. Cross-column logic

-- Apartment on floor 8 of a 5-story building?
SELECT COUNT(*) FROM listings
WHERE floor > floor_count

-- Total area smaller than living + kitchen area?
SELECT COUNT(*) FROM listings
WHERE area_total < area_living + area_kitchen

Each column looks reasonable on its own — you only see the problem when you compare them.

What does this number actually mean?

SELECT AVG(price) FROM listings
-- Returns a number. What does it mean?

If the price column contains apartment sales in USD alongside monthly room rentals in UAH, the average is meaningless — it mixes currency, deal type, and property type in one AVG().

Profiling tells you what’s being mixed before you aggregate.

You just ran a profiling query. Now what?

| city      | listings | avg_price |
|-----------|----------|-----------|
| Киев      |    4,980   |   42,300  |
| Київ      |    2,809   |   39,800  |
| киев      |    109   |   41,500  |
| NULL      |    40   |   35,200  |

Same city, three rows — Russian, Ukrainian, lowercase Russian. The GROUP BY splits what should be one group into three.

Cleaning Patterns

Cleaning tools

Problem	Tool
Strings that should be numbers	`CAST()` / `SAFE_CAST()`
Inconsistent categories	`CASE WHEN`, `LOWER()`, `TRIM()`
Missing values	`COALESCE()`, `IFNULL()`
Duplicates	`DISTINCT` or `ROW_NUMBER()`
Sentinel values (price = 1)	`WHERE` + domain logic
Cross-column conflicts	`WHERE floor > floor_count`

SAFE_CAST: one bad row shouldn’t crash your query

-- CAST fails on ANY invalid value
SELECT CAST(price AS FLOAT64) FROM listings  -- Error on "NA"

-- SAFE_CAST returns NULL for invalid values
SELECT SAFE_CAST(price AS FLOAT64) FROM listings  -- Works

Remember the "NA" problem from practice? This is the fix.

CASE WHEN: standardize inconsistent values

SELECT
  CASE
    WHEN LOWER(TRIM(city)) IN ('kyiv', 'kiev', 'київ') THEN 'Kyiv'
    WHEN LOWER(TRIM(city)) IN ('lviv', 'львів', 'lwów') THEN 'Lviv'
    ELSE city
  END AS city_clean
FROM listings

LOWER() + TRIM() handles case and whitespace. CASE WHEN maps all the variants to one standard spelling.

COALESCE: pick the first non-NULL value

SELECT
  COALESCE(area_total, area_living + area_kitchen, 0) AS area
FROM listings

Tries each argument left to right. Returns the first non-NULL.

What to do with missing values:

Fill → sensible default exists
Filter → row is useless without the value
Flag → keep, but mark as incomplete

Cleaning with CTEs

-- Each CTE handles one cleaning concern
WITH fix_types AS (
  SELECT *, SAFE_CAST(price AS FLOAT64) AS price_num FROM raw_listings
),
standardize AS (
  SELECT *,
    CASE WHEN LOWER(TRIM(city)) IN ('kyiv', 'kiev') THEN 'Kyiv'
         ELSE city END AS city_clean
  FROM fix_types
),
deduplicated AS (
  SELECT * EXCEPT(row_num) FROM (
    SELECT *, ROW_NUMBER() OVER (
    PARTITION BY listing_id ORDER BY insert_time DESC
    ) AS row_num
    FROM standardize
  ) WHERE row_num = 1
)
SELECT * FROM deduplicated

Wrap-up

What we covered

BigQuery architecture:

Columnar storage → SELECT only what you need
Parallel execution → no indexes, full scans by design
Separate compute and storage → serverless, scales on demand

Data quality at scale:

Problems from systems (ingestion, schema drift) and from humans at scale (user-generated data)
You profile with SQL, not your eyes
Profile → Clean → Analyze

Practice sessions

You’ll work with a real dataset: ~1.4 million real estate listings from a Flatfy.

User-generated data with real quality issues — your task is to profile and clean it.

Practice 1: Profile — what’s actually in this data?

Practice 2: Clean — build a CTE pipeline, save a cleaned view

Reminders

Assignment 1 deadline this week (8+2 pts)
Readings: BQ DG Ch3 (data types, functions), SQL DA Ch2 (data preparation)
Next week: Aggregations, business metrics, and NULL handling

Questions?

o_omelchenko@kse.org.ua