Under the Hood & What Can Go Wrong

ECON250 - Big Data Analytics | Week 2

Oleh Omelchenko

2026-02-10

Recap

What happens when the data itself is wrong?


Last week: connected to BigQuery, ran queries, saw bytes scanned, created personal datasets, explored partitioning.


Today: how BigQuery works under the hood, and what goes wrong when data isn’t clean.

Key idea 1: Columnar storage


Row storage (traditional DB)

[Anna, 25, Kyiv, 45000]
[Boris, 31, Lviv, 52000]
[Clara, 28, Odesa, 48000]

To compute AVG(salary) — must read every column of every row

Column storage (BigQuery)

name:   [Anna, Boris, Clara]
age:    [25, 31, 28]
city:   [Kyiv, Lviv, Odesa]
salary: [45000, 52000, 48000]

To compute AVG(salary) — reads only the salary column

You already saw this


-- Scans ALL columns
SELECT * FROM `bigquery-public-data.thelook_ecommerce.order_items`
-- ~13 MB

-- Scans only 3 columns
SELECT order_id, sale_price, status
FROM `bigquery-public-data.thelook_ecommerce.order_items`
-- ~4.5 MB


With columnar storage, you only pay for the columns you read.

This is also why LIMIT doesn’t save money. BigQuery reads full columns first, then truncates.

Key idea 2: Distributed parallel execution


Traditional databases need indexes to avoid reading everything.

BigQuery does the opposite — distributes data across thousands of machines, scans everything in parallel.


Full table scans are not a problem in BigQuery — that’s the intended design.

Key idea 3: Compute and storage are separate


Traditional warehouse

Storage + compute on same machines

More data → need more compute

More users → everything slows down

BigQuery

Storage and compute are independent

Store petabytes, pay storage only

Query → BQ assigns compute on demand, then releases it

What this means for how you write SQL


Traditional DB BigQuery
Create indexes for speed No indexes
SELECT * is fine on small data SELECT * wastes money
LIMIT reduces work LIMIT doesn’t reduce bytes scanned
Optimize for row lookups Optimize for column selection

How data gets into BigQuery


Data arrives via pipelines (ETL, ELT, CSV uploads). BigQuery enforces schema and types at load time.


But BigQuery doesn’t validate whether the data makes sense — if the source data has problems, BigQuery stores them as-is. Your queries will run fine and return results, but those results can be wrong.

What Can Go Wrong

Data quality issues you already know


From stats and econometrics courses:

  • Missing values (NULLs, blanks)
  • Typos (“Kyiv” vs “Kyv”)
  • Wrong types (numbers as text)
  • Outliers (salary of 999,999,999)


With smaller datasets, you could fix these by opening the file, scrolling through, spotting the problem.

Let’s share experience


What’s the worst data problem you’ve dealt with?

What changes at scale


You can’t scroll through 1.4 million rows


At scale, new categories of problems appear that don’t exist in small datasets.

Small problems get buried


A 0.1% duplicate rate

  • In 1,000 rows: 1 duplicate — easy to spot
  • In 10,000,000 rows: 10,000 extra rows — invisible without a query


A misspelled category

  • In 200 rows: obvious when you scan the column
  • In 50,000 distinct values: buried

Where do problems come from?


From systems

Pipelines, ingestion processes, ETL jobs — automated processes that create errors on their own

From humans at scale

User-generated data: millions of people entering free-text values, each in their own way


Both are different from the small, hand-entered datasets you’ve cleaned before.

System problem: Ingestion duplicates


An ETL pipeline loads orders every night. On a network timeout, the pipeline retries automatically — and the same 500 orders get loaded twice.


  • Every column is identical — no way to tell which copy is “real”
  • Your revenue report double-counts those orders

What happened here?


A company’s quarterly revenue report:

Quarter Revenue
2023 Q1 $4.2M
2023 Q2 $4.5M
2023 Q3 $4.8M
2023 Q4 $5.1M
2024 Q1 $4.3M
2024 Q2 $4.1M


The CEO asks: “Why is revenue declining?” What data problems could explain this?

Many possible data explanations


Explanation What went wrong
Ingestion duplicates 2023 had duplicate rows inflating totals, later fixed
Schema drift Column meant gross in 2023, net in 2024 after a migration
Source change A data feed was added or dropped between years
Currency mixing Some rows switched from USD to local currency mid-2024
Late-arriving refunds 2024 includes retroactive returns; 2023 doesn’t yet


Same symptom, five different root causes — and each needs a different fix.

Example: Schema drift


A column called revenue existed since 2023.

In 2024, after a system migration, it started meaning net revenue (after returns). Before, it was gross.


  • 2023 revenue: gross (higher)
  • 2024 revenue: net (lower)
  • “Revenue dropped 15%” ← wrong

System problem: Type coercion at ingestion


CSV files have no type information — everything is text. When you load a CSV, BigQuery’s auto-detection infers column types, but it doesn’t always get them right.

You already experienced this


In practice, one group loaded a CSV where a numeric column contained "NA" values.

BigQuery couldn’t infer it as INT64 or FLOAT64 — so it loaded the entire column as STRING.


SELECT price * 1.2 AS with_vat FROM orders
-- Error: No matching signature for operator *
-- Found: STRING * FLOAT64


Fix: SAFE_CAST(price AS FLOAT64) — converts valid numbers, returns NULL for "NA".

Human problem: User-generated data at scale


Imagine a real estate platform — thousands of agents posting listings.


Each agent enters the address, price, and property details their own way:

  • Addresses: same region spelled differently depending on source language and abbreviation style
  • Prices: some in UAH, some in USD, some in EUR — all in one price column
  • Listing types: hundreds of distinct values mixing Ukrainian, Russian, and English

Human problem: Multi-source conflicts


Same region, same dataset, different conventions:

Variant Source convention
Киевская область Full Russian name
Киевская обл. Abbreviated Russian
Київська область Full Ukrainian name
Киевская Shortened form


One GROUP BY produces four rows instead of one. How many listings are you miscounting?

Late-arriving changes


Scenario: E-commerce platform. Order placed and shipped in January.

Customer requests a refund in March. Chargeback processed in April.


  • Your January revenue report was correct — at the time
  • After the refund, January’s numbers change retroactively
  • How far back do you restate?

Empty strings aren’t NULL


-- This finds NULLs
SELECT * FROM orders WHERE city IS NULL
-- Returns 0 rows ✓

-- But...
SELECT * FROM orders WHERE city = ''
-- Returns 2,847 rows!


Empty strings are not NULL. They pass IS NOT NULL checks.

Your completeness report says “no missing cities” — but 2,847 are blank.

Before you analyze anything


Every new dataset gets the same treatment:


  1. Profile — what do we actually have?
  2. Clean — fix what’s broken
  3. Analyze — answer the question


Skipping steps 1 and 2 is how wrong numbers end up in reports.

Profiling with SQL

Profiling = asking structured questions about the data


Six categories of profiling questions:

  1. Shape & completeness — what’s here, what’s missing?
  2. Distributions — what values exist, how often?
  3. Uniqueness & duplicates — is there a primary key?
  4. Types & formats — is the data what it claims to be?
  5. Value ranges & sentinels — are the numbers reasonable?
  6. Cross-column logic — do related columns agree?

1. Shape & completeness


-- How complete is each column?
SELECT
  COUNT(*) AS total,
  COUNTIF(price IS NULL) AS null_price,
  COUNTIF(area_total IS NULL) AS null_area,
  COUNTIF(area_kitchen IS NULL) AS null_kitchen,
  COUNTIF(ceiling_height IS NULL) AS null_ceiling
FROM listings


Not all columns have the same NULL rate — some are 0%, others can be over 90%. A column that’s almost entirely NULL may not be worth analyzing.


One query tells you which columns have gaps — and what kind of gaps.

2. Distributions


Run this on every categorical column:

SELECT
  currency,
  COUNT(*) AS frequency
FROM listings
GROUP BY currency
ORDER BY frequency DESC


What to look for:

  • Long tails — thousands of values appearing once (typos?)
  • Near-duplicates — “UAH” and “uah” and “грн”
  • Unexpected values — a currency of “test” or a blank

3. Uniqueness & duplicates


-- First signal: do these match?
SELECT
  COUNT(*) AS total_rows,
  COUNT(DISTINCT listing_id) AS unique_listings
FROM listings
-- If total_rows > unique_listings → duplicates exist


-- Find them
SELECT listing_id, COUNT(*) AS copies
FROM listings
GROUP BY listing_id
HAVING COUNT(*) > 1


It gets harder when the same key has different values. Which row is correct?

4. Types & format checks


Check column types in the BigQuery web UI: click the table → Schema tab.


Then verify the data actually matches:

-- How many values fail to convert to numbers?
SELECT
  COUNTIF(SAFE_CAST(price AS FLOAT64) IS NULL
          AND price IS NOT NULL) AS bad_prices
FROM listings


SAFE_CAST returns NULL on failure instead of crashing. Good for diagnosis.

Find the bug


This query should count listings with invalid prices:

SELECT COUNT(*) AS bad_prices
FROM listings
WHERE CAST(price AS FLOAT64) IS NULL
  AND price IS NOT NULL


What’s wrong?

5. Value ranges & sentinels


SELECT MIN(price), MAX(price) FROM listings


What if MIN is 1 and MAX is in the billions?

  • Price = 1 often means “contact us” — the field was required, so the agent typed something
  • Price in the billions — typos, test data, or someone typing random digits
  • Area of 9,999,999,999 m² — a classic “fill with nines” placeholder


These pass IS NOT NULL, > 0, and even basic range checks.

6. Cross-column logic


-- Apartment on floor 8 of a 5-story building?
SELECT COUNT(*) FROM listings
WHERE floor > floor_count


-- Total area smaller than living + kitchen area?
SELECT COUNT(*) FROM listings
WHERE area_total < area_living + area_kitchen


Each column looks reasonable on its own — you only see the problem when you compare them.

What does this number actually mean?


SELECT AVG(price) FROM listings
-- Returns a number. What does it mean?


If the price column contains apartment sales in USD alongside monthly room rentals in UAH, the average is meaningless — it mixes currency, deal type, and property type in one AVG().


Profiling tells you what’s being mixed before you aggregate.

You just ran a profiling query. Now what?


| city      | listings | avg_price |
|-----------|----------|-----------|
| Киев      |    4,980   |   42,300  |
| Київ      |    2,809   |   39,800  |
| киев      |    109   |   41,500  |
| NULL      |    40   |   35,200  |


Same city, three rows — Russian, Ukrainian, lowercase Russian. The GROUP BY splits what should be one group into three.

Cleaning Patterns

Cleaning tools


Problem Tool
Strings that should be numbers CAST() / SAFE_CAST()
Inconsistent categories CASE WHEN, LOWER(), TRIM()
Missing values COALESCE(), IFNULL()
Duplicates DISTINCT or ROW_NUMBER()
Sentinel values (price = 1) WHERE + domain logic
Cross-column conflicts WHERE floor > floor_count

SAFE_CAST: one bad row shouldn’t crash your query


-- CAST fails on ANY invalid value
SELECT CAST(price AS FLOAT64) FROM listings  -- Error on "NA"

-- SAFE_CAST returns NULL for invalid values
SELECT SAFE_CAST(price AS FLOAT64) FROM listings  -- Works


Remember the "NA" problem from practice? This is the fix.

CASE WHEN: standardize inconsistent values


SELECT
  CASE
    WHEN LOWER(TRIM(city)) IN ('kyiv', 'kiev', 'київ') THEN 'Kyiv'
    WHEN LOWER(TRIM(city)) IN ('lviv', 'львів', 'lwów') THEN 'Lviv'
    ELSE city
  END AS city_clean
FROM listings


LOWER() + TRIM() handles case and whitespace. CASE WHEN maps all the variants to one standard spelling.

COALESCE: pick the first non-NULL value


SELECT
  COALESCE(area_total, area_living + area_kitchen, 0) AS area
FROM listings


Tries each argument left to right. Returns the first non-NULL.


What to do with missing values:

  • Fill → sensible default exists
  • Filter → row is useless without the value
  • Flag → keep, but mark as incomplete

Cleaning with CTEs


-- Each CTE handles one cleaning concern
WITH fix_types AS (
  SELECT *, SAFE_CAST(price AS FLOAT64) AS price_num FROM raw_listings
),
standardize AS (
  SELECT *,
    CASE WHEN LOWER(TRIM(city)) IN ('kyiv', 'kiev') THEN 'Kyiv'
         ELSE city END AS city_clean
  FROM fix_types
),
deduplicated AS (
  SELECT * EXCEPT(row_num) FROM (
    SELECT *, ROW_NUMBER() OVER (
    PARTITION BY listing_id ORDER BY insert_time DESC
    ) AS row_num
    FROM standardize
  ) WHERE row_num = 1
)
SELECT * FROM deduplicated

Wrap-up

What we covered


BigQuery architecture:

  • Columnar storage → SELECT only what you need
  • Parallel execution → no indexes, full scans by design
  • Separate compute and storage → serverless, scales on demand


Data quality at scale:

  • Problems from systems (ingestion, schema drift) and from humans at scale (user-generated data)
  • You profile with SQL, not your eyes
  • Profile → Clean → Analyze

Practice sessions


You’ll work with a real dataset: ~1.4 million real estate listings from a Flatfy.

User-generated data with real quality issues — your task is to profile and clean it.


Practice 1: Profile — what’s actually in this data?

Practice 2: Clean — build a CTE pipeline, save a cleaned view

Reminders


  • Assignment 1 deadline this week (8+2 pts)
  • Readings: BQ DG Ch3 (data types, functions), SQL DA Ch2 (data preparation)
  • Next week: Aggregations, business metrics, and NULL handling

Questions?


o_omelchenko@kse.org.ua