To compute AVG(salary) — reads only the salary column
You already saw this
-- Scans ALL columnsSELECT*FROM `bigquery-public-data.thelook_ecommerce.order_items`-- ~13 MB-- Scans only 3 columnsSELECT order_id, sale_price, statusFROM `bigquery-public-data.thelook_ecommerce.order_items`-- ~4.5 MB
With columnar storage, you only pay for the columns you read.
This is also why LIMIT doesn’t save money. BigQuery reads full columns first, then truncates.
Key idea 2: Distributed parallel execution
Traditional databases need indexes to avoid reading everything.
BigQuery does the opposite — distributes data across thousands of machines, scans everything in parallel.
Full table scans are not a problem in BigQuery — that’s the intended design.
Key idea 3: Compute and storage are separate
Traditional warehouse
Storage + compute on same machines
More data → need more compute
More users → everything slows down
BigQuery
Storage and compute are independent
Store petabytes, pay storage only
Query → BQ assigns compute on demand, then releases it
What this means for how you write SQL
Traditional DB
BigQuery
Create indexes for speed
No indexes
SELECT * is fine on small data
SELECT * wastes money
LIMIT reduces work
LIMIT doesn’t reduce bytes scanned
Optimize for row lookups
Optimize for column selection
How data gets into BigQuery
Data arrives via pipelines (ETL, ELT, CSV uploads). BigQuery enforces schema and types at load time.
But BigQuery doesn’t validate whether the data makes sense — if the source data has problems, BigQuery stores them as-is. Your queries will run fine and return results, but those results can be wrong.
What Can Go Wrong
Data quality issues you already know
From stats and econometrics courses:
Missing values (NULLs, blanks)
Typos (“Kyiv” vs “Kyv”)
Wrong types (numbers as text)
Outliers (salary of 999,999,999)
With smaller datasets, you could fix these by opening the file, scrolling through, spotting the problem.
Let’s share experience
What’s the worst data problem you’ve dealt with?
What changes at scale
You can’t scroll through 1.4 million rows
At scale, new categories of problems appear that don’t exist in small datasets.
Small problems get buried
A 0.1% duplicate rate
In 1,000 rows: 1 duplicate — easy to spot
In 10,000,000 rows: 10,000 extra rows — invisible without a query
A misspelled category
In 200 rows: obvious when you scan the column
In 50,000 distinct values: buried
Where do problems come from?
From systems
Pipelines, ingestion processes, ETL jobs — automated processes that create errors on their own
From humans at scale
User-generated data: millions of people entering free-text values, each in their own way
Both are different from the small, hand-entered datasets you’ve cleaned before.
System problem: Ingestion duplicates
An ETL pipeline loads orders every night. On a network timeout, the pipeline retries automatically — and the same 500 orders get loaded twice.
Every column is identical — no way to tell which copy is “real”
Your revenue report double-counts those orders
What happened here?
A company’s quarterly revenue report:
Quarter
Revenue
2023 Q1
$4.2M
2023 Q2
$4.5M
2023 Q3
$4.8M
2023 Q4
$5.1M
2024 Q1
$4.3M
2024 Q2
$4.1M
The CEO asks: “Why is revenue declining?” What data problems could explain this?
Many possible data explanations
Explanation
What went wrong
Ingestion duplicates
2023 had duplicate rows inflating totals, later fixed
Schema drift
Column meant gross in 2023, net in 2024 after a migration
Source change
A data feed was added or dropped between years
Currency mixing
Some rows switched from USD to local currency mid-2024
Late-arriving refunds
2024 includes retroactive returns; 2023 doesn’t yet
Same symptom, five different root causes — and each needs a different fix.
Example: Schema drift
A column called revenue existed since 2023.
In 2024, after a system migration, it started meaning net revenue (after returns). Before, it was gross.
2023 revenue: gross (higher)
2024 revenue: net (lower)
“Revenue dropped 15%” ← wrong
System problem: Type coercion at ingestion
CSV files have no type information — everything is text. When you load a CSV, BigQuery’s auto-detection infers column types, but it doesn’t always get them right.
You already experienced this
In practice, one group loaded a CSV where a numeric column contained "NA" values.
BigQuery couldn’t infer it as INT64 or FLOAT64 — so it loaded the entire column as STRING.
SELECT price *1.2AS with_vat FROM orders-- Error: No matching signature for operator *-- Found: STRING * FLOAT64
Fix: SAFE_CAST(price AS FLOAT64) — converts valid numbers, returns NULL for "NA".
Human problem: User-generated data at scale
Imagine a real estate platform — thousands of agents posting listings.
Each agent enters the address, price, and property details their own way:
Addresses: same region spelled differently depending on source language and abbreviation style
Prices: some in UAH, some in USD, some in EUR — all in one price column
Listing types: hundreds of distinct values mixing Ukrainian, Russian, and English
Human problem: Multi-source conflicts
Same region, same dataset, different conventions:
Variant
Source convention
Киевская область
Full Russian name
Киевская обл.
Abbreviated Russian
Київська область
Full Ukrainian name
Киевская
Shortened form
One GROUP BY produces four rows instead of one. How many listings are you miscounting?
Late-arriving changes
Scenario: E-commerce platform. Order placed and shipped in January.
Customer requests a refund in March. Chargeback processed in April.
Your January revenue report was correct — at the time
After the refund, January’s numbers change retroactively
How far back do you restate?
Empty strings aren’t NULL
-- This finds NULLsSELECT*FROM orders WHERE city ISNULL-- Returns 0 rows ✓-- But...SELECT*FROM orders WHERE city =''-- Returns 2,847 rows!
Empty strings are not NULL. They pass IS NOT NULL checks.
Your completeness report says “no missing cities” — but 2,847 are blank.
Before you analyze anything
Every new dataset gets the same treatment:
Profile — what do we actually have?
Clean — fix what’s broken
Analyze — answer the question
Skipping steps 1 and 2 is how wrong numbers end up in reports.
Profiling with SQL
Profiling = asking structured questions about the data
Types & formats — is the data what it claims to be?
Value ranges & sentinels — are the numbers reasonable?
Cross-column logic — do related columns agree?
1. Shape & completeness
-- How complete is each column?SELECTCOUNT(*) AS total, COUNTIF(price ISNULL) AS null_price, COUNTIF(area_total ISNULL) AS null_area, COUNTIF(area_kitchen ISNULL) AS null_kitchen, COUNTIF(ceiling_height ISNULL) AS null_ceilingFROM listings
Not all columns have the same NULL rate — some are 0%, others can be over 90%. A column that’s almost entirely NULL may not be worth analyzing.
One query tells you which columns have gaps — and what kind of gaps.
2. Distributions
Run this on every categorical column:
SELECT currency,COUNT(*) AS frequencyFROM listingsGROUPBY currencyORDERBY frequency DESC
What to look for:
Long tails — thousands of values appearing once (typos?)
Near-duplicates — “UAH” and “uah” and “грн”
Unexpected values — a currency of “test” or a blank
3. Uniqueness & duplicates
-- First signal: do these match?SELECTCOUNT(*) AS total_rows,COUNT(DISTINCT listing_id) AS unique_listingsFROM listings-- If total_rows > unique_listings → duplicates exist
-- Find themSELECT listing_id, COUNT(*) AS copiesFROM listingsGROUPBY listing_idHAVINGCOUNT(*) >1
It gets harder when the same key has different values. Which row is correct?
4. Types & format checks
Check column types in the BigQuery web UI: click the table → Schema tab.
Then verify the data actually matches:
-- How many values fail to convert to numbers?SELECT COUNTIF(SAFE_CAST(price AS FLOAT64) ISNULLAND price ISNOTNULL) AS bad_pricesFROM listings
SAFE_CAST returns NULL on failure instead of crashing. Good for diagnosis.
Find the bug
This query should count listings with invalid prices:
SELECTCOUNT(*) AS bad_pricesFROM listingsWHERECAST(price AS FLOAT64) ISNULLAND price ISNOTNULL
What’s wrong?
5. Value ranges & sentinels
SELECTMIN(price), MAX(price) FROM listings
What if MIN is 1 and MAX is in the billions?
Price = 1 often means “contact us” — the field was required, so the agent typed something
Price in the billions — typos, test data, or someone typing random digits
Area of 9,999,999,999 m² — a classic “fill with nines” placeholder
These pass IS NOT NULL, > 0, and even basic range checks.
6. Cross-column logic
-- Apartment on floor 8 of a 5-story building?SELECTCOUNT(*) FROM listingsWHEREfloor> floor_count
-- Total area smaller than living + kitchen area?SELECTCOUNT(*) FROM listingsWHERE area_total < area_living + area_kitchen
Each column looks reasonable on its own — you only see the problem when you compare them.
What does this number actually mean?
SELECTAVG(price) FROM listings-- Returns a number. What does it mean?
If the price column contains apartment sales in USD alongside monthly room rentals in UAH, the average is meaningless — it mixes currency, deal type, and property type in one AVG().
Profiling tells you what’s being mixed before you aggregate.
Same city, three rows — Russian, Ukrainian, lowercase Russian. The GROUP BY splits what should be one group into three.
Cleaning Patterns
Cleaning tools
Problem
Tool
Strings that should be numbers
CAST() / SAFE_CAST()
Inconsistent categories
CASE WHEN, LOWER(), TRIM()
Missing values
COALESCE(), IFNULL()
Duplicates
DISTINCT or ROW_NUMBER()
Sentinel values (price = 1)
WHERE + domain logic
Cross-column conflicts
WHERE floor > floor_count
SAFE_CAST: one bad row shouldn’t crash your query
-- CAST fails on ANY invalid valueSELECTCAST(price AS FLOAT64) FROM listings -- Error on "NA"-- SAFE_CAST returns NULL for invalid valuesSELECT SAFE_CAST(price AS FLOAT64) FROM listings -- Works
Remember the "NA" problem from practice? This is the fix.
CASE WHEN: standardize inconsistent values
SELECTCASEWHENLOWER(TRIM(city)) IN ('kyiv', 'kiev', 'київ') THEN'Kyiv'WHENLOWER(TRIM(city)) IN ('lviv', 'львів', 'lwów') THEN'Lviv'ELSE cityENDAS city_cleanFROM listings
LOWER() + TRIM() handles case and whitespace. CASE WHEN maps all the variants to one standard spelling.
COALESCE: pick the first non-NULL value
SELECTCOALESCE(area_total, area_living + area_kitchen, 0) AS areaFROM listings
Tries each argument left to right. Returns the first non-NULL.
What to do with missing values:
Fill → sensible default exists
Filter → row is useless without the value
Flag → keep, but mark as incomplete
Cleaning with CTEs
-- Each CTE handles one cleaning concernWITH fix_types AS (SELECT*, SAFE_CAST(price AS FLOAT64) AS price_num FROM raw_listings),standardize AS (SELECT*,CASEWHENLOWER(TRIM(city)) IN ('kyiv', 'kiev') THEN'Kyiv'ELSE city ENDAS city_cleanFROM fix_types),deduplicated AS (SELECT*EXCEPT(row_num) FROM (SELECT*, ROW_NUMBER() OVER (PARTITIONBY listing_id ORDERBY insert_time DESC ) AS row_numFROM standardize ) WHERE row_num =1)SELECT*FROM deduplicated
Wrap-up
What we covered
BigQuery architecture:
Columnar storage → SELECT only what you need
Parallel execution → no indexes, full scans by design
Separate compute and storage → serverless, scales on demand
Data quality at scale:
Problems from systems (ingestion, schema drift) and from humans at scale (user-generated data)
You profile with SQL, not your eyes
Profile → Clean → Analyze
Practice sessions
You’ll work with a real dataset: ~1.4 million real estate listings from a Flatfy.
User-generated data with real quality issues — your task is to profile and clean it.
Practice 1: Profile — what’s actually in this data?
Practice 2: Clean — build a CTE pipeline, save a cleaned view