Week 2 Practice: Data Profiling & Cleaning

ECON250: Big Data Analytics

Published

February 11, 2026

Modified

March 21, 2026

Your Submission

Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion

You’ll create these objects progressively during today’s two practice sessions.

Required Submission Items

Create both of the following in your personal dataset (econ250-2026.student__yourname):


1. Table: week2_profiling_results

What to create: A table containing profiling statistics for each column in the flatfy_2019 dataset.

Required columns (for each data column): - nulls_<column> — count of NULL values - null_pct_<column> — percentage of NULL values - distinct_<column> — count of distinct values

Where to find it: Practice 1 Part 11 provides the template. You’ll extend your Part 2 completeness query.

Quality check: Table must include profiling statistics for all 11 data columns (excluding int64_field_0, insert_time, and download_time).


2. View: week2_cleaned_listings

What to create: A view containing cleaned listings using a multi-step CTE pipeline.

Required transformations (must include all): - Extract and standardize region from the geo column (handle top 5 regions) - Classify deal_type from section (sale/rental/daily_rental categories) - Convert ceiling_height from STRING to FLOAT64 - Create price_usd via JOIN with exchange rate table - Filter out outliers (sentinel values like price = 1, extreme values)

Where to find it: Practice 2 Part 6 provides the pipeline template, Part 7 shows how to save it.

Quality check: View must run without errors and produce cleaned output with all transformations applied.


How to Submit

  1. In BigQuery Explorer, find each object in your student__yourname dataset
  2. Click the three dots menu (⋮) next to the object name
  3. Select “Copy link”
  4. Paste both links in Moodle

Working Process

As you work through the practice, save your intermediate queries as views in your dataset using descriptive names:

econ250-2026.student__yourname.week2_description

For example: - week2_completeness — your NULL analysis from Part 2 - week2_section_dist — section value distribution from Part 3 - week2_regions — region extraction from Part 4 - week2_deal_types — deal type classification from Practice 2 Part 1 - week2_region_standard — region standardization from Practice 2 Part 2

Why create intermediate views? - Builds your work incrementally (easier to debug) - Creates an audit trail of your profiling work - Lets you reference earlier queries in later parts - Helps with grading (shows you did the work, not just the final submission)

What to submit to Moodle: Only the two final objects (table + view) listed above. The intermediate views stay in your dataset as proof of work.


Before You Start

Required: BigQuery console open at console.cloud.google.com/bigquery?project=econ250-2026

Dataset: econ250-2026.course_data.flatfy_2019 — approximately 1.4 million real estate listings scraped from a Ukrainian property platform (Flatfy). Each row is a single listing with price, location, property characteristics, and timestamps.

This is real user-generated data. Thousands of real estate agents posted these listings, each entering information in their own way, and nobody cleaned it.

NoteDataset Size

The full table is ~265 MB. A SELECT * scans all of it. For profiling, you’ll typically select specific columns, so most queries scan much less — around 10-40 MB per query.

Column reference

Before querying, look at the table schema in BigQuery Explorer: click flatfy_2019 in the Explorer panel, then the Schema tab.

Column Type Description
int64_field_0 INTEGER Row index from the original CSV export
price INTEGER Listing price (but in what currency?)
currency STRING Currency code
room_count FLOAT Number of rooms
section STRING Listing category (deal type + property type)
floor_count FLOAT Total floors in the building
floor FLOAT Floor of the listing
area_total FLOAT Total area in m²
area_kitchen FLOAT Kitchen area in m²
area_living FLOAT Living area in m²
ceiling_height STRING Ceiling height (look at the type — does it seem right?)
insert_time TIMESTAMP When the listing was posted
download_time TIMESTAMP When the listing was scraped
geo STRING Location (region, city, district, street — all packed together)

Some of these types look wrong — keep that in mind as you explore.

What you’ll practice

Review (from Week 1): GROUP BY + COUNT(*), WHERE filters, basic aggregations

New this week:

  • COUNTIF — conditional counting without WHERE
  • SPLIT + SAFE_OFFSET — extracting parts from delimited strings
  • SAFE_CAST — type conversion that returns NULL instead of crashing
  • CASE WHEN — recoding values into clean categories
  • CTEs (WITH ... AS) — organizing multi-step transformations into a readable pipeline
  • PARSE_DATE — converting date strings to proper DATE values
  • LEFT JOIN for currency conversion using a reference table

Practice 1: Profiling

The goal is to figure out what this dataset actually contains before writing any analytical queries: what’s complete, what’s broken, and whether the data is trustworthy.

Parts 1-10 build your profiling skills and inform Part 11 (Submission Item 1: the profiling table). Practice 2 builds your cleaning pipeline toward its Part 7 (Submission Item 2: the cleaned view). Work through them in order during the session.

Part 1: Shape and size

Start with the basics: how many rows, and over what time period?

Write a query that returns:

  • Total row count
  • The earliest and latest insert_time (when listings were posted)
  • The earliest and latest download_time (when data was scraped)

What you should be able to answer: How many listings are there? Over how many months/years were they posted?

Part 2: Completeness

Not every listing has every field filled in. Using the COUNTIF(column IS NULL) pattern from the lecture, write a single query that counts the number of NULL values in each column.

What you should be able to answer: Which columns are nearly complete (close to 0% NULL)? Which are mostly empty? Is there one column that’s almost entirely NULL?

TipFrom Counts to Percentages

Raw NULL counts are hard to interpret in a 1.4M-row table. To get the percentage, divide the COUNTIF result by COUNT(*) and multiply by 100. ROUND(..., 1) keeps it readable.

Once you have the percentages, think about what explains the patterns. For instance: why might area_kitchen be NULL much more often than area_total?

TipSave Your Work

Save this completeness analysis as a view for reference:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_completeness` AS
-- Your query with NULL counts and percentages

Part 3: Distributions of categorical columns

The universal profiling pattern is GROUP BY column, COUNT(*), ORDER BY count DESC. Run it on the categorical columns to see what values actually exist.

Currency

What you should be able to answer: How many distinct currencies are there? What percentage of listings is in each? What does this mean if someone computes AVG(price) without filtering by currency?

Section

The section column describes what kind of listing this is.

What you should be able to answer: How many distinct section values exist? Can you spot groups of values that seem to describe the same thing in different ways?

Look for patterns: some values are in Ukrainian, some in Russian. Some use commas as separators, some don’t. How many variants can you find for “apartment sale” or “apartment rental”?

TipCounting Distinct Values

COUNT(DISTINCT column) returns the number of unique values in a column — useful when the GROUP BY output is too long to scan visually.

TipSave Your Work

This distribution will inform your deal_type classification in Practice 2. Save it:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_section_dist` AS
-- Your GROUP BY section query

Part 4: Working with the geo column

The geo column packs multiple location components into a single comma-separated string — region, city, district, and sometimes street.

Use SPLIT(geo, ',')[SAFE_OFFSET(0)] to extract the region and [SAFE_OFFSET(1)] for the city. Combine with GROUP BY + COUNT(*) to see the distribution of regions.

What you should be able to answer: How many different ways is the Kyiv region spelled? (Count both the number of variants and the total listings across all of them.) What about Odesa region?

Do the same for cities. Is “Kyiv” the city represented consistently?

TipSave Your Work

This region analysis will inform your standardization logic in Practice 2. Save it:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_regions` AS
SELECT
  SPLIT(geo, ',')[SAFE_OFFSET(0)] AS region_raw,
  SPLIT(geo, ',')[SAFE_OFFSET(1)] AS city_raw,
  COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
WHERE geo IS NOT NULL
GROUP BY region_raw, city_raw
ORDER BY listing_count DESC

Part 5: Now look at NULLs again, but smarter

In Part 2 you counted NULLs across the whole table. Now that you know how to break down by categories, go deeper.

Pick one of these questions and write a query to answer it:

  • Does the NULL rate for floor depend on deal type? Group the NULL count and total count by a classification of section (sale vs. rental vs. other — use LIKE patterns). Do rentals have more floor information than sales, or vice versa?
  • Are area columns more complete for some regions than others? For the top 5 regions, what percentage of listings has area_total filled in?

Part 6: Uniqueness

Is int64_field_0 a unique identifier? Compare COUNT(*) with COUNT(DISTINCT int64_field_0).

If they match, every row has a unique ID. But that doesn’t mean there are no business duplicates — the same property could appear multiple times with different IDs. Think about which columns you’d compare to detect that.

Part 7: Type checks

Look back at the schema. The ceiling_height column is stored as STRING, not FLOAT64. Is that a problem?

Use SAFE_CAST(ceiling_height AS FLOAT64) to test how many non-NULL values are valid numbers and how many would fail conversion. The COUNTIF pattern works well here: count values where the original is not NULL but the SAFE_CAST result is NULL — those are the ones that can’t be converted.

What you should be able to answer: How many non-NULL ceiling_height values exist? How many of them are valid numbers?

Also look at room_count — it’s stored as FLOAT. Run a distribution query. Are there fractional room counts, or are they all whole numbers? What’s the highest room count?

Part 8: Value ranges and sentinel values

For numeric columns, check MIN, MAX, and the median (use APPROX_QUANTILES(column, 100)[OFFSET(50)]).

Start with price:

What you should be able to answer: What’s the minimum price? The maximum? How many listings have a price of exactly 1? How many have prices over 100 million?

Think about what a price of 1 means on a real estate platform where the price field is mandatory. And what does a price in the hundreds of billions likely represent?

Now do the same for area_total. Are there any values that are physically impossible?

Part 9: Cross-column logic

Some columns have logical relationships that should always hold. Write queries to test these:

  • Can a listing be on floor 8 of a 5-story building? (Check floor > floor_count)
  • Can the total area be smaller than the sum of its parts? (Check area_total < area_living + area_kitchen)

What you should be able to answer: How many rows violate each rule?

Part 10: What does AVG(price) actually mean?

SELECT AVG(price) FROM `econ250-2026.course_data.flatfy_2019`

What you should be able to answer: What does this number mix together? Why is it meaningless?

Part 11: Save your profiling summary ⭐ SUBMISSION ITEM 1

ImportantRequired Submission

This part creates Submission Item 1: the week2_profiling_results table. See the Your Submission section at the top for requirements.

You already have the core query from Part 2 — NULL counts and percentages for each column. Extend it by adding COUNT(DISTINCT column) for each column, then save as a table.

CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week2_profiling_results` AS
SELECT
  COUNT(*) AS total_rows,

  -- price
  COUNTIF(price IS NULL) AS nulls_price,
  ROUND(COUNTIF(price IS NULL) / COUNT(*) * 100, 1) AS null_pct_price,
  COUNT(DISTINCT price) AS distinct_price,

  -- currency
  COUNTIF(currency IS NULL) AS nulls_currency,
  ROUND(COUNTIF(currency IS NULL) / COUNT(*) * 100, 1) AS null_pct_currency,
  COUNT(DISTINCT currency) AS distinct_currency,

  -- room_count
  COUNTIF(room_count IS NULL) AS nulls_room_count,
  ROUND(COUNTIF(room_count IS NULL) / COUNT(*) * 100, 1) AS null_pct_room_count,
  COUNT(DISTINCT room_count) AS distinct_room_count,

  -- repeat the same 3-line pattern for: section, floor, floor_count,
  --   area_total, area_kitchen, area_living, ceiling_height, geo

FROM `econ250-2026.course_data.flatfy_2019`

Add the same three lines (nulls count, null percentage, distinct count) for each of the remaining 8 columns. Replace yourname with your actual dataset suffix.


Practice 2: Cleaning

Part 1: Classify deal types from section

The section column mixes deal type and property type in inconsistent formats. Your goal is to create a clean deal_type column with values like 'sale', 'rental', 'daily_rental', or 'other'.

Use CASE WHEN with LIKE patterns to match keywords in the section text. You saw these keywords when profiling: продаж (sale), оренда/аренда (rental), посуточн (daily).

WarningOrder Matters in CASE WHEN

CASE WHEN stops at the first match. If you check for '%оренда%' before '%посуточн%', daily rentals will be classified as regular rentals — because “посуточна оренда” contains “оренда”. Check the more specific pattern first.

Write a query that shows each distinct section value alongside your computed deal_type and the row count. Verify that the classification makes sense for the top 20 values.

What you should be able to answer: How many listings are in each deal type category? What percentage ends up as “other”?

TipSave Your Work

Save your deal type classification logic for use in the final pipeline:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_deal_types` AS
SELECT
  section,
  CASE
    WHEN LOWER(section) LIKE '%посуточн%' THEN 'daily_rental'
    WHEN LOWER(section) LIKE '%оренда%' OR LOWER(section) LIKE '%аренда%' THEN 'rental'
    WHEN LOWER(section) LIKE '%продаж%' THEN 'sale'
    ELSE 'other'
  END AS deal_type,
  COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
GROUP BY section, deal_type
ORDER BY listing_count DESC

Part 2: Standardize regions

Your profiling showed that the same region appears under multiple spellings (Russian full name, Russian abbreviated, Ukrainian, shortened). Write a CASE WHEN that maps the variants to a single canonical form.

You don’t need to cover every region — focus on the top 5-6 by volume. Use your profiling results from Practice 1 Part 4 to identify which variants to consolidate. The ELSE clause keeps everything else unchanged.

Test your mapping by running it with GROUP BY + COUNT(*) on the result — verify that what was previously 4 rows for Kyiv region is now 1.

TipSave Your Work

Save your region standardization logic for use in the final pipeline:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_region_standard` AS
SELECT
  SPLIT(geo, ',')[SAFE_OFFSET(0)] AS region_raw,
  CASE
    WHEN SPLIT(geo, ',')[SAFE_OFFSET(0)] IN ('Київська обл.', 'Киевская обл.', 'Київська', 'Киевская') THEN 'Kyiv Oblast'
    WHEN SPLIT(geo, ',')[SAFE_OFFSET(0)] IN ('Одеська обл.', 'Одесская обл.') THEN 'Odesa Oblast'
    -- Add more mappings for top regions
    ELSE SPLIT(geo, ',')[SAFE_OFFSET(0)]
  END AS region_clean,
  COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
WHERE geo IS NOT NULL
GROUP BY region_raw, region_clean
ORDER BY listing_count DESC

Part 3: Fix types

Convert ceiling_height from STRING to FLOAT64 using SAFE_CAST. Your profiling already showed how many non-NULL values are valid numbers, so this is a simple step.

Part 4: Normalize prices to a common currency

Prices are in three different currencies (USD, UAH, EUR). To make them comparable, convert to a common currency using exchange rates.

Explore the exchange rate table

There’s a second table available: econ250-2026.course_data.week2_exchange_rates — daily exchange rates from the National Bank of Ukraine (NBU). Take a look at it:

SELECT *
FROM `econ250-2026.course_data.week2_exchange_rates`
LIMIT 10

What you should notice:

  • The date column is a STRING in DD.MM.YYYY format — not a proper DATE type
  • The rate column is the UAH equivalent for the number of units specified in the units column (e.g., rate = 2412 for 100 USD means 1 USD = 24.12 UAH)
  • There are rows for both USD and EUR, one per day

The rates CTE (provided)

This CTE parses the date strings and pivots the table so each date has one row with both USD and EUR rates as separate columns. Use it as-is in your pipeline:

cte_rates AS (
  SELECT
    PARSE_DATE('%d.%m.%Y', date) AS rate_date,
    MAX(IF(currency = 'USD', rate / units, NULL)) AS uah_per_usd,
    MAX(IF(currency = 'EUR', rate / units, NULL)) AS uah_per_eur
  FROM `econ250-2026.course_data.week2_exchange_rates`
  GROUP BY rate_date
)

Try it standalone first: WITH cte_rates AS (...) SELECT * FROM cte_rates LIMIT 10.

Join and convert

Now join cte_rates with the listings on the listing date. Use DATE(insert_time) to extract the date from the listing timestamp, and match it to rate_date.

The conversion logic:

  • USD listings: price is already in USD — keep as-is
  • UAH listings: divide by uah_per_usd to get USD
  • EUR listings: multiply by uah_per_eur (to get UAH), then divide by uah_per_usd (to get USD)

Write a CASE WHEN on the currency column that applies the right formula for each case. Use ROUND(..., 2) to keep results clean.

TipKeep the Original Currency

Keep the original price and currency columns alongside the new price_usd. The original currency tells you something about the listing — USD pricing is more common for sales, UAH for rentals. Dropping it loses that signal.

What you should be able to answer: After conversion, what’s the average price in USD for the entire dataset? How does it compare to the meaningless average from Practice 1?

Part 5: Filter outliers

Now that prices are in a common currency, your outlier filters are more meaningful. Define reasonable ranges for price_usd and area_total.

Consider:

  • What’s a reasonable minimum price in USD? (A listing priced at $1 or $2 is clearly a placeholder.)
  • What’s a reasonable maximum? (Where does “expensive property” end and “data entry error” begin?)
  • For areas: what’s physically plausible for the largest property in this dataset?

Use your findings from Practice 1 Part 8 to set boundaries. A starting point: filter prices below $5-10 and above $2-5M. For areas, think about physically plausible upper bounds. There’s no single correct threshold — the goal is to remove obvious junk.

Be careful with NULL columns — if area_total is NULL, you probably want to keep the row (some listings have no area information — check your Part 2 completeness results). The pattern area_total IS NULL OR area_total < threshold keeps NULLs while filtering bad values.

Part 6: Build the full CTE pipeline

Combine all your cleaning steps into a single query. Fill in the ??? placeholders with expressions from Parts 1-5:

WITH cte_rates AS (
  -- PROVIDED — paste the exchange rates CTE from Part 4
  SELECT
    PARSE_DATE('%d.%m.%Y', date) AS rate_date,
    MAX(IF(currency = 'USD', rate / units, NULL)) AS uah_per_usd,
    MAX(IF(currency = 'EUR', rate / units, NULL)) AS uah_per_eur
  FROM `econ250-2026.course_data.week2_exchange_rates`
  GROUP BY rate_date
),

cte_parsed AS (
  -- TODO: extract region and city from geo using SPLIT
  -- TODO: classify deal_type from section using CASE WHEN
  --       (remember: check посуточн before оренда — order matters)
  -- TODO: convert ceiling_height to FLOAT64 using SAFE_CAST
  SELECT
    *,
    ??? AS region_raw,
    ??? AS city_raw,
    ??? AS deal_type,
    ??? AS ceiling_height_num
  FROM `econ250-2026.course_data.flatfy_2019`
),

cte_standardized AS (
  -- TODO: map region name variants to canonical names
  -- * EXCEPT(region_raw) selects all columns except region_raw —
  -- we're replacing it with the cleaned version region_clean
  SELECT
    * EXCEPT(region_raw),
    ??? AS region_clean
  FROM cte_parsed
),

cte_with_prices AS (
  -- TODO: join with cte_rates on date
  -- TODO: compute price_usd using CASE WHEN on currency
  SELECT
    s.*,
    ??? AS price_usd
  FROM cte_standardized s
  LEFT JOIN cte_rates r ON ???
),

cte_filtered AS (
  -- TODO: filter out sentinel values and outliers
  SELECT *
  FROM cte_with_prices
  WHERE ???
)

SELECT
  int64_field_0 AS listing_id,
  price,
  currency,
  price_usd,
  deal_type,
  room_count,
  region_clean AS region,
  city_raw AS city,
  floor,
  floor_count,
  area_total,
  area_kitchen,
  area_living,
  ceiling_height_num AS ceiling_height,
  insert_time,
  download_time
FROM cte_filtered
WarningCommon CTE Mistakes
  • Missing comma: CTEs are separated by commas — don’t forget the comma after each closing )
  • Column name conflicts: if you create region_clean in one CTE and reference it in the next, make sure the name matches exactly
  • cte_rates is independent: it reads from the exchange rate table, not from cte_parsed — it gets joined in cte_with_prices

Run the query. Check how many rows survive. Spot-check a few rows to make sure the cleaning looks right.

Part 7: Save as a view ⭐ SUBMISSION ITEM 2

ImportantRequired Submission

This part creates Submission Item 2: the week2_cleaned_listings view. See the Your Submission section at the top for requirements.

Once your pipeline produces reasonable results, save it as a view:

CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_cleaned_listings` AS
-- Your full CTE pipeline here

Replace yourname with your actual dataset suffix.

Verify it works

Query your view to check that the results make sense:

  • Group by region and deal_type — you should see clean, consolidated groups without the variant fragmentation from profiling
  • Check that price_usd values look reasonable for different deal types (sales in the tens of thousands, rentals in the hundreds)
  • Verify that the original currency column is still present

Going Further (optional)

  • Add a city cleaning step (“Киев”/“Київ” → single name), similar to regions
  • Extract property type from section (квартира, будинок, ділянкаproperty_type column)
  • Compute average price_usd per m² by region for apartment sales only
  • Group by currency and deal_type — are sales more likely in USD and rentals in UAH?

Troubleshooting

Problem Solution
SPLIT returns NULL The geo column itself may be NULL — filter with WHERE geo IS NOT NULL
CASE WHEN doesn’t match expected values Check for leading/trailing spaces: try wrapping with TRIM(...)
JOIN produces more rows than expected Make sure the rates CTE has one row per date (use GROUP BY)
PARSE_DATE fails Verify the format string matches the data: '%d.%m.%Y' for DD.MM.YYYY
View creation fails Verify you replaced yourname with your actual dataset suffix
“Table not found” Use the full path: econ250-2026.course_data.flatfy_2019
SAFE_CAST returns all NULLs Double-check the column name — are you casting the right column?
CTE syntax error Check for missing commas between CTEs and that each CTE reads from the previous one

SAFE_CAST vs CAST reminder: CAST crashes the entire query if any single value can’t convert. SAFE_CAST returns NULL for invalid values and lets the query finish. Always use SAFE_CAST when working with data you haven’t fully validated.

BigQuery SQL reference: The Standard SQL functions reference is a useful bookmark for looking up function syntax and examples.