Week 2 Practice: Data Profiling & Cleaning
ECON250: Big Data Analytics
Your Submission
Deadline: 3 days after the practice sessions Points: 2 points for satisfactory completion
You’ll create these objects progressively during today’s two practice sessions.
Required Submission Items
Create both of the following in your personal dataset (econ250-2026.student__yourname):
1. Table: week2_profiling_results
What to create: A table containing profiling statistics for each column in the flatfy_2019 dataset.
Required columns (for each data column): - nulls_<column> — count of NULL values - null_pct_<column> — percentage of NULL values - distinct_<column> — count of distinct values
Where to find it: Practice 1 Part 11 provides the template. You’ll extend your Part 2 completeness query.
Quality check: Table must include profiling statistics for all 11 data columns (excluding int64_field_0, insert_time, and download_time).
2. View: week2_cleaned_listings
What to create: A view containing cleaned listings using a multi-step CTE pipeline.
Required transformations (must include all): - Extract and standardize region from the geo column (handle top 5 regions) - Classify deal_type from section (sale/rental/daily_rental categories) - Convert ceiling_height from STRING to FLOAT64 - Create price_usd via JOIN with exchange rate table - Filter out outliers (sentinel values like price = 1, extreme values)
Where to find it: Practice 2 Part 6 provides the pipeline template, Part 7 shows how to save it.
Quality check: View must run without errors and produce cleaned output with all transformations applied.
How to Submit
- In BigQuery Explorer, find each object in your
student__yournamedataset - Click the three dots menu (⋮) next to the object name
- Select “Copy link”
- Paste both links in Moodle
Working Process
As you work through the practice, save your intermediate queries as views in your dataset using descriptive names:
econ250-2026.student__yourname.week2_description
For example: - week2_completeness — your NULL analysis from Part 2 - week2_section_dist — section value distribution from Part 3 - week2_regions — region extraction from Part 4 - week2_deal_types — deal type classification from Practice 2 Part 1 - week2_region_standard — region standardization from Practice 2 Part 2
Why create intermediate views? - Builds your work incrementally (easier to debug) - Creates an audit trail of your profiling work - Lets you reference earlier queries in later parts - Helps with grading (shows you did the work, not just the final submission)
What to submit to Moodle: Only the two final objects (table + view) listed above. The intermediate views stay in your dataset as proof of work.
Before You Start
Required: BigQuery console open at console.cloud.google.com/bigquery?project=econ250-2026
Dataset: econ250-2026.course_data.flatfy_2019 — approximately 1.4 million real estate listings scraped from a Ukrainian property platform (Flatfy). Each row is a single listing with price, location, property characteristics, and timestamps.
This is real user-generated data. Thousands of real estate agents posted these listings, each entering information in their own way, and nobody cleaned it.
The full table is ~265 MB. A SELECT * scans all of it. For profiling, you’ll typically select specific columns, so most queries scan much less — around 10-40 MB per query.
Column reference
Before querying, look at the table schema in BigQuery Explorer: click flatfy_2019 in the Explorer panel, then the Schema tab.
| Column | Type | Description |
|---|---|---|
int64_field_0 |
INTEGER | Row index from the original CSV export |
price |
INTEGER | Listing price (but in what currency?) |
currency |
STRING | Currency code |
room_count |
FLOAT | Number of rooms |
section |
STRING | Listing category (deal type + property type) |
floor_count |
FLOAT | Total floors in the building |
floor |
FLOAT | Floor of the listing |
area_total |
FLOAT | Total area in m² |
area_kitchen |
FLOAT | Kitchen area in m² |
area_living |
FLOAT | Living area in m² |
ceiling_height |
STRING | Ceiling height (look at the type — does it seem right?) |
insert_time |
TIMESTAMP | When the listing was posted |
download_time |
TIMESTAMP | When the listing was scraped |
geo |
STRING | Location (region, city, district, street — all packed together) |
Some of these types look wrong — keep that in mind as you explore.
What you’ll practice
Review (from Week 1): GROUP BY + COUNT(*), WHERE filters, basic aggregations
New this week:
COUNTIF— conditional counting withoutWHERESPLIT+SAFE_OFFSET— extracting parts from delimited stringsSAFE_CAST— type conversion that returns NULL instead of crashingCASE WHEN— recoding values into clean categories- CTEs (
WITH ... AS) — organizing multi-step transformations into a readable pipeline PARSE_DATE— converting date strings to proper DATE valuesLEFT JOINfor currency conversion using a reference table
Practice 1: Profiling
The goal is to figure out what this dataset actually contains before writing any analytical queries: what’s complete, what’s broken, and whether the data is trustworthy.
Parts 1-10 build your profiling skills and inform Part 11 (Submission Item 1: the profiling table). Practice 2 builds your cleaning pipeline toward its Part 7 (Submission Item 2: the cleaned view). Work through them in order during the session.
Part 1: Shape and size
Start with the basics: how many rows, and over what time period?
Write a query that returns:
- Total row count
- The earliest and latest
insert_time(when listings were posted) - The earliest and latest
download_time(when data was scraped)
What you should be able to answer: How many listings are there? Over how many months/years were they posted?
Part 2: Completeness
Not every listing has every field filled in. Using the COUNTIF(column IS NULL) pattern from the lecture, write a single query that counts the number of NULL values in each column.
What you should be able to answer: Which columns are nearly complete (close to 0% NULL)? Which are mostly empty? Is there one column that’s almost entirely NULL?
Raw NULL counts are hard to interpret in a 1.4M-row table. To get the percentage, divide the COUNTIF result by COUNT(*) and multiply by 100. ROUND(..., 1) keeps it readable.
Once you have the percentages, think about what explains the patterns. For instance: why might area_kitchen be NULL much more often than area_total?
Save this completeness analysis as a view for reference:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_completeness` AS
-- Your query with NULL counts and percentagesPart 3: Distributions of categorical columns
The universal profiling pattern is GROUP BY column, COUNT(*), ORDER BY count DESC. Run it on the categorical columns to see what values actually exist.
Currency
What you should be able to answer: How many distinct currencies are there? What percentage of listings is in each? What does this mean if someone computes AVG(price) without filtering by currency?
Section
The section column describes what kind of listing this is.
What you should be able to answer: How many distinct section values exist? Can you spot groups of values that seem to describe the same thing in different ways?
Look for patterns: some values are in Ukrainian, some in Russian. Some use commas as separators, some don’t. How many variants can you find for “apartment sale” or “apartment rental”?
COUNT(DISTINCT column) returns the number of unique values in a column — useful when the GROUP BY output is too long to scan visually.
This distribution will inform your deal_type classification in Practice 2. Save it:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_section_dist` AS
-- Your GROUP BY section queryPart 4: Working with the geo column
The geo column packs multiple location components into a single comma-separated string — region, city, district, and sometimes street.
Use SPLIT(geo, ',')[SAFE_OFFSET(0)] to extract the region and [SAFE_OFFSET(1)] for the city. Combine with GROUP BY + COUNT(*) to see the distribution of regions.
What you should be able to answer: How many different ways is the Kyiv region spelled? (Count both the number of variants and the total listings across all of them.) What about Odesa region?
Do the same for cities. Is “Kyiv” the city represented consistently?
This region analysis will inform your standardization logic in Practice 2. Save it:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_regions` AS
SELECT
SPLIT(geo, ',')[SAFE_OFFSET(0)] AS region_raw,
SPLIT(geo, ',')[SAFE_OFFSET(1)] AS city_raw,
COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
WHERE geo IS NOT NULL
GROUP BY region_raw, city_raw
ORDER BY listing_count DESCPart 5: Now look at NULLs again, but smarter
In Part 2 you counted NULLs across the whole table. Now that you know how to break down by categories, go deeper.
Pick one of these questions and write a query to answer it:
- Does the NULL rate for
floordepend on deal type? Group the NULL count and total count by a classification ofsection(sale vs. rental vs. other — useLIKEpatterns). Do rentals have more floor information than sales, or vice versa? - Are area columns more complete for some regions than others? For the top 5 regions, what percentage of listings has
area_totalfilled in?
Part 6: Uniqueness
Is int64_field_0 a unique identifier? Compare COUNT(*) with COUNT(DISTINCT int64_field_0).
If they match, every row has a unique ID. But that doesn’t mean there are no business duplicates — the same property could appear multiple times with different IDs. Think about which columns you’d compare to detect that.
Part 7: Type checks
Look back at the schema. The ceiling_height column is stored as STRING, not FLOAT64. Is that a problem?
Use SAFE_CAST(ceiling_height AS FLOAT64) to test how many non-NULL values are valid numbers and how many would fail conversion. The COUNTIF pattern works well here: count values where the original is not NULL but the SAFE_CAST result is NULL — those are the ones that can’t be converted.
What you should be able to answer: How many non-NULL ceiling_height values exist? How many of them are valid numbers?
Also look at room_count — it’s stored as FLOAT. Run a distribution query. Are there fractional room counts, or are they all whole numbers? What’s the highest room count?
Part 8: Value ranges and sentinel values
For numeric columns, check MIN, MAX, and the median (use APPROX_QUANTILES(column, 100)[OFFSET(50)]).
Start with price:
What you should be able to answer: What’s the minimum price? The maximum? How many listings have a price of exactly 1? How many have prices over 100 million?
Think about what a price of 1 means on a real estate platform where the price field is mandatory. And what does a price in the hundreds of billions likely represent?
Now do the same for area_total. Are there any values that are physically impossible?
Part 9: Cross-column logic
Some columns have logical relationships that should always hold. Write queries to test these:
- Can a listing be on floor 8 of a 5-story building? (Check
floor > floor_count) - Can the total area be smaller than the sum of its parts? (Check
area_total < area_living + area_kitchen)
What you should be able to answer: How many rows violate each rule?
Part 10: What does AVG(price) actually mean?
SELECT AVG(price) FROM `econ250-2026.course_data.flatfy_2019`What you should be able to answer: What does this number mix together? Why is it meaningless?
Part 11: Save your profiling summary ⭐ SUBMISSION ITEM 1
This part creates Submission Item 1: the week2_profiling_results table. See the Your Submission section at the top for requirements.
You already have the core query from Part 2 — NULL counts and percentages for each column. Extend it by adding COUNT(DISTINCT column) for each column, then save as a table.
CREATE OR REPLACE TABLE `econ250-2026.student__yourname.week2_profiling_results` AS
SELECT
COUNT(*) AS total_rows,
-- price
COUNTIF(price IS NULL) AS nulls_price,
ROUND(COUNTIF(price IS NULL) / COUNT(*) * 100, 1) AS null_pct_price,
COUNT(DISTINCT price) AS distinct_price,
-- currency
COUNTIF(currency IS NULL) AS nulls_currency,
ROUND(COUNTIF(currency IS NULL) / COUNT(*) * 100, 1) AS null_pct_currency,
COUNT(DISTINCT currency) AS distinct_currency,
-- room_count
COUNTIF(room_count IS NULL) AS nulls_room_count,
ROUND(COUNTIF(room_count IS NULL) / COUNT(*) * 100, 1) AS null_pct_room_count,
COUNT(DISTINCT room_count) AS distinct_room_count,
-- repeat the same 3-line pattern for: section, floor, floor_count,
-- area_total, area_kitchen, area_living, ceiling_height, geo
FROM `econ250-2026.course_data.flatfy_2019`Add the same three lines (nulls count, null percentage, distinct count) for each of the remaining 8 columns. Replace yourname with your actual dataset suffix.
Practice 2: Cleaning
Part 1: Classify deal types from section
The section column mixes deal type and property type in inconsistent formats. Your goal is to create a clean deal_type column with values like 'sale', 'rental', 'daily_rental', or 'other'.
Use CASE WHEN with LIKE patterns to match keywords in the section text. You saw these keywords when profiling: продаж (sale), оренда/аренда (rental), посуточн (daily).
CASE WHEN stops at the first match. If you check for '%оренда%' before '%посуточн%', daily rentals will be classified as regular rentals — because “посуточна оренда” contains “оренда”. Check the more specific pattern first.
Write a query that shows each distinct section value alongside your computed deal_type and the row count. Verify that the classification makes sense for the top 20 values.
What you should be able to answer: How many listings are in each deal type category? What percentage ends up as “other”?
Save your deal type classification logic for use in the final pipeline:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_deal_types` AS
SELECT
section,
CASE
WHEN LOWER(section) LIKE '%посуточн%' THEN 'daily_rental'
WHEN LOWER(section) LIKE '%оренда%' OR LOWER(section) LIKE '%аренда%' THEN 'rental'
WHEN LOWER(section) LIKE '%продаж%' THEN 'sale'
ELSE 'other'
END AS deal_type,
COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
GROUP BY section, deal_type
ORDER BY listing_count DESCPart 2: Standardize regions
Your profiling showed that the same region appears under multiple spellings (Russian full name, Russian abbreviated, Ukrainian, shortened). Write a CASE WHEN that maps the variants to a single canonical form.
You don’t need to cover every region — focus on the top 5-6 by volume. Use your profiling results from Practice 1 Part 4 to identify which variants to consolidate. The ELSE clause keeps everything else unchanged.
Test your mapping by running it with GROUP BY + COUNT(*) on the result — verify that what was previously 4 rows for Kyiv region is now 1.
Save your region standardization logic for use in the final pipeline:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_region_standard` AS
SELECT
SPLIT(geo, ',')[SAFE_OFFSET(0)] AS region_raw,
CASE
WHEN SPLIT(geo, ',')[SAFE_OFFSET(0)] IN ('Київська обл.', 'Киевская обл.', 'Київська', 'Киевская') THEN 'Kyiv Oblast'
WHEN SPLIT(geo, ',')[SAFE_OFFSET(0)] IN ('Одеська обл.', 'Одесская обл.') THEN 'Odesa Oblast'
-- Add more mappings for top regions
ELSE SPLIT(geo, ',')[SAFE_OFFSET(0)]
END AS region_clean,
COUNT(*) AS listing_count
FROM `econ250-2026.course_data.flatfy_2019`
WHERE geo IS NOT NULL
GROUP BY region_raw, region_clean
ORDER BY listing_count DESCPart 3: Fix types
Convert ceiling_height from STRING to FLOAT64 using SAFE_CAST. Your profiling already showed how many non-NULL values are valid numbers, so this is a simple step.
Part 4: Normalize prices to a common currency
Prices are in three different currencies (USD, UAH, EUR). To make them comparable, convert to a common currency using exchange rates.
Explore the exchange rate table
There’s a second table available: econ250-2026.course_data.week2_exchange_rates — daily exchange rates from the National Bank of Ukraine (NBU). Take a look at it:
SELECT *
FROM `econ250-2026.course_data.week2_exchange_rates`
LIMIT 10What you should notice:
- The
datecolumn is a STRING inDD.MM.YYYYformat — not a proper DATE type - The
ratecolumn is the UAH equivalent for the number of units specified in theunitscolumn (e.g., rate = 2412 for 100 USD means 1 USD = 24.12 UAH) - There are rows for both USD and EUR, one per day
The rates CTE (provided)
This CTE parses the date strings and pivots the table so each date has one row with both USD and EUR rates as separate columns. Use it as-is in your pipeline:
cte_rates AS (
SELECT
PARSE_DATE('%d.%m.%Y', date) AS rate_date,
MAX(IF(currency = 'USD', rate / units, NULL)) AS uah_per_usd,
MAX(IF(currency = 'EUR', rate / units, NULL)) AS uah_per_eur
FROM `econ250-2026.course_data.week2_exchange_rates`
GROUP BY rate_date
)Try it standalone first: WITH cte_rates AS (...) SELECT * FROM cte_rates LIMIT 10.
Join and convert
Now join cte_rates with the listings on the listing date. Use DATE(insert_time) to extract the date from the listing timestamp, and match it to rate_date.
The conversion logic:
- USD listings: price is already in USD — keep as-is
- UAH listings: divide by
uah_per_usdto get USD - EUR listings: multiply by
uah_per_eur(to get UAH), then divide byuah_per_usd(to get USD)
Write a CASE WHEN on the currency column that applies the right formula for each case. Use ROUND(..., 2) to keep results clean.
Keep the original price and currency columns alongside the new price_usd. The original currency tells you something about the listing — USD pricing is more common for sales, UAH for rentals. Dropping it loses that signal.
What you should be able to answer: After conversion, what’s the average price in USD for the entire dataset? How does it compare to the meaningless average from Practice 1?
Part 5: Filter outliers
Now that prices are in a common currency, your outlier filters are more meaningful. Define reasonable ranges for price_usd and area_total.
Consider:
- What’s a reasonable minimum price in USD? (A listing priced at $1 or $2 is clearly a placeholder.)
- What’s a reasonable maximum? (Where does “expensive property” end and “data entry error” begin?)
- For areas: what’s physically plausible for the largest property in this dataset?
Use your findings from Practice 1 Part 8 to set boundaries. A starting point: filter prices below $5-10 and above $2-5M. For areas, think about physically plausible upper bounds. There’s no single correct threshold — the goal is to remove obvious junk.
Be careful with NULL columns — if area_total is NULL, you probably want to keep the row (some listings have no area information — check your Part 2 completeness results). The pattern area_total IS NULL OR area_total < threshold keeps NULLs while filtering bad values.
Part 6: Build the full CTE pipeline
Combine all your cleaning steps into a single query. Fill in the ??? placeholders with expressions from Parts 1-5:
WITH cte_rates AS (
-- PROVIDED — paste the exchange rates CTE from Part 4
SELECT
PARSE_DATE('%d.%m.%Y', date) AS rate_date,
MAX(IF(currency = 'USD', rate / units, NULL)) AS uah_per_usd,
MAX(IF(currency = 'EUR', rate / units, NULL)) AS uah_per_eur
FROM `econ250-2026.course_data.week2_exchange_rates`
GROUP BY rate_date
),
cte_parsed AS (
-- TODO: extract region and city from geo using SPLIT
-- TODO: classify deal_type from section using CASE WHEN
-- (remember: check посуточн before оренда — order matters)
-- TODO: convert ceiling_height to FLOAT64 using SAFE_CAST
SELECT
*,
??? AS region_raw,
??? AS city_raw,
??? AS deal_type,
??? AS ceiling_height_num
FROM `econ250-2026.course_data.flatfy_2019`
),
cte_standardized AS (
-- TODO: map region name variants to canonical names
-- * EXCEPT(region_raw) selects all columns except region_raw —
-- we're replacing it with the cleaned version region_clean
SELECT
* EXCEPT(region_raw),
??? AS region_clean
FROM cte_parsed
),
cte_with_prices AS (
-- TODO: join with cte_rates on date
-- TODO: compute price_usd using CASE WHEN on currency
SELECT
s.*,
??? AS price_usd
FROM cte_standardized s
LEFT JOIN cte_rates r ON ???
),
cte_filtered AS (
-- TODO: filter out sentinel values and outliers
SELECT *
FROM cte_with_prices
WHERE ???
)
SELECT
int64_field_0 AS listing_id,
price,
currency,
price_usd,
deal_type,
room_count,
region_clean AS region,
city_raw AS city,
floor,
floor_count,
area_total,
area_kitchen,
area_living,
ceiling_height_num AS ceiling_height,
insert_time,
download_time
FROM cte_filtered- Missing comma: CTEs are separated by commas — don’t forget the comma after each closing
) - Column name conflicts: if you create
region_cleanin one CTE and reference it in the next, make sure the name matches exactly cte_ratesis independent: it reads from the exchange rate table, not fromcte_parsed— it gets joined incte_with_prices
Run the query. Check how many rows survive. Spot-check a few rows to make sure the cleaning looks right.
Part 7: Save as a view ⭐ SUBMISSION ITEM 2
This part creates Submission Item 2: the week2_cleaned_listings view. See the Your Submission section at the top for requirements.
Once your pipeline produces reasonable results, save it as a view:
CREATE OR REPLACE VIEW `econ250-2026.student__yourname.week2_cleaned_listings` AS
-- Your full CTE pipeline hereReplace yourname with your actual dataset suffix.
Verify it works
Query your view to check that the results make sense:
- Group by
regionanddeal_type— you should see clean, consolidated groups without the variant fragmentation from profiling - Check that
price_usdvalues look reasonable for different deal types (sales in the tens of thousands, rentals in the hundreds) - Verify that the original
currencycolumn is still present
Going Further (optional)
- Add a city cleaning step (“Киев”/“Київ” → single name), similar to regions
- Extract property type from
section(квартира,будинок,ділянка→property_typecolumn) - Compute average
price_usdper m² by region for apartment sales only - Group by
currencyanddeal_type— are sales more likely in USD and rentals in UAH?
Troubleshooting
| Problem | Solution |
|---|---|
SPLIT returns NULL |
The geo column itself may be NULL — filter with WHERE geo IS NOT NULL |
| CASE WHEN doesn’t match expected values | Check for leading/trailing spaces: try wrapping with TRIM(...) |
| JOIN produces more rows than expected | Make sure the rates CTE has one row per date (use GROUP BY) |
| PARSE_DATE fails | Verify the format string matches the data: '%d.%m.%Y' for DD.MM.YYYY |
| View creation fails | Verify you replaced yourname with your actual dataset suffix |
| “Table not found” | Use the full path: econ250-2026.course_data.flatfy_2019 |
| SAFE_CAST returns all NULLs | Double-check the column name — are you casting the right column? |
| CTE syntax error | Check for missing commas between CTEs and that each CTE reads from the previous one |
SAFE_CAST vs CAST reminder: CAST crashes the entire query if any single value can’t convert. SAFE_CAST returns NULL for invalid values and lets the query finish. Always use SAFE_CAST when working with data you haven’t fully validated.
BigQuery SQL reference: The Standard SQL functions reference is a useful bookmark for looking up function syntax and examples.