BigQuery Introduction & SQL Foundations

ECON250 - Big Data Analytics | Week 1

Oleh Omelchenko

2026-01-01

Course Introduction

What is this course about?


Analytical SQL — not database administration

Cloud-scale thinking — not local tools

Business questions → SQL → Insights

What this course is NOT


We cover We don’t cover
Writing analytical queries Building databases
BigQuery, cloud warehouses PostgreSQL administration
Deriving insights from data Building data pipelines
SQL patterns for analysis Software engineering

Why BigQuery?

Practical reasons

  • Cloud scale, no infrastructure
  • Free sandbox for learning
  • Industry standard SQL

Learning reasons

  • Immediate feedback loop
  • Skills transfer everywhere
  • Focus on analysis, not setup

Course structure


8 weeks:

1-2: Foundations (BigQuery, data types, cost awareness)

3-4: Core patterns (aggregations, window functions)

5-6: Advanced analysis (cohorts, complex structures)

7-8: Integration & projects

Weekly rhythm


Day Session Focus
Day 1 Lecture (80 min) Concepts, demonstrations
Day 2 Practice 1 (80 min) Hands-on exercises
Day 2 Practice 2 (80 min) Continued work, submission

Assessment overview


Component Points Notes
Weekly quizzes 12 Weeks 2-7, at lecture start
Practice submissions 14 Satisfactory completion
Assignments 50 5 assignments, individual
Group project 35 Groups of 3-4

The key message


Practical skills


You’ll leave with abilities you can immediately apply in internships and jobs.

What Makes Data “Big”?

A quick question


Raise your hand if you’ve ever had problems with a file being too big for analysis on your local laptop

Typical laptop’s limits


RAM: 8-16 GB (where analysis happens)

Storage: 256 GB - 1 TB (where files sit)

CPUs: 4-8 cores


A “big” Excel file: ~1M rows × 20 columns ≈ 200 MB

What happens when data doesn’t fit?


Your dataset is 50 GB. Your laptop has 16 GB RAM.

Now what?

“Just buy a bigger computer?”


Scale Example Fits on laptop?
MB Course grades spreadsheet ✓ Easy
GB University database ✓ Manageable
TB Large E-commerce transactions ✗ Difficult
PB Spotify, Netflix ✗ Impossible

The scale of “big”


1 PB = 1,000,000 GB


At some point, buying a bigger computer stops working.

This is called: VOLUME


The size dimension of big data

When data is too large for a single machine:

  • Can’t load into memory
  • Can’t store on one disk
  • Can’t process in reasonable time

Solution: Distribute across many machines

Now imagine something different…


Rozetka on Black Friday

Millions of users. Thousands of orders per minute.


What challenges arise when data arrives faster than you can process it?

Speed matters differently


Processing type Latency Example
Batch Hours-days Monthly reports
Near-real-time Minutes Dashboard updates
Real-time Milliseconds Fraud detection

This is called: VELOCITY


The speed dimension of big data

When data arrives faster than batch processing allows:

  • Can’t wait for nightly jobs
  • Need streaming infrastructure
  • Different tools entirely (Kafka, etc.)


Most analytics (including this course) uses batch processing.

One more scenario…


What if your data isn’t a nice table?

Think about what companies actually store:

  • Customer support chat logs
  • Product images
  • Sensor readings from IoT devices
  • JSON from mobile apps

Data comes in many forms


Structured

  • Tables with defined columns
  • SQL databases
  • Clear schema

Semi/Unstructured

  • JSON, XML
  • Text documents
  • Images, video, audio


BigQuery handles structured and semi-structured (JSON, arrays).

This is called: VARIETY


The structure dimension of big data

When data isn’t rows and columns:

  • Need flexible schemas
  • Different storage approaches
  • Specialized processing tools

The 3 Vs Framework


VOLUME

Size dimension

“Data too big for one machine”

VELOCITY

Speed dimension

“Data arriving too fast”

VARIETY

Structure dimension

“Data that isn’t tables”

Real-world scale


Company Scale Primary Challenge
Spotify 100+ PB, 600M users Volume + Velocity
Netflix Billions of events/day Velocity
Google Exabytes All three
Monobank Millions of txns/day Velocity (fraud)

The analyst’s reality


Data is getting bigger everywhere

  • Traditional industries adopting analytics
  • More sensors, more tracking, more logs
  • Economic research using satellite data, mobile data


You need to know how to work beyond your laptop

Where BigQuery Fits

The Modern Data Stack


BigQuery sits at: Storage + Transform + Analysis

Why start here: Most immediately applicable skill

Data Warehouse vs Data Lake


Data Warehouse Data Lake
Structured, cleaned Raw files
Optimized for analytics Store everything
Schema-on-write Schema-on-read
BigQuery, Snowflake S3, GCS


For this course: We work in the warehouse layer.

The job landscape


Role Primary focus
Data Analyst Analysis + Consumption
Analytics Engineer Transform + Analysis
Data Engineer Ingestion + Storage + Transform
Data Scientist Analysis + ML


This course: Data Analyst skills. You’ll know where everything else fits.

What we’re NOT covering


  • Hadoop, Spark (distributed processing frameworks)
  • Kafka (streaming)
  • dbt (transformation tool)
  • ML pipelines


Why? Different tools for different problems. SQL skills transfer everywhere — that’s the foundation.

The key message


SQL = Lingua franca of data


SQL skills compound. Every data tool speaks SQL.

SQL as an Analytical Tool

Let’s start with some questions


Imagine you work at an e-commerce company. Your CEO asks:


“How many customers purchased again this month?”


“What’s our revenue trend by region over 3 years?”


“Are we retaining our customers?”

These questions are hard


Not because the SQL syntax is complex…


But because they require:

  • Aggregating millions of rows
  • Comparing time periods
  • Tracking the same customers over time
  • Calculating percentages and trends

The database you learned SQL on


Designed for applications:

  • “Get this customer’s order”
  • “Update this user’s email”
  • “Insert a new transaction”


Optimized for: Finding one needle in the haystack

But these questions are different


They ask about the whole haystack:

  • “What patterns exist across all customers?”
  • “How does behavior change over time?”
  • “What’s the trend, not just the current state?”


This requires a different kind of database — and a different mindset.

Two mindsets for SQL


Transactional (OLTP) Analytical (OLAP)
Support applications Support decisions
Single-row lookups Aggregate millions of rows
“Get this customer’s order” “What are our top products?”
CRUD operations Read-heavy analysis
Normalized schemas Denormalized for reading

The mental shift


Transactional

Find a needle

What happened?

Analytical

Understand the haystack

Why did it happen?

What will happen?

The analytical toolkit


To answer business questions at scale, you need:

  1. Aggregations — collapsing rows into insights
  2. Window functions — calculations across rows without collapsing
  3. Time-based patterns — trends, seasonality, comparisons

Tool 1: Aggregations


Collapsing rows into insights


-- What's our revenue by product category per month?
SELECT
  category,
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(amount) AS revenue
FROM orders
GROUP BY category, month
ORDER BY month, revenue DESC

Tool 2: Window functions


Calculations across rows without collapsing


-- What percentage of total sales does each product represent?
SELECT
  product_name,
  amount,
  amount / SUM(amount) OVER () AS pct_of_total,
  SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders

Tool 3: Time-based analysis


Trends, seasonality, comparisons


-- Week-over-week growth rate
SELECT
  week,
  revenue,
  LAG(revenue) OVER (ORDER BY week) AS prev_week,
  (revenue - LAG(revenue) OVER (ORDER BY week)) /
    LAG(revenue) OVER (ORDER BY week) AS growth_rate
FROM weekly_sales

Back to the CEO’s question


“Are we retaining customers?”


  1. What does “retaining” mean? → Customers who purchased again
  2. What data do we need? → Customer IDs, purchase dates
  3. What patterns? → Window functions, date math, aggregations

The query structure

-- Step 1: Find when each customer first purchased
WITH customer_first_purchase AS (
  SELECT customer_id, MIN(order_date) AS first_date
  FROM orders GROUP BY customer_id
),
-- Step 2: Track activity relative to first purchase
customer_activity AS (
  SELECT ... -- join and calculate months since first
)
-- Step 3: Aggregate into cohort retention
SELECT
  cohort_month, months_since_first,
  COUNT(DISTINCT customer_id) AS retained
FROM customer_activity
GROUP BY 1, 2

The validation step


Always ask:

  • Does the number make sense?
  • What are the edge cases?
  • What happens with NULLs?


SQL can answer these questions. That’s what this course teaches.

What you’ll learn to build


Weeks 2-4

  • Efficient queries
  • Data type handling
  • Aggregations & grouping
  • Window functions

Weeks 5-7

  • Cohort analysis
  • Time series patterns
  • Complex structures
  • Real-world workflows

Q&A + Practice Preview

Questions?


  • About the course structure?
  • About what “big data” means?
  • About where we’re heading?

What’s next: Practice sessions


Practice 1 (80 min):

  • BigQuery console setup
  • First queries
  • Understanding costs


Practice 2 (80 min):

  • Personal datasets
  • Tables and views
  • Partitioning basics + submission

Before practice sessions


Bring your laptop

Make sure you can access:

console.cloud.google.com/bigquery

(Use your KSE Google account)


Project: econ250-2026

See you in practice!


Questions?


o_omelchenko@kse.org.ua