BigQuery Introduction & SQL Foundations

ECON250 - Big Data Analytics | Week 1

Oleh Omelchenko

2026-01-01

Course Introduction

What is this course about?

Analytical SQL — not database administration

Cloud-scale thinking — not local tools

Business questions → SQL → Insights

What this course is NOT

We cover	We don’t cover
Writing analytical queries	Building databases
BigQuery, cloud warehouses	PostgreSQL administration
Deriving insights from data	Building data pipelines
SQL patterns for analysis	Software engineering

Why BigQuery?

Practical reasons

Cloud scale, no infrastructure
Free sandbox for learning
Industry standard SQL

Learning reasons

Immediate feedback loop
Skills transfer everywhere
Focus on analysis, not setup

Course structure

8 weeks:

1-2: Foundations (BigQuery, data types, cost awareness)

3-4: Core patterns (aggregations, window functions)

5-6: Advanced analysis (cohorts, complex structures)

7-8: Integration & projects

Weekly rhythm

Day	Session	Focus
Day 1	Lecture (80 min)	Concepts, demonstrations
Day 2	Practice 1 (80 min)	Hands-on exercises
Day 2	Practice 2 (80 min)	Continued work, submission

Assessment overview

Component	Points	Notes
Weekly quizzes	12	Weeks 2-7, at lecture start
Practice submissions	14	Satisfactory completion
Assignments	50	5 assignments, individual
Group project	35	Groups of 3-4

The key message

Practical skills

You’ll leave with abilities you can immediately apply in internships and jobs.

What Makes Data “Big”?

A quick question

Raise your hand if you’ve ever had problems with a file being too big for analysis on your local laptop

Typical laptop’s limits

RAM: 8-16 GB (where analysis happens)

Storage: 256 GB - 1 TB (where files sit)

CPUs: 4-8 cores

A “big” Excel file: ~1M rows × 20 columns ≈ 200 MB

What happens when data doesn’t fit?

Your dataset is 50 GB. Your laptop has 16 GB RAM.

Now what?

“Just buy a bigger computer?”

Scale	Example	Fits on laptop?
MB	Course grades spreadsheet	✓ Easy
GB	University database	✓ Manageable
TB	Large E-commerce transactions	✗ Difficult
PB	Spotify, Netflix	✗ Impossible

The scale of “big”

1 PB = 1,000,000 GB

At some point, buying a bigger computer stops working.

This is called: VOLUME

The size dimension of big data

When data is too large for a single machine:

Can’t load into memory
Can’t store on one disk
Can’t process in reasonable time

Solution: Distribute across many machines

Now imagine something different…

Rozetka on Black Friday

Millions of users. Thousands of orders per minute.

What challenges arise when data arrives faster than you can process it?

Speed matters differently

Processing type	Latency	Example
Batch	Hours-days	Monthly reports
Near-real-time	Minutes	Dashboard updates
Real-time	Milliseconds	Fraud detection

This is called: VELOCITY

The speed dimension of big data

When data arrives faster than batch processing allows:

Can’t wait for nightly jobs
Need streaming infrastructure
Different tools entirely (Kafka, etc.)

Most analytics (including this course) uses batch processing.

One more scenario…

What if your data isn’t a nice table?

Think about what companies actually store:

Customer support chat logs
Product images
Sensor readings from IoT devices
JSON from mobile apps

Data comes in many forms

Structured

Tables with defined columns
SQL databases
Clear schema

Semi/Unstructured

JSON, XML
Text documents
Images, video, audio

BigQuery handles structured and semi-structured (JSON, arrays).

This is called: VARIETY

The structure dimension of big data

When data isn’t rows and columns:

Need flexible schemas
Different storage approaches
Specialized processing tools

The 3 Vs Framework

VOLUME

Size dimension

“Data too big for one machine”

VELOCITY

Speed dimension

“Data arriving too fast”

VARIETY

Structure dimension

“Data that isn’t tables”

Real-world scale

Company	Scale	Primary Challenge
Spotify	100+ PB, 600M users	Volume + Velocity
Netflix	Billions of events/day	Velocity
Google	Exabytes	All three
Monobank	Millions of txns/day	Velocity (fraud)

The analyst’s reality

Data is getting bigger everywhere

Traditional industries adopting analytics
More sensors, more tracking, more logs
Economic research using satellite data, mobile data

You need to know how to work beyond your laptop

Where BigQuery Fits

The Modern Data Stack

BigQuery sits at: Storage + Transform + Analysis

Why start here: Most immediately applicable skill

Data Warehouse vs Data Lake

Data Warehouse	Data Lake
Structured, cleaned	Raw files
Optimized for analytics	Store everything
Schema-on-write	Schema-on-read
BigQuery, Snowflake	S3, GCS

For this course: We work in the warehouse layer.

The job landscape

Role	Primary focus
Data Analyst	Analysis + Consumption
Analytics Engineer	Transform + Analysis
Data Engineer	Ingestion + Storage + Transform
Data Scientist	Analysis + ML

This course: Data Analyst skills. You’ll know where everything else fits.

What we’re NOT covering

Hadoop, Spark (distributed processing frameworks)
Kafka (streaming)
dbt (transformation tool)
ML pipelines

Why? Different tools for different problems. SQL skills transfer everywhere — that’s the foundation.

The key message

SQL = Lingua franca of data

SQL skills compound. Every data tool speaks SQL.

SQL as an Analytical Tool

Let’s start with some questions

Imagine you work at an e-commerce company. Your CEO asks:

“How many customers purchased again this month?”

“What’s our revenue trend by region over 3 years?”

“Are we retaining our customers?”

These questions are hard

Not because the SQL syntax is complex…

But because they require:

Aggregating millions of rows
Comparing time periods
Tracking the same customers over time
Calculating percentages and trends

The database you learned SQL on

Designed for applications:

“Get this customer’s order”
“Update this user’s email”
“Insert a new transaction”

Optimized for: Finding one needle in the haystack

But these questions are different

They ask about the whole haystack:

“What patterns exist across all customers?”
“How does behavior change over time?”
“What’s the trend, not just the current state?”

This requires a different kind of database — and a different mindset.

Two mindsets for SQL

Transactional (OLTP)	Analytical (OLAP)
Support applications	Support decisions
Single-row lookups	Aggregate millions of rows
“Get this customer’s order”	“What are our top products?”
CRUD operations	Read-heavy analysis
Normalized schemas	Denormalized for reading

The mental shift

Transactional

Find a needle

What happened?

Analytical

Understand the haystack

Why did it happen?

What will happen?

The analytical toolkit

To answer business questions at scale, you need:

Aggregations — collapsing rows into insights
Window functions — calculations across rows without collapsing
Time-based patterns — trends, seasonality, comparisons

Tool 1: Aggregations

Collapsing rows into insights

-- What's our revenue by product category per month?
SELECT
  category,
  DATE_TRUNC(order_date, MONTH) AS month,
  SUM(amount) AS revenue
FROM orders
GROUP BY category, month
ORDER BY month, revenue DESC

Tool 2: Window functions

Calculations across rows without collapsing

-- What percentage of total sales does each product represent?
SELECT
  product_name,
  amount,
  amount / SUM(amount) OVER () AS pct_of_total,
  SUM(amount) OVER (ORDER BY order_date) AS running_total
FROM orders

Tool 3: Time-based analysis

Trends, seasonality, comparisons

-- Week-over-week growth rate
SELECT
  week,
  revenue,
  LAG(revenue) OVER (ORDER BY week) AS prev_week,
  (revenue - LAG(revenue) OVER (ORDER BY week)) /
    LAG(revenue) OVER (ORDER BY week) AS growth_rate
FROM weekly_sales

Back to the CEO’s question

“Are we retaining customers?”

What does “retaining” mean? → Customers who purchased again
What data do we need? → Customer IDs, purchase dates
What patterns? → Window functions, date math, aggregations

The query structure

-- Step 1: Find when each customer first purchased
WITH customer_first_purchase AS (
  SELECT customer_id, MIN(order_date) AS first_date
  FROM orders GROUP BY customer_id
),
-- Step 2: Track activity relative to first purchase
customer_activity AS (
  SELECT ... -- join and calculate months since first
)
-- Step 3: Aggregate into cohort retention
SELECT
  cohort_month, months_since_first,
  COUNT(DISTINCT customer_id) AS retained
FROM customer_activity
GROUP BY 1, 2

The validation step

Always ask:

Does the number make sense?
What are the edge cases?
What happens with NULLs?

SQL can answer these questions. That’s what this course teaches.

What you’ll learn to build

Weeks 2-4

Efficient queries
Data type handling
Aggregations & grouping
Window functions

Weeks 5-7

Cohort analysis
Time series patterns
Complex structures
Real-world workflows

Q&A + Practice Preview

Questions?

About the course structure?
About what “big data” means?
About where we’re heading?

What’s next: Practice sessions

Practice 1 (80 min):

BigQuery console setup
First queries
Understanding costs

Practice 2 (80 min):

Personal datasets
Tables and views
Partitioning basics + submission

Before practice sessions

Bring your laptop

Make sure you can access:

console.cloud.google.com/bigquery

(Use your KSE Google account)

Project: econ250-2026

See you in practice!

Questions?

o_omelchenko@kse.org.ua