1000usdinchina.com Dev Retrospective (3) - Travel Data ETL: 4.5 GB of Raw Data to Clean City JSON

Q: What is a travel data ETL pipeline?

A process that Extracts raw travel sources, Transforms them by cleaning and aggregating, and Loads the result into a clean format — here, one JSON file per city.

Q: How are missing prices handled?

They stay null and are skipped on export, or filled with a clearly modeled reference multiplier, never silently treated as zero, which would understate the budget.

The numbers in 1000usdinchina.com are only as good as the data behind them. Behind 100 cities sits a travel data ETL pipeline that ingests roughly 4.5 GB of raw sources — ~443K hotels, restaurant data for 60 cities, metro geometry for 57 cities, nationwide attractions, high-speed rail and flights — and emits one clean, compliant JSON file per city.

This is post 3 of the series.

What the raw data looks like
The pipeline, stage by stage
The compliance red line
Validation as a closed loop
Handling data gaps honestly
FAQ

What the raw data looks like

Raw travel data is big, messy, and inconsistent:

A single ~443K-row hotel spreadsheet.
Restaurant CSVs totaling 4–5 GB across 60 cities.
Metro station/line geometry CSVs for 57 cities.
Nationwide attractions, plus high-speed-rail and flight spreadsheets.

None of it is import-ready. Encodings differ, columns drift, prices are missing, and the same city appears under three spellings. The pipeline's job is to make it boring and uniform.

flowchart LR
    subgraph Raw[Raw sources ~4.5 GB]
      H[Hotels .xlsx ~443K]
      F[Restaurant CSV 60 cities]
      M[Metro geometry 57 cities]
      A[Attractions / rail / flights]
    end
    Raw --> X[Extract]
    X --> C[Clean / normalize]
    C --> G[Aggregate]
    G --> O[(data/cities/{slug}/*.json)]
    O --> V[etl-validate]
    V -->|fail| C
    V -->|pass| Ship[Ship to git]

The pipeline, stage by stage

Extract — read each source format (xlsx/csv) and pull only the fields the product needs.
Clean / normalize — unify city slugs, fix encodings, coerce numeric prices, drop rows with no usable signal.
Aggregate — collapse thousands of rows per city into a small set of summary figures: median meal price, lodging tiers, transit cost.
Emit — write data/cities/{slug}/*.json (and a transit matrix). These aggregated outputs are committed to git; the raw source datasets never are — both for compliance and because gigabytes don't belong in a repo.

A crucial detail: the raw datasets are explicitly git-ignored. Only the aggregated products ship. The repository stays small and contains nothing license-encumbered.

The compliance red line

The most important rule in the whole pipeline is about what not to output. The food data file (food.json) emits only aggregates — meal price ranges, cuisine styles, signature dishes. It is forbidden from containing any shop-level fields: no restaurant name, shop ID, phone number, rating, review count, photo, or store URL.

Kept (aggregate)	Dropped (shop-level)
`meal_budget_cny`, `meal_mid_cny`, `meal_high_cny`	restaurant name
`cuisine_styles`	shop ID / phone
`signature_dishes`	rating / review count
	photos / store URL

This isn't a nice-to-have. Aggregation is what makes the dataset usable and shareable without republishing someone else's per-shop listings. It's the difference between a data product and a scraped copy.

Validation as a closed loop

Because that red line is easy to cross by accident during a refactor, the pipeline ends in a validation step that runs every time the ETL changes. It checks the output schema and, critically, asserts that no forbidden shop-level field ever appears in food.json. If it fails, the data doesn't ship. ETL changes and validation are inseparable — you don't get to skip the self-check.

Handling data gaps honestly

Real datasets have holes, and pretending otherwise produces wrong budgets:

Missing meal prices for a handful of cities (e.g. Hong Kong, Macau, Taiwan) are left null and the export step skips them rather than inventing numbers.
Thin food coverage in some cities is filled by a reference multiplier off comparable cities, clearly modeled rather than guessed inline.
Missing attraction ticket prices were treated as a real risk: a null price silently counts as ¥0 and understates the budget, so prices were audited and patched rather than left to quietly bias the total low.

Surfacing gaps — instead of papering over them — is what keeps the estimate honest.

Key takeaways

A travel data ETL turns gigabytes of messy sources into small, uniform per-city JSON.
Commit the aggregated outputs, never the raw datasets — for compliance and repo hygiene.
Aggregate-only output (no shop-level fields) is the rule that makes the data a shareable product, enforced by an automated validation step.
Model data gaps explicitly; a silent null that reads as ¥0 quietly corrupts a budget.

FAQ

What is a travel data ETL pipeline? A process that Extracts raw travel sources (hotels, restaurants, transit), Transforms them by cleaning and aggregating, and Loads the result into a clean format — here, one JSON file per city.

Why aggregate instead of storing raw rows? Aggregation makes the data usable and compliant: it answers "what does a meal cost here?" without republishing individual restaurants' names, ratings, or contact details.

How are missing prices handled? They stay null and are skipped on export, or filled with a clearly modeled reference multiplier — never silently treated as zero, which would understate the budget.

Next → Hand-drawing an interactive 100-city SVG map of China