The numbers in 1000usdinchina.com are only as good as the data behind them. Behind 100 cities sits a travel data ETL pipeline that ingests roughly 4.5 GB of raw sources — ~443K hotels, restaurant data for 60 cities, metro geometry for 57 cities, nationwide attractions, high-speed rail and flights — and emits one clean, compliant JSON file per city.
This is post 3 of the series.
Table of contents
- What the raw data looks like
- The pipeline, stage by stage
- The compliance red line
- Validation as a closed loop
- Handling data gaps honestly
- FAQ
What the raw data looks like
Raw travel data is big, messy, and inconsistent:
- A single ~443K-row hotel spreadsheet.
- Restaurant CSVs totaling 4–5 GB across 60 cities.
- Metro station/line geometry CSVs for 57 cities.
- Nationwide attractions, plus high-speed-rail and flight spreadsheets.
None of it is import-ready. Encodings differ, columns drift, prices are missing, and the same city appears under three spellings. The pipeline's job is to make it boring and uniform.
flowchart LR
subgraph Raw[Raw sources ~4.5 GB]
H[Hotels .xlsx ~443K]
F[Restaurant CSV 60 cities]
M[Metro geometry 57 cities]
A[Attractions / rail / flights]
end
Raw --> X[Extract]
X --> C[Clean / normalize]
C --> G[Aggregate]
G --> O[(data/cities/{slug}/*.json)]
O --> V[etl-validate]
V -->|fail| C
V -->|pass| Ship[Ship to git]The pipeline, stage by stage
- Extract — read each source format (xlsx/csv) and pull only the fields the product needs.
- Clean / normalize — unify city slugs, fix encodings, coerce numeric prices, drop rows with no usable signal.
- Aggregate — collapse thousands of rows per city into a small set of summary figures: median meal price, lodging tiers, transit cost.
- Emit — write
data/cities/{slug}/*.json(and a transit matrix). These aggregated outputs are committed to git; the raw source datasets never are — both for compliance and because gigabytes don't belong in a repo.
A crucial detail: the raw datasets are explicitly git-ignored. Only the aggregated products ship. The repository stays small and contains nothing license-encumbered.
The compliance red line
The most important rule in the whole pipeline is about what not to output. The food data
file (food.json) emits only aggregates — meal price ranges, cuisine styles, signature
dishes. It is forbidden from containing any shop-level fields: no restaurant name, shop
ID, phone number, rating, review count, photo, or store URL.
| Kept (aggregate) | Dropped (shop-level) |
|---|---|
meal_budget_cny, meal_mid_cny, meal_high_cny |
restaurant name |
cuisine_styles |
shop ID / phone |
signature_dishes |
rating / review count |
| photos / store URL |
This isn't a nice-to-have. Aggregation is what makes the dataset usable and shareable without republishing someone else's per-shop listings. It's the difference between a data product and a scraped copy.
Validation as a closed loop
Because that red line is easy to cross by accident during a refactor, the pipeline ends in a
validation step that runs every time the ETL changes. It checks the output schema and,
critically, asserts that no forbidden shop-level field ever appears in food.json. If it
fails, the data doesn't ship. ETL changes and validation are inseparable — you don't get to
skip the self-check.
Handling data gaps honestly
Real datasets have holes, and pretending otherwise produces wrong budgets:
- Missing meal prices for a handful of cities (e.g. Hong Kong, Macau, Taiwan) are left
nulland the export step skips them rather than inventing numbers. - Thin food coverage in some cities is filled by a reference multiplier off comparable cities, clearly modeled rather than guessed inline.
- Missing attraction ticket prices were treated as a real risk: a
nullprice silently counts as ¥0 and understates the budget, so prices were audited and patched rather than left to quietly bias the total low.
Surfacing gaps — instead of papering over them — is what keeps the estimate honest.
Key takeaways
- A travel data ETL turns gigabytes of messy sources into small, uniform per-city JSON.
- Commit the aggregated outputs, never the raw datasets — for compliance and repo hygiene.
- Aggregate-only output (no shop-level fields) is the rule that makes the data a shareable product, enforced by an automated validation step.
- Model data gaps explicitly; a silent
nullthat reads as ¥0 quietly corrupts a budget.
FAQ
What is a travel data ETL pipeline? A process that Extracts raw travel sources (hotels, restaurants, transit), Transforms them by cleaning and aggregating, and Loads the result into a clean format — here, one JSON file per city.
Why aggregate instead of storing raw rows? Aggregation makes the data usable and compliant: it answers "what does a meal cost here?" without republishing individual restaurants' names, ratings, or contact details.
How are missing prices handled?
They stay null and are skipped on export, or filled with a clearly modeled reference
multiplier — never silently treated as zero, which would understate the budget.
Next → Hand-drawing an interactive 100-city SVG map of China