AGENTUPDATE JOURNAL

1000usdinchina.com Dev Retrospective (3) - Travel Data ETL: 4.5 GB of Raw Data to Clean City JSON

1000usdinchina.com Dev Retrospective (3) - Travel Data ETL: 4.5 GB of Raw Data to Clean City JSON
Table of Contents

The numbers in 1000usdinchina.com are only as good as the data behind them. Behind 100 cities sits a travel data ETL pipeline that ingests roughly 4.5 GB of raw sources — ~443K hotels, restaurant data for 60 cities, metro geometry for 57 cities, nationwide attractions, high-speed rail and flights — and emits one clean, compliant JSON file per city.

This is post 3 of the series.

Table of contents

What the raw data looks like

Raw travel data is big, messy, and inconsistent:

  • A single ~443K-row hotel spreadsheet.
  • Restaurant CSVs totaling 4–5 GB across 60 cities.
  • Metro station/line geometry CSVs for 57 cities.
  • Nationwide attractions, plus high-speed-rail and flight spreadsheets.

None of it is import-ready. Encodings differ, columns drift, prices are missing, and the same city appears under three spellings. The pipeline's job is to make it boring and uniform.

flowchart LR
    subgraph Raw[Raw sources ~4.5 GB]
      H[Hotels .xlsx ~443K]
      F[Restaurant CSV 60 cities]
      M[Metro geometry 57 cities]
      A[Attractions / rail / flights]
    end
    Raw --> X[Extract]
    X --> C[Clean / normalize]
    C --> G[Aggregate]
    G --> O[(data/cities/{slug}/*.json)]
    O --> V[etl-validate]
    V -->|fail| C
    V -->|pass| Ship[Ship to git]

The pipeline, stage by stage

  1. Extract — read each source format (xlsx/csv) and pull only the fields the product needs.
  2. Clean / normalize — unify city slugs, fix encodings, coerce numeric prices, drop rows with no usable signal.
  3. Aggregate — collapse thousands of rows per city into a small set of summary figures: median meal price, lodging tiers, transit cost.
  4. Emit — write data/cities/{slug}/*.json (and a transit matrix). These aggregated outputs are committed to git; the raw source datasets never are — both for compliance and because gigabytes don't belong in a repo.

A crucial detail: the raw datasets are explicitly git-ignored. Only the aggregated products ship. The repository stays small and contains nothing license-encumbered.

The compliance red line

The most important rule in the whole pipeline is about what not to output. The food data file (food.json) emits only aggregates — meal price ranges, cuisine styles, signature dishes. It is forbidden from containing any shop-level fields: no restaurant name, shop ID, phone number, rating, review count, photo, or store URL.

Kept (aggregate) Dropped (shop-level)
meal_budget_cny, meal_mid_cny, meal_high_cny restaurant name
cuisine_styles shop ID / phone
signature_dishes rating / review count
photos / store URL

This isn't a nice-to-have. Aggregation is what makes the dataset usable and shareable without republishing someone else's per-shop listings. It's the difference between a data product and a scraped copy.

Validation as a closed loop

Because that red line is easy to cross by accident during a refactor, the pipeline ends in a validation step that runs every time the ETL changes. It checks the output schema and, critically, asserts that no forbidden shop-level field ever appears in food.json. If it fails, the data doesn't ship. ETL changes and validation are inseparable — you don't get to skip the self-check.

Handling data gaps honestly

Real datasets have holes, and pretending otherwise produces wrong budgets:

  • Missing meal prices for a handful of cities (e.g. Hong Kong, Macau, Taiwan) are left null and the export step skips them rather than inventing numbers.
  • Thin food coverage in some cities is filled by a reference multiplier off comparable cities, clearly modeled rather than guessed inline.
  • Missing attraction ticket prices were treated as a real risk: a null price silently counts as ¥0 and understates the budget, so prices were audited and patched rather than left to quietly bias the total low.

Surfacing gaps — instead of papering over them — is what keeps the estimate honest.

Key takeaways

  • A travel data ETL turns gigabytes of messy sources into small, uniform per-city JSON.
  • Commit the aggregated outputs, never the raw datasets — for compliance and repo hygiene.
  • Aggregate-only output (no shop-level fields) is the rule that makes the data a shareable product, enforced by an automated validation step.
  • Model data gaps explicitly; a silent null that reads as ¥0 quietly corrupts a budget.

FAQ

What is a travel data ETL pipeline? A process that Extracts raw travel sources (hotels, restaurants, transit), Transforms them by cleaning and aggregating, and Loads the result into a clean format — here, one JSON file per city.

Why aggregate instead of storing raw rows? Aggregation makes the data usable and compliant: it answers "what does a meal cost here?" without republishing individual restaurants' names, ratings, or contact details.

How are missing prices handled? They stay null and are skipped on export, or filled with a clearly modeled reference multiplier — never silently treated as zero, which would understate the budget.


Next → Hand-drawing an interactive 100-city SVG map of China