retail data orchestration pipeline

unified retail data platform for ingestion, validation, enrichment, and brand mapping

oct 2025

designed and built a unified orchestration pipeline at keychain by merging two previously separate ingestion and validation workflows into one coordinated data platform. the system handles supplier feeds + 8gb+ retail datasets end-to-end: file intake, profiling, schema validation, flattening, enrichment, ai-assisted brand mapping, and database/search sync.

retail data orchestration architecture
end-to-end orchestration: intake, validation, transformation, intelligence, and delivery

orchestration starts with sftp delta sync and metadata capture (filename, modified time, size), then routes each file through column profiling and pre-validation gates. these gates separate critical schema failures from warning-level quality issues, so invalid files are quarantined early while valid files continue automatically.

validation gate design
quality-control gates: quarantine critical failures, log warnings, and continue clean records

for large retail payloads, the flattener streams deeply nested product json into relational-friendly csv columns (~20 key attributes including brand/subbrand, identifiers, dimensions, nutrition, categories, and media) without loading full files into memory. this let the pipeline reliably process very large files and ~2.8 million records.

the brand intelligence stage combines deterministic retailer/sub-brand pattern rules with openai-assisted classification. retailer-brand pairs are normalized, scored for private-label confidence, and cached to avoid duplicate api calls. this hybrid approach improved consistency while keeping model usage efficient at scale.

orchestration logic also includes checkpointed dry-runs before database mutation. every proposed brand reassignment is materialized to audit csvs first, then executed in safe batches with commit-per-batch, progress logging, and retry-friendly boundaries. this reduced risk during large backfills and made rollouts reviewable.

you also orchestrated modular processors for upc checksum validation (mod-10), price normalization, ingredient formatting, duplicate detection, and scientific-notation anomaly checks. outputs from these modules feed a normalized dataset used by postgresql ingestion and typesense sync.

execution and audit flow
safe rollout path: dry run, audit artifacts, batched execution, and downstream sync

end-to-end, this merged orchestration processed 788,851+ products with structured logging, periodic progress reporting, timestamped artifacts, and clear stage boundaries. the core orchestration contribution was turning separate scripts into a controlled, auditable pipeline where each stage had explicit inputs, outputs, and failure handling.

pythonpandasopenai apipostgresqlthreadingsftptypesenseloguru