built a scalable batch pipeline that automatically tags manufacturing and packaging processes for cpg products across food & beverage, pet products, and beauty & healthcare. the system combines a canonical process registry, a knowledge graph, semantic embeddings, and llm inference to predict which processes (tempering, retort sterilization, form-fill-seal, etc.) apply to each product.

the data foundation merges three sources — a manufacturing_process database (~700 processes), expert-curated spreadsheets (~800 process names across 54 categories), and ~43,000 historically verified product-process mappings — into a canonical registry of ~900 unique processes. expert names are normalized (abbreviation expansion, typo correction, plural resolution) and matched to master records via semantic cosine similarity with three confidence tiers (HIGH ≥0.85 auto-accept, MEDIUM 0.70–0.85, LOW <0.70 manual review).

semantic matching uses text-embedding-3-small (1536 dims) instead of fuzzy string matching — critical because "Breaking" vs "Breading" are 88% string-similar but semantically different, while "Pasteurization" vs "Pasteurizer" are only 85% similar but mean the same thing.
the knowledge graph is a sql edge table with ~6,500–7,000 edges built in three passes: historical verified mappings aggregated to parent category level (~6,400), expert gap-fills (~200), and universal processes (Metal Detector, X-Ray, Palletizer, etc.) added to all 83 categories.
the prediction pipeline batch-fetches products from sql, resolves parent categories, retrieves 25–50 candidate processes from the knowledge graph, then sends product details + candidates to the ai model. the model selects 5–15 processes with evidence-based reasoning — it cannot invent processes outside the candidates. an optional vision stage extracts packaging and manufacturing clues from product images.

for example, a Coldpress Strawberry & Banana Smoothie 250ml (Refrigerated Juices category, ingredients: apple juice, banana puree, strawberry puree, etc.) produces: lifecycle analysis — "refrigerated plastic-bottled smoothie requiring fruit puree/juice blending, cold non-thermal handling, and standard bottling operations"; pre-processing: Mixer; processing: Cold Press, High Pressure Processing; packaging: Filler, Labeler; avg confidence: 4.4/5. each product gets this structured breakdown with 5–15 processes split by manufacturing stage.
output uses a two-stage async pattern: batch creation submits to the inference api, batch consumption (6–12h later) writes results to the database. full artifacts persisted to s3 for auditability.