Stage 4 — Canonical
2.7 Stage 4 — Canonical product matching
This stage collapses different surface forms of the same product into a single canonical identifier. For example:
COCA COLA 330ML KUTUC.COLA 33CL TENEKECOCA-COLA 0.33 LCOKA 330 ML
All four resolve to the same canonical_product_id. This resolution is a precondition for price memory and the B2B data product.
Approach
Canonical resolution is a multi-stage embedding-based resolver with confidence-tiered disambiguation and a human review queue for ambiguous cases.
The exact similarity thresholds, embedding model, and disambiguation prompt are managed in the internal operations layer.
An unresolved line item is recorded with a null canonical reference. bINT for that line is calculated after queue canonicalisation.
Taxonomy structure
category > subcategory > brand > product > variantExample:
Beverages > Carbonated Soft Drinks > Coca-Cola > Coca-Cola Classic > 330 ml canEach canonical product carries normalised attributes: size_value, size_unit, package_type, brand_id, is_private_label, barcode_gtin (when available).
Cold start
The canonical index is bootstrapped from open product datasets, licensed catalog partnerships, and seeded user uploads from the closed beta. The index grows organically as the canonicalisation queue is drained.
Pending canonicalisation queue
Ambiguous line items enter a review queue. The reviewer (initially the Yumo Yumo team, later a community pool earning PoC) either creates a new canonical product or maps the raw text to an existing one. This queue is a primary cost lever for the pipeline as it scales — 08 lists it as a core operational risk.