PROJECT NEXUS / 01
The Data Scale Engine
Software Engineer · One Red Maple · 2021 — 2024
A distributed web scraping and matching infrastructure that tracked 1M+ products across 300+ e-commerce stores, powering a consumer price-comparison product on web, mobile, and browser extension.
▣ THE PROBLEM
Marketplace listings and local-store catalogs describe the same physical product with 90%+ string difference — long SEO-stuffed marketplace titles versus terse local names. Comparing prices at city scale meant continuously crawling hundreds of hostile, anti-bot-protected storefronts and resolving identity across catalogs with no shared keys.
▣ THE APPROACH
Decompose the crawl into queue-fed, per-store workers behind rotating proxy pools; normalize and deduplicate everything into a single index; and resolve product identity with a purpose-trained ML model that fuses string features with image similarity. Every stage is independently scalable and independently restartable.
Flow connections: Crawl Orchestrator flows to Azure Service Bus; Azure Service Bus flows to Scraper Fleet; Scraper Fleet flows to Normalize + Dedup; Normalize + Dedup flows to ML Match Engine; ML Match Engine flows to Azure Search Index; Azure Search Index flows to Consumer Surfaces.
Anti-bot resilience as infrastructure
Rotating proxy pools configured per e-commerce platform sustained large-scale collection without blocking — reliability was engineered into the transport layer, not patched per incident.
Identity across 90%+ string drift
Trained a product + image matching model that resolves identical items across wildly inconsistent naming, letting users find cheaper local alternatives to marketplace listings.
Sub-100ms at million-product scale
Normalization, deduplication, and Azure indexing of 5,000+ daily records kept search under 100ms while the catalog grew past 1M tracked products.
Funnel-driven product iteration
GA4 event tracking exposed a 23% onboarding drop-off; the redesign it motivated lifted conversions 11% on a 30K+ download, 4.3★ Flutter app.
1M+
products at peak
300+
stores crawled
5K+
records normalized daily
<100ms
search response