Pipeline Name: Customer Data Sync Pipeline
Author: Matthew Healey
Version: 1.0.0
This pipeline demonstrates a typical enterprise ETL workflow pattern for synchronising customer data from a Snowflake data warehouse to Salesforce CRM. The design prioritises reliability, observability, and error handling.
Database: ANALYTICS_DB
Schema: CUSTOMER_SCHEMA
View: VW_ACTIVE_CUSTOMERS
The source view aggregates customer data from multiple upstream tables, pre-filtering for active accounts with recent activity. This ensures we only sync relevant, current customers.
Before proceeding with the sync, the pipeline validates incoming data against quality rules:
| Check | Condition | Severity |
|---|---|---|
| Email Not Null | EMAIL IS NOT NULL |
ERROR (blocks pipeline) |
| Customer ID Unique | No duplicate IDs | ERROR (blocks pipeline) |
| Valid Segment | Segment in allowed values | WARNING (logged only) |
Transforms prepare data for Salesforce's schema requirements:
Data is staged to S3 as a checkpoint and for audit/recovery purposes:
Bucket: data-pipeline-staging
Path: salesforce-sync/customers/
Format: CSV (GZIP compressed)
Retention: 30 days
Operation: UPSERT (insert new, update existing)
Object: Account
External ID Field: External_ID__c
Batch Size: 200 records per API call
Using upsert with an external ID ensures idempotency — the pipeline can be safely re-run without creating duplicates.
The pipeline implements a robust error handling strategy:
Pipeline execution is monitored via:
The pipeline runs daily at 06:00 UTC via cron: 0 6 * * *
This timing ensures Salesforce is updated before the start of the UK/EU business day.