📊 Matillion ETL Pipeline Documentation

Pipeline Name: Customer Data Sync Pipeline
Author: Matthew Healey
Version: 1.0.0

Matillion ETL
Snowflake
AWS S3
Salesforce
Data Engineering

Overview

This pipeline demonstrates a typical enterprise ETL workflow pattern for synchronising customer data from a Snowflake data warehouse to Salesforce CRM. The design prioritises reliability, observability, and error handling.

Business Purpose: Keep Salesforce CRM up-to-date with the latest customer information from the data warehouse, enabling sales and account teams to work with current, accurate data.

Pipeline Flow

Snowflake
Extract
Validate
Quality Gates
Transform
Map Fields
S3 Staging
CSV Export
Salesforce
Upsert
Audit Log
Record

Architecture Components

1. Source: Snowflake Data Warehouse

Database: ANALYTICS_DB
Schema: CUSTOMER_SCHEMA
View: VW_ACTIVE_CUSTOMERS

The source view aggregates customer data from multiple upstream tables, pre-filtering for active accounts with recent activity. This ensures we only sync relevant, current customers.

2. Data Quality Validation

Before proceeding with the sync, the pipeline validates incoming data against quality rules:

Check Condition Severity
Email Not Null EMAIL IS NOT NULL ERROR (blocks pipeline)
Customer ID Unique No duplicate IDs ERROR (blocks pipeline)
Valid Segment Segment in allowed values WARNING (logged only)

3. Transformation Layer

Transforms prepare data for Salesforce's schema requirements:

4. S3 Staging

Data is staged to S3 as a checkpoint and for audit/recovery purposes:

Bucket: data-pipeline-staging
Path: salesforce-sync/customers/
Format: CSV (GZIP compressed)
Retention: 30 days

5. Salesforce Load

Operation: UPSERT (insert new, update existing)
Object: Account
External ID Field: External_ID__c
Batch Size: 200 records per API call

Using upsert with an external ID ensures idempotency — the pipeline can be safely re-run without creating duplicates.

Error Handling

The pipeline implements a robust error handling strategy:

Monitoring & Alerting

Pipeline execution is monitored via:

Schedule

The pipeline runs daily at 06:00 UTC via cron: 0 6 * * *

This timing ensures Salesforce is updated before the start of the UK/EU business day.

Security Considerations

Download Resources

📥 Download Workflow JSON

← Back to Portfolio