📊 Matillion ETL Pipeline Documentation

Pipeline Name: Customer Data Sync Pipeline
Author: Matthew Healey
Version: 1.0.0

Matillion ETL

Snowflake

AWS S3

Salesforce

Data Engineering

Overview

This pipeline demonstrates a typical enterprise ETL workflow pattern for synchronising customer data from a Snowflake data warehouse to Salesforce CRM. The design prioritises reliability, observability, and error handling.

            Business Purpose: Keep Salesforce CRM up-to-date with the latest customer information from the data warehouse, enabling sales and account teams to work with current, accurate data.
        

Pipeline Flow

Snowflake
Extract

→

Validate
Quality Gates

→

Transform
Map Fields

→

S3 Staging
CSV Export

→

Salesforce
Upsert

→

Audit Log
Record

Architecture Components

1. Source: Snowflake Data Warehouse

Database: ANALYTICS_DB
Schema: CUSTOMER_SCHEMA
View: VW_ACTIVE_CUSTOMERS

The source view aggregates customer data from multiple upstream tables, pre-filtering for active accounts with recent activity. This ensures we only sync relevant, current customers.

2. Data Quality Validation

Before proceeding with the sync, the pipeline validates incoming data against quality rules:

Check	Condition	Severity
Email Not Null	`EMAIL IS NOT NULL`	ERROR (blocks pipeline)
Customer ID Unique	No duplicate IDs	ERROR (blocks pipeline)
Valid Segment	Segment in allowed values	WARNING (logged only)

3. Transformation Layer

Transforms prepare data for Salesforce's schema requirements:

Account Name: Uses company name, falls back to personal name
Phone Formatting: Strips non-numeric characters for consistency
Account Type Mapping: Converts internal segments to Salesforce picklist values
External ID: Creates unique identifier for upsert operations

4. S3 Staging

Data is staged to S3 as a checkpoint and for audit/recovery purposes:

Bucket: data-pipeline-staging
Path: salesforce-sync/customers/
Format: CSV (GZIP compressed)
Retention: 30 days

5. Salesforce Load

Operation: UPSERT (insert new, update existing)
Object: Account
External ID Field: External_ID__c
Batch Size: 200 records per API call

Using upsert with an external ID ensures idempotency — the pipeline can be safely re-run without creating duplicates.

Error Handling

The pipeline implements a robust error handling strategy:

Validation Failures: Pipeline halts, error alert sent, failure logged to audit table
Salesforce Errors: Continues processing batch, individual failures logged
Success Path: Results logged, success notification sent

Monitoring & Alerting

Pipeline execution is monitored via:

AWS SNS: Email notifications for success/failure
CloudWatch Metrics: Execution time, record counts, error rates
Audit Table: Complete run history with timestamps and outcomes

Schedule

The pipeline runs daily at 06:00 UTC via cron: 0 6 * * *

This timing ensures Salesforce is updated before the start of the UK/EU business day.