Data Migration

The problem

Your company is migrating from a legacy CRM to a new platform. The migration team has run the import scripts, and now you need to answer a simple question: did every customer record arrive correctly?

This is harder than it sounds. The legacy system and new platform use different schemas, different column names, and different conventions. The new system assigns its own timestamps. Some customers exist in multiple regions, so a single customer_id is not enough to uniquely identify a row — you need a composite key.

You cannot just diff the files line by line. Row order differs, columns are renamed, and timestamps will never match. You need a tool that understands structure.

The data exports

Before cutover, you export the full customer table from the legacy CRM. After import, you export the same data from the new platform.

legacy_customers.csv (source)

customer_id,region,first_name,last_name,email,plan,balance,created_at
1001,US,Alice,Smith,alice@example.com,pro,120.00,2023-06-15T09:30:00Z
1001,EU,Alice,Smith,alice.eu@example.com,basic,45.00,2023-06-15T09:30:00Z
1002,US,Bob,Jones,bob@corp.io,enterprise,950.00,2023-07-01T10:00:00Z
1003,US,Carol,Lee,carol@example.com,pro,0.00,2023-08-20T10:30:00Z
1004,US,Dan,Wu,dan@corp.io,basic,75.50,2023-09-10T11:00:00Z

new_customers.csv (target)

cust_id,region,name,email,tier,current_balance,migrated_at
1001,US,Alice Smith,alice@example.com,pro,120.00,2025-03-01T08:00:00Z
1001,EU,Alice Smith,alice.eu@example.com,basic,45.00,2025-03-01T08:00:00Z
1002,US,Bob Jones,bob@corp.io,enterprise,950.00,2025-03-01T08:05:00Z
1004,US,Dan Wu,dan@corp.io,basic,75.50,2025-03-01T08:20:00Z
1005,US,Eve Park,eve@newco.com,trial,0.00,2025-03-01T08:25:00Z

Differences you would expect to find:

Carol (1003/US) is missing from the new platform
Eve (1005/US) is unexpected — not in the legacy export
Column names differ — customer_id → cust_id, plan → tier, balance → current_balance
Names are split in legacy (first_name, last_name) but combined in the new platform (name)
Timestamps are irrelevant — created_at vs migrated_at will always differ

The validation config

type: tabular
source: legacy_customers.csv
target: new_customers.csv
 
keys:
  - customer_id
  - region
 
column_mapping:
  customer_id: cust_id
  full_name: name
  plan: tier
  balance: current_balance
 
ignore_columns:
  - created_at
  - first_name
  - last_name
 
normalization:
  full_name:
    - op: concat
      args: [first_name, " ", last_name]
    - op: trim

Why this config works

Composite keys — customer_id + region uniquely identifies each row. Alice has two records (US and EU), so a single-column key would create duplicates. Reconlify requires unique keys and will error if they are not.

Column mapping — translates between legacy and new-platform column names. The config uses logical (source-side) names everywhere: keys says customer_id, not cust_id.

Normalization — generates a full_name column on the source side by concatenating first_name and last_name. This derived field is compared against the target's name column. Without this, Reconlify would have no way to compare split source names against the combined target field.

ignore_columns — excludes created_at (timestamps always differ between exports), first_name, and last_name (replaced by the generated full_name).

Running the validation

reconlify run config.yaml

Reconlify prints a summary and writes the full report to report.json. The exit code is 1, meaning differences were found.

To save the report to a specific location:

reconlify run config.yaml --out migration-validation/round1.json

Interpreting the results

Open report.json. The summary tells you the migration status at a glance:

{
  "source_rows": 5,
  "target_rows": 5,
  "missing_in_target": 1,
  "missing_in_source": 1,
  "rows_with_mismatches": 0,
  "mismatched_cells": 0
}

What each number means for your migration

missing_in_target: 1 — one legacy record did not make it to the new platform. The samples.missing_in_target section identifies it as Carol (1003/US). This is the most critical finding — it means data was lost. Investigate whether the import script failed, whether a filter excluded her, or whether her record was rejected by validation rules.

missing_in_source: 1 — one record exists in the new platform that was not in the legacy export. The samples identify Eve (1005/US). This could be test data, a record created directly in the new system, or a duplicate from a previous import run. Either way, it needs review.

rows_with_mismatches: 0 — every record that exists on both sides has identical values (after column mapping and normalization). Alice's two records, Bob, and Dan all match.

The migration checklist

For a clean migration, you want:

Metric	Expected	This run
missing_in_target	0	1
missing_in_source	0	1
rows_with_mismatches	0	0

This migration is not yet clean. Fix the Carol import issue, investigate the Eve record, re-export, and run the validation again.

The validation workflow

Migration validation is rarely one-and-done. The typical workflow is:

Export both systems after import
Run Reconlify with your validation config
Review the report — focus on missing rows first, then mismatches
Fix import issues or data quality problems
Re-export and re-run until the report shows zero differences

Save each report with a timestamped filename so you have an audit trail of each validation round.

Common migration validation scenarios

Pre/post cutover — compare the last legacy export against the first new-system export
Phased migration — run periodic comparisons to track progress as batches are migrated
ETL pipeline regression — confirm that pipeline changes did not introduce data loss
SQL refactor validation — verify that rewritten queries produce the same output
Environment promotion — confirm staging and production datasets match after deployment

Column Mapping — how logical names and target-side resolution work
Normalization and Rules — all available transform operations for source-side pipelines
Report Format — full reference for the JSON report structure