Normalization and Rules

Reconlify provides three levels of cleanup: global comparison settings that apply to every column, string rules that target specific columns, and normalization pipelines that create derived columns from source data.

This page focuses on practical usage of each feature. For the full list of options and their syntax, see the YAML Config Reference.

Global comparison settings

The compare section applies the same cleanup to all columns at once.

compare:
  trim_whitespace: true
  case_insensitive: false
  normalize_nulls: ["", "NULL", "null"]
  • trim_whitespace (default: true) — strips leading and trailing spaces
  • case_insensitive (default: false) — ignores case across all columns
  • normalize_nulls — treats listed string values as equivalent to null

Global settings are coarse. When you need different behavior per column, use string rules instead.

String rules

String rules apply per-column cleanup logic. They are defined under string_rules in the config, where each key is a column name (using logical source-side names) and the value is a list of rules to apply.

trim

Problem. Your bank exports counterparty names with padding spaces. The ERP stores the same names without padding. Every row flags as a mismatch.

bank.csv (source)

txn_id,counterparty,amount
T-01, Acme Corp  ,5000.00
T-02,  GlobalTech ,3200.00
T-03,Wayne Industries,1800.00

erp.csv (target)

txn_id,counterparty,amount
T-01,Acme Corp,5000.00
T-02,GlobalTech,3200.00
T-03,Wayne Industries,1800.00

Config

type: tabular
source: bank.csv
target: erp.csv
keys:
  - txn_id
 
string_rules:
  counterparty:
    - trim

Before vs after

txn_id Source (raw) Target Without rule With rule
T-01 " Acme Corp " "Acme Corp" mismatch match
T-02 " GlobalTech " "GlobalTech" mismatch match
T-03 "Wayne Industries" "Wayne Industries" match match

The trim rule strips leading and trailing whitespace from the source value before comparison. T-01 and T-02 now match. T-03 already matched and is unaffected.

Note: compare.trim_whitespace does the same thing globally. Use the string rule when you only want trimming on specific columns — for example, when other columns contain significant whitespace that should be compared exactly.

case_insensitive

Problem. The legacy system stores status values in uppercase ("ACTIVE", "CLOSED"). The new platform uses title case ("Active", "Closed"). Every row with a status value flags as a mismatch.

legacy.csv (source)

account_id,status,balance
A-100,ACTIVE,12500.00
A-200,CLOSED,0.00
A-300,PENDING REVIEW,4800.00

platform.csv (target)

account_id,status,balance
A-100,Active,12500.00
A-200,Closed,0.00
A-300,Pending Review,4800.00

Config

type: tabular
source: legacy.csv
target: platform.csv
keys:
  - account_id
 
string_rules:
  status:
    - case_insensitive

Before vs after

account_id Source Target Without rule With rule
A-100 "ACTIVE" "Active" mismatch match
A-200 "CLOSED" "Closed" mismatch match
A-300 "PENDING REVIEW" "Pending Review" mismatch match

The case_insensitive rule compares values without regard to letter casing. All three rows now match.

Use the string rule instead of compare.case_insensitive when casing matters on other columns. For example, product SKUs like "ABC-100" and "abc-100" may represent different items — you want case-insensitive comparison on status but exact comparison on sku.

regex_extract

Problem. The source system stores order references with a prefix and year ("ORD-2025-48301"). The target system stores only the numeric suffix ("48301"). The values represent the same order but never match as strings.

orders_source.csv (source)

order_ref,customer,total
ORD-2025-48301,Acme Corp,1500.00
ORD-2025-48302,Globex Inc,820.00
ORD-2024-91055,Initech Ltd,3400.00

orders_target.csv (target)

order_ref,customer,total
48301,Acme Corp,1500.00
48302,Globex Inc,820.00
91055,Initech Ltd,3400.00

Config

type: tabular
source: orders_source.csv
target: orders_target.csv
keys:
  - order_ref
 
string_rules:
  order_ref:
    - regex_extract:
        pattern: "ORD-\\d{4}-(\\d+)"
        group: 1

Before vs after

Source (raw) Extracted Target Without rule With rule
"ORD-2025-48301" "48301" "48301" mismatch match
"ORD-2025-48302" "48302" "48302" mismatch match
"ORD-2024-91055" "91055" "91055" mismatch match

The regex_extract rule applies the regex pattern to the source value and extracts the specified capture group. Group 1 captures the trailing digits after ORD-YYYY-. The extracted value is what gets compared against the target.

Because order_ref is also the key column, the extraction applies during key matching too — the source row ORD-2025-48301 is paired with the target row 48301.

contains

Problem. The source system stores short product names ("Widget"). The target system stores extended descriptions that include the short name ("Premium Widget - Large"). The values refer to the same product but differ as strings.

inventory_source.csv (source)

sku,product_name,qty
W-100,Widget,250
W-200,Gadget,180
W-300,Sprocket,90

inventory_target.csv (target)

sku,product_name,qty
W-100,Premium Widget - Large,250
W-200,Gadget Pro,180
W-300,Industrial Sprocket v2,90

Config

type: tabular
source: inventory_source.csv
target: inventory_target.csv
keys:
  - sku
 
string_rules:
  product_name:
    - contains

Before vs after

sku Source Target Without rule With rule
W-100 "Widget" "Premium Widget - Large" mismatch match
W-200 "Gadget" "Gadget Pro" mismatch match
W-300 "Sprocket" "Industrial Sprocket v2" mismatch match

The contains rule matches if either value contains the other as a substring. "Widget" is contained in "Premium Widget - Large", so W-100 matches.

This rule is intentionally loose. Use it when one system stores an abbreviated version of a value that the other system stores in full. If you need more precision, use regex_extract to pull out the specific substring you want to compare.

Combining rules

Rules can be stacked on a single column. They apply in order.

string_rules:
  vendor_name:
    - trim
    - case_insensitive

This first strips whitespace, then compares case-insensitively. A source value of " ACME CORP " matches a target value of "Acme Corp".

A common combination for messy string data:

string_rules:
  counterparty:
    - trim
    - case_insensitive
  reference_id:
    - regex_extract:
        pattern: "REF-(\\d+)"
        group: 1

Different columns get different treatment in a single config.

Source-side normalization pipelines

When the source and target represent the same data in structurally different ways, string rules are not enough. Normalization pipelines create derived columns on the source side before comparison.

Each entry is a linear pipeline. The first step receives its inputs from args (column names or literals). Each subsequent step operates on the result of the previous one.

Combine split fields

The source stores names in two columns. The target has a single field.

normalization:
  full_name:
    - op: concat
      args: [first_name, " ", last_name]
    - op: trim
 
ignore_columns:
  - first_name
  - last_name

Reconlify generates full_name on the source side, then compares it against the target's full_name. The original columns are excluded since they have been replaced.

Map codes to labels

The source uses short codes. The target stores readable labels.

normalization:
  status:
    - op: map
      args: [status_code, "A", "ACTIVE", "I", "INACTIVE", "S", "SUSPENDED"]
 
ignore_columns:
  - status_code

The map operation takes the source column followed by key-value pairs. Each source code is replaced with its corresponding label.

Align date formats

The source uses ISO dates. The target uses a DD/MM/YYYY format.

normalization:
  event_date:
    - op: date_format
      args: [event_date, "%Y-%m-%d", "%d/%m/%Y"]

The date_format operation parses the source date with the first format string and outputs it in the second. Both sides now compare as "15/01/2025".

Other operations

Normalization supports additional operations for common patterns:

  • round — align decimal precision (e.g., 4 places to 2)
  • coalesce — replace nulls with a default value
  • add, sub, mul, div — compute derived numeric values

See the YAML Config Reference for the full list with argument syntax.

Limitations

  • Source-side only. Normalization pipelines run on source data. Target columns are not transformed — use column mapping for target-side renames.
  • Linear pipeline. Steps run in sequence. No branching or conditionals.
  • No cross-references. Generated columns cannot reference other generated columns. All args must refer to original source columns or literals.
  • No arbitrary expressions. For complex transformations, preprocess your data before running Reconlify.