Normalization and Rules
Reconlify provides three levels of cleanup: global comparison settings that apply to every column, string rules that target specific columns, and normalization pipelines that create derived columns from source data.
This page focuses on practical usage of each feature. For the full list of options and their syntax, see the YAML Config Reference.
Global comparison settings
The compare section applies the same cleanup to all columns at once.
compare:
trim_whitespace: true
case_insensitive: false
normalize_nulls: ["", "NULL", "null"]- trim_whitespace (default: true) — strips leading and trailing spaces
- case_insensitive (default: false) — ignores case across all columns
- normalize_nulls — treats listed string values as equivalent to null
Global settings are coarse. When you need different behavior per column, use string rules instead.
String rules
String rules apply per-column cleanup logic. They are defined under
string_rules in the config, where each key is a column name (using logical
source-side names) and the value is a list of rules to apply.
trim
Problem. Your bank exports counterparty names with padding spaces. The ERP stores the same names without padding. Every row flags as a mismatch.
bank.csv (source)
txn_id,counterparty,amount
T-01, Acme Corp ,5000.00
T-02, GlobalTech ,3200.00
T-03,Wayne Industries,1800.00erp.csv (target)
txn_id,counterparty,amount
T-01,Acme Corp,5000.00
T-02,GlobalTech,3200.00
T-03,Wayne Industries,1800.00Config
type: tabular
source: bank.csv
target: erp.csv
keys:
- txn_id
string_rules:
counterparty:
- trimBefore vs after
| txn_id | Source (raw) | Target | Without rule | With rule |
|---|---|---|---|---|
| T-01 | " Acme Corp " |
"Acme Corp" |
mismatch | match |
| T-02 | " GlobalTech " |
"GlobalTech" |
mismatch | match |
| T-03 | "Wayne Industries" |
"Wayne Industries" |
match | match |
The trim rule strips leading and trailing whitespace from the source value
before comparison. T-01 and T-02 now match. T-03 already matched and is
unaffected.
Note: compare.trim_whitespace does the same thing globally. Use the string
rule when you only want trimming on specific columns — for example, when other
columns contain significant whitespace that should be compared exactly.
case_insensitive
Problem. The legacy system stores status values in uppercase ("ACTIVE",
"CLOSED"). The new platform uses title case ("Active", "Closed"). Every
row with a status value flags as a mismatch.
legacy.csv (source)
account_id,status,balance
A-100,ACTIVE,12500.00
A-200,CLOSED,0.00
A-300,PENDING REVIEW,4800.00platform.csv (target)
account_id,status,balance
A-100,Active,12500.00
A-200,Closed,0.00
A-300,Pending Review,4800.00Config
type: tabular
source: legacy.csv
target: platform.csv
keys:
- account_id
string_rules:
status:
- case_insensitiveBefore vs after
| account_id | Source | Target | Without rule | With rule |
|---|---|---|---|---|
| A-100 | "ACTIVE" |
"Active" |
mismatch | match |
| A-200 | "CLOSED" |
"Closed" |
mismatch | match |
| A-300 | "PENDING REVIEW" |
"Pending Review" |
mismatch | match |
The case_insensitive rule compares values without regard to letter casing.
All three rows now match.
Use the string rule instead of compare.case_insensitive when casing matters
on other columns. For example, product SKUs like "ABC-100" and "abc-100"
may represent different items — you want case-insensitive comparison on
status but exact comparison on sku.
regex_extract
Problem. The source system stores order references with a prefix and year
("ORD-2025-48301"). The target system stores only the numeric suffix
("48301"). The values represent the same order but never match as strings.
orders_source.csv (source)
order_ref,customer,total
ORD-2025-48301,Acme Corp,1500.00
ORD-2025-48302,Globex Inc,820.00
ORD-2024-91055,Initech Ltd,3400.00orders_target.csv (target)
order_ref,customer,total
48301,Acme Corp,1500.00
48302,Globex Inc,820.00
91055,Initech Ltd,3400.00Config
type: tabular
source: orders_source.csv
target: orders_target.csv
keys:
- order_ref
string_rules:
order_ref:
- regex_extract:
pattern: "ORD-\\d{4}-(\\d+)"
group: 1Before vs after
| Source (raw) | Extracted | Target | Without rule | With rule |
|---|---|---|---|---|
"ORD-2025-48301" |
"48301" |
"48301" |
mismatch | match |
"ORD-2025-48302" |
"48302" |
"48302" |
mismatch | match |
"ORD-2024-91055" |
"91055" |
"91055" |
mismatch | match |
The regex_extract rule applies the regex pattern to the source value and
extracts the specified capture group. Group 1 captures the trailing digits
after ORD-YYYY-. The extracted value is what gets compared against the
target.
Because order_ref is also the key column, the extraction applies during key
matching too — the source row ORD-2025-48301 is paired with the target row
48301.
contains
Problem. The source system stores short product names ("Widget"). The
target system stores extended descriptions that include the short name
("Premium Widget - Large"). The values refer to the same product but differ
as strings.
inventory_source.csv (source)
sku,product_name,qty
W-100,Widget,250
W-200,Gadget,180
W-300,Sprocket,90inventory_target.csv (target)
sku,product_name,qty
W-100,Premium Widget - Large,250
W-200,Gadget Pro,180
W-300,Industrial Sprocket v2,90Config
type: tabular
source: inventory_source.csv
target: inventory_target.csv
keys:
- sku
string_rules:
product_name:
- containsBefore vs after
| sku | Source | Target | Without rule | With rule |
|---|---|---|---|---|
| W-100 | "Widget" |
"Premium Widget - Large" |
mismatch | match |
| W-200 | "Gadget" |
"Gadget Pro" |
mismatch | match |
| W-300 | "Sprocket" |
"Industrial Sprocket v2" |
mismatch | match |
The contains rule matches if either value contains the other as a substring.
"Widget" is contained in "Premium Widget - Large", so W-100 matches.
This rule is intentionally loose. Use it when one system stores an abbreviated
version of a value that the other system stores in full. If you need more
precision, use regex_extract to pull out the specific substring you want to
compare.
Combining rules
Rules can be stacked on a single column. They apply in order.
string_rules:
vendor_name:
- trim
- case_insensitiveThis first strips whitespace, then compares case-insensitively. A source value
of " ACME CORP " matches a target value of "Acme Corp".
A common combination for messy string data:
string_rules:
counterparty:
- trim
- case_insensitive
reference_id:
- regex_extract:
pattern: "REF-(\\d+)"
group: 1Different columns get different treatment in a single config.
Source-side normalization pipelines
When the source and target represent the same data in structurally different ways, string rules are not enough. Normalization pipelines create derived columns on the source side before comparison.
Each entry is a linear pipeline. The first step receives its inputs from args
(column names or literals). Each subsequent step operates on the result of the
previous one.
Combine split fields
The source stores names in two columns. The target has a single field.
normalization:
full_name:
- op: concat
args: [first_name, " ", last_name]
- op: trim
ignore_columns:
- first_name
- last_nameReconlify generates full_name on the source side, then compares it against
the target's full_name. The original columns are excluded since they have
been replaced.
Map codes to labels
The source uses short codes. The target stores readable labels.
normalization:
status:
- op: map
args: [status_code, "A", "ACTIVE", "I", "INACTIVE", "S", "SUSPENDED"]
ignore_columns:
- status_codeThe map operation takes the source column followed by key-value pairs. Each
source code is replaced with its corresponding label.
Align date formats
The source uses ISO dates. The target uses a DD/MM/YYYY format.
normalization:
event_date:
- op: date_format
args: [event_date, "%Y-%m-%d", "%d/%m/%Y"]The date_format operation parses the source date with the first format
string and outputs it in the second. Both sides now compare as "15/01/2025".
Other operations
Normalization supports additional operations for common patterns:
- round — align decimal precision (e.g., 4 places to 2)
- coalesce — replace nulls with a default value
- add, sub, mul, div — compute derived numeric values
See the YAML Config Reference for the full list with argument syntax.
Limitations
- Source-side only. Normalization pipelines run on source data. Target columns are not transformed — use column mapping for target-side renames.
- Linear pipeline. Steps run in sequence. No branching or conditionals.
- No cross-references. Generated columns cannot reference other generated
columns. All
argsmust refer to original source columns or literals. - No arbitrary expressions. For complex transformations, preprocess your data before running Reconlify.
Related docs
- Column Mapping — align columns when source and target use different names
- Financial Reconciliation — example using trim and case_insensitive rules with tolerance
- Data Migration — example using normalization pipelines for a CRM migration
- YAML Config Reference — full configuration reference