Why diff fails for CSV comparison
When you need to compare two CSV files, the first instinct is usually to run a diff.
Something like:
diff file_before.csv file_after.csv
Or maybe a visual diff tool.
At first glance this seems reasonable. CSV files are just text files, so a text diff should work.
But in practice, diff produces results that are often noisy, misleading, or completely unusable.
This is especially common when validating:
- data migrations
- SQL pipeline refactors
- report regressions
- system exports
The problem is that CSV files represent structured datasets, while diff only understands lines of text.
Once you start comparing real-world datasets, several issues quickly appear.
1. Row order often changes
Many datasets do not have deterministic ordering.
For example, a database export might return rows in a different order after a migration or query change.
Consider this dataset:
id,name,amount
1,Alice,100
2,Bob,200
3,Carol,300
After a migration, the export might look like this:
id,name,amount
3,Carol,300
1,Alice,100
2,Bob,200
The data is identical.
But a line-by-line diff will show almost every row as changed.
From diff's perspective, the entire file has been modified.
From a data perspective, nothing actually changed.
2. Column order can differ
Another common case happens when column order changes.
For example:
id,name,amount
1,Alice,100
vs
name,id,amount
Alice,1,100
Again, the underlying data is the same.
But because diff compares text positions rather than fields, it treats the entire row as different.
This is extremely common when comparing exports from different systems or different versions of the same pipeline.
3. Formatting noise creates false positives
Real-world data often contains small formatting differences.
Examples include:
- extra whitespace
- different casing
- NULL vs empty string
- timestamp formatting
For example:
Alice vs Alice
Or:
NULL vs null
These differences usually do not matter from a business or data perspective, but diff reports them as changes.
When you are validating thousands of rows, this noise quickly becomes overwhelming.
4. Numeric rounding differences
Another very common case appears in financial or analytical datasets.
For example:
100.00 vs 100
Or:
19.999999 vs 20.00
In many pipelines, these small differences are expected due to floating point behavior or formatting rules.
But diff cannot distinguish between a meaningful difference and a harmless rounding artifact.
5. What you actually want is semantic comparison
When people compare datasets, the real question is usually simple:
Did the data actually change?
Answering this requires understanding the structure of the data.
For example:
- rows should be matched by key
- column names may differ
- ordering may not matter
- small numeric tolerances may be acceptable
- certain columns may be ignored
In other words, the comparison needs to be semantic, not textual.
6. Example: migration validation
Imagine validating a system migration.
You export transactions from the old system:
txn_id,customer,amount
1001,Alice,100.00
1002,Bob,250.50
And from the new system:
id,client,total_amount
1001,Alice,100
1002,Bob,250.50
A diff tool sees completely different rows.
But a semantic comparison would recognize:
txn_idmaps toidcustomermaps toclientamountmaps tototal_amount100and100.00are equivalent within tolerance
From that perspective, the datasets are identical.
7. Why tools for this are surprisingly rare
Despite how common this problem is, many teams still rely on:
- Excel comparisons
- ad-hoc Python scripts
- SQL queries
- manual inspection
These approaches work, but they are usually fragile and hard to reuse.
8. A different approach
Instead of comparing files line by line, Reconlify compares datasets.
It supports concepts like:
- key-based row matching
- column mapping
- normalization rules
- numeric tolerance
- structured audit reports
This makes it easier to answer the real question:
Did the data change in a meaningful way?
Learn more
Reconlify is an open-source CLI for semantic dataset comparison.
GitHub: https://github.com/testuteab/reconlify-cli
Docs and examples: https://reconlify.com