Skip to content

Adapting Dac-Man's CSV plug-in to custom data

This example illustrates how to implement a change analysis tailored to specific features of the data being analyzed, by creating a minimal extension of the included CSV plug-in.

A complete runnable version of the entire script is available at the end of this section, and in the examples/csv directory of the Dac-Man source code repository, together with the example data (shown immediately below) as two separate CSV files.

Test data

In this example, we perform an analysis of the changes between the files A.csv and B.csv:

# a single commented line
site_id,message,date,temperature
C,Some text,8/15/2019,24.1
A,Some other text,3/2/2019,35.3
B,,1/1/2018,42.8
# a 
# longer
# comment
# spanning
# several
# lines
site_id,temperature,message,day
B,42.1,Some new text,01/01/2018
# some random comments in the middle
A,35.3,Some other text,03/02/2019
C,24.1,Some text,09/15/2019

Since they contain only a few lines, we are able to notice some of the differences by looking at the two files. Even though the tabular data itself is very similar, there are several differences in the format and the structure on the two files:

  • B contains additional comments
  • Corresponding rows are not in the same order
  • The order of values in the same row is different
  • The values appearing under the field date in A appear under day in B
  • The values for date/day use different conventions to represents date values (with/without leading zeros)

By running dacman diff A.csv B.csv with the default options, we obtain these change statistics:

percent change: 87.50%
      frac_changed  n_added  n_changed  n_deleted  n_modified  n_unchanged  \
name                                                                         
0              0.5        0          2          0           2            2   
1              1.0        1          4          0           3            0   
2              1.0        0          4          0           4            0   
3              1.0        0          4          0           4            0   

        status  
name            
0     MODIFIED  
1     MODIFIED  
2     MODIFIED  
3     MODIFIED  

The complete output contains more information about each part of the analysis.

The file section contains the change metrics using information about the source files. It is calculated in the CSVPlugin.compare_file() method.

Dac-Man detected changes in the filename, n_lines, n_lines_commented, and n_chars metadata properties, while the path_parent and n_lines_uncommented are the same in both files.

{
    "file": {
        "status": "MODIFIED",
        "metadata": {
            "common": {
                "path_parent": "/tmp",
                "n_lines_uncommented": 4
            },
            "changes": {
                "filename": {
                    "a": "A.csv",
                    "b": "B.csv"
                },
                "n_lines": {
                    "a": 5,
                    "b": 11
                },
                "n_lines_commented": {
                    "a": 1,
                    "b": 7
                },
                "n_chars": {
                    "a": 134,
                    "b": 215
                }
            }
        }
    }
}

The change percentage is high compared to our preliminary inspection of the sources, so it is possible that some of the changed detected with the most general options are false positives. We will repeat the analysis, tweaking the plug-in options based on our knowledge about the structure of the source data.

Creating a specialized comparator class

Tip

This part of the example focuses on MyCSVPlugin, a simple extension of Dac-Man's included CSVPlugin comparator class and its options, described here. A complete runnable analysis script is available at the end of this section.

Tip

For more information about Dac-Man plug-ins, refer to the Using plug-ins and Plug-in API sections.

We start from creating the comparator class implementing our customizations by extending the CSVPlugin class:

from dacman.plugins.csv import CSVPlugin

class MyCSVPlugin(CSVPlugin):
    pass

Setting column names from the header

We start from selecting the position of the row containing the header, from which the column names will be obtained.

This is done by setting the value of header_pos in the calc_options plug-in attribute, which will be zero since the header is the first row (starting from 0) in both sources:

class MyCSVPlugin(CSVPlugin):

    calc_options = {
        'header_pos': 0,
    }

After running dacman, the change percentage is slightly lower, accounting for the fact that the positions of the temperature and message columns are switched in the sources:

percent change: 80.00%
             frac_changed  n_added  n_changed  n_deleted  n_modified  \
name                                                                   
site_id          0.666667        0          2          0           2   
message          0.666667        1          2          0           1   
date             1.000000        0          3          3           0   
temperature      0.666667        0          2          0           2   
day              1.000000        3          3          0           0   

             n_unchanged    status  
name                                
site_id                1  MODIFIED  
message                1  MODIFIED  
date                   0  MODIFIED  
temperature            1  MODIFIED  
day                    0  MODIFIED  

We can verify this by checking the section dedicated column metadata under the colidx property in the complete change analysis output:

{
    "site_id": {
        "status": "MODIFIED",
        "metadata": {
            "common": {
                "name": "site_id",
                "name_header": "site_id",
                "colidx": 0,
                "dtype": "object"
            },
            "changes": {}
        },
        "values": {
            "n_unchanged": 1,
            "n_added": 0,
            "n_deleted": 0,
            "n_modified": 2,
            "n_changed": 2,
            "frac_changed": 0.6666666666666666
        }
    },
    "message": {
        "status": "MODIFIED",
        "metadata": {
            "common": {
                "name": "message",
                "name_header": "message",
                "dtype": "object"
            },
            "changes": {
                "colidx": {
                    "a": 1,
                    "b": 2
                }
            }
        },
        "values": {
            "n_unchanged": 1,
            "n_added": 1,
            "n_deleted": 0,
            "n_modified": 1,
            "n_changed": 2,
            "frac_changed": 0.6666666666666666
        }
    },
    "date": {
        "status": "MODIFIED",
        "metadata": {
            "common": {},
            "changes": {}
        },
        "values": {
            "n_unchanged": 0,
            "n_added": 0,
            "n_deleted": 3,
            "n_modified": 0,
            "n_changed": 3,
            "frac_changed": 1.0
        }
    },
    "temperature": {
        "status": "MODIFIED",
        "metadata": {
            "common": {
                "name": "temperature",
                "name_header": "temperature",
                "dtype": "object"
            },
            "changes": {
                "colidx": {
                    "a": 3,
                    "b": 1
                }
            }
        },
        "values": {
            "n_unchanged": 1,
            "n_added": 0,
            "n_deleted": 0,
            "n_modified": 2,
            "n_changed": 2,
            "frac_changed": 0.6666666666666666
        }
    },
    "day": {
        "status": "MODIFIED",
        "metadata": {
            "common": {},
            "changes": {}
        },
        "values": {
            "n_unchanged": 0,
            "n_added": 3,
            "n_deleted": 0,
            "n_modified": 0,
            "n_changed": 3,
            "frac_changed": 1.0
        }
    }
}

Setting the index

Table values are compared row-by-row, according to the row labels, or, in equivalent terms, to which column is set as the table index. The default option uses the original row index, which results in an erroneous correspondence between rows because of the rows being ordered differently in the sources.

The values under site_id are a good choice for an index, since they are unique within each table, and are the same in both sources.

To set the index, we set the value of index in calc_options:

class MyCSVPlugin(CSVPlugin):

    calc_options = {
        'header_pos': 0,
        'index': 'site_id',
    }

After running dacman, the results are:

percent change: 66.67%
             frac_changed  n_added  n_changed  n_deleted  n_modified  \
name                                                                   
message          0.333333        1          1          0           0   
date             1.000000        0          3          3           0   
temperature      0.333333        0          1          0           1   
day              1.000000        3          3          0           0   

             n_unchanged    status  
name                                
message                2  MODIFIED  
date                   0  MODIFIED  
temperature            2  MODIFIED  
day                    0  MODIFIED  

The change percentage is lower, accounting for the fact that the rows are associated correctly, independently on the order in which they appear in the CSV files.

We can verify this by comparing the complete output before and after setting index = site_id for e.g. the values of the message column:

{
    "n_unchanged": 1,
    "n_added": 1,
    "n_deleted": 0,
    "n_modified": 1,
    "n_changed": 2,
    "frac_changed": 0.6666666666666666
}
{
    "n_unchanged": 2,
    "n_added": 1,
    "n_deleted": 0,
    "n_modified": 0,
    "n_changed": 1,
    "frac_changed": 0.3333333333333333
}

For the sake of comparison, if we instead used index = lineno, the line number where the table row data appears in the source file would be used instead. Comments are also counted in lineno, causing the rows to be completely misaligned:

percent change: 96.67%
             frac_changed  n_added  n_changed  n_deleted  n_modified  \
name                                                                   
site_id          1.000000        3          6          3           0   
message          0.833333        3          5          2           0   
date             1.000000        0          3          3           0   
temperature      1.000000        3          6          3           0   
day              1.000000        3          3          0           0   

             n_unchanged    status  
name                                
site_id                0  MODIFIED  
message                1  MODIFIED  
date                   0  MODIFIED  
temperature            0  MODIFIED  
day                    0  MODIFIED  

For reference, this is how the calc tables generated from A.csv and B.csv look like in the three cases:

{
    "header_pos": 0
}
__rowidx site_id message date temperature __lineno __rowidx
1 C Some text 8/15/2019 24.1 3 1
2 A Some other text 3/2/2019 35.3 4 2
3 B nan 1/1/2018 42.8 5 3
__rowidx site_id temperature message day __lineno __rowidx
1 B 42.1 Some new text 01/01/2018 8 1
2 A 35.3 Some other text 03/02/2019 10 2
3 C 24.1 Some text 09/15/2019 11 3
{
    "header_pos": 0,
    "index": "site_id"
}
site_id message date temperature __lineno __rowidx
A Some other text 3/2/2019 35.3 4 2
B nan 1/1/2018 42.8 5 3
C Some text 8/15/2019 24.1 3 1
site_id temperature message day __lineno __rowidx
A 35.3 Some other text 03/02/2019 10 2
B 42.1 Some new text 01/01/2018 8 1
C 24.1 Some text 09/15/2019 11 3
{
    "header_pos": 0,
    "index": "lineno"
}
__lineno site_id message date temperature __lineno __rowidx
3 C Some text 8/15/2019 24.1 3 1
4 A Some other text 3/2/2019 35.3 4 2
5 B nan 1/1/2018 42.8 5 3
__lineno site_id temperature message day __lineno __rowidx
8 B 42.1 Some new text 01/01/2018 8 1
10 A 35.3 Some other text 03/02/2019 10 2
11 C 24.1 Some text 09/15/2019 11 3

Matching columns with different names

From the change summary, we see that the two columns date and day appear as detected as DELETED and ADDED respectively.

We know that the two columns are actually the same column appearing with different names in the sources, so we can set the column_renames value using a dict to match the column names used for comparison.

Detail

The order of the mapping (i.e. turning day into date, or viceversa) is not relevant, as the original names are in any case stored and compared as column metadata.

class MyCSVPlugin(CSVPlugin):

    calc_options = {
        'header_pos': 0,
        'index': 'site_id',
        'column_renames': {'day': 'date'},
    }

After running dacman, the change percentage is lower, since the date column doesn't appear twice under different names:

percent change: 55.56%
             frac_changed  n_added  n_changed  n_deleted  n_modified  \
name                                                                   
message          0.333333        1          1          0           0   
date             1.000000        0          3          0           3   
temperature      0.333333        0          1          0           1   

             n_unchanged    status  
name                                
message                2  MODIFIED  
date                   0  MODIFIED  
temperature            2  MODIFIED  

From the complete output, we can verify that the column name obtained from the header (under the metadata property name_header) is changed:

{
    "status": "MODIFIED",
    "metadata": {
        "common": {
            "name": "date",
            "dtype": "object"
        },
        "changes": {
            "name_header": {
                "a": "date",
                "b": "day"
            },
            "colidx": {
                "a": 2,
                "b": 3
            }
        }
    },
    "values": {
        "n_unchanged": 0,
        "n_added": 0,
        "n_deleted": 0,
        "n_modified": 3,
        "n_changed": 3,
        "frac_changed": 1.0
    }
}

Using value-specific data types instead of text

Despite matching the date column across both sources correctly, all values in the date column still appear as modified.

We can get more details from the values_by_column section of the complete output, by comparing the entries under the a and b keys for each value. We can see that only one of the values is actually modified, while the others refer to the same date in both sources, but stored as text differently.

{
    "UNCHANGED": {
        "count": 0
    },
    "ADDED": {
        "count": 0,
        "values": []
    },
    "DELETED": {
        "count": 0,
        "values": []
    },
    "MODIFIED": {
        "count": 3,
        "values": [
            {
                "a": {
                    "original": "3/2/2019",
                    "calculated_as": "3/2/2019",
                    "loc": {
                        "line": 4
                    }
                },
                "b": {
                    "original": "03/02/2019",
                    "calculated_as": "03/02/2019",
                    "loc": {
                        "lineno": 10,
                        "colidx": {
                            "a": 2,
                            "b": 3
                        },
                        "colname": {
                            "a": "date",
                            "b": "day"
                        }
                    }
                },
                "delta": NaN
            },
            {
                "a": {
                    "original": "1/1/2018",
                    "calculated_as": "1/1/2018",
                    "loc": {
                        "line": 5
                    }
                },
                "b": {
                    "original": "01/01/2018",
                    "calculated_as": "01/01/2018",
                    "loc": {
                        "lineno": 8,
                        "colidx": {
                            "a": 2,
                            "b": 3
                        },
                        "colname": {
                            "a": "date",
                            "b": "day"
                        }
                    }
                },
                "delta": NaN
            },
            {
                "a": {
                    "original": "8/15/2019",
                    "calculated_as": "8/15/2019",
                    "loc": {
                        "line": 3
                    }
                },
                "b": {
                    "original": "09/15/2019",
                    "calculated_as": "09/15/2019",
                    "loc": {
                        "lineno": 11,
                        "colidx": {
                            "a": 2,
                            "b": 3
                        },
                        "colname": {
                            "a": "date",
                            "b": "day"
                        }
                    }
                },
                "delta": NaN
            }
        ]
    }
}

By comparing the date values as actual datetime objects, rather than text, we can address these false positives caused by the different date format.

By setting the dtype option to True, Dac-Man will attempt to convert values from text to more specific data types (datetime or numeric) automatically. This relies on pandas's robust data converter functionality, and should work without needing further manual adjustments in most cases.

class MyCSVPlugin(CSVPlugin):

    calc_options = {
        'header_pos': 0,
        'index': 'site_id',
        'column_renames': {'day': 'date'},
        'dtype': True
    }

After running dacman again, the result is:

percent change: 33.33%
             frac_changed  n_added  n_changed  n_deleted  n_modified  \
name                                                                   
message          0.333333        1          1          0           0   
date             0.333333        0          1          0           1   
temperature      0.333333        0          1          0           1   

             n_unchanged    status  
name                                
message                2  MODIFIED  
date                   2  MODIFIED  
temperature            2  MODIFIED  

We can observe the effect of the data type conversion by comparing the values_by_column section for the date column with the result from the previous run:

  • The two values representing the same dates are classified as UNCHANGED rather than MODIFIED
  • For the value that was actually different, it is now possible to calculate the delta between the two sources expressed as a meaningful quantity: in this case, since the values are of type datetime, delta is of type timedelta
{
    "UNCHANGED": {
        "count": 2
    },
    "ADDED": {
        "count": 0,
        "values": []
    },
    "DELETED": {
        "count": 0,
        "values": []
    },
    "MODIFIED": {
        "count": 1,
        "values": [
            {
                "a": {
                    "original": "8/15/2019",
                    "calculated_as": "2019-08-15 00:00:00",
                    "loc": {
                        "line": 3
                    }
                },
                "b": {
                    "original": "09/15/2019",
                    "calculated_as": "2019-09-15 00:00:00",
                    "loc": {
                        "lineno": 11,
                        "colidx": {
                            "a": 2,
                            "b": 3
                        },
                        "colname": {
                            "a": "date",
                            "b": "day"
                        }
                    }
                },
                "delta": "-31 days +00:00:00"
            }
        ]
    }
}

Since the data type conversion was applied automatically, deltas are available for all columns with numeric values, such as temperature:

{
    "UNCHANGED": {
        "count": 2
    },
    "ADDED": {
        "count": 0,
        "values": []
    },
    "DELETED": {
        "count": 0,
        "values": []
    },
    "MODIFIED": {
        "count": 1,
        "values": [
            {
                "a": {
                    "original": "42.8",
                    "calculated_as": 42.8,
                    "loc": {
                        "line": 5
                    }
                },
                "b": {
                    "original": "42.1",
                    "calculated_as": 42.1,
                    "loc": {
                        "lineno": 8,
                        "colidx": {
                            "a": 3,
                            "b": 1
                        },
                        "colname": {
                            "a": "temperature",
                            "b": "temperature"
                        }
                    }
                },
                "delta": 0.6999999999999957
            }
        ]
    }
}

Complete change analysis script

Available at examples/csv/my_csv_ana.py.

#!/usr/bin/env python3
import sys

import dacman
from dacman.plugins.csv import CSVPlugin


class MyCSVPlugin(CSVPlugin):

    calc_options = {
        'header_pos': 0,
        'index': 'site_id',
        'column_renames': {'day': 'date'},
        'dtype': True
    }


def run_my_change_analysis(file_a, file_b):
    comparisons = [(file_a, file_b)]
    differ = dacman.DataDiffer(comparisons, dacman.Executor.DEFAULT)
    differ.use_plugin(MyCSVPlugin)
    differ.start()


if __name__ == '__main__':
    cli_args = sys.argv[1:]
    print(f'cli_args={cli_args}')
    file_a, file_b = cli_args[0], cli_args[1]
    run_my_change_analysis(file_a, file_b)