Extending the CSV plug-in¶
This example illustrates how to implement a change analysis for CSV files tailored to specific features of the data being analyzed, by creating a minimal extension of the included CSV plug-in.
A complete runnable version of the entire script is available at the end of this section,
and in the examples/csv
directory of the Dac-Man source code repository,
together with the example data shown below as two separate CSV files.
Creating the main
block of the change analysis script¶
We start from creating the skeleton of the analysis script in a Python file, e.g. /home/user/my_csv_ana.py
.
The first line starting with #!...
is necessary to make the script executable.
Dac-Man analysis scripts must accept two command-line arguments, which are the paths to the files that will be compared.
#!/usr/bin/env python3 import sys if __name__ == '__main__': cli_args = sys.argv[1:] print(f'cli_args={cli_args}') file_a, file_b = cli_args[0], cli_args[1]
Implementing the change analysis with Dac-Man's API¶
Then, we create a Python function, having the two files as arguments, implementing the custom change analysis using Dac-Man's API. This will allow us to integrate our customized comparator class while reusing much of the functionality provided by Dac-Man.
import sys import dacman def run_my_change_analysis(file_a, file_b): comparisons = [(file_a, file_b)] differ = dacman.DataDiffer(comparisons, dacman.Executor.DEFAULT) if __name__ == '__main__': cli_args = sys.argv[1:] print(f'cli_args={cli_args}') file_a, file_b = cli_args[0], cli_args[1] run_my_change_analysis(file_a, file_b)
Creating a specialized comparator class¶
In this example, we perform an analysis of the changes between the files A.csv
and B.csv
:
# a single commented line site_id,message,date,temperature C,Some text,8/15/2019,24.1 A,Some other text,3/2/2019,35.3 B,,1/1/2018,42.8
# a # longer # comment # spanning # several # lines site_id,temperature,message,day B,42.1,Some new text,01/01/2018 # some random comments in the middle A,35.3,Some other text,03/02/2019 C,24.1,Some text,09/15/2019
Since they contain only a few lines, we are able to notice some of the differences by looking at the two files. Even though the tabular data itself is very similar, there are several differences in the format and the structure on the two files:
B
contains additional comments- Corresponding rows are not in the same order
- The order of values in the same row is different
- The values appearing under the field
date
inA
appear underday
inB
- The values for
date
/day
use different conventions to represents date values (with/without leading zeros)
In the following, we'll build an extension of the CSV plug-in tailored to the structure of this source data, and how to integrate it in a custom analysis script.
We start from creating the comparator subclass implementing our customizations by extending the CSVPlugin
class:
from dacman.plugins.csv import CSVPlugin class MyCSVPlugin(CSVPlugin): pass
Setting column names from the header¶
Our first step to adapt the MyCSVPlugin
to the input data is specifying the position of the row containing the header,
from which the table column names will be obtained.
This is done by setting the value of header_pos
in the calc_options
plug-in attribute, which will be zero since the header is the first row (starting from 0) in both sources:
class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, }
Setting the index¶
Table values are compared row-by-row, according to the row labels, or, in equivalent terms, to which column is set as the table index. The default option uses the original row index, which results in an erroneous correspondence between rows because of the rows being ordered differently in the sources.
The values under site_id
are a good choice for an index, since they are unique within each table, and are the same in both sources.
To set the index, we set the value of index
is calc_options
:
class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, 'index': 'site_id', }
Matching columns with different names¶
In the input data, the two columns date
in A.csv
and day
in B.csv
refer to the same data.
Since their name is different in the two version,
they will be detected as:
date
:DELETED
fromA.csv
, because it only appears in the first file;day
:ADDED
toB.csv
, because it only appears in the second file.
We know that the two columns are actually the same column appearing with different names in the sources, so we can set the column_renames
value using a dict
to match the column names used for comparison.
class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, 'index': 'site_id', 'column_renames': {'day': 'date'}, }
Detail
The order of the mapping (i.e. turning day
into date
, or the other way around) is not relevant, as the original names are in any case stored and compared as column metadata.
Using value-specific data types instead of text¶
Despite matching the date
column across both sources correctly,
all values in the date
column will be interpreted as modified.
This is because the text values representing the same dates
are expressed as text in different formats.
By comparing the date
values as actual datetime
objects, rather than text,
we can address these false positives caused by the different date format.
By setting the dtype
option to True
, Dac-Man will attempt to convert values
from text to more specific data types (datetime
or numeric) automatically.
This relies on pandas's robust data converter functionality,
and should work without needing further manual adjustments in most cases.
class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, 'index': 'site_id', 'column_renames': {'day': 'date'}, 'dtype': True }
Integrating our custom comparator in the change analysis¶
The next step is to add the code for our custom comparator class MyCSVPlugin
and set it as the plug-in to use for the comparison:
import sys import dacman from dacman.plugins.csv import CSVPlugin class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, 'index': 'site_id', 'column_renames': {'day': 'date'}, 'dtype': True } def run_my_change_analysis(file_a, file_b): comparisons = [(file_a, file_b)] differ = dacman.DataDiffer(comparisons, dacman.Executor.DEFAULT) differ.use_plugin(MyCSVPlugin) differ.start() if __name__ == '__main__': cli_args = sys.argv[1:] print(f'cli_args={cli_args}') file_a, file_b = cli_args[0], cli_args[1] run_my_change_analysis(file_a, file_b)
Running the custom change analysis¶
The complete code for this custom analysis script is:
#!/usr/bin/env python3 import sys import dacman from dacman.plugins.csv import CSVPlugin class MyCSVPlugin(CSVPlugin): calc_options = { 'header_pos': 0, 'index': 'site_id', 'column_renames': {'day': 'date'}, 'dtype': True } def run_my_change_analysis(file_a, file_b): comparisons = [(file_a, file_b)] differ = dacman.DataDiffer(comparisons, dacman.Executor.DEFAULT) differ.use_plugin(MyCSVPlugin) differ.start() if __name__ == '__main__': cli_args = sys.argv[1:] print(f'cli_args={cli_args}') file_a, file_b = cli_args[0], cli_args[1] run_my_change_analysis(file_a, file_b)
To test this change analysis script with Dac-Man,
add executable permissions to the my_csv_ana.py
Python file using e.g. the chmod
command:
chmod +x /home/user/my_csv_ana.py
Then, navigate to the examples/csv
directories and run:
dacman diff A.csv B.csv --script /home/user/my_csv_ana.py
Tip
A complete runnable copy of this file is already available as examples/csv/my_csv_ana.py