This repository contains the code for building the knowledge graph TaxGraph. For more information on TaxGraph see taxgraph.informatik.uni-mannheim.de.
Paper: A Knowledge Graph for Assessing Aggressive Tax Planning Strategies
A dump of the knowledge graph can be downloaded here.
Five path variables have to be specified in createRDF.py before running the file to build the knowledge graph.
path_lei_data: This path points to a Golden Copy File containing information about legal entities. These files are published
by GLEIF and can be downloaded from
here. Download the LEI-CDF v2.1 file.
The code expects the file to be in CSV format. Our knowledge graph was build with the file from 2019-10-09 08:00.
path_relationship_data: This path points to a Golden Copy File containing information about the relationships between legal
entities. These files are published by GLEIF and can be downloaded from
here. Download the RR-CDF v1.1 file.
The code expects the file to be in CSV format. Our knowledge graph was build with the file from 2019-10-09 08:00.
path_wikidata_cities: This path points to a CSV file containing combinations of wikidata entity ID, postal code and label.
A compressed version of the file that we used can be found under data/wikidataCityData/wikidata_cities.csv.gz.
The file can be decompressed by running gzip -dk wikidata_cities.csv.gz.
path_additonal_data: This path points to a file containing additional data retrieved from the
World Bank, the OECD and Wikidata. This
file can be created by running createAdditionalDataSets.py. The file that we used for building our version of the
knowledge graph can be found under data/additionalData/2020-03-17_00:56:06_df.pkl.
graph_storage_folder: This path points to the folder in which to store the final knowledge graph as an RDF file.
The build process has a high memory footprint, as most of the data processing is performed in-memory. We build the knowledge graph on a machine with 32 GB of memory. By optimizing the code and rewriting the data processing to be performed on disk, it should be possible to reduce the memory footprint by a lot.