Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.
Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.