Skip to content

WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1044

@michaeldinzinger

Description

@michaeldinzinger

Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions