Skip to content

Add unique and deterministic id generation#1993

Open
oyilmaz-nvidia wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/unique-deterministic-ids
Open

Add unique and deterministic id generation#1993
oyilmaz-nvidia wants to merge 5 commits into
NVIDIA-NeMo:mainfrom
oyilmaz-nvidia:onur/unique-deterministic-ids

Conversation

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor

Deterministic task identifiers via DAG-path lineage

Why

Task._uuid is generated by uuid.uuid4() on construction, so running the same pipeline twice on the same inputs produces a completely different set of IDs. That makes any kind of caching, checkpointing, or "resume from intermediate output" impossible — and it already affects real code: dedup stages name Parquet files after _uuid (minhash.py:309, buckets_to_edges.py:83, kmeans.py:235), and AddId prefixes document IDs with it (add_id.py:71).

What changed

Two new fields on Task, derived from the task's path through the pipeline DAG:

Field Meaning
_uuid Unchanged — still uuid.uuid4(). There are existing calls and existing call sites keep working.
_lineage_path Underscore-separated index path through the DAG. Propagated to children. E.g. "3_0_7" = 4th root, 1st child, 8th grandchild.
_udid sha256(_lineage_path)[:32] — 32-char hex, deterministic across runs. Use this for cache keys / output filenames.

Lineage assignment happens at one chokepoint — the default ProcessingStage.process_batch calls a new module-level helper, nemo_curator.stages.base.assign_child_lineage(parent_paths, result), on the output of each process() call. Stages that override process_batch are responsible for calling the helper themselves (documented in the docstring); if they forget, outputs have empty _lineage_path/_udid — a loud, easy-to-catch failure rather than a silent mis-id.

Examples

_lineage_path _udid
"3" (root task, 4th output of first stage) 4e07408562bedb8b60ce05c1decfe3ad
"3_0" (its single child) sha256("3_0")[:32]
"3_0_7" (further descendant) sha256("3_0_7")[:32]
"0_0_1_0_2_0_0" (fan-in of 3 parents + idx 0) sha256(...)[:32]

Files

File Change
nemo_curator/tasks/tasks.py Add _lineage_path and _udid fields + _set_lineage(parents, idx) helper on Task. _uuid untouched.
nemo_curator/stages/base.py Add module-level assign_child_lineage(parent_paths, result) helper; default process_batch delegates to it; docstring updates the override contract.
tests/tasks/test_tasks.py Added test_lineage_path_and_udid_format, test_fanout_udid_from_empty_root, test_udid_deterministic_across_runs. Original _uuid uniqueness test kept as-is.
tests/pipelines/test_pipelines.py Added test_pipeline_udid_deterministic_across_runs and test_pipeline_udid_fanout_passthrough_fanin_passthrough (4-stage topology: fan-out → passthrough → fan-in → passthrough; exercises the multi-parent code path).

Trade-offs documented in code

  • Hyperparameter changes do not invalidate _udid. Same DAG shape ⇒ same _udid. Clear the cache or version output dirs when changing stage config.
  • Stages overriding process_batch must call assign_child_lineage themselves. No magic safety net.
  • _uuid stays random. Anything reading _uuid today keeps the same per-run behaviour; new code that needs stability reads _udid.

Test plan

  • New unit tests cover root tasks, single-parent fanout, multi-parent join, and end-to-end determinism across two independent runs of the same pipeline.
  • Original test_fanout_tasks_have_unique_uuid still passes (back-compat for _uuid confirmed).
  • Writer tests in tests/stages/text/io/writer/ untouched and still pass — their mock_uuid4.call_count assertions remain valid because _uuid is unchanged.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia oyilmaz-nvidia requested a review from a team as a code owner May 16, 2026 07:00
@oyilmaz-nvidia oyilmaz-nvidia requested review from meatybobby and removed request for a team May 16, 2026 07:00
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 16, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@oyilmaz-nvidia oyilmaz-nvidia requested review from VibhuJawa and praateekmahajan and removed request for meatybobby May 16, 2026 07:00
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 16, 2026

Greptile Summary

This PR introduces _lineage_path and _udid on Task to give every task a deterministic, DAG-position-based identifier that is stable across pipeline runs — addressing the inability to cache or checkpoint pipelines whose _uuid was previously random. The lineage is assigned at one choke point (assign_child_lineage in process_batch) and propagated to children as a _-separated index path hashed into a 32-char hex ID.

  • Task._set_lineage(parent_paths, idx) builds and stores _lineage_path and _udid; the if self._udid: return False guard preserves existing lineage for in-place stage returns.
  • assign_child_lineage in base.py normalises None/single/list returns and wires lineage assignment into the default process_batch; stages overriding process_batch must call it themselves.
  • tests/pipelines/test_pipelines.py imports assign_root_lineage from nemo_curator.stages.base, but that symbol is never defined — this ImportError breaks every test in the file at collection time.

Confidence Score: 2/5

Not safe to merge — the test file has a broken import that prevents all pipeline tests from running, and the root-lineage assignment is fragile under task reuse.

The test file imports assign_root_lineage from nemo_curator.stages.base, a function that does not exist anywhere in the codebase. This causes an ImportError at collection time, silently disabling every test in tests/pipelines/test_pipelines.py — including the pre-existing ones — making the test suite unable to guard against regressions in the new feature. Additionally, root lineage is assigned inline in Pipeline.run() with no re-entry protection, so any task object passed to run() a second time retains its first-run position and silently emits the wrong _udid for all downstream children.

tests/pipelines/test_pipelines.py (broken import) and nemo_curator/pipeline/pipeline.py (inline root-lineage loop that is neither exported nor guarded against reuse)

Important Files Changed

Filename Overview
nemo_curator/tasks/tasks.py Adds _lineage_path, _udid fields and _set_lineage() helper to Task. The guard logic that prevents re-assignment (checking _udid truthiness) is correct for in-place stages, but becomes a silent hazard when the same task object is reused across pipeline runs.
nemo_curator/stages/base.py Adds assign_child_lineage() helper and wires it into the default process_batch. The helper correctly normalizes None/single/list returns and skips tasks with existing _udid. Does not export assign_root_lineage, which tests depend on.
nemo_curator/pipeline/pipeline.py Assigns root lineage inline in run() before handing off to the executor. The root-lineage logic is not exposed as a reusable helper, causing the test-file import of assign_root_lineage to fail.
tests/pipelines/test_pipelines.py Imports the non-existent assign_root_lineage from nemo_curator.stages.base, causing an ImportError that prevents all tests in the file from running, including pre-existing ones.
tests/tasks/test_tasks.py Adds unit tests for _set_lineage, idempotency, and determinism. Does not import assign_root_lineage, so these tests should work independently of the broken import in test_pipelines.py.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Pipeline
    participant Task
    participant ProcessingStage
    participant assign_child_lineage

    Caller->>Pipeline: run(initial_tasks)
    Pipeline->>Pipeline: build()
    loop each root task[i]
        Pipeline->>Task: _set_lineage([], i)
        Task-->>Pipeline: "_lineage_path="i", _udid=sha256("i")[:32]"
    end
    Pipeline->>Executor: execute(stages, initial_tasks)

    loop each stage
        Executor->>ProcessingStage: process_batch(tasks)
        loop each task
            ProcessingStage->>ProcessingStage: process(task) → result
            ProcessingStage->>assign_child_lineage: ([task._lineage_path], result)
            loop each child[i] where _udid is empty
                assign_child_lineage->>Task: _set_lineage(parent_paths, i)
                Task-->>assign_child_lineage: _lineage_path, _udid set
            end
            assign_child_lineage-->>ProcessingStage: [children with lineage]
        end
        ProcessingStage-->>Executor: results
    end
    Executor-->>Caller: final tasks
Loading

Reviews (5): Last reviewed commit: "Assign ids at for initial tasks" | Re-trigger Greptile

Comment thread nemo_curator/tasks/tasks.py Outdated
Comment on lines +55 to +59
def _set_lineage(self, parent_lineage_paths: list[str], child_index: int) -> None:
# DAG structure does. Clear cache directories when changing config.
parts = [*[p for p in parent_lineage_paths if p], str(child_index)]
self._lineage_path = "_".join(parts)
self._udid = hashlib.sha256(self._lineage_path.encode()).hexdigest()[:32]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Lineage path collision: _ separator is ambiguous for multi-parent joins

_set_lineage joins all parent paths with _ and then appends the child index with _. Because _ is also the separator used within a parent's own path, the resulting string is not injective.

Concrete collision — both of these produce "3_0_7":

  • Single-parent chain: parent with path "3_0" at child index 7 → _set_lineage(["3_0"], 7)parts = ["3_0", "7"]"3_0_7"
  • Fan-in: two parents with paths ["3", "0"] at child index 7 → _set_lineage(["3", "0"], 7)parts = ["3", "0", "7"]"3_0_7"

The PR description's own example ("3_0_7" = 4th root, 1st child, 8th grandchild) is itself ambiguous by this encoding. Any pipeline that combines deep sequential chains with multi-parent fan-in stages is at risk of two structurally distinct tasks sharing the same _udid, directly undermining the caching and file-naming goals this PR is designed to serve. A fix would use an unambiguous encoding for the multi-parent case — for example, hashing the ordered list of parent paths with a null-byte separator before appending the child index, rather than flattening them all with _.

Comment thread nemo_curator/tasks/tasks.py Outdated
Comment on lines +55 to +57
def _set_lineage(self, parent_lineage_paths: list[str], child_index: int) -> None:
# DAG structure does. Clear cache directories when changing config.
parts = [*[p for p in parent_lineage_paths if p], str(child_index)]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The comment is clearly truncated — "DAG structure does." reads as the tail of a sentence, not the beginning. Based on the PR description the intended message is that hyperparameter changes don't change _udid, only structural (DAG-shape) changes do.

Suggested change
def _set_lineage(self, parent_lineage_paths: list[str], child_index: int) -> None:
# DAG structure does. Clear cache directories when changing config.
parts = [*[p for p in parent_lineage_paths if p], str(child_index)]
def _set_lineage(self, parent_lineage_paths: list[str], child_index: int) -> None:
# Hyperparameter changes do not invalidate `_udid` — only DAG structure does.
# Clear cache directories when changing stage config.
parts = [*[p for p in parent_lineage_paths if p], str(child_index)]

Comment thread nemo_curator/stages/base.py Outdated
Comment on lines +44 to +48
Each surviving ``children[i]`` gets ``_lineage_path`` and ``_uuid`` derived
from ``(parent_paths, i)`` so that the same pipeline run twice on the same
inputs produces byte-identical task IDs. Call this from any custom
``process_batch`` override to keep outputs consistent with the rest of the
pipeline.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The docstring mentions _uuid but the function never touches _uuid — it only sets _lineage_path and _udid. Referencing _uuid here will mislead implementers of custom process_batch overrides who are deciding which field to use for stable identifiers.

Suggested change
Each surviving ``children[i]`` gets ``_lineage_path`` and ``_uuid`` derived
from ``(parent_paths, i)`` so that the same pipeline run twice on the same
inputs produces byte-identical task IDs. Call this from any custom
``process_batch`` override to keep outputs consistent with the rest of the
pipeline.
Each surviving ``children[i]`` gets ``_lineage_path`` and ``_udid`` derived
from ``(parent_paths, i)`` so that the same pipeline run twice on the same
inputs produces byte-identical task IDs. Call this from any custom
``process_batch`` override to keep outputs consistent with the rest of the
pipeline.

Comment thread nemo_curator/stages/base.py Outdated
outputs.extend(assign_child_lineage([task._lineage_path], raw))
return outputs

Outputs that skip this step will carry empty ``_uuid``/``_lineage_path``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same _uuid / _udid mix-up: _uuid is always a random uuid4 and is never empty; the field that will be empty when lineage is skipped is _udid.

Suggested change
Outputs that skip this step will carry empty ``_uuid``/``_lineage_path``.
Outputs that skip this step will carry empty ``_udid``/``_lineage_path``.

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test c1eb896

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

@VibhuJawa @praateekmahajan @sarahyurick Could you please review this PR when you a chance?

Copy link
Copy Markdown
Contributor

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the logic looks sound to me but will really lean on @praateekmahajan because i am not familiar with this side of the codebase.

@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/claude review

results.extend(result)
else:
results.append(result)
results.extend(assign_child_lineage([task._lineage_path], result))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: _udid collisions when multiple root tasks enter a 1:1 first stage.

All root tasks start with _lineage_path = "". The _set_lineage helper filters out empty parent paths, so every 1:1 child gets _lineage_path = "0" regardless of which root task it came from:

# Two root tasks (both _lineage_path=""), 1:1 stage:
# root_a → assign_child_lineage([""], child_a) → child_a._lineage_path = "0"
# root_b → assign_child_lineage([""], child_b) → child_b._lineage_path = "0"  # collision!

The PR description's own example ("3_0_7" = "4th root task, …") implies root tasks should carry their batch index, but nothing assigns it. The executors pass initial_tasks straight into the first stage.

Fix: assign root lineage before the first stage runs, e.g. in Pipeline.run or in each executor:

for i, task in enumerate(initial_tasks):
    task._set_lineage([], i)

The tests only cover single-root-task scenarios, so this isn't caught today. A test with 2+ root tasks through a 1:1 stage would reproduce it immediately.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a bit tripped up about the counting part, but I think the concern about the collision is wrong, I don't see a collision happening because either:

  1. we start with a [EmptyTask] and so it produces e.g., a list of FileGroupTasks which will be handled correctly
  2. we start with something else, in which case it is assigned a root lineage in pipeline.py

But perhaps I am thinking too narrowly about the possibilities of scenario 1.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

/ok to test f806a55

Copy link
Copy Markdown
Contributor

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic LGTM. Left a minor comment.

Comment thread nemo_curator/stages/base.py Outdated
@oyilmaz-nvidia
Copy link
Copy Markdown
Contributor Author

@sarahyurick could you please approve? @praateekmahajan could you please review?

# It is propagated to children and hashed into `_udid`, the deterministic
# task id.
_lineage_path: str = field(init=False, default="")
_udid: str = field(init=False, default="")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding udid on top of uuid, task_id I think we can just keep one of them and remove others. I believe uuid is only used by add id and I'm pretty sure it'll be unaffected too

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the stages use the uuid so I couldn't touch it. Task_id is an integer number and not sure if we should touch it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share which ones, I can help decide if we can yank them. Would prefer to not add new _id fiields. Similarly for task_id..  Would want to avoid as much as possible to create new fields without depreciating similar old fields..

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will use task_id and won't add the _udid. Will update the code.

Comment thread nemo_curator/tasks/tasks.py Outdated
Co-authored-by: Praateek Mahajan <praateekmahajan@users.noreply.github.com>
Signed-off-by: Onur Yilmaz <35306097+oyilmaz-nvidia@users.noreply.github.com>
]


def _drive(pipeline: Pipeline, initial_tasks: list[Task]) -> list[Task]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test might better live inside tests/backends/test_integration.py within backends or you can create a new one.

Instead of having drive as the functionality, let's just run the pipeline how users run without interfacing with these lower-level functionality, primarily because of tomorrow the contract changes then we don't want to be testing things with a lower level utility

def test_pipeline_udid_deterministic_across_runs():
def run_once() -> tuple[list[str], list[str]]:
pipeline = Pipeline(name="det", stages=[_Repeat(times=2), _Repeat(times=3)])
pipeline.build()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

build() is a no-op so far, AI continues to consider them important.. Same as above, let's test how users run our pipelines. i..e. Pipeiline(...).run() and then check the output tasks.


Pipeline topology:

Input ─▶ FanOut(3) ─▶ Passthrough ─▶ FanIn ─▶ Passthrough ─▶ Output
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit : How is ruff passing on these characters?
Can we keep it to Input -> FanOut(3) -> Passthrough -> FanIn...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but it's passing :D

Comment on lines +250 to +254
# Drive stage-by-stage so we can inspect each intermediate set of tasks.
after_fanout = BaseStageAdapter(pipeline.stages[0]).process_batch([root])
after_passthrough_1 = BaseStageAdapter(pipeline.stages[1]).process_batch(after_fanout)
after_fanin = BaseStageAdapter(pipeline.stages[2]).process_batch(after_passthrough_1)
after_passthrough_2 = BaseStageAdapter(pipeline.stages[3]).process_batch(after_fanin)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a bug here.. your FanIn stage if you do pipeline.run() isn't technically a fan-in..

Once you modify your tests to how I shared above you'll see, that the output tasks woudn't be what you expect.

For a fanin to be you need to define batch_size property on the stage. So a good reason to move the tests to follow similar pattern to tests/backends/test_integration.py
I could be wrong but some of these fan outs and fan ins might already be there with some other properties being tested. So if we can reuse existing tests while covering same functionality I'd prefer that 100%

return False
parts = [*[p for p in parent_lineage_paths if p], str(child_index)]
self._lineage_path = "_".join(parts)
self._udid = hashlib.sha256(self._lineage_path.encode()).hexdigest()[:32]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe move this to tasks/utils.py (a separate staticmethod here in this calass) and then also have a test in test_tasks.py that we indeed to that the same hasing we expect and that we pass lineage_path to it

Comment thread tests/tasks/test_tasks.py
Comment on lines +76 to +77
def _sha256_32(s: str) -> str:
return hashlib.sha256(s.encode()).hexdigest()[:32]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my comment on staticmethod for the shaing

Comment thread nemo_curator/pipeline/pipeline.py Outdated
results.extend(result)
else:
results.append(result)
results.extend(assign_child_lineage([task._lineage_path], result))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not have this functionality inside stages/base.py:process_batch..
Users for batch style stages often override it therefore users will be at risk of not calling assign_child_lineage..

I believe another nemo_curator/backends/base.py:process_batch would work

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@praateekmahajan Not sure how another process_batch would work here. That function is called by the executors and there is a contract between the user stages and executors.

This is the reason I have this assign_child_lineage as a function and if the user doesn't call it, the _udid won't be generated hence I can warn the user if the resumability is enabled later on.

Copy link
Copy Markdown
Contributor Author

@oyilmaz-nvidia oyilmaz-nvidia May 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I misunderstood it. The reason I couldn't do it is that base stage class returns the flattened list of tasks and I lose the parent child relationship.

So, I need an update in the ProcessingStage.process_batch and I don't think we'll be able to do it since many user have already been using it.

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
from nemo_curator.backends.base import BaseStageAdapter
from nemo_curator.pipeline.pipeline import Pipeline
from nemo_curator.stages.base import ProcessingStage
from nemo_curator.stages.base import ProcessingStage, assign_child_lineage, assign_root_lineage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 assign_root_lineage is imported but never defined

nemo_curator.stages.base exports only assign_child_lineageassign_root_lineage does not exist anywhere in the codebase. This import fails at collection time with ImportError: cannot import name 'assign_root_lineage' from 'nemo_curator.stages.base', which prevents every test in this file (including the pre-existing ones) from running. The root-lineage assignment lives inline in Pipeline.run() via a bare loop over task._set_lineage([], i), but it is never extracted into an exported helper function that tests can call directly.

Comment on lines +215 to +218
if initial_tasks:
# Assign deterministic root-level lineage to initial pipeline tasks
for i, task in enumerate(initial_tasks):
task._set_lineage([], i)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Root lineage assignment only in run() — re-entrant pipelines silently corrupt lineage

pipeline.run() calls build() internally (line 187), which replaces self.stages with the decomposed execution stages. If a caller passes the same Task objects to run() a second time (e.g. for a retry or second pipeline sharing inputs), the if self._udid: return False guard in _set_lineage silently preserves the lineage from the first run. The tasks keep their first-run position even if their index in initial_tasks changed, silently producing wrong _udid values for all downstream children without any error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants