fix: prevent temp file leak and race condition in save_formatted_audio#1905
Open
sajisanchu1913-source wants to merge 10 commits into
Open
Conversation
…dup and harm categories
… fix imports and ordering
- _RemoteDatasetLoader._fetch_zip_from_url:
- keyword-only args (source, inner_files, cache)
- streams download (requests stream=True + iter_content) to avoid
double-buffering large archives
- md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
cache=True; named temp file otherwise (cleaned up after parse)
- validates each inner_files extension against FILE_TYPE_HANDLERS;
raises ValueError with a member preview if an inner file is missing
- parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
so the open ZipFile never escapes the worker thread
- adds the missing import zipfile that broke the previous commit
- _MICDataset:
- drops unused io / json / requests imports (helper handles them)
- delegates download + parse to the helper; only owns the seed
construction loop
- guards non-string Q values (in addition to NaN moral values)
- forwards cache from fetch_dataset_async to the helper
- factors authors into AUTHORS class constant
- Tests:
- test_moral_integrity_corpus_dataset.py: stops mocking requests.get
directly; patches _fetch_zip_from_url to return parsed dicts so
tests don't depend on the helper's internal shape
- adds test_fetch_dataset_non_string_q and
test_fetch_dataset_passes_cache_flag
- hoists imports into the right groups so ruff I001 stops firing
- removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
happy path, on-disk caching (hits 1 network call across 2 fetches),
cache=False does not persist, missing inner file raises ValueError,
unsupported extension raises ValueError
Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav to prevent concurrent call collisions - Wrap Azure upload in try/finally to ensure temp file is always deleted even when upload fails - Add regression test to verify cleanup on upload failure Fixes microsoft#1894
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
DataTypeSerializer.save_formatted_audiohad two bugs in the Azure storage branch:File leak:
os.remove(local_temp_path)was not in afinallyblock,so if the Azure upload raised an exception,
temp_audio.wavwas never deleted.Race condition: The temp file always used the fixed name
temp_audio.wav,so concurrent calls would clobber each other's WAV before upload.
Fix
temp_audio.wavwithtempfile.NamedTemporaryFileto giveeach call a unique path
try/finallyso the temp file is always deletedTests
no new
.wavfiles remain inDB_DATA_PATHFixes #1894