Sample Platform Initial Assessment
This issue documents a comprehensive reliability assessment of the Sample Platform and tracks progress on improvements.
Executive Summary
The Sample Platform has fundamental architectural issues causing:
- Test timeouts (4+ hours)
- Tests stuck in "queued" state indefinitely
- False "PR closed or updated" errors
- Silent failures leaving tests in limbo
The root cause is the lack of a proper task queue system combined with missing timeout handling and no error recovery mechanisms.
Technology Stack
| Component |
Current |
Issue |
| Backend |
Python Flask |
Synchronous, blocking request handlers |
| Task Queue |
None |
❌ Critical gap - uses cron polling instead |
| Job Trigger |
Cron jobs |
Only runs every N minutes, can crash silently |
| VM Management |
GCP Compute Engine |
Blocking API calls with no timeouts |
| Database |
MySQL/SQLAlchemy |
No transaction management, no connection timeouts |
| GitHub Integration |
PyGithub |
API calls without timeouts |
Critical Issues
1. No Task Queue System
The platform processes tests via cron jobs instead of a proper queue (Celery, RQ, Redis Queue, etc.):
GitHub Webhook → DB Insert → Wait for Cron → Process Test
Problems:
- Tests accumulate while waiting for next cron cycle
- If cron crashes, no tests run until manually restarted
- No parallel processing capability
- No retry mechanism for failed tasks
2. Infinite Timeout Loop
File: mod_ci/controllers.py lines 563-591
def wait_for_operation(compute, project, zone, operation):
while True: # NO TIMEOUT OR MAX ITERATIONS!
result = compute.zoneOperations().get(
project=project, zone=zone, operation=operation
).execute()
if result['status'] == 'DONE':
return result
time.sleep(1) # Loops forever if GCP operation stalls
Impact: A single stuck GCP operation blocks the entire cron job for hours.
3. Missing Timeouts on External API Calls
File: mod_ci/controllers.py
| Line |
Call |
Timeout |
| 239 |
repository.get_pull(test.pr_nr) |
❌ None |
| 244 |
repository.get_commit(test.commit) |
❌ None |
| 386 |
repository.get_artifacts() |
❌ None |
| 396 |
requests.get(artifact_url) |
❌ None |
| 1669 |
requests.get(api_url) |
✅ 10 seconds |
Only 1 out of 5+ external calls has a timeout. Any network issue causes indefinite hangs.
4. Silent Failures
File: mod_ci/controllers.py lines 398-413
except Exception as e:
log.critical("Could not fetch artifact, request timed out")
return # Silent return - test stays stuck in "preparation" forever!
if r.status_code != 200:
log.critical(f"Could not fetch artifact, response code: {r.status_code}")
return # Silent return - no status update, no retry
Impact: Test remains in "preparation" state indefinitely. User sees test as "running" but nothing happens.
5. Race Condition: "PR closed or updated"
File: mod_ci/controllers.py lines 238-251
test_pr = repository.get_pull(test.pr_nr)
if test.commit != test_pr.head.sha:
deschedule_test(gh_commit, message="PR closed or updated", test=test)
continue
Scenario:
- User opens PR with commit
abc123
- Test queued with
commit = abc123
- User pushes fix (new commit
def456)
- Cron runs 5 minutes later
- Check fails:
abc123 != def456
- Test cancelled with "PR closed or updated"
User experience: "I just opened a PR and it immediately shows an error"
6. No Concurrency Control
File: mod_ci/controllers.py lines 225-252
pending_tests = Test.query.filter(
Test.id.notin_(finished_tests),
Test.id.notin_(running_tests),
Test.platform == platform
)
for test in pending_tests:
start_test(...) # No lock! Two cron processes can start same test
Impact: Duplicate VM creation, wasted resources, inconsistent results.
7. No Database Transaction Management
Multiple places use g.db.commit() without:
- Try-except blocks
- Rollback on error
- Connection timeout configuration
Example locations: lines 182-183, 735-736, and many others.
8. Blocking Operations in Request Handlers
File: mod_ci/controllers.py lines 1149-1195 (progress_reporter endpoint)
The progress reporting endpoint performs synchronous database operations and file I/O, which can cause request timeouts when the VM reports progress.
9. Resource Leaks
File: mod_ci/controllers.py line 404
open(os.path.join(base_folder, 'ccextractor.zip'), 'wb').write(r.content)
- File handle not explicitly closed
- Entire artifact loaded into memory (no streaming)
- Failed downloads leave temp files behind
Symptom-to-Cause Mapping
| Symptom |
Root Cause |
| 4+ hour timeouts |
wait_for_operation() infinite loop; GitHub/GCP API hangs without timeout |
| Tests stuck "queued" |
Cron crashes on exception; silent failures in start_test(); no state recovery |
| "PR closed or updated" errors |
Race condition between PR update and cron execution |
| Generic errors |
Silent failures with no retry logic; exceptions logged but test status not updated |
| Inconsistent results |
No concurrency control; duplicate test execution possible |
Recommended Plan of Action
Phase 1: Critical Fixes (Quick Wins)
Phase 2: Improved Error Handling
Phase 3: Architectural Improvements
Phase 4: Infrastructure
Progress Tracking
This section will be updated as PRs are submitted.
| PR |
Description |
Status |
| - |
- |
- |
References
Assessment conducted: 2025-12-22
Sample Platform Initial Assessment
This issue documents a comprehensive reliability assessment of the Sample Platform and tracks progress on improvements.
Executive Summary
The Sample Platform has fundamental architectural issues causing:
The root cause is the lack of a proper task queue system combined with missing timeout handling and no error recovery mechanisms.
Technology Stack
Critical Issues
1. No Task Queue System
The platform processes tests via cron jobs instead of a proper queue (Celery, RQ, Redis Queue, etc.):
Problems:
2. Infinite Timeout Loop
File:
mod_ci/controllers.pylines 563-591Impact: A single stuck GCP operation blocks the entire cron job for hours.
3. Missing Timeouts on External API Calls
File:
mod_ci/controllers.pyrepository.get_pull(test.pr_nr)repository.get_commit(test.commit)repository.get_artifacts()requests.get(artifact_url)requests.get(api_url)Only 1 out of 5+ external calls has a timeout. Any network issue causes indefinite hangs.
4. Silent Failures
File:
mod_ci/controllers.pylines 398-413Impact: Test remains in "preparation" state indefinitely. User sees test as "running" but nothing happens.
5. Race Condition: "PR closed or updated"
File:
mod_ci/controllers.pylines 238-251Scenario:
abc123commit = abc123def456)abc123 != def456User experience: "I just opened a PR and it immediately shows an error"
6. No Concurrency Control
File:
mod_ci/controllers.pylines 225-252Impact: Duplicate VM creation, wasted resources, inconsistent results.
7. No Database Transaction Management
Multiple places use
g.db.commit()without:Example locations: lines 182-183, 735-736, and many others.
8. Blocking Operations in Request Handlers
File:
mod_ci/controllers.pylines 1149-1195 (progress_reporterendpoint)The progress reporting endpoint performs synchronous database operations and file I/O, which can cause request timeouts when the VM reports progress.
9. Resource Leaks
File:
mod_ci/controllers.pyline 404Symptom-to-Cause Mapping
wait_for_operation()infinite loop; GitHub/GCP API hangs without timeoutstart_test(); no state recoveryRecommended Plan of Action
Phase 1: Critical Fixes (Quick Wins)
Add timeouts to all external API calls (GitHub, GCP, requests)
Add timeout/max iterations to
wait_for_operation()Update test status on failure
returnstatements to mark test as "failed"Add basic locking
Phase 2: Improved Error Handling
Implement retry logic with exponential backoff
Add database transaction management
Fix race condition for PR updates
Phase 3: Architectural Improvements
Implement proper task queue
Add test state machine
Add health monitoring
Phase 4: Infrastructure
Stream artifact downloads
Add cleanup job
Improve logging
Progress Tracking
This section will be updated as PRs are submitted.
References
Assessment conducted: 2025-12-22