UN-2972 [FEAT] Implement API deployment rate limiting with Django cache and per-org locks#1649
Conversation
… per-org locks This commit implements a comprehensive rate limiting system for API deployments with atomic operations, caching, and management commands. Features: - Organization-level and global concurrent request limits - Per-organization Redis distributed locks to prevent race conditions - Django cache integration for org limits (~95% reduction in DB queries) - Automatic cache invalidation on limit updates - Centralized constants to eliminate magic strings - Management commands for easy administration Technical Details: - Uses Redis ZSET for tracking active executions with TTL-based cleanup - Per-org locks ensure atomic check-and-acquire operations - Global limit enforced with eventual consistency (acceptable tolerance) - Rate limiter fails open on Redis errors for system reliability - Retry-After header in 429 responses for proper client backoff New Files: - backend/api_v2/rate_limit_constants.py: Centralized keys and constants - backend/api_v2/rate_limiter.py: Core rate limiting logic - backend/api_v2/migrations/0003_add_organization_rate_limit.py: DB migration - backend/api_v2/management/commands/set_org_rate_limit.py: Set org limits - backend/api_v2/management/commands/get_org_rate_limit.py: View org limits - backend/api_v2/management/commands/list_org_rate_limits.py: List all limits Modified Files: - backend/api_v2/models.py: OrganizationRateLimit model with auto cache clearing - backend/api_v2/deployment_helper.py: Use atomic check_and_acquire - backend/api_v2/admin.py: Admin interface for OrganizationRateLimit - backend/api_v2/exceptions.py: RateLimitExceeded exception - backend/backend/exceptions.py: Add Retry-After header to 429 responses - backend/workflow_manager/workflow_v2/models/execution.py: Auto-release on completion - backend/backend/settings/base.py: Rate limiting configuration - backend/sample.env: Configuration documentation Configuration (environment variables): - API_DEPLOYMENT_DEFAULT_RATE_LIMIT: Default per-org limit (default: 5) - API_DEPLOYMENT_GLOBAL_RATE_LIMIT: System-wide limit (default: 50) - API_DEPLOYMENT_RATE_LIMIT_TTL_HOURS: ZSET entry TTL (default: 6 hours) - API_DEPLOYMENT_RATE_LIMIT_CACHE_TTL: Cache TTL (default: 1 hour) - API_DEPLOYMENT_RATE_LIMIT_LOCK_TIMEOUT: Lock timeout (default: 2 seconds) - API_DEPLOYMENT_RATE_LIMIT_LOCK_BLOCKING_TIMEOUT: Lock wait (default: 5 seconds) - API_DEPLOYMENT_RATE_LIMIT_RETRY_AFTER: Retry-After value (default: 300 seconds) Management Commands: - python manage.py set_org_rate_limit <org-id-or-name> <limit> - python manage.py get_org_rate_limit <org-id-or-name> [--clear-cache] - python manage.py list_org_rate_limits [--with-usage] Performance Impact: - DB queries for org limits: ~95% reduction (cached) - Rate limit check latency: 10-20ms (acceptable tradeoff for no race conditions) - Lock contention: Minimal (per-org, not global) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Summary by CodeRabbit
WalkthroughAdds Redis-backed atomic rate limiting for API deployments with per-organization and global concurrent limits; introduces OrganizationRateLimit model, migration and admin; new rate limiter, constants, exception, view/workflow integration with execution_id handling and slot release, management commands, settings, and documentation. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant View as API Deployment View
participant Limiter as APIDeploymentRateLimiter
participant ExecHelper as DeploymentHelper.execute_workflow
participant Workflow as WorkflowExecution
participant Redis as Redis/ZSET
participant Cache as DjangoCache
Client->>View: POST /deploy
View->>View: generate execution_id (UUID)
rect rgb(220,240,255)
Note over View,Limiter: Atomic check-and-acquire
View->>Limiter: check_and_acquire(org, execution_id)
Limiter->>Cache: read org limit cache
Limiter->>Redis: acquire org lock
Redis-->>Limiter: lock acquired
Limiter->>Redis: cleanup expired ZSET entries
Limiter->>Redis: read org/global counts
alt below limits
Limiter->>Redis: add execution_id to org & global ZSETs
Limiter->>Redis: release org lock
Limiter-->>View: acquired (usage, limit)
else limit exceeded
Limiter->>Redis: release org lock
Limiter-->>View: denied (usage, limit) -> RateLimitExceeded
end
end
rect rgb(240,255,240)
Note over View,ExecHelper: create execution with execution_id
View->>ExecHelper: execute_workflow(..., execution_id)
ExecHelper->>Workflow: create_workflow_execution(..., execution_id)
ExecHelper-->>View: response (accepted)
end
rect rgb(255,240,240)
Note over Workflow,Limiter: On final state, release slot
Workflow->>Limiter: release_slot(org_id, execution_id)
Limiter->>Redis: remove execution_id from org/global ZSETs
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas to focus review on:
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
for more information, see https://pre-commit.ci
Fix Ruff linting errors by moving Django signal imports to the top of the file according to PEP 8 conventions. Changes: - Moved `from django.db.models.signals import post_delete` to line 7 - Moved `from django.dispatch import receiver` to line 8 - Removed duplicate imports from line 268-269 (bottom of file) - Signal handler function remains at bottom unchanged Fixes Ruff errors: - E402: Module level import not at top of file (line 266) - E402: Module level import not at top of file (line 267) Also installed pre-commit hooks locally for future commits. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
…stack/unstract into UN-2972-api-deployment-rate-limiting
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
backend/api_v2/management/commands/set_org_rate_limit.py (1)
41-53: Idempotent operation with clear feedback.The use of
update_or_createensures the command is safe to run multiple times. Cache invalidation happens automatically via the model's save method.To silence the static analysis warning about the unused variable, prefix it with an underscore:
- org_rate_limit, created = OrganizationRateLimit.objects.update_or_create( + _org_rate_limit, created = OrganizationRateLimit.objects.update_or_create( organization=organization, defaults={"concurrent_request_limit": limit} )backend/workflow_manager/workflow_v2/models/execution.py (1)
323-342: Defensive rate limit release with room for improvement.The method safely releases rate limit slots with proper existence checks. The broad exception catch is appropriate for defensive programming here.
Use
logging.exceptionto automatically include the traceback for easier debugging:except Exception as e: # Log but don't fail the execution update for rate limit release errors - logger.error( + logger.exception( f"Failed to release rate limit slot for execution {self.id}: {e}" )backend/api_v2/management/commands/list_org_rate_limits.py (1)
36-44: Consider usinglogging.exceptionfor better debugging.The error handling correctly allows the command to continue when usage fetch fails. For easier debugging, consider using
logger.exceptioninstead ofself.stdout.writeto capture the full traceback.try: usage = APIDeploymentRateLimiter.get_current_usage(org) pct = (usage["org_count"] / limit * 100) if limit > 0 else 0 self.stdout.write( f' Usage: {usage["org_count"]}/{limit} ({pct:.1f}%)' ) except Exception as e: + import logging + logger = logging.getLogger(__name__) + logger.exception(f"Failed to fetch usage for org {org.organization_id}") self.stdout.write(f" Usage: Error - {e}")backend/api_v2/management/commands/get_org_rate_limit.py (1)
26-35: Use explicit exception chaining for cleaner error handling.When raising a new exception inside an exception handler, use
from None(to suppress the original exception) orfrom err(to show the chain) per PEP 409.except (ValueError, Organization.DoesNotExist): try: organization = Organization.objects.get(name=org_id) except Organization.DoesNotExist: - raise CommandError(f'Organization "{org_id}" not found') + raise CommandError(f'Organization "{org_id}" not found') from Nonebackend/api_v2/rate_limiter.py (2)
140-295: Solid atomic implementation with good fail-open behavior.The per-org locking strategy correctly prevents race conditions within organizations, and the fail-open approach ensures service resilience. The detailed comment about future global lock enhancement (lines 150-176) is helpful for maintainers.
Consider using
logger.exception()in exception handlers to capture full tracebacks for easier debugging:except Exception as e: - logger.error(f"Error in rate limit check for org {org_id}: {e}") + logger.exception(f"Error in rate limit check for org {org_id}") # Fail open: allow request on errors return True, None
103-137: Consider adding error handling for consistency.While the docstring mentions "Raises: Exception: If Redis operations fail," other methods in this class use fail-open error handling. For consistency and resilience, consider wrapping this in try-except to handle Redis failures gracefully.
@classmethod def get_current_usage(cls, organization: Organization) -> dict: """Get current usage statistics for an organization. Returns: dict: { 'org_count': int - current org-level concurrent requests, 'global_count': int - current system-wide concurrent requests, 'org_limit': int - org-level limit, 'global_limit': int - global limit } - Raises: - Exception: If Redis operations fail """ org_key = cls._get_org_key(str(organization.organization_id)) + try: # Cleanup expired entries before counting cls._cleanup_expired_entries(org_key) cls._cleanup_expired_entries(RateLimitKeys.GLOBAL_EXECUTIONS_KEY) org_count = redis_cache.zcard(org_key) global_count = redis_cache.zcard(RateLimitKeys.GLOBAL_EXECUTIONS_KEY) org_limit = cls._get_org_limit(organization) global_limit = getattr( settings, "API_DEPLOYMENT_GLOBAL_RATE_LIMIT", RateLimitDefaults.DEFAULT_GLOBAL_LIMIT, ) return { "org_count": org_count, "global_count": global_count, "org_limit": org_limit, "global_limit": global_limit, } + except Exception as e: + logger.exception(f"Error fetching usage for org {organization.organization_id}") + # Return safe defaults on error + return { + "org_count": 0, + "global_count": 0, + "org_limit": cls._get_org_limit(organization), + "global_limit": getattr(settings, "API_DEPLOYMENT_GLOBAL_RATE_LIMIT", RateLimitDefaults.DEFAULT_GLOBAL_LIMIT), + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (14)
backend/api_v2/admin.py(1 hunks)backend/api_v2/deployment_helper.py(4 hunks)backend/api_v2/exceptions.py(1 hunks)backend/api_v2/management/commands/get_org_rate_limit.py(1 hunks)backend/api_v2/management/commands/list_org_rate_limits.py(1 hunks)backend/api_v2/management/commands/set_org_rate_limit.py(1 hunks)backend/api_v2/migrations/0003_add_organization_rate_limit.py(1 hunks)backend/api_v2/models.py(4 hunks)backend/api_v2/rate_limit_constants.py(1 hunks)backend/api_v2/rate_limiter.py(1 hunks)backend/backend/exceptions.py(1 hunks)backend/backend/settings/base.py(1 hunks)backend/sample.env(1 hunks)backend/workflow_manager/workflow_v2/models/execution.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/workflow_manager/workflow_v2/models/execution.py
338-338: Do not catch blind exception: Exception
(BLE001)
340-342: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
backend/api_v2/management/commands/set_org_rate_limit.py
18-18: Unused method argument: args
(ARG002)
24-24: Avoid specifying long messages outside the exception class
(TRY003)
37-39: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
37-39: Avoid specifying long messages outside the exception class
(TRY003)
43-43: Unpacked variable org_rate_limit is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
73-73: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/admin.py
8-13: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
14-14: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
15-15: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
16-16: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
backend/api_v2/management/commands/list_org_rate_limits.py
16-16: Unused method argument: args
(ARG002)
43-43: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/management/commands/get_org_rate_limit.py
21-21: Unused method argument: args
(ARG002)
35-35: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
35-35: Avoid specifying long messages outside the exception class
(TRY003)
98-98: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/models.py
187-192: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
269-269: Unused function argument: sender
(ARG001)
269-269: Unused function argument: kwargs
(ARG001)
backend/api_v2/rate_limiter.py
282-282: Consider moving this statement to an else block
(TRY300)
284-284: Do not catch blind exception: Exception
(BLE001)
285-285: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
292-292: Do not catch blind exception: Exception
(BLE001)
338-338: Consider moving this statement to an else block
(TRY300)
339-339: Do not catch blind exception: Exception
(BLE001)
341-344: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
385-385: Consider moving this statement to an else block
(TRY300)
386-386: Do not catch blind exception: Exception
(BLE001)
387-387: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
410-410: Do not catch blind exception: Exception
(BLE001)
411-411: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
backend/api_v2/migrations/0003_add_organization_rate_limit.py
10-13: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
15-61: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (23)
backend/backend/exceptions.py (1)
42-46: LGTM! Clean rate limit header integration.The Retry-After header is correctly set from the exception's
retry_after_secondsattribute. The import placement inside the function body is appropriate for avoiding potential circular import issues.backend/sample.env (1)
113-127: Well-documented configuration with sensible defaults.The rate limiting configuration is comprehensive and the default values are reasonable:
- Per-org limit (5) is conservative for initial rollout
- Global limit (50) provides system-wide protection
- TTL values balance tracking accuracy vs resource usage
- Lock timeouts prevent deadlocks while allowing reasonable acquisition time
backend/api_v2/deployment_helper.py (3)
2-2: LGTM! Required imports for rate limiting.The new imports support the rate limiting functionality correctly.
Also applies to: 34-34, 38-38
185-203: Atomic rate limit check with clear error handling.The upfront execution ID generation and atomic
check_and_acquirecall prevent race conditions. The exception includes all necessary context (usage, limit, retry delay, type) for proper client handling.
215-220: The execution ID flow is correct and necessary.The
create_workflow_executionfunction uses the passedexecution_idas the primary key when creating a newWorkflowExecutionobject. Line 220's reassignment is not redundant—it's necessary because the function can return an existing execution (line 139-140 of execution.py) rather than always creating a new one. This pattern correctly handles both scenarios: new creation and existing execution retrieval. The code is sound.backend/api_v2/exceptions.py (1)
61-91: Well-designed rate limit exception with clear messaging.The exception properly uses HTTP 429 status and provides comprehensive context. The conditional message construction distinguishes between organization-level and global limits, making it clear to API consumers which limit they've hit.
backend/api_v2/management/commands/set_org_rate_limit.py (3)
10-16: LGTM! Clear argument definitions.The command accepts both UUID and organization name for flexibility, making it user-friendly for administrators.
18-39: Robust organization resolution with good UX.The command tries UUID first, then falls back to organization name, making it flexible for administrators. The validation logic is sound.
55-74: Helpful usage reporting with appropriate error handling.Showing current usage after setting the limit and warning when it exceeds the new limit provides valuable feedback. The broad exception catch is acceptable here since displaying usage is non-critical and the command should succeed even if the usage check fails.
backend/workflow_manager/workflow_v2/models/execution.py (1)
286-321: Clean integration of rate limit release.The flag-based approach ensures rate limit slots are released exactly once when executions reach final states. Releasing after
save()ensures the state is persisted before freeing the slot, and thepipeline_idcheck correctly limits this to API deployment executions.backend/api_v2/admin.py (1)
3-34: Well-structured admin interface with good UX.The admin configuration provides comprehensive management capabilities with searchable organizations and appropriate field groupings. The static analysis warnings about
ClassVarare false positives for Django admin classes where these attributes are part of the framework's metaclass-based configuration.backend/api_v2/models.py (4)
155-192: Clean model design with proper constraints.The model structure follows Django best practices with appropriate constraints, defaults, and metadata. The unique constraint ensures one rate limit per organization.
167-181: Automatic cache invalidation ensures consistency.The save hook automatically clears cached rate limits, preventing stale data. Local imports avoid circular dependencies, and using the centralized key construction ensures consistency across the codebase.
267-278: Proper cache cleanup on deletion.The signal handler ensures cache consistency when rate limit records are deleted. The
senderandkwargsparameters are required by Django's signal framework and cannot be removed despite being unused.
267-278: No duplicate signal handlers found. Review comment is incorrect.Verification shows only one
@receiver(post_delete, sender=OrganizationRateLimit)handler in the file. The cache-clearing logic at line 179 is part of a model method (_clear_cache()), not a duplicate signal handler. These serve different purposes: the model method clears cache on save, while the signal handler clears cache on delete.Likely an incorrect or invalid review comment.
backend/api_v2/management/commands/list_org_rate_limits.py (1)
9-14: LGTM!Good UX design warning users that
--with-usagewill be slower due to Redis queries.backend/api_v2/management/commands/get_org_rate_limit.py (2)
40-63: LGTM!Cache management logic is clear and provides good feedback to operators.
78-96: Nice UX with color-coded capacity warnings.The threshold-based styling (90% error, 70% warning) provides clear visual feedback to operators.
backend/backend/settings/base.py (1)
111-136: LGTM!The rate limiting configuration is well-structured with sensible defaults and clear documentation. The naming convention is consistent and the values are appropriate for production use.
backend/api_v2/rate_limit_constants.py (1)
1-120: Excellent centralization of constants.This module provides a clean separation of Redis key patterns, default values, and user-facing messages. The consistent naming conventions and comprehensive documentation make it easy to maintain and extend.
backend/api_v2/rate_limiter.py (3)
50-89: LGTM!The cache-aside pattern is correctly implemented with proper fallback handling. This will significantly reduce database load as documented in the PR objectives (≈95% DB query reduction).
297-345: Deprecation clearly documented.Good practice keeping this method for backward compatibility while clearly documenting the race condition and recommending the atomic
check_and_acquire()alternative.
391-411: LGTM!The atomic removal from both ZSETs using a pipeline is correct. Error handling appropriately logs failures without disrupting the caller.
The Organization.organization_id field is a CharField (not UUIDField), storing custom string IDs like "org_qijtoAkJNhznYhNt". The previous implementation attempted UUID parsing which failed for these values. Changes: - Remove UUID validation that assumes organization_id is a UUID - Try organization_id lookup first (handles both UUID and string formats) - Fall back to name lookup if organization_id lookup fails - Update help text to remove "UUID" reference This allows commands to work with both: - Organization ID: python manage.py set_org_rate_limit org_qijtoAkJNhznYhNt 10 - Organization name: python manage.py set_org_rate_limit zipstack 10 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
backend/api_v2/management/commands/set_org_rate_limit.py (1)
52-72: Consider catching more specific exceptions.Catching the generic
Exceptionclass is overly broad. Consider catching specific exceptions that might occur duringget_current_usage()(e.g., Redis connection errors, rate limiter errors) or document why broad exception handling is acceptable here.For example:
except Exception as e: - self.stdout.write(self.style.WARNING(f"Could not fetch current usage: {e}")) + # Gracefully handle any errors fetching usage from Redis/rate limiter + self.stdout.write( + self.style.WARNING(f"Could not fetch current usage: {e}") + )Or if specific exceptions are known:
except (RedisError, ConnectionError) as e: self.stdout.write( self.style.WARNING(f"Could not fetch current usage: {e}") )backend/api_v2/management/commands/get_org_rate_limit.py (1)
61-96: Consider moving import to module level.The usage display logic is sound with proper division-by-zero protection (lines 77-79) and helpful color-coded status output. The broad exception catch on line 95 is acceptable for a management command to provide user-friendly error messages.
However, the import on line 62 is placed inside the method, which is unconventional. Unless this is done to avoid circular imports, consider moving it to the top of the file for consistency with Python conventions.
Apply this diff to move the import:
+from api_v2.rate_limiter import APIDeploymentRateLimiter from account_v2.models import Organization from django.conf import settings from django.core.cache import cacheThen remove it from line 62:
# Get current usage - from api_v2.rate_limiter import APIDeploymentRateLimiter - try:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
backend/api_v2/management/commands/get_org_rate_limit.py(1 hunks)backend/api_v2/management/commands/set_org_rate_limit.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/management/commands/get_org_rate_limit.py
21-21: Unused method argument: args
(ARG002)
32-32: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
32-32: Avoid specifying long messages outside the exception class
(TRY003)
95-95: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/management/commands/set_org_rate_limit.py
18-18: Unused method argument: args
(ARG002)
24-24: Avoid specifying long messages outside the exception class
(TRY003)
34-36: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
34-36: Avoid specifying long messages outside the exception class
(TRY003)
40-40: Unpacked variable org_rate_limit is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
70-70: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
backend/api_v2/management/commands/set_org_rate_limit.py (3)
1-5: LGTM!The imports are clean and well-organized.
7-16: LGTM!The command structure and argument definitions follow Django conventions correctly.
18-24: LGTM!The limit validation is appropriate and the error message is clear for CLI users.
backend/api_v2/management/commands/get_org_rate_limit.py (3)
1-11: LGTM!The imports and command class setup follow Django management command conventions correctly.
13-19: LGTM!The argument definitions are clear and properly configured. The
org_idparameter accepting either organization ID or name provides good flexibility.
36-59: LGTM!The cache management, database retrieval, and cache status display logic is well-structured. The fallback to default settings when no database limit exists is handled correctly, and the user feedback is clear and informative.
New command allows administrators to remove custom organization rate limits, reverting organizations back to the system default limit. Features: - Delete custom rate limit for an organization (by ID or name) - Confirmation prompt (can be skipped with --force) - Shows default limit that will be used after deletion - Warns if current usage exceeds default - Automatic cache clearing via post_delete signal Also improved error logging in WorkflowExecution to use logger.exception() instead of logger.error() for better stack trace visibility. Usage: python manage.py delete_org_rate_limit org_qijtoAkJNhznYhNt python manage.py delete_org_rate_limit zipstack --force 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
backend/api_v2/management/commands/delete_org_rate_limit.py (2)
22-29: Consider adding exception chaining for better error context.The fallback logic from organization_id to name is well-designed. However, when raising the final
CommandError, consider using exception chaining (from errorfrom None) to provide better debugging context.Apply this diff:
except Organization.DoesNotExist: try: organization = Organization.objects.get(name=org_id) except Organization.DoesNotExist: - raise CommandError(f'Organization "{org_id}" not found') + raise CommandError(f'Organization "{org_id}" not found') from None
70-82: Consider moving imports to the top of the file.While late imports are sometimes necessary in Django commands to avoid circular dependencies, these particular imports (settings and RateLimitDefaults) could likely be moved to the top of the file for better code organization and early error detection.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
backend/api_v2/management/commands/delete_org_rate_limit.py(1 hunks)backend/workflow_manager/workflow_v2/models/execution.py(3 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/management/commands/delete_org_rate_limit.py
18-18: Unused method argument: args
(ARG002)
29-29: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
29-29: Avoid specifying long messages outside the exception class
(TRY003)
100-100: Do not catch blind exception: Exception
(BLE001)
backend/workflow_manager/workflow_v2/models/execution.py
341-341: Redundant exception object included in logging.exception call
(TRY401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
backend/api_v2/management/commands/delete_org_rate_limit.py (3)
1-16: LGTM: Clean command structure.The imports, class definition, and argument parsing are well-structured and follow Django management command conventions.
31-68: LGTM: Well-designed deletion flow with proper safeguards.The logic correctly checks for the existence of a custom limit, provides clear user feedback, implements a confirmation prompt (bypassable with
--force), and documents that cache invalidation happens automatically via signals.
103-103: LGTM: Clear final confirmation message.The final message appropriately confirms that cache invalidation is automatic.
backend/workflow_manager/workflow_v2/models/execution.py (3)
299-300: LGTM!The flag initialization is clear and follows a standard pattern for tracking whether rate limit cleanup is needed.
304-310: LGTM!The logic correctly identifies terminal execution states and sets the flag appropriately. All final states (COMPLETED, ERROR, STOPPED) are covered.
319-321: LGTM!The conditional check correctly limits rate limit release to pipeline-based executions, and the call is appropriately placed after
save()to ensure the execution state is persisted.
New Features:
1. Management command to clear organization rate limit cache
- Clear specific org: --org-id <org_id>
- Clear all with custom limits: (default)
- Clear ALL orgs: --all
- Uses Redis pattern deletion (rate_limit:cache:org_limit:*)
for performance, falls back to individual deletion
2. Reduced cache TTL from 1 hour to 10 minutes
- Allows faster pickup of default limit changes from ENV
- Still provides significant DB query reduction
3. Implement cache TTL refresh on every read
- TTL is extended by 10 minutes on every cache hit
- Frequently-used orgs stay cached indefinitely
- Inactive orgs expire after 10 minutes
- LRU-like behavior without explicit LRU cache
Usage:
# Clear specific organization cache
python manage.py clear_org_rate_limit_cache --org-id org_qijtoAkJNhznYhNt
# Clear cache for all orgs with custom limits
python manage.py clear_org_rate_limit_cache
# Clear cache for ALL organizations (uses pattern deletion if Redis)
python manage.py clear_org_rate_limit_cache --all
This is useful when changing API_DEPLOYMENT_DEFAULT_RATE_LIMIT to ensure
organizations pick up the new default value immediately.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
backend/api_v2/rate_limiter.py (5)
72-76: Consider removing TTL refresh on cache hits for better performance.The cache write on line 75 adds overhead on every cache hit (within the rate-limit critical path). Since the cache TTL is already 10 minutes (default), refreshing it on every read may not be necessary.
If you want to remove the TTL refresh overhead, apply this diff:
- # Try cache first cached_limit = cache.get(cache_key) if cached_limit is not None: - # Refresh TTL on cache hit (extends TTL for frequently-used orgs) - cache.set(cache_key, cached_limit, cache_ttl) return cached_limitHowever, if extending TTL for frequently-used orgs is important, the current design is acceptable.
142-298: Well-designed atomic check-and-acquire with appropriate fail-open strategy.The per-organization lock prevents race conditions within each org, and the fail-open behavior ensures availability when Redis has issues. The documented global limit race condition (1-2% over-limit possible) is an acceptable trade-off.
For better debugging, consider using
logging.exceptioninstead oflogging.erroron line 288 to capture stack traces:except Exception as e: - logger.error(f"Error in rate limit check for org {org_id}: {e}") + logger.exception(f"Error in rate limit check for org {org_id}") # Fail open: allow request on errors return True, NoneThis helps diagnose unexpected Redis failures without changing behavior.
299-348: Deprecation notice is clear.Good documentation that this method has a race condition and users should migrate to
check_and_acquire().For consistency with the main method, consider using
logging.exceptionon line 344:except Exception as e: # If Redis fails, allow the request to proceed (fail open) - logger.error( - f"Rate limit check failed for org {organization.organization_id}: {e}. " - "Allowing request to proceed." - ) + logger.exception( + f"Rate limit check failed for org {organization.organization_id}. " + "Allowing request to proceed." + ) return True, None
350-391: Deprecated method retained for backward compatibility.The atomic pipeline operations are correct.
For better debugging, use
logging.exceptionon line 390:except Exception as e: - logger.error(f"Failed to acquire rate limit slot: {e}") + logger.exception("Failed to acquire rate limit slot") return False
393-414: Atomic slot release with pipeline is correct.For consistency, use
logging.exceptionon line 414:except Exception as e: - logger.error(f"Failed to release rate limit slot: {e}") + logger.exception("Failed to release rate limit slot")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
backend/api_v2/management/commands/clear_org_rate_limit_cache.py(1 hunks)backend/api_v2/rate_limit_constants.py(1 hunks)backend/api_v2/rate_limiter.py(1 hunks)backend/sample.env(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/sample.env
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/management/commands/clear_org_rate_limit_cache.py
26-26: Unused method argument: args
(ARG002)
126-126: Consider moving this statement to an else block
(TRY300)
127-127: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/rate_limiter.py
285-285: Consider moving this statement to an else block
(TRY300)
287-287: Do not catch blind exception: Exception
(BLE001)
288-288: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
295-295: Do not catch blind exception: Exception
(BLE001)
341-341: Consider moving this statement to an else block
(TRY300)
342-342: Do not catch blind exception: Exception
(BLE001)
344-347: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
388-388: Consider moving this statement to an else block
(TRY300)
389-389: Do not catch blind exception: Exception
(BLE001)
390-390: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
413-413: Do not catch blind exception: Exception
(BLE001)
414-414: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
backend/api_v2/management/commands/clear_org_rate_limit_cache.py (4)
26-38: LGTM! Clear command flow with good separation of concerns.The three-mode approach (specific org → all orgs → custom limits only) provides flexibility for different operational needs.
Note: The static analysis warning about unused
argsis a false positive—it's part of the standard DjangoBaseCommand.handle()signature.
40-61: Excellent UX with double fallback lookup.The ability to specify either organization ID or name makes this command more user-friendly for operators.
111-129: Pattern deletion with fail-safe fallback is well-designed.The bare
Exceptioncatch on line 127 is appropriate here since this is an optional optimization—if pattern deletion fails for any reason, the method correctly falls back to individual deletion.Note: Static analysis warnings (TRY300, BLE001) are false positives in this context.
131-158: Good safety measure with confirmation prompt.The confirmation for large batches (>50 orgs) prevents accidental mass cache invalidation.
backend/api_v2/rate_limit_constants.py (1)
1-120: Excellent centralization of constants and key patterns.The three-class structure (Keys, Defaults, Messages) provides clear separation of concerns and makes maintenance easier. The consistent naming conventions and sensible default values demonstrate good system design.
backend/api_v2/rate_limiter.py (1)
17-17: Module-levelget_redis_connection()call is safe and will not cause import failures.The
get_redis_connection("default")call returns a redis-py client object immediately without opening a TCP connection, and actual socket connections are created lazily on demand when the first Redis command is executed. The module will load successfully regardless of Redis availability. All usages ofredis_cacheare within methods that execute at runtime, where try-except blocks already implement the fail-open strategy (allowing requests to proceed if Redis operations fail). This is a standard, safe pattern used elsewhere in the codebase (e.g.,cache_service.py).Likely an incorrect or invalid review comment.
chandrasekharan-zipstack
left a comment
There was a problem hiding this comment.
@ritwik-g we might have to add such similar rate limits for ETL / TASK pipelines as well. I think we provide an option to trigger them via an API as well
Not the intention of this PR currently. Also in ETL not a high concern as API. But definitely something we can consider. Please run this by team and if required raise a task ticket |
Documentation: - Created docs/API_DEPLOYMENT_RATE_LIMITING.md with: - Architecture overview (per-org + global limits, Redis locks, cache) - All 7 ENV variables with detailed explanations - All 5 management commands with usage examples - Usage scenarios and best practices - Troubleshooting guide - Performance characteristics and security considerations - Confirms rate limiting ONLY affects API deployments (not ETL/tasks) Default Limit Updates (3 locations): - Organization limit: 5 → 20 concurrent requests - Global limit: 50 → 100 concurrent requests - Updated in: 1. backend/api_v2/rate_limit_constants.py (RateLimitDefaults) 2. backend/sample.env (environment variable defaults) 3. backend/backend/settings/base.py (Django setting defaults) Also updated cache TTL default in base.py: 3600s → 600s (10 minutes) to match the change made in previous commit. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
backend/api_v2/rate_limit_constants.py(1 hunks)backend/backend/settings/base.py(1 hunks)backend/sample.env(1 hunks)docs/API_DEPLOYMENT_RATE_LIMITING.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- backend/sample.env
- backend/backend/settings/base.py
- backend/api_v2/rate_limit_constants.py
🧰 Additional context used
🪛 LanguageTool
docs/API_DEPLOYMENT_RATE_LIMITING.md
[grammar] ~5-~5: Use a hyphen to join words.
Context: ...erview This document describes the rate limiting implementation for API deployme...
(QB_NEW_EN_HYPHEN)
[grammar] ~119-~119: Use a hyphen to join words.
Context: ...equests** to proceed. This prevents rate limiting infrastructure issues from bloc...
(QB_NEW_EN_HYPHEN)
[grammar] ~588-~588: Use a hyphen to join words.
Context: ...al of Service (DoS) Prevention The rate limiting system itself is designed to pr...
(QB_NEW_EN_HYPHEN)
[grammar] ~589-~589: Use a hyphen to join words.
Context: ... DoS: - Fail-open strategy prevents rate limiting infrastructure from becoming at...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/API_DEPLOYMENT_RATE_LIMITING.md
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
183-183: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
208-208: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
232-232: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
261-261: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
290-290: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
627-627: Bare URL used
(MD034, no-bare-urls)
628-628: Bare URL used
(MD034, no-bare-urls)
629-629: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (2)
docs/API_DEPLOYMENT_RATE_LIMITING.md (2)
1-630: Documentation is comprehensive but verify implementation alignment.Overall, this is well-structured and thorough operational documentation. The architecture is clearly explained, management commands are well-documented with examples, and best practices/troubleshooting sections are helpful. However, the critical default value discrepancies (flagged separately) must be resolved before this documentation is accurate.
Once defaults are corrected, consider a follow-up verification that:
- Cache TTL values (10 min for org limits, 6 hours for ZSET) match actual implementation
- Lock timeouts and blocking timeouts match configured values in code
- All code examples actually work with corrected defaults
128-154: ****The review comment incorrectly identifies a discrepancy. The documentation accurately represents two distinct and intentional defaults serving different purposes:
Environment variable defaults (lines 128-154):
API_DEPLOYMENT_DEFAULT_RATE_LIMIT=20andAPI_DEPLOYMENT_GLOBAL_RATE_LIMIT=100are application-wide configuration defaults used when no per-organization limit is set—verified inbackend/backend/settings/base.pyandbackend/api_v2/rate_limit_constants.py.Model field default (line 554):
concurrent_request_limit = IntegerField(default=5)is the database model default when creating a newOrganizationRateLimitrecord—a separate concern from environment configuration.These are not conflicting values; they address different layers. Additionally, the claim that "PR objectives state defaults are 5/50" cannot be verified—the PR template is empty and no PR description containing such objectives was found in the codebase. The documentation is internally consistent and correctly reflects the implementation.
Likely an incorrect or invalid review comment.
Issues Fixed:
1. RateLimitExceeded was not using centralized RateLimitMessages constants
2. Response format documentation didn't match actual drf-standardized-errors format
3. Retry-After header and retry_after_seconds were not useful (hard to predict accurately)
Changes:
- RateLimitExceeded now uses RateLimitMessages.get_org_limit_exceeded_message()
and get_global_limit_exceeded_message() for consistent error messages
- Removed retry_after_seconds parameter from RateLimitExceeded.__init__()
- Removed retry_after_seconds from rate_limiter.py return dicts
- Removed retry_after_seconds from deployment_helper.py exception call
- Removed dead Retry-After header code from backend/exceptions.py
- Removed API_DEPLOYMENT_RATE_LIMIT_RETRY_AFTER from:
- sample.env
- settings/base.py
- rate_limit_constants.py (DEFAULT_RETRY_AFTER_SECONDS)
- Updated documentation to show actual drf-standardized-errors response format:
{
"type": "client_error",
"errors": [{"code": "error", "detail": "...", "attr": null}]
}
- Removed all Retry-After references from documentation
Clients should implement their own retry logic with exponential backoff.
Rate limits are released when active requests complete.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/api_v2/deployment_helper.py (1)
186-268: Release rate-limit slot on early failures.We acquire the Redis slot (
check_and_acquire) before any workflow side effects, but every failure path before the async job is successfully enqueued (e.g.create_workflow_execution, file staging, queue dispatch) exits without callingAPIDeploymentRateLimiter.release_slot. Each failure pins the token in Redis until the 6‑hour TTL expires, so a handful of transient errors will exhaust the org/global quota and block legitimate traffic. Please release the slot whenever setup fails.A minimal fix looks like this:
- execution_id = uuid.uuid4() - - # Atomically check rate limit and acquire slot - can_proceed, limit_info = APIDeploymentRateLimiter.check_and_acquire( - organization, str(execution_id) - ) + execution_uuid = uuid.uuid4() + rate_limit_token = str(execution_uuid) + + # Atomically check rate limit and acquire slot + can_proceed, limit_info = APIDeploymentRateLimiter.check_and_acquire( + organization, rate_limit_token + ) if not can_proceed: ... - workflow_execution = WorkflowExecutionServiceHelper.create_workflow_execution( - workflow_id=workflow_id, - pipeline_id=pipeline_id, - execution_id=execution_id, - ... - ) + try: + workflow_execution = WorkflowExecutionServiceHelper.create_workflow_execution( + workflow_id=workflow_id, + pipeline_id=pipeline_id, + execution_id=execution_uuid, + ... + ) + except Exception: + APIDeploymentRateLimiter.release_slot( + str(organization.organization_id), rate_limit_token + ) + raise execution_id = workflow_execution.id - hash_values_of_files = SourceConnector.add_input_file_to_api_storage(...) + try: + hash_values_of_files = SourceConnector.add_input_file_to_api_storage(...) + except Exception: + APIDeploymentRateLimiter.release_slot( + str(organization.organization_id), rate_limit_token + ) + raise try: result = WorkflowHelper.execute_workflow_async(...) except Exception as error: + APIDeploymentRateLimiter.release_slot( + str(organization.organization_id), rate_limit_token + ) ...That keeps the slot consistent with the token we added and frees it immediately on any setup error.
🧹 Nitpick comments (2)
docs/API_DEPLOYMENT_RATE_LIMITING.md (2)
26-291: Add language identifiers to fenced code blocks.markdownlint is still flagging the code fences without a language (e.g., the flow diagram, command snippets, outputs around Lines 26, 179, 204, 228, 257, 286). Please tag them with something appropriate (
text,bash,http,python, etc.) so lint passes and readers get syntax highlighting.
625-627: Wrap bare URLs in Markdown links.The references section still has naked URLs. Please wrap each in
[descriptive text](url)form to comply with MD034 and keep the style consistent.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (7)
backend/api_v2/deployment_helper.py(4 hunks)backend/api_v2/exceptions.py(1 hunks)backend/api_v2/rate_limit_constants.py(1 hunks)backend/api_v2/rate_limiter.py(1 hunks)backend/backend/settings/base.py(1 hunks)backend/sample.env(1 hunks)docs/API_DEPLOYMENT_RATE_LIMITING.md(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/sample.env
- backend/backend/settings/base.py
🧰 Additional context used
🪛 LanguageTool
docs/API_DEPLOYMENT_RATE_LIMITING.md
[grammar] ~5-~5: Use a hyphen to join words.
Context: ...erview This document describes the rate limiting implementation for API deployme...
(QB_NEW_EN_HYPHEN)
[grammar] ~119-~119: Use a hyphen to join words.
Context: ...equests** to proceed. This prevents rate limiting infrastructure issues from bloc...
(QB_NEW_EN_HYPHEN)
[grammar] ~586-~586: Use a hyphen to join words.
Context: ...al of Service (DoS) Prevention The rate limiting system itself is designed to pr...
(QB_NEW_EN_HYPHEN)
[grammar] ~587-~587: Use a hyphen to join words.
Context: ... DoS: - Fail-open strategy prevents rate limiting infrastructure from becoming at...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/API_DEPLOYMENT_RATE_LIMITING.md
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
179-179: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
204-204: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
228-228: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
257-257: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
286-286: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
625-625: Bare URL used
(MD034, no-bare-urls)
626-626: Bare URL used
(MD034, no-bare-urls)
627-627: Bare URL used
(MD034, no-bare-urls)
🪛 Ruff (0.14.3)
backend/api_v2/rate_limiter.py
273-273: Consider moving this statement to an else block
(TRY300)
275-275: Do not catch blind exception: Exception
(BLE001)
276-276: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
283-283: Do not catch blind exception: Exception
(BLE001)
327-327: Consider moving this statement to an else block
(TRY300)
328-328: Do not catch blind exception: Exception
(BLE001)
330-333: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
374-374: Consider moving this statement to an else block
(TRY300)
375-375: Do not catch blind exception: Exception
(BLE001)
376-376: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
399-399: Do not catch blind exception: Exception
(BLE001)
400-400: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
- Move APIDeploymentRateLimiter import to module top (PEP 8 compliance) - Fix off-by-one warning bug: change > to >= to match rate limiter blocking logic - Affects: delete_org_rate_limit, set_org_rate_limit, get_org_rate_limit, list_org_rate_limits The rate limiter blocks when usage >= limit, so warnings must also use >= for consistency. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (6)
backend/api_v2/management/commands/set_org_rate_limit.py (2)
35-37: Preserve exception chain or explicitly suppress it.This issue was flagged in a previous review. When re-raising within an except clause, use
raise ... from Noneto suppress the intermediate exception orraise ... from errto preserve the chain.
41-41: Prefix unused variable with underscore.This issue was flagged in a previous review. The
org_rate_limitvariable is unused; prefix it with an underscore to follow Python conventions.backend/api_v2/management/commands/get_org_rate_limit.py (1)
33-33: Improve exception chaining.This issue was flagged in a previous review. Use
raise ... from Noneto suppress the intermediate exception for a cleaner user experience.backend/api_v2/management/commands/delete_org_rate_limit.py (3)
30-30: Improve exception chaining.This issue was flagged in a previous review. Use
raise ... from Noneto suppress the intermediate exception for a cleaner user experience.
72-74: Move imports to module top.As flagged in a previous review, these imports should be moved to the top of the file rather than imported late within the function.
92-92: Fix comparison logic to match rate limiter behavior.As identified in a previous review, the comparison should use
>=instead of>to align with the rate limiter's blocking behavior (which blocks when usage equals the limit).
🧹 Nitpick comments (3)
backend/api_v2/management/commands/set_org_rate_limit.py (1)
27-37: Address the duplicated organization lookup pattern.This exact organization lookup logic (try by
organization_id, fallback toname) is duplicated across three management commands (set_org_rate_limit.py,get_org_rate_limit.py,delete_org_rate_limit.py). Consider extracting it into a shared helper function to improve maintainability.For example, add a helper in a shared utilities module:
def get_organization_by_id_or_name(org_id: str) -> Organization: """Get organization by organization_id or name.""" try: return Organization.objects.get(organization_id=org_id) except Organization.DoesNotExist: try: return Organization.objects.get(name=org_id) except Organization.DoesNotExist: raise CommandError(f'Organization "{org_id}" not found') from NoneThen use it in all three commands:
- # Get organization (try organization_id first, then name) - try: - organization = Organization.objects.get(organization_id=org_id) - except Organization.DoesNotExist: - # Try by name - try: - organization = Organization.objects.get(name=org_id) - except Organization.DoesNotExist: - raise CommandError( - f'Organization with ID or name "{org_id}" does not exist' - ) + organization = get_organization_by_id_or_name(org_id)backend/api_v2/management/commands/get_org_rate_limit.py (1)
27-33: Address duplicated organization lookup pattern.This is the same duplicated lookup pattern mentioned in
set_org_rate_limit.py. See the comment on that file for the suggested helper function approach.backend/api_v2/management/commands/delete_org_rate_limit.py (1)
23-30: Address duplicated organization lookup pattern.This is the same duplicated lookup pattern mentioned in
set_org_rate_limit.py. See the comment on that file for the suggested helper function approach.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (4)
backend/api_v2/management/commands/delete_org_rate_limit.py(1 hunks)backend/api_v2/management/commands/get_org_rate_limit.py(1 hunks)backend/api_v2/management/commands/list_org_rate_limits.py(1 hunks)backend/api_v2/management/commands/set_org_rate_limit.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/management/commands/list_org_rate_limits.py
17-17: Unused method argument: args
(ARG002)
42-42: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/management/commands/get_org_rate_limit.py
22-22: Unused method argument: args
(ARG002)
33-33: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
33-33: Avoid specifying long messages outside the exception class
(TRY003)
94-94: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/management/commands/set_org_rate_limit.py
19-19: Unused method argument: args
(ARG002)
25-25: Avoid specifying long messages outside the exception class
(TRY003)
35-37: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
35-37: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Unpacked variable org_rate_limit is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
69-69: Do not catch blind exception: Exception
(BLE001)
backend/api_v2/management/commands/delete_org_rate_limit.py
19-19: Unused method argument: args
(ARG002)
30-30: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
30-30: Avoid specifying long messages outside the exception class
(TRY003)
99-99: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (4)
backend/api_v2/management/commands/set_org_rate_limit.py (1)
69-70: Bare exception catch is acceptable here.The broad
Exceptioncatch on line 69 is appropriate in this context since the usage reporting is supplementary and should not prevent the main operation (setting the rate limit) from completing successfully. The error is logged for visibility.backend/api_v2/management/commands/list_org_rate_limits.py (1)
17-45: LGTM! Clean implementation with appropriate error handling.The command correctly:
- Uses
select_relatedto optimize database queries- Provides an optional
--with-usageflag for performance-conscious users- Handles usage retrieval errors gracefully without blocking the listing operation
- The bare
Exceptioncatch on line 42 is appropriate here since usage reporting is supplementarybackend/api_v2/management/commands/get_org_rate_limit.py (1)
37-92: Well-structured command with helpful diagnostics.The implementation provides comprehensive visibility into:
- Database vs. cached limits
- Cache invalidation control
- Current usage with color-coded capacity warnings
The bare
Exceptioncatch on line 94 is appropriate since usage reporting should not block the command.backend/api_v2/management/commands/delete_org_rate_limit.py (1)
46-69: Well-implemented confirmation flow with appropriate safety checks.The confirmation prompt (unless
--force) provides good UX by showing the user what will change before executing the deletion. The cache is automatically cleared via the post_delete signal.
**Problem:** Review bot identified critical bug where early failures (DB, file staging, queue dispatch) would leak rate limit slots for 6 hours, potentially exhausting org/global quotas. **Root Cause:** - Rate limiting was in deployment_helper.py - Helper has internal try-catch that swallows exceptions and returns error response - Exceptions that don't propagate can't be caught by outer handlers - No slot release on these failures → orphaned slots **Solution:** Move rate limiting to view layer with dual release strategy: 1. **View Layer (api_deployment_views.py)** - Acquires rate limit slot before calling execute_workflow() - Wraps execute_workflow() in try-catch - Releases slot if exception propagates (early setup failures) 2. **Helper Layer (deployment_helper.py)** - Accepts optional execution_id parameter - Existing exception handler now also releases slot - Handles failures in async dispatch that don't propagate 3. **Signal (workflow completion)** - Unchanged - releases slot when async job completes successfully **Coverage:** ✅ Lines 197-241 failures (Tag creation, WorkflowExecution, file staging) → View catches & releases ✅ Lines 243-289 failures (Async dispatch, config checks) → Helper catches & releases ✅ Async job completion → Signal releases ✅ No double-release (each path has one release point) ✅ No orphaned slots on any error path **Files Changed:** - api_deployment_views.py: Add rate limit check + blanket exception handler - deployment_helper.py: Accept execution_id param, add slot release to exception handler - API_DEPLOYMENT_RATE_LIMITING.md: Update architecture diagram and code locations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (3)
backend/api_v2/api_deployment_views.py(4 hunks)backend/api_v2/deployment_helper.py(6 hunks)docs/API_DEPLOYMENT_RATE_LIMITING.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/API_DEPLOYMENT_RATE_LIMITING.md
[grammar] ~5-~5: Use a hyphen to join words.
Context: ...erview This document describes the rate limiting implementation for API deployme...
(QB_NEW_EN_HYPHEN)
[grammar] ~147-~147: Use a hyphen to join words.
Context: ...equests** to proceed. This prevents rate limiting infrastructure issues from bloc...
(QB_NEW_EN_HYPHEN)
[grammar] ~614-~614: Use a hyphen to join words.
Context: ...al of Service (DoS) Prevention The rate limiting system itself is designed to pr...
(QB_NEW_EN_HYPHEN)
[grammar] ~615-~615: Use a hyphen to join words.
Context: ... DoS: - Fail-open strategy prevents rate limiting infrastructure from becoming at...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/API_DEPLOYMENT_RATE_LIMITING.md
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
207-207: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
232-232: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
256-256: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
285-285: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
314-314: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
654-654: Bare URL used
(MD034, no-bare-urls)
655-655: Bare URL used
(MD034, no-bare-urls)
656-656: Bare URL used
(MD034, no-bare-urls)
🪛 Ruff (0.14.3)
backend/api_v2/api_deployment_views.py
131-131: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (3)
backend/api_v2/api_deployment_views.py (1)
93-108: Rate limiting integration is well-implemented.The check-and-acquire flow correctly:
- Generates a unique execution_id for tracking
- Performs atomic rate limit checks via APIDeploymentRateLimiter
- Raises RateLimitExceeded with detailed context (current_usage, limit, limit_type)
- Logs the organization ID for debugging
This aligns with the documented dual-layer rate limiting architecture.
backend/api_v2/deployment_helper.py (2)
163-190: Backward compatibility handled well.The execution_id parameter with a None default and fallback UUID generation ensures existing callers continue to work while enabling the new rate limiting flow. The docstring appropriately notes that rate limiting is handled at the view layer.
202-208: Execution ID is properly respected - no critical issue found.The original concern is based on a false premise. The
create_workflow_executionmethod usesid=execution_id if execution_id else uuid.uuid4(), meaning it respects the providedexecution_idparameter as the primary key.When
deployment_helper.pyline 202 passesexecution_id="abc-123", the workflow execution is created with that same ID. Line 207's reassignment is redundant but harmless—it assigns the same value back to the variable.The rate limit acquired in the view layer and all subsequent operations use the same ID consistently. No slot leak occurs.
Likely an incorrect or invalid review comment.
…ion handler Changed logger.error() to logger.exception() in api_deployment_views.py exception handler to automatically include the full stack trace. This provides better debugging information when workflow execution fails during rate limit protected operations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/api_v2/api_deployment_views.py (1)
135-140: Release rate-limit slot when the workflow returns an error payload
WhenDeploymentHelper.execute_workflowreturns an error dictionary (no exception), we exit without freeing the acquired slot. That leaks the per-org/global slot until Redis TTL expiry, effectively blocking further deployments after only a few soft failures. Please release the slot before returning the 422 response.# Success - signal will handle slot release when workflow completes if "error" in response and response["error"]: logger.error("API deployment execution failed") + APIDeploymentRateLimiter.release_slot(organization, execution_id) return Response( {"message": response}, status=status.HTTP_422_UNPROCESSABLE_ENTITY, )
🧹 Nitpick comments (1)
backend/api_v2/api_deployment_views.py (1)
128-132: Letlogger.exceptioncapture the traceback without interpolating the error
logger.exceptionalready records the current exception and traceback; interpolating{error}is redundant and Ruff flags it (TRY401). Please log a static message instead.- logger.exception(f"Workflow execution failed: {error}") + logger.exception("Workflow execution failed")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
backend/api_v2/api_deployment_views.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/api_deployment_views.py
131-131: Redundant exception object included in logging.exception call
(TRY401)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
docs/API_DEPLOYMENT_RATE_LIMITING.md (2)
26-26: Add language identifiers to fenced code blocks for syntax highlighting.Multiple code blocks are missing language identifiers. Update them as follows:
-``` +```text ┌─────────────────────────────────────────────────────────────┐ │ API Request Flow (View Layer) │For the Python configuration block (line 183):
-``` +```python API_DEPLOYMENT_DEFAULT_RATE_LIMIT = int(os.environ.get("API_DEPLOYMENT_DEFAULT_RATE_LIMIT", 20))For all management command output examples (lines 207, 232, 256, 285, 314):
-``` +```bash Created rate limit for organization "acme-corp" (org_a1b2c3d4e5f6g7h8): 50For the HTTP response example (line 394):
-``` +```http HTTP/1.1 200 OKAlso applies to: 183-184, 207-208, 232-233, 256-257, 285-286, 314-315, 394-395
654-656: Wrap bare URLs in markdown link syntax.Convert the bare URLs to proper markdown links:
-- Redis ZSET documentation: https://redis.io/docs/data-types/sorted-sets/ -- Django cache framework: https://docs.djangoproject.com/en/stable/topics/cache/ -- HTTP 429 status code: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429 +- [Redis ZSET documentation](https://redis.io/docs/data-types/sorted-sets/) +- [Django cache framework](https://docs.djangoproject.com/en/stable/topics/cache/) +- [HTTP 429 status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/429)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
docs/API_DEPLOYMENT_RATE_LIMITING.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
docs/API_DEPLOYMENT_RATE_LIMITING.md
[grammar] ~5-~5: Use a hyphen to join words.
Context: ...erview This document describes the rate limiting implementation for API deployme...
(QB_NEW_EN_HYPHEN)
[grammar] ~147-~147: Use a hyphen to join words.
Context: ...equests** to proceed. This prevents rate limiting infrastructure issues from bloc...
(QB_NEW_EN_HYPHEN)
[grammar] ~614-~614: Use a hyphen to join words.
Context: ...al of Service (DoS) Prevention The rate limiting system itself is designed to pr...
(QB_NEW_EN_HYPHEN)
[grammar] ~615-~615: Use a hyphen to join words.
Context: ... DoS: - Fail-open strategy prevents rate limiting infrastructure from becoming at...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/API_DEPLOYMENT_RATE_LIMITING.md
26-26: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
207-207: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
232-232: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
256-256: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
285-285: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
314-314: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
654-654: Bare URL used
(MD034, no-bare-urls)
655-655: Bare URL used
(MD034, no-bare-urls)
656-656: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
1. Remove deprecated acquire_slot() method (PR comment #1) - Method marked as deprecated, not used anywhere in codebase - check_and_acquire() is the only method used (atomic operation) - Dead code removal improves maintainability 2. Add detailed comment explaining TTL vs manual cleanup (PR comment #2) - Clarifies why _cleanup_expired_entries() is needed even with TTL - Redis ZSET TTL expires entire key, not individual entries - Manual cleanup (ZREMRANGEBYSCORE) removes stale entries within ZSET - Both mechanisms work together for accurate rate limiting 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
backend/api_v2/rate_limiter.py (4)
203-203: Consider moving import to module level.The
Lockimport is inside the method. While this works, moving it to the top-level imports would be more conventional and slightly more efficient since the import only needs to happen once per module load.Apply this diff to move the import:
from django_redis import get_redis_connection +from redis.lock import Lock from api_v2.models import OrganizationRateLimitAnd remove line 203:
- from redis.lock import Lock - lock_timeout = getattr(
282-285: Uselogging.exceptionfor better debugging.When logging errors in exception handlers,
logging.exceptionautomatically includes the stack trace, which aids debugging without requiring manual traceback formatting.Apply this diff:
except Exception as e: - logger.error(f"Error in rate limit check for org {org_id}: {e}") + logger.exception(f"Error in rate limit check for org {org_id}: {e}") # Fail open: allow request on errors return True, None
294-341: Deprecation is well-documented; consider improved logging.The deprecation notice clearly explains the race condition and directs users to
check_and_acquire(). This is good practice for backward compatibility.Minor improvement: Use
logging.exceptionin the exception handler (lines 337-340) for better debugging information.Apply this diff:
except Exception as e: # If Redis fails, allow the request to proceed (fail open) - logger.error( + logger.exception( f"Rate limit check failed for org {organization.organization_id}: {e}. " "Allowing request to proceed." )
343-364: Correct atomic slot release; improve error logging.The use of a pipeline to atomically remove from both ZSETs is correct. Consider using
logging.exceptionin the error handler for better debugging.Apply this diff:
except Exception as e: - logger.error(f"Failed to release rate limit slot: {e}") + logger.exception(f"Failed to release rate limit slot: {e}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
backend/api_v2/rate_limiter.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.3)
backend/api_v2/rate_limiter.py
280-280: Consider moving this statement to an else block
(TRY300)
282-282: Do not catch blind exception: Exception
(BLE001)
283-283: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
290-290: Do not catch blind exception: Exception
(BLE001)
334-334: Consider moving this statement to an else block
(TRY300)
335-335: Do not catch blind exception: Exception
(BLE001)
337-340: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
363-363: Do not catch blind exception: Exception
(BLE001)
364-364: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
backend/api_v2/rate_limiter.py (6)
1-18: LGTM: Clean module setup.The imports are well-organized, and initializing
redis_cacheat the module level is efficient for reuse across all class methods.
20-47: LGTM: Well-structured helper methods.The helper methods are clean and use appropriate Redis operations. Good use of
getattrwith defaults for configuration values.
94-103: LGTM: Cache clearing is straightforward.
105-147: LGTM: Excellent explanation of TTL vs manual cleanup.The detailed comment (lines 122-129) clearly explains why both manual cleanup and key TTL are necessary, addressing the previous review question. The implementation correctly cleans expired entries before counting to maintain accuracy.
149-293: Strong implementation with atomic guarantees and clear documentation.The
check_and_acquiremethod correctly implements atomic rate limiting with per-organization locking. The detailed explanation (lines 155-186) of the trade-off between per-org locking and potential global limit overage is excellent documentation for future maintainers.Key strengths:
- Proper use of Redis Lock for per-org atomicity
- Cleanup of expired entries before checks
- Pipeline for atomic multi-key updates
- Fail-open strategy for resilience
- Clear logging at appropriate levels
49-92: Cache invalidation is properly configured; no issues found.The verification confirms that cache invalidation signals are correctly implemented:
- The
save()method calls_clear_cache()which explicitly deletes the cache entry (lines 167-170)- A
post_deletesignal handler also clears the cache on deletion- The TTL refresh pattern on read is intentional and safe because any model update or deletion immediately invalidates the cache
The implementation is correct as designed.



Summary
Implements a comprehensive rate limiting system for API deployments with atomic operations, Django caching, and management commands to prevent race conditions and reduce database load.
Key Features
✅ Race Condition Fix: Per-organization Redis distributed locks ensure atomic check-and-acquire
✅ Django Cache Integration: ~95% reduction in DB queries for org rate limits (1-hour cache TTL)
✅ Centralized Constants: All Redis/cache keys in one module, no magic strings
✅ Auto Cache Invalidation: Automatic cache clearing on model save/delete via signals
✅ Management Commands: Easy admin tools for setting/viewing rate limits
✅ Production Ready: Fail-open strategy, comprehensive error handling, configurable timeouts
Technical Implementation
Architecture
Performance Impact
Files Changed
New Files
backend/api_v2/rate_limit_constants.py- Centralized keys, defaults, and messagesbackend/api_v2/rate_limiter.py- Core rate limiting logic with locks and cachingbackend/api_v2/migrations/0003_add_organization_rate_limit.py- Database migrationbackend/api_v2/management/commands/set_org_rate_limit.py- Set/update org limitsbackend/api_v2/management/commands/get_org_rate_limit.py- View limits and usagebackend/api_v2/management/commands/list_org_rate_limits.py- List all custom limitsModified Files
backend/api_v2/models.py- OrganizationRateLimit model with auto cache clearingbackend/api_v2/deployment_helper.py- Use atomic check_and_acquire()backend/api_v2/admin.py- Admin interface for rate limitsbackend/api_v2/exceptions.py- RateLimitExceeded exceptionbackend/backend/exceptions.py- Add Retry-After header to 429 responsesbackend/workflow_manager/workflow_v2/models/execution.py- Auto-release slots on completionbackend/backend/settings/base.py- Rate limiting configurationbackend/sample.env- Configuration documentationConfiguration
Environment Variables
Management Commands
Set Organization Rate Limit
View Organization Rate Limit
List All Custom Rate Limits
Cache Invalidation Flow
Testing Checklist
python manage.py migrateFuture Enhancements (Optional)
If strict global limit enforcement is needed (no tolerance), the code includes pseudo-code for adding a global lock. See
rate_limiter.py:150-175for implementation guide.Related Issues
Closes UN-2972
🤖 Generated with Claude Code
Co-Authored-By: Claude noreply@anthropic.com