feat: add strategy support to lite mode#538
Conversation
|
For the second issue, I used it myself and made sure all were working. You sure the count isn't working correctly? |
It does work and it correctly sends connections to the backend with the lowest connection count. But when a player disconnects from the backend, the count is not updated, so the proxy still thinks there are players there and it will send connections to a backend which may have actually more players. |
|
I also changed it so instead of having a connection counted when it gives the backend with the least connections, it counts whenever a connection is forwarded and a player actually joins. |
|
Pretty solid doc and code. Would you be able to send a friend request to my discord for us to further on this case. Discord: okeanosthedev |
|
@okeanosthedev @robinbraemer Should we also cache checking backends? How long should the cache stay? |
I don't think it is needed that much, it wont make a significant change. |
…one of the strategies in the example
- Add null checks and initialization in the main Forward function - Prevents panic when counter doesn't exist in the map - Matches the initialization pattern used in leastConnectionsNextBackend
7e519ac to
f657750
Compare
- Add tests for configuration validation of strategies - Test atomic counter increment/decrement logic for least-connections - Verify strategy constants are correctly defined - Add documentation about network connectivity requirements for full testing The connection counting logic is correct: - Uses atomic counters for thread safety - Increments on connection establishment - Decrements on connection close via defer - Properly initializes counter map and individual counters
Addresses all review feedback from robinbraemer: 🔧 **Global State Elimination** - Remove all global variables (roundRobinIndex, leastConnectionCounter, latencyCache) - Create StrategyManager per proxy instance to avoid multi-instance conflicts - Use atomic operations and proper synchronization 🏗️ **Architecture Improvements** - StrategyManager encapsulates all strategy state per proxy instance - Pass StrategyManager instead of proxyId to functions - Proper separation of concerns between strategies and forwarding logic 📊 **Better Latency Measurement** - Move latency measurement from dial time to status response time - Status ping latency is more representative of actual server performance - Measures full request-response cycle instead of just connection setup 🧵 **Concurrency Safety** - Use atomic counters for connection tracking - Proper mutex protection for shared maps - Thread-safe strategy state management ✅ **All Review Issues Addressed** - No more global vars that break multiple Gate instances - Atomic operations prevent race conditions - Reuse status pings for latency measurement (no separate dials) - Clean architecture with proper encapsulation
🏗️ **Clean Architecture** - Create Lite struct to encapsulate all lite mode functionality - StrategyManager is now contained within Lite for better abstraction - Proxy.Lite() provides clean access to lite mode features - Prepares for future lite mode extensions (caching, metrics, etc.) ✅ **Benefits** - Better separation of concerns - Extensible design for future lite mode features - Cleaner API surface on Proxy struct - Proper encapsulation of lite mode state 🧪 **Testing** - Add tests for Lite abstraction and instance isolation - Verify strategy managers are properly isolated between instances - Test connection counter isolation between proxy instances
- Remove proxy.id field and getId() method - Simplify architecture by leveraging per-instance StrategyManager isolation - Each proxy instance automatically has isolated strategy state - No need for explicit proxy identification for round-robin or other strategies - Cleaner, simpler design without unnecessary complexity
🔧 **Type Safety Improvements** - Create Strategy type instead of plain string for better type safety - Add comprehensive documentation for each strategy constant - Improve validation error messages with allowed values - Better IDE support and code completion 📚 **Enhanced Documentation** - Complete rewrite of strategy documentation in lite.md - Add detailed comparison table with use cases and algorithms - Provide real-world configuration examples (gaming networks, dev setups) - Explain how strategies affect both connections and status pings - Add health checking and integration details 🎯 **Better User Experience** - Clear guidance on which strategy to use when - Comprehensive examples for different deployment scenarios - Detailed explanations of each strategy's behavior - Integration tips for health checking and optimization The documentation now properly explains: - When to use each strategy - How they work under the hood - Real-world configuration examples - Performance characteristics and trade-offs
🚀 **Performance Improvements** - Remove checkBackend() function that was dialing every backend on every request - Eliminate 5-second TCP dial timeouts blocking strategy selection - Strategies now return immediately instead of waiting for health checks - Up to 25+ second improvement for routes with 5+ backends 🏗️ **Better Design** - Use lazy failure detection via existing tryBackends error handling - Failed connections naturally retry next backend without pre-checking - Simpler, faster code path for all strategies - Remove unnecessary network overhead 📚 **Accurate Documentation** - Update docs to reflect 'lazy failure detection' instead of 'health-aware' - Explain natural failover behavior through connection failures - Remove misleading claims about aggressive health checking - More accurate performance characteristics ✅ **Maintains Reliability** - Failure handling still works through tryBackends mechanism - Fast failover when backends are unreachable - No loss of functionality, only improved performance
- Remove emoji icons from strategy section headings - Replace emoji checkmarks with clean bullet points - Maintain professional tone while preserving readability - Keep essential information without visual clutter - More appropriate for enterprise and professional environments
- Add links to comprehensive strategy guide in config.yml and config-lite.yml - Reference https://gate.minekube.com/guide/lite#load-balancing-strategies - Update main lite section to reference full documentation - Improve strategy comment clarity and formatting - Help users discover detailed strategy configuration options
- Reduce documentation length by ~50% while maintaining all essential information - Replace verbose strategy detail sections with concise code group examples - Use inline YAML comments to explain strategy behavior directly in config - Add practical 'Mixed Strategies' example for real-world usage - Consolidate performance notes into focused tip section - Eliminate duplication between table and detail sections - Improve scannability and practical usability for users
|
@robinbraemer although having the check backend logic altered, which is much faster, it does create a couple problems that were fixed before:
Maybe there is a better way to handle. A semi way which instead of checking every time per backend, it uses a cache. Or we could use the connection that is created when the backend is checked, instead of ignoring it. This better, but indeed slower, method could maybe be an extra option in the config, for users that have backends that frequently are not reachable. |
The original implementation before PR #538 was much cleaner and simpler. This reverts to the elegant pop-first approach while keeping only the log verbosity fix. **What was reverted:** - Complex strategy manager backend selection - Search-and-remove logic with normalization - O(n) backend removal loops **What we kept:** - Simple pop-first approach: tryBackends[0] then tryBackends[1:] - Sequential order (predictable, no duplicates) - O(1) backend removal - Log verbosity fix (V(1) for failed backends) **Benefits of simple approach:** ✅ No duplicates - guaranteed by pop-first ✅ Tries all backends in order ✅ Clean, readable code ✅ Same behavior as before PR #538 **Tests updated:** - Simplified all tests to match the pop-first logic - Removed complex strategy manager interactions - Tests now verify sequential backend selection - All tests still pass and cover the actual logic This maintains the fix for issues #2 and #3 while being much simpler.
…ault (#565) * fix: reduce log verbosity for failed backend connections in lite mode This minimal fix addresses issue #564 by simply increasing the log verbosity level for failed backend connection attempts. This prevents console spam when backends are unreachable while maintaining the simple retry behavior that has always worked. Changes: - Failed backend connections now log at V(1) debug level instead of info - Fallback status messages also use V(1) to reduce spam - No complex health tracking or caching (keeps it simple) - Preserves existing retry behavior without marking backends unhealthy Fixes #564 * fix: address all three lite mode issues from #564 This commit provides a complete fix for all three issues: 1. **Log spam reduction** (✓ Fixed) - Failed backend logs now use V(1) debug level - Fallback messages also use V(1) to reduce verbosity 2. **Backend cycling** (✓ Fixed) - When a backend fails, it's removed from the retry list - Strategy manager properly cycles through all available backends - No duplicate attempts on the same backend 3. **Fallback response** (✓ Fixed) - Fallback properly shown when all backends are unreachable - Already worked, just needed reduced log verbosity Changes: - Modified tryBackends logging to use V(1) for failed attempts - Fixed nextBackend to remove tried backends from the list - Added comprehensive tests for all three issues - Maintains simple retry behavior without complex health tracking Tests added: - TestBackendSelection_TriesAllBackends - TestBackendSelection_SucceedsOnSecondBackend - TestBackendSelection_NoDuplicateAttempts - TestFallbackResponse_UsedWhenAllBackendsFail - TestLogSpamReduction All existing tests continue to pass. * refactor: improve testability and add real integration tests - Extracted handleFallbackResponse for better testability - Removed useless tests that didn't test real functionality: * TestResolveStatusResponseIntegration (just logged messages) * TestBackendRemovalFromList (tested list ops, not real code) - Added meaningful integration tests: * TestNextBackendFunctionality - tests actual nextBackend implementation * TestFallbackResponseWithRealRoute - tests real fallback scenarios * TestLogVerbosityActuallyWorks - verifies log.V(1) behavior - All tests now exercise actual production code paths - Better test coverage of the real functionality * format * simplify: revert to original pop-first backend selection logic The original implementation before PR #538 was much cleaner and simpler. This reverts to the elegant pop-first approach while keeping only the log verbosity fix. **What was reverted:** - Complex strategy manager backend selection - Search-and-remove logic with normalization - O(n) backend removal loops **What we kept:** - Simple pop-first approach: tryBackends[0] then tryBackends[1:] - Sequential order (predictable, no duplicates) - O(1) backend removal - Log verbosity fix (V(1) for failed backends) **Benefits of simple approach:** ✅ No duplicates - guaranteed by pop-first ✅ Tries all backends in order ✅ Clean, readable code ✅ Same behavior as before PR #538 **Tests updated:** - Simplified all tests to match the pop-first logic - Removed complex strategy manager interactions - Tests now verify sequential backend selection - All tests still pass and cover the actual logic This maintains the fix for issues #2 and #3 while being much simpler. * revert: restore original smart error verbosity system * fix: treat connection refused as debug level to prevent spam Connection refused errors are common when backends are down and should not spam the console at INFO level. This adds smart detection of connection refused errors in dialRoute to use verbosity 1 (debug level). Before: Connection refused → Verbosity 0 → INFO level → console spam After: Connection refused → Verbosity 1 → DEBUG level → quiet This preserves the smart verbosity system while fixing the specific case of connection refused errors that were causing spam. * undo config * format * add comprehensive strategy behavior tests Previously there were NO tests for individual strategy behaviors, only validation tests. This adds complete test coverage for all four load balancing strategies. Tests added: ✅ TestRandomStrategy - verifies random distribution ✅ TestRoundRobinStrategy - verifies sequential cycling ✅ TestRoundRobinStrategy_DifferentRoutes - verifies route isolation ✅ TestLeastConnectionsStrategy - verifies connection-based selection ✅ TestLowestLatencyStrategy - verifies latency-based selection ✅ TestStrategyWithEmptyBackends - edge case handling ✅ TestStrategyWithSingleBackend - single backend behavior ✅ TestGetNextBackendStrategyRouting - integration testing Each test verifies actual strategy behavior and ensures the algorithms work correctly according to their specifications. * remove bad test * feat: add sequential strategy and make it the default Added explicit sequential strategy enum and made it the default behavior when no strategy is configured, providing clarity and consistency. Changes: ✅ Added StrategySequential enum to config ✅ Added sequentialNextBackend implementation ✅ Default empty strategy now uses sequential (not random) ✅ Updated documentation to reflect sequential as default ✅ Updated config files to list sequential as first option ✅ Added comprehensive tests for sequential strategy Strategy behavior: - Empty strategy → sequential (default) - strategy: sequential → explicit sequential - strategy: random → random selection - strategy: round-robin → round-robin cycling - strategy: least-connections → connection-based - strategy: lowest-latency → latency-based All existing strategies continue to work exactly as before. Sequential is now clearly documented and properly tested. * remove unused
Overview
Add comprehensive load balancing strategies to Gate Lite mode, enabling intelligent traffic distribution across multiple backend servers.
Co-authored-by: @okeanosthedev
Co-authored-by: @robinbraemer
Features Added
Load Balancing Strategies
Configuration
Major Improvements Made
Architecture Refactoring (Addressed Review Feedback)
Problem: Original implementation used global state variables that would break multiple Gate instances.
Solution: Complete architecture overhaul
roundRobinIndex,leastConnectionCounter,latencyCache)Litestruct to encapsulate all lite mode functionalityStrategyManagerfor isolated state managementPerformance Optimizations
Problem: Aggressive health checking was causing 25+ second delays.
Solution: Smart lazy failure detection
checkBackend()- eliminated 5-second TCP timeouts per backendtryBackendshandles failures efficientlyBetter Latency Measurement
Problem: Using dial time for latency measurement was inaccurate.
Solution: Status ping latency measurement
Thread Safety Improvements
Problem: Race conditions in connection counting.
Solution: Proper atomic operations
go test -raceType Safety
Problem: Strategy configuration used plain strings.
Solution: Typed constants with validation
Strategytype instead of plain stringsProfessional Documentation
Problem: Basic documentation with incorrect grammar.
Solution: Comprehensive professional guide
config.ymlandconfig-lite.ymlTechnical Details
Strategy Selection Flow
StrategyManager.GetNextBackend()calledtryBackendshandles actual dialingConnection Tracking (Least-Connections)
counter.Add(1)when connection establisheddefer counter.Add(^uint32(0))when connection closesLatency Measurement (Lowest-Latency)
Testing
Breaking Changes
None - All changes are backward compatible. Existing configurations continue to work with
randomstrategy as default.Migration
No migration required - new
strategyfield is optional and defaults torandombehavior.