Skip to content

[refactor] 🔧 Semantic Function Clustering Analysis: Refactoring Opportunities #3119

@ghost

Description

🔧 Semantic Function Clustering Analysis

Analysis of repository: githubnext/gh-aw

Executive Summary

Analysis of 180 non-test Go files across the pkg/ directory revealed several refactoring opportunities through semantic function clustering and duplicate detection. The codebase is generally well-organized with clear package boundaries, but there are opportunities to improve code organization by consolidating validation functions, eliminating duplicate code, and centralizing scattered utilities.

Key Findings:

  • 180 files analyzed across 7 packages (workflow: 108, cli: 60, parser: 6, console: 3, others: 3)
  • 9 files with validation functions scattered outside validation.go
  • 2+ exact duplicate JavaScript functions identified
  • Multiple parsing functions distributed across 15+ files
  • Scattered helper/utility files across packages
Full Report Details

Function Inventory

Package Distribution

Package File Count Primary Purpose
pkg/workflow/ 108 Core workflow compilation, engines, safe outputs
pkg/cli/ 60 CLI commands, MCP management, tooling
pkg/parser/ 6 YAML/JSON parsing, GitHub API, frontmatter
pkg/console/ 3 Terminal UI rendering
pkg/constants/ 1 Shared constants
pkg/logger/ 1 Logging utilities
pkg/timeutil/ 1 Time formatting

Key File Organization

The repository follows Go best practices with feature-based file organization:

  • Engine files: claude_engine.go, copilot_engine.go, codex_engine.go, etc.
  • Create operations: create_issue.go, create_pr.go, create_discussion.go
  • Specialized functionality per file

Identified Issues

1. Validation Functions Scattered Across Multiple Files

Issue: Validation functions exist in 9+ files outside the dedicated validation.go file, violating the single responsibility principle for validation logic.

Files with Misplaced Validation Functions:

  1. pkg/workflow/compiler.go - Contains validation logic that should be in validation.go
  2. pkg/workflow/docker.go:86 - validateDockerImage() function
  3. pkg/workflow/engine.go:261,315 - validateEngine(), validateSingleEngineSpecification()
  4. pkg/workflow/expression_safety.go:66,141 - validateExpressionSafety(), validateSingleExpression()
  5. pkg/workflow/mcp-config.go:1034,1046 - validateStringProperty(), validateMCPRequirements()
  6. pkg/workflow/npm.go:45 - validateNpxPackages()
  7. pkg/workflow/pip.go:49,84,113,174 - Multiple Python package validation functions
  8. pkg/workflow/strict_mode.go:43,72,94,115,155 - Five strict mode validation functions
  9. pkg/workflow/template.go:54 - validateNoIncludesInTemplateRegions()

Current State:

  • validation.go has 30+ validation functions (primary validation file)
  • 9 other files contain 20+ additional validation functions

Recommendation:

  • Move domain-specific validation to appropriate domain files (e.g., Docker validation can stay in docker.go)
  • Move general validation functions to validation.go
  • Consider creating validation sub-files if validation.go becomes too large (e.g., validation_packages.go, validation_strict_mode.go)

Estimated Impact: Medium - Improved code organization and easier testing of validation logic


2. Exact Duplicate JavaScript Function: sanitizeLabelContent

Issue: The sanitizeLabelContent function appears identically in two JavaScript files.

Duplicate Occurrences:

Occurrence 1: pkg/workflow/js/create_issue.cjs:4-17

function sanitizeLabelContent(content) {
  if (!content || typeof content !== "string") {
    return "";
  }
  let sanitized = content.trim();
  sanitized = sanitized.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F\x7F]/g, "");
  sanitized = sanitized.replace(/\x1b\[[0-9;]*[mGKH]/g, "");
  sanitized = sanitized.replace(
    /(^|[^\w`])@([A-Za-z0-9](?:[A-Za-z0-9-]{0,37}[A-Za-z0-9])?(?:\/[A-Za-z0-9._-]+)?)/g,
    (_m, p1, p2) => `${p1}\`@${p2}\``
  );
  sanitized = sanitized.replace(/[<>&'"]/g, "");
  return sanitized.trim();
}

Occurrence 2: pkg/workflow/js/add_labels.cjs:4-17 (identical implementation)

Code Similarity: 100% identical (14 lines)

Recommendation:

  • Extract to shared utility module (e.g., pkg/workflow/js/shared_utils.cjs)
  • Import in both files instead of duplicating
  • Benefits: Single source of truth, easier maintenance, reduced code size

Estimated Impact: Low effort, high maintainability benefit


3. Duplicate Sanitization Functions (Go and JavaScript)

Issue: Multiple sanitization functions exist across different languages and files with similar purposes.

Sanitization Functions Found:

Go Functions:

  • pkg/workflow/strings.go:75 - SanitizeName()
  • pkg/workflow/strings.go:157 - SanitizeWorkflowName()
  • pkg/workflow/workflow_name.go:12 - SanitizeIdentifier()

JavaScript Functions:

  • pkg/workflow/js/sanitize.cjs:14 - sanitizeContent()
  • pkg/workflow/js/parse_firewall_logs.cjs:242 - sanitizeWorkflowName()
  • pkg/workflow/js/create_issue.cjs:4 - sanitizeLabelContent()
  • pkg/workflow/js/add_labels.cjs:4 - sanitizeLabelContent() (duplicate)

Analysis:

  • Go sanitization is reasonably consolidated in strings.go and workflow_name.go
  • JavaScript sanitization is scattered and duplicated
  • Some naming inconsistency (SanitizeIdentifier vs SanitizeWorkflowName)

Recommendation:

  • Go side is acceptable - Keep as is
  • JavaScript side needs consolidation - Create shared sanitization utilities module

Estimated Impact: Medium - Reduces JavaScript code duplication


4. Parsing Functions Distributed Across 15+ Files

Issue: Parsing functions are distributed across many files instead of being centralized in parser-related files or following a clear pattern.

Files with Parse Functions:

  1. pkg/workflow/time_delta.go - Multiple date/time parsing functions
  2. pkg/workflow/comment.go - ParseCommandEvents()
  3. pkg/workflow/dependabot.go - parseNpmPackage(), parsePipPackage(), parseGoPackage()
  4. pkg/workflow/create_discussion.go - parseDiscussionsConfig()
  5. pkg/workflow/create_pr_review_comment.go - parsePullRequestReviewCommentsConfig()
  6. pkg/workflow/threat_detection.go - parseThreatDetectionConfig()
  7. pkg/workflow/expressions.go - Expression parsing
  8. pkg/workflow/frontmatter_extraction.go - Frontmatter parsing
  9. And 7+ more files...

Analysis:

  • Domain-specific parsing (e.g., time parsing in time_delta.go) ✅ Good organization
  • Config parsing (e.g., parseDiscussionsConfig()) ✅ Acceptable in feature files
  • Generic parsing utilities scattered across multiple files ⚠️ Could be improved

Recommendation:

  • Keep domain-specific parsers in their respective files (time, expressions, etc.)
  • Keep config parsers in feature-specific files (create_discussion.go, etc.)
  • ⚠️ Consider extracting common parsing patterns if code duplication is found

Estimated Impact: Low priority - Current organization is mostly acceptable


5. Helper/Utility File Organization

Issue: Helper and utility files exist in both pkg/cli/ and pkg/workflow/ with varying naming conventions.

Current Helper Files:

CLI Package:

  • frontmatter_utils.go - Frontmatter manipulation utilities
  • repeat_utils.go - Retry logic utilities
  • shared_utils.go - General shared utilities

Workflow Package:

  • engine_helpers.go - Engine-specific helper functions
  • prompt_step_helper.go - Prompt step utilities
  • strings.go - String manipulation utilities
  • safe_outputs_env_test_helpers.go - Test helper (appropriately named)

Analysis:

  • Good: Naming convention with _utils and _helpers suffixes
  • Good: Test helpers clearly identified
  • ⚠️ Mixed: Some utilities specific to domain (good), others generic (could be consolidated)

Recommendation:

  • Keep current organization - It's reasonable and follows Go conventions
  • Consider documenting the distinction between "utils" and "helpers" in contribution guidelines
  • Monitor for utility function sprawl in future

Estimated Impact: Very low - Current state is acceptable


Semantic Function Clustering Results

Cluster 1: Validation Functions ⚠️ (Scattered)

Pattern: validate* functions
Total Found: 50+ validation functions
Primary File: validation.go (30+ functions)
Scattered Across: 9 additional files

Analysis: While having a primary validation file is good, too many validation functions are scattered. This creates maintenance challenges and makes it harder to understand validation logic.


Cluster 2: Sanitization Functions ⚠️ (Partially Consolidated)

Pattern: sanitize* or Sanitize* functions
Total Found: 10+ functions (Go + JavaScript)
Files:

  • Go: strings.go, workflow_name.go (good consolidation)
  • JavaScript: 4+ files with duplicates (needs improvement)

Analysis: Go side is well-organized, JavaScript side has duplicates.


Cluster 3: Parsing Functions ✅ (Acceptable)

Pattern: parse* or Parse* functions
Total Found: 40+ parsing functions
Distribution: Spread across 15+ files based on domain

Analysis: Most parsing functions are appropriately placed in domain-specific files. This is good organization.


Cluster 4: Rendering Functions ✅ (Well Organized)

Pattern: render* or Render* functions
Total Found: 30+ rendering functions
Organization: Test files + engine_helpers.go + specific engine files

Analysis: Rendering logic is appropriately distributed. No consolidation needed.


Cluster 5: Formatting Functions ✅ (Well Organized)

Pattern: format* or Format* functions
Total Found: 20+ formatting functions
Key Files: engine_helpers.go, js.go, permissions_validator.go

Analysis: Formatting functions are reasonably organized by purpose.


Refactoring Recommendations

Priority 1: High Impact, Low Effort

1.1 Consolidate Duplicate JavaScript Function

Task: Extract sanitizeLabelContent to shared utility module

Steps:

  1. Create pkg/workflow/js/label_utils.cjs with the sanitization function
  2. Update create_issue.cjs to import from shared module
  3. Update add_labels.cjs to import from shared module
  4. Add tests for the shared function

Estimated Effort: 1-2 hours
Benefits:

  • Eliminates 14 lines of duplicate code
  • Single source of truth for label sanitization
  • Easier to test and maintain

1.2 Review and Document Validation Function Organization

Task: Create guidelines for where validation functions should live

Steps:

  1. Document validation function placement rules in CONTRIBUTING.md:
    • Domain-specific validations → domain files (e.g., Docker validation in docker.go)
    • General workflow validations → validation.go
    • Complex validation logic → consider sub-files
  2. Review the 9 files with scattered validation functions
  3. Move or document exceptions

Estimated Effort: 2-3 hours
Benefits:

  • Clear guidelines for contributors
  • Prevents future validation sprawl
  • Improves code discoverability

Priority 2: Medium Impact, Medium Effort

2.1 Consolidate JavaScript Sanitization Utilities

Task: Create shared JavaScript sanitization module

Steps:

  1. Create pkg/workflow/js/sanitize_shared.cjs
  2. Move sanitizeLabelContent (from Priority 1)
  3. Consider consolidating other JS sanitization functions
  4. Update imports in dependent files
  5. Add comprehensive tests

Estimated Effort: 3-4 hours
Benefits:

  • Centralized JavaScript sanitization logic
  • Reduced duplication
  • Easier to apply consistent sanitization rules

2.2 Consider Validation Sub-Files

Task: Split validation.go if it becomes too large

Approach: Only if validation.go exceeds 1000 lines or has distinct validation domains

Suggested Split (if needed):

  • validation.go - Core workflow validations
  • validation_packages.go - Package validation (npm, pip, etc.)
  • validation_strict_mode.go - Strict mode validations
  • validation_features.go - Repository feature validations

Estimated Effort: 4-6 hours (only if needed)
Benefits:

  • Easier navigation of validation logic
  • Logical grouping of related validations

Priority 3: Long-term Improvements

3.1 Monitor for Utility Function Sprawl

Task: Establish guidelines for when to create new utility files

Guidelines:

  • Functions used in 3+ files → move to utility file
  • Domain-specific utilities → keep in domain file
  • Test helpers → suffix with _test_helpers.go

Estimated Effort: Ongoing code review discipline
Benefits: Prevents future utility sprawl


File Organization Assessment

Well-Organized Areas ✅

  1. Engine Architecture: Each engine has its own file (claude, copilot, codex)
  2. Create Operations: Separate files for each creation type (issue, PR, discussion)
  3. String Utilities: Consolidated in strings.go
  4. Test Organization: Clear _test.go suffix convention

Areas for Improvement ⚠️

  1. Validation Functions: Too scattered (9+ files)
  2. JavaScript Duplicates: Exact duplicates exist
  3. Sanitization (JS): Could be more consolidated

Implementation Checklist

  • P1.1: Extract sanitizeLabelContent to shared JS utility
  • P1.2: Document validation function placement guidelines
  • P2.1: Create JavaScript sanitization shared module
  • P2.2: Evaluate if validation.go needs splitting
  • P3.1: Establish utility function placement guidelines
  • Verify no functionality broken after changes
  • Update tests to reflect refactoring
  • Update CONTRIBUTING.md with new guidelines

Analysis Metadata

  • Total Go Files Analyzed: 180 (excluding test files)
  • Total Functions Cataloged: 500+ functions across all files
  • Function Clusters Identified: 5 major clusters (validation, sanitization, parsing, rendering, formatting)
  • Outliers Found: 20+ validation functions in wrong files
  • Exact Duplicates Detected: 2+ JavaScript functions (100% match)
  • Near-Duplicates Detected: Multiple sanitization functions with similar purpose
  • Detection Method: Serena semantic code analysis + grep pattern analysis + manual review
  • Analysis Date: 2025-11-04
  • Packages Analyzed: cli (60 files), workflow (108 files), parser (6 files), console (3 files), others (3 files)

Conclusion

The gh-aw codebase demonstrates generally good organization with clear package boundaries and feature-based file structure. The primary opportunities for improvement are:

  1. ⚠️ High Priority: Eliminate JavaScript code duplication (quick win)
  2. ⚠️ Medium Priority: Consolidate scattered validation functions
  3. Low Priority: Current helper organization is acceptable

The refactoring recommendations focus on high-impact, low-effort improvements that will enhance maintainability without requiring extensive restructuring. Most of the codebase follows Go best practices effectively.


Note: This analysis focused on non-test Go files (.go excluding *_test.go) and associated JavaScript files in the pkg/ directory. The findings represent refactoring opportunities discovered through semantic function clustering, naming pattern analysis, and duplicate detection using Serena's code analysis tools.

AI generated by Semantic Function Refactoring

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions