Skip to content

Latest commit

 

History

History
120 lines (98 loc) · 7.59 KB

File metadata and controls

120 lines (98 loc) · 7.59 KB

Merged Functional Analysis and Specifications

Project Overview

The project involves developing a Java-based document validation system using Maven, Apache PdfBox 3.x, LanguageTool, and optionally Tint NLP. The system verifies document templates (PDF or Word) and compiled PDFs, each containing two contract copies (e.g., Edenred and client). It checks template correctness, ensures compiled documents match the template, verifies consistency between contract copies, and generates detailed reports. Key features include excluding the template’s first page and footers from analysis and handling recipient-specific differences.

Objectives

  1. Verify Template Correctness: Check grammar, spelling, and paragraph numbering, ignoring placeholders, the first page, and footers.
  2. Verify Contract Copies: Ensure the two contract copies (e.g., Edenred and client) within the template (post-first page) and compiled PDFs are identical except for recipient-specific fields.
  3. Verify Compiled Document Consistency: Confirm compiled PDFs match the template in text, layout, and style, excluding footers.
  4. Generate Reports: Produce a CSV report with precise issue locations and an annotated PDF highlighting problematic words in the content area.

Use Cases

  • PDF Template: Process directly using PdfBox 3.x.
  • Word Template: Convert to PDF, preserving layout, then process.

Technologies

  • Java: Core programming language.
  • Maven: Build and dependency management.
  • Apache PdfBox 3.x: PDF processing and text extraction.
  • LanguageTool: Grammar and spelling verification for non-placeholder text.
  • Tint NLP: Optional for advanced linguistic analysis.
  • Allure: HTML reporting.

System Architecture

The system is structured into modular components, drawing from Claude’s detailed architecture while incorporating specific requirements from Grok and OpenAI-4.5:

  1. Document Ingestion and Preprocessing:
    • Tasks:
      • Handle PDF and Word templates.
      • Convert Word to PDF (e.g., using Apache POI or LibreOffice).
      • Exclude the first page of the template (Grok, OpenAI-4.5).
      • Exclude footers (e.g., text with y-coordinate < 10% of page height, per Grok).
      • Split documents into two contract copies based on headers (e.g., "COPA EDENRED", "COP1A CLIENTE" from Grok).
      • Extract text, positions, styles, and paragraph numbers using PdfBox.
      • Identify placeholders with configurable patterns (e.g., [Placeholder] from Gemini/Grok, «Placeholder» from OpenAI-4.5).
    • Purpose: Prepares documents for analysis by normalizing and structuring content.
  2. Validation Engine:
    • Tasks:
      • Check non-placeholder text with LanguageTool for grammar and spelling (all documents).
      • Validate paragraph numbering sequence (all documents).
    • Purpose: Ensures template text and structure are correct, excluding placeholders.
  3. Comparison Engine:
    • Tasks:
      • Compare the two contract copies within the template (post-first page), allowing differences only in recipient-specific fields (Grok, OpenAI-4.5).
      • Compare the two contract copies within the compiled PDF, allowing differences only in recipient-specific fields (all documents).
      • Compare each compiled contract copy with the corresponding template copy, verifying fixed text, styles, and placeholder replacement (all documents).
    • Purpose: Ensures consistency within and across documents.
  4. Reporting Engine:
    • Tasks:
      • Generate a CSV report with columns: Document, Copy (e.g., Edenred/Client), Page, Paragraph, Line, Position, Issue, Word (merged from all).
      • Generate an Allure HTML report with the csv outcome appended.
    • Purpose: Provides actionable output for users to review discrepancies.
  5. Annotating Engine:
    • Tasks:
      • Create an annotated PDF for each comparison, highlighting issues in the content area (all documents).
    • Purpose: Allow for issues to be tracked on the pdf.

Functional Requirements

  1. Template Processing:
    • Convert Word templates to PDF.
    • Exclude the first page and footers.
    • Split into Edenred and client copies by headers.
    • Extract text, positions, styles, and paragraph numbers.
    • Identify placeholders using configurable patterns.
    • Check non-placeholder text with LanguageTool.
    • Validate paragraph numbering sequence.
    • Compare Edenred and client copies, allowing differences only in recipient fields.
  2. Compiled Document Processing:
    • Exclude footers from all pages.
    • Split compiled PDF into Edenred and client copies by headers.
    • Extract text, positions, styles, and paragraph numbers.
    • Compare Edenred and client copies within the compiled PDF, allowing differences only in recipient fields.
    • Compare each copy with the corresponding template copy (post-first page), verifying fixed text, styles, and placeholder replacement.
    • Check paragraph numbering consistency.
  3. Document Comparison:
    • Identify recipient-specific fields from placeholders.
    • Ensure copies are identical except for these fields, both within the template (post-first page) and compiled PDF.
  4. Reporting:
    • Generate a CSV report detailing issues with precise locations (e.g., "Spelling error: 'recieve'").
    • Create an annotated PDF highlighting only problematic words in the content area.

Specifications

  • Placeholder Identification: Configurable regex patterns (e.g., [.?] from Gemini/Grok, «.?» from OpenAI-4.5) to support flexibility across document types.
  • Text Extraction: Use PdfBox for PDFs (word-level positions and styles) and Apache POI for Word documents (all documents).
  • Linguistic Checking: Apply LanguageTool to non-placeholder text (all documents).
  • Comparison Logic: Implement a diff algorithm treating placeholders as wildcards; compare text and styles (Gemini, Claude, Grok).
  • Paragraph and Line Numbering: Parse paragraph numbers from text; determine lines via y-coordinates (all documents).
  • Issue Reporting: CSV with detailed locations (Document, Copy, Page, Paragraph, Line, Position, Issue, Word) (merged from all).
  • Annotated PDF: Highlight problematic words in the content area using PdfBox (all documents).

Assumptions

  • The template’s first page is introductory and excluded from analysis (Grok, OpenAI-4.5).
  • Footers are at the bottom (e.g., y-coordinate < 10% of page height) and excluded (Grok, OpenAI-4.5).
  • Placeholders use consistent, configurable patterns (all documents).
  • Paragraphs are explicitly numbered (all documents).
  • Templates and compiled PDFs contain two contract copies, identifiable by headers (Grok, OpenAI-4.5).
  • Recipient-specific fields correspond to placeholders (all documents).
  • Documents are digitally generated (no OCR needed) (OpenAI-4.5).

Deliverables

  • Java application implementing the specified functionalities.
  • CSV report detailing issues.
  • Annotated PDF highlighting issues in the content area.

Additional Considerations

  • Error Handling: Gracefully handle malformed documents and processing failures.
  • Security: Ensure document content remains secure during processing (Claude).
  • Testing: Include unit tests for components, integration tests for workflows, and acceptance tests for overall functionality (Claude).