Skip to content

Update PG16 prior to 1.7.0 part 2#2375

Merged
jrgemignani merged 6 commits intoapache:PG16from
MuhammadTahaNaveed:PG16_1_7_0_remaining
Apr 8, 2026
Merged

Update PG16 prior to 1.7.0 part 2#2375
jrgemignani merged 6 commits intoapache:PG16from
MuhammadTahaNaveed:PG16_1_7_0_remaining

Conversation

@MuhammadTahaNaveed
Copy link
Copy Markdown
Member

Following commits are cherry-picked from master. This PR is an an extension of PR #2358

b29ca5e Replace libcsv with pg COPY for csv loading (#2310)
a1f472d Fix and improve index.sql addendum (#2301)
7beb653 Fix and improve index.sql regression test coverage (#2300)
48fca83 Restrict age_load commands (#2274)
0ea9464 Fix possible memory and file descriptors leaks (#2258)
5aed9ec Add index on id columns (#2117)

Please use Rebase and Merge for this PR to maintain commit history.

MuhammadTahaNaveed and others added 6 commits April 8, 2026 17:40
- Whenever a label will be created, indices on id columns will be
  created by default. In case of vertex, a unique index on id column
  will be created, which will also serve as a unique constraint.
  In case of edge, a non-unique index on start_id and end_id columns
  will be created.

- This change is expected to improve the performance of queries that
  involve joins. From some performance tests, it was observed that
  the performance of queries improved alot.

- Loader was updated to insert tuples in indices as well. This has
  caused to slow the loader down a bit, but it was necessary.

- A bug related to command ids in cypher_delete executor was also fixed.
- Used postgres memory allocation functions instead of standard ones.
- Wrapped main loop of csv loader in PG_TRY block for better error handling.
This PR applies restrictions to the following age_load commands -

    load_labels_from_file()
    load_edges_from_file()

They are now tied to a specific root directory and are required to have a
specific file extension to eliminate any attempts to force them to access
any other files.

Nothing else has changed with the actual command formats or parameters,
only that they work out of the /tmp/age directory and only access files
with an extension of .csv.

Added regression tests and updated the location of the csv files for
those regression tests.

modified:   regress/expected/age_load.out
modified:   regress/sql/age_load.sql
modified:   src/backend/utils/load/age_load.c
NOTE: This PR was created with AI tools and a human.

- Remove unused copy command (leftover from deleted agload_test_graph test)
- Replace broken Section 4 that referenced non-existent graph with
  comprehensive WHERE clause tests covering string, int, bool, and float
  properties with AND/OR/NOT operators
- Add EXPLAIN tests to verify index usage:
  - Section 3: Validate GIN indices (load_city_gin_idx, load_country_gin_idx)
    show Bitmap Index Scan for property matching
  - Section 4: Validate all expression indices (city_country_code_idx,
    city_id_idx, city_west_coast_idx, country_life_exp_idx) show Index Scan
    for WHERE clause filtering

All indices now have EXPLAIN verification confirming they are used as expected.

modified:   regress/expected/index.out
modified:   regress/sql/index.sql
NOTE: This PR was created with the help of AI tools and a human.

Added additional requested regression tests -

 *EXPLAIN for pattern with WHERE clause

 *EXPLAIN for pattern with filters on both country and city

modified:   regress/expected/index.out
modified:   regress/sql/index.sql
- Commit also adds permission checks
- Resolves a critical memory spike issue on loading large file
- Use pg's COPY infrastructure (BeginCopyFrom, NextCopyFromRawFields)
  for 64KB buffered CSV parsing instead of libcsv
- Add byte based flush threshold (64KB) matching COPY behavior for memory safety
- Use heap_multi_insert with BulkInsertState for optimized batch inserts
- Add per batch memory context to prevent memory growth during large loads
- Remove libcsv dependency (libcsv.c, csv.h)
- Improves loading performance by 15-20%
- No previous regression tests were impacted
- Added regression tests for permissions/rls
Assisted-by AI
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR cherry-picks a set of loader/indexing/security changes onto the PG16 branch ahead of the 1.7.0 release, including migrating CSV loading to PostgreSQL’s COPY infrastructure, tightening file access, and improving index-related behavior and regression coverage.

Changes:

  • Replace the libcsv-based loader with COPY-based CSV parsing and batch insertion, plus sandboxing (/tmp/age) and privilege/RLS checks.
  • Create default indexes on label id columns (and edge start/end ids) to improve join performance, and expand index regression assertions.
  • Misc. memory/FD safety improvements (e.g., replacing strdup/free patterns with pstrdup/pnstrdup, adding repalloc helper), plus regression output updates reflecting new plan/order behavior.

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
src/include/utils/load/csv.h Removes libcsv header (dependency cleanup).
src/backend/utils/load/libcsv.c Removes libcsv implementation.
Makefile Drops libcsv object from build.
src/include/utils/load/age_load.h Updates loader API/types for COPY-based batching and buffering.
src/include/utils/load/ag_load_labels.h Updates vertex CSV loader interface/docs for COPY-based load.
src/include/utils/load/ag_load_edges.h Updates edge CSV loader interface/docs for COPY-based load.
src/backend/utils/load/age_load.c Adds sandboxing/permission/RLS checks; adds batch insert helpers; updates insert paths for indexes.
src/backend/utils/load/ag_load_labels.c Reimplements vertex CSV loading via COPY raw-field parsing + batch inserts.
src/backend/utils/load/ag_load_edges.c Reimplements edge CSV loading via COPY raw-field parsing + batch inserts.
src/include/utils/agtype.h Adds repalloc_check declaration.
src/backend/utils/adt/agtype.c Implements repalloc_check; replaces strdup/strndup with pstrdup/pnstrdup and removes corresponding frees.
src/backend/utils/adt/age_global_graph.c Switches to pnstrdup and removes manual free.
src/backend/executor/cypher_delete.c Updates command id fields after delete to keep executor state consistent.
src/backend/commands/label_commands.c Adds automatic indexes on id/start_id/end_id at label creation; adjusts vertex id constraints.
regress/sql/index.sql Updates index tests and adds EXPLAIN assertions for index usage.
regress/expected/index.out Expected output updates for index tests.
regress/sql/age_load.sql Updates load tests for /tmp/age sandbox and adds permission/RLS/constraint scenarios.
regress/expected/age_load.out Expected output updates for new sandbox/security behavior.
regress/expected/map_projection.out Expected output changes reflecting different row order.
regress/expected/graph_generation.out Expected output ordering updates.
regress/expected/expr.out Expected output ordering updates.
regress/expected/cypher_vle.out Expected output ordering updates.
regress/expected/cypher_merge.out Expected output ordering updates.
regress/expected/cypher_match.out Expected output ordering updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jrgemignani jrgemignani merged commit 8c74fd2 into apache:PG16 Apr 8, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants