Reuse thread-local IODebugContext in BlockFetcher to avoid per-read mutex construction#14531
Reuse thread-local IODebugContext in BlockFetcher to avoid per-read mutex construction#14531damahua wants to merge 1 commit intofacebook:mainfrom
Conversation
…utex construction
Summary:
BlockFetcher constructs an IODebugContext on the stack for every block
read (3 call sites). IODebugContext contains a std::shared_mutex, a
std::map, and a std::any, all of which are constructed and then
destructed without being used — PosixRandomAccessFile::Read ignores the
IODebugContext parameter entirely.
On every block read this means:
- pthread_rwlock_init + pthread_rwlock_destroy (from std::shared_mutex)
- std::map default construction + destruction
- std::any default construction + destruction
This change introduces a thread-local IODebugContext that is
lightweight-reset between uses, eliminating the repeated
construction/destruction of synchronization primitives on the hot read
path.
Benchmark results (db_bench readrandom, all data cached, 30-60s
duration, warmup run discarded):
macOS arm64 (Apple Silicon, 16 cores), N=3:
Baseline: 4,628,149 ops/sec (stddev 86K)
Optimized: 4,789,664 ops/sec (stddev 35K)
Delta: +3.49% (non-overlapping distributions)
Linux x86_64 (Ubuntu 24.04, 4 cores, jemalloc), N=3:
Baseline: 1,437,385 ops/sec (stddev 12.7K)
Optimized: 1,458,909 ops/sec (stddev 10.8K)
Delta: +1.50% (non-overlapping distributions)
The improvement is larger on macOS because _tlv_get_addr (macOS TLS)
and pthread_rwlock_init are more expensive there than on Linux.
Reproduction:
# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DWITH_SNAPPY=1 -DWITH_LZ4=1 \
-DWITH_ZSTD=1 -DWITH_GFLAGS=1 -DPORTABLE=1 -DWITH_TESTS=0 \
-DWITH_BENCHMARK_TOOLS=1
make -j$(nproc) db_bench
# Populate
./db_bench --db=/tmp/rocksdb_bench --benchmarks=fillrandom \
--num=1000000 --value_size=100 --threads=1 --compression_type=snappy
# Warmup (discard)
./db_bench --db=/tmp/rocksdb_bench --benchmarks=readrandom \
--duration=10 --value_size=100 --threads=16 --use_existing_db \
--cache_size=536870912
# Measure (repeat 3x)
./db_bench --db=/tmp/rocksdb_bench --benchmarks=readrandom \
--duration=30 --value_size=100 --threads=16 --use_existing_db \
--cache_size=536870912
|
Hi @damahua! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Summary
BlockFetcherconstructs anIODebugContexton the stack for every block read (3 call sites inblock_fetcher.cc).IODebugContextcontains astd::shared_mutex, astd::map<std::string, uint64_t>, and astd::any— all constructed and destructed on every read, even thoughPosixRandomAccessFile::Readignores theIODebugContextparameter entirely (parameter name is commented out:IODebugContext* /*dbg*/).This creates unnecessary overhead on the hot read path:
pthread_rwlock_init+pthread_rwlock_destroyper block read (fromstd::shared_mutex)std::map+std::anydefault construction/destruction per block readThis PR introduces a thread-local
IODebugContextwith a lightweight reset between uses, eliminating ~131K unnecessary mutex construction/destruction cycles per second during cached reads.Benchmark Results
db_bench readrandom — all data cached (CPU-bound), warmup run discarded, N=3:
The improvement is larger on macOS because
_tlv_get_addr(macOS TLS access) andpthread_rwlock_init/destroyare more expensive than Linux equivalents.readwhilewriting (I/O-bound) showed +0.55% — improvement is masked by disk I/O, as expected.
Profiling Evidence
CPU profile (
macOS sample, 10s) showedIODebugContext::~IODebugContext()→pthread_cond_destroyandpthread_mutex_destroyin the block read call stack. Custom instrumentation confirmed ~484K cache misses per second, each triggering aBlockFetcher::ReadBlockwith a freshIODebugContextconstruction.Memory allocation tracing (
MallocStackLogging+malloc_history -callTree) showed 65,657 block reads in 15 seconds, each constructing and destructingIODebugContext.Reproduction
Notes
IODebugContextis reset (fields cleared) before each use, preserving correctness for anyFileSystemimplementation that does readIODebugContext.IODebugContext dbg;stack constructions across the codebase (mostly incomposite_env.cc). This PR focuses on the 3 inblock_fetcher.ccwhich are on the hottest read path. The pattern could be extended to other hot paths if the approach is accepted.