Files
binutils-gdb/gdbsupport/parallel-for.h
Simon Marchi a01cb764bd gdbsupport: use dynamic partitioning in gdb::parallel_for_each
gdb::parallel_for_each uses static partitioning of the workload, meaning
that each worker thread receives a similar number of work items.  Change
it to use dynamic partitioning, where worker threads pull work items
from a shared work queue when they need to.

Note that gdb::parallel_for_each is currently only used for processing
minimal symbols in GDB.  I am looking at improving the startup
performance of GDB, where the minimal symbol process is one step.

With static partitioning, there is a risk of workload imbalance if some
threads receive "easier" work than others.  Some threads sit still while
others finish working on their share of the work.  This is not
desirable, because the gdb::parallel_for_each takes as long as the
slowest thread takes.

When loading a file with a lot of minimal symbols (~600k) in GDB, with
"maint set per-command time on", I observe some imbalance:

    Time for "minsyms install worker": wall 0.732, user 0.550, sys 0.041, user+sys 0.591, 80.7 % CPU
    Time for "minsyms install worker": wall 0.881, user 0.722, sys 0.071, user+sys 0.793, 90.0 % CPU
    Time for "minsyms install worker": wall 2.107, user 1.804, sys 0.147, user+sys 1.951, 92.6 % CPU
    Time for "minsyms install worker": wall 2.351, user 2.003, sys 0.151, user+sys 2.154, 91.6 % CPU
    Time for "minsyms install worker": wall 2.611, user 2.322, sys 0.235, user+sys 2.557, 97.9 % CPU
    Time for "minsyms install worker": wall 3.074, user 2.729, sys 0.203, user+sys 2.932, 95.4 % CPU
    Time for "minsyms install worker": wall 3.486, user 3.074, sys 0.260, user+sys 3.334, 95.6 % CPU
    Time for "minsyms install worker": wall 3.927, user 3.475, sys 0.336, user+sys 3.811, 97.0 % CPU
                                              ^
                                          ----´

The fastest thread took 0.732 seconds to complete its work (and then sat
still), while the slowest took 3.927 seconds.  This means the
parallel_for_each took a bit less than 4 seconds.

Even if the number of minimal symbols assigned to each worker is the
same, I suppose that some symbols (e.g. those that need demangling) take
longer to process, which could explain the imbalance.

With this patch, things are much more balanced:

    Time for "minsym install worker": wall 2.807, user 2.222, sys 0.144, user+sys 2.366, 84.3 % CPU
    Time for "minsym install worker": wall 2.808, user 2.073, sys 0.131, user+sys 2.204, 78.5 % CPU
    Time for "minsym install worker": wall 2.804, user 1.994, sys 0.151, user+sys 2.145, 76.5 % CPU
    Time for "minsym install worker": wall 2.808, user 1.977, sys 0.135, user+sys 2.112, 75.2 % CPU
    Time for "minsym install worker": wall 2.808, user 2.061, sys 0.142, user+sys 2.203, 78.5 % CPU
    Time for "minsym install worker": wall 2.809, user 2.012, sys 0.146, user+sys 2.158, 76.8 % CPU
    Time for "minsym install worker": wall 2.809, user 2.178, sys 0.137, user+sys 2.315, 82.4 % CPU
    Time for "minsym install worker": wall 2.820, user 2.141, sys 0.170, user+sys 2.311, 82.0 % CPU
                                              ^
                                          ----´

In this version, the parallel_for_each took about 2.8 seconds,
representing a reduction of ~1.2 seconds for this step.  Not
life-changing, but it's still good I think.

Note that this patch helps when loading big programs.  My go-to test
program for this is telegram-desktop that I built from source.  For
small programs (including loading gdb itself), it makes no perceptible
difference.

Now the technical bits:

 - One impact that this change has on the minimal symbol processing
   specifically is that not all calls to compute_and_set_names (a
   critical region guarded by a mutex) are done at the end of each
   worker thread's task anymore.

   Before this patch, each thread would compute the names and hash values for
   all the minimal symbols it has been assigned, and then would call
   compute_and_set_names for all of them, while holding the mutex (thus
   preventing other threads from doing this same step).

   With the shared work queue approach, each thread grabs a batch of of
   minimal symbols, computes the names and hash values for them, and
   then calls compute_and_set_names (with the mutex held) for this batch
   only.  It then repeats that until the work queue is empty.

   There are therefore more small and spread out compute_and_set_names
   critical sections, instead of just one per worker thread at the end.
   Given that before this patch the work was not well balanced among worker
   threads, I guess that threads would enter that critical region at
   roughly different times, causing little contention.

   In the "with this patch" results, the CPU utilization numbers are not
   as good, suggesting that there is some contention.  But I don't know
   if it's contention due to the compute_and_set_names critical section
   or the shared work queue critical section.  That can be investigated
   later.  In any case, what ultimately counts is the wall time, which
   improves.

 - One choice I had to make was to decide how many work items (in this
   case minimal symbols) each worker should pop when getting work from
   the shared queue.  The general wisdom is that:

   - popping too few items, and the synchronization overhead becomes
     significant, and the total processing time increases
   - popping too many items, and we get some imbalance back, and the
     total processing time increases again

   I experimented using a dynamic batch size proportional to the number
   of remaining work items.  It worked well in some cases but not
   always.  So I decided to keep it simple, with a fixed batch size.
   That can always be tweaked later.

 - I want to still be able to use scoped_time_it to measure the time
   that each worker thread spent working on the task.  I find it really
   handy when measuring the performance impact of changes.

   Unfortunately, the current interface of gdb::parallel_for_each, which
   receives a simple callback, is not well-suited for that, once I
   introduce the dynamic partitioning.  The callback would get called
   once for each work item batch (multiple time for each worker thread),
   so it's not possible to maintain a per-worker thread object for the
   duration of the parallel for.

   To allow this, I changed gdb::parallel_for_each to receive a worker
   type as a template parameter.  Each worker thread creates one local
   instance of that type, and calls operator() on it for each work item
   batch.  By having a scoped_time_it object as a field of that worker,
   we can get the timings per worker thread.

   The drawbacks of this approach is that we must now define the
   parallel task in a separate class and manually capture any context we
   need as fields of that class.

Change-Id: Ibf1fea65c91f76a95b9ed8f706fd6fa5ef52d9cf
Approved-By: Tom Tromey <tom@tromey.com>
2025-09-30 19:37:20 +00:00

147 lines
4.5 KiB
C++

/* Parallel for loops
Copyright (C) 2019-2025 Free Software Foundation, Inc.
This file is part of GDB.
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>. */
#ifndef GDBSUPPORT_PARALLEL_FOR_H
#define GDBSUPPORT_PARALLEL_FOR_H
#include <algorithm>
#include <atomic>
#include <tuple>
#include "gdbsupport/thread-pool.h"
namespace gdb
{
/* A "parallel-for" implementation using a shared work queue. Work items get
popped in batches of size up to BATCH_SIZE from the queue and handed out to
worker threads.
Each worker thread instantiates an object of type Worker, forwarding ARGS to
its constructor. The Worker object can be used to keep some per-worker
thread state.
Worker threads call Worker::operator() repeatedly until the queue is
empty. */
template<std::size_t batch_size, class RandomIt, class Worker,
class... WorkerArgs>
void
parallel_for_each (const RandomIt first, const RandomIt last,
WorkerArgs &&...worker_args)
{
/* If enabled, print debug info about how the work is distributed across
the threads. */
const bool parallel_for_each_debug = false;
gdb_assert (first <= last);
if (parallel_for_each_debug)
{
debug_printf ("Parallel for: n elements: %zu\n",
static_cast<std::size_t> (last - first));
debug_printf ("Parallel for: batch size: %zu\n", batch_size);
}
const size_t n_worker_threads
= std::max<size_t> (thread_pool::g_thread_pool->thread_count (), 1);
std::vector<gdb::future<void>> results;
/* The next item to hand out. */
std::atomic<RandomIt> next = first;
/* The worker thread task.
We need to capture args as a tuple, because it's not possible to capture
the parameter pack directly in C++17. Once we migrate to C++20, the
capture can be simplified to:
... args = std::forward<Args>(args)
and `args` can be used as-is in the lambda. */
auto args_tuple
= std::forward_as_tuple (std::forward<WorkerArgs> (worker_args)...);
auto task = [&next, first, last, n_worker_threads, &args_tuple] ()
{
/* Instantiate the user-defined worker. */
auto worker = std::make_from_tuple<Worker> (args_tuple);
for (;;)
{
/* Grab a snapshot of NEXT. */
auto local_next = next.load ();
gdb_assert (local_next <= last);
/* Number of remaining items. */
auto n_remaining = last - local_next;
gdb_assert (n_remaining >= 0);
/* Are we done? */
if (n_remaining == 0)
break;
const auto this_batch_size
= std::min<std::size_t> (batch_size, n_remaining);
/* The range to process in this iteration. */
const auto this_batch_first = local_next;
const auto this_batch_last = local_next + this_batch_size;
/* Update NEXT. If the current value of NEXT doesn't match
LOCAL_NEXT, it means another thread updated it concurrently,
restart. */
if (!next.compare_exchange_weak (local_next, this_batch_last))
continue;
if (parallel_for_each_debug)
debug_printf ("Processing %zu items, range [%zu, %zu[\n",
this_batch_size,
static_cast<size_t> (this_batch_first - first),
static_cast<size_t> (this_batch_last - first));
worker (this_batch_first, this_batch_last);
}
};
/* Start N_WORKER_THREADS tasks. */
for (int i = 0; i < n_worker_threads; ++i)
results.push_back (gdb::thread_pool::g_thread_pool->post_task (task));
/* Wait for all of them to be finished. */
for (auto &fut : results)
fut.get ();
}
/* A sequential drop-in replacement of parallel_for_each. This can be useful
when debugging multi-threading behavior, and you want to limit
multi-threading in a fine-grained way. */
template<class RandomIt, class Worker, class... WorkerArgs>
void
sequential_for_each (RandomIt first, RandomIt last, WorkerArgs &&...worker_args)
{
if (first == last)
return;
Worker (std::forward<WorkerArgs> (worker_args)...) (first, last);
}
}
#endif /* GDBSUPPORT_PARALLEL_FOR_H */