An openmp Parallel Data Access Pattern in Faiss

Comments

This is a quick note on how to use openmp or rather, any multithreading library to divide the underlying data. Let’s use some code snippets from Faiss.

Usually you would parallel for, or first parallel then for over a sequence. For example:

        int nt = std::min(omp_get_max_threads(), int(n));

#pragma omp parallel for if (nt > 1)
        for (idx_t slice = 0; slice < nt; slice++) {
            IndexIVFStats local_stats;
            idx_t i0 = n * slice / nt;
            idx_t i1 = n * (slice + 1) / nt;

I came across a different use case that was note-worthy:

    DirectMapAdd dm_adder(direct_map, n, xids);

#pragma omp parallel reduction(+ : nadd)
    {
        int nt = omp_get_num_threads();
        int rank = omp_get_thread_num();

        // each thread takes care of a subset of lists
        for (size_t i = 0; i < n; i++) {
            idx_t list_no = coarse_idx[i];
            if (list_no >= 0 && list_no % nt == rank) {
                idx_t id = xids ? xids[i] : ntotal + i;
                size_t ofs = invlists->add_entry(
                        list_no, id, flat_codes.get() + i * code_size);

                dm_adder.add(i, list_no, ofs);

                nadd++;
            } else if (rank == 0 && list_no == -1) {
                dm_adder.add(i, -1, 0);
            }
        }
    }

Here you just use parallel but not for. Instead, you get the total N of threads nt, and thread rank/number rank. Every single thread enters the entire for loop, iterating over each element. However, there is a conditional, which guarantees each element is processed at most once by one thread alone. Here it is done by the modulo list_no % nt. This allows all elements that are identified by a unique list_no to be exclusively handled by one thread, serially without possible race conditions. The beauty of it is that you get to accumulate states for each element’s turn without ever needing a lock on the accumulator (the one that is identified by list_no in this case). It is very efficient for similar scenarios.

I also happened to encounter this pattern in a write-up. Check out the Using the intrinsic parallelism of the parallel_flat_hash_map to insert values from multiple threads, lock free section. BTW, you NEED to check out parallel-hashmap.