Abstract:
This note makes an observation that significantly simplifies a number of previous parallel, two-way merge algorithms based on binary search and sequential merge in parallel. First, it is shown that the additional merge step of distinguished elements as found in previous algorithms is not necessary, thus simplifying the implementation and reducing constant factors. Second, by fixating the requirements to the binary search, the merge algorithm becomes stable, provided that the sequential merge subroutine is stable. The stable, parallel merge algorithm can easily be used to implement a stable, parallel merge sort. For ordered sequences with $n$ and $m$ elements, $m\leq n$, the simplified merge algorithm runs in $O(n/p+\log n)$ operations using $p$ processing elements. It can be implemented on an EREW PRAM, but since it requires only a single synchronization step, it is also a candidate for implementation on other parallel, shared-memory computers.

Abstract:
This note recapitulates an algorithmic observation for ordered Depth-First Search (DFS) in directed graphs that immediately leads to a parallel algorithm with linear speed-up for a range of processors for non-sparse graphs. The note extends the approach to ordered Breadth-First Search (BFS). With $p$ processors, both DFS and BFS algorithms run in $O(m/p+n)$ time steps on a shared-memory parallel machine allowing concurrent reading of locations, e.g., a CREW PRAM, and have linear speed-up for $p\leq m/n$. Both algorithms need $n$ synchronization steps.

Abstract:
We show that the following variation of the single-source shortest path problem is NP-complete. Let a weighted, directed, acyclic graph $G=(V,E,w)$ with source and sink vertices $s$ and $t$ be given. Let in addition a mapping $f$ on $E$ be given that associate information with the edges (e.g., a pointer), such that $f(e)=f(e')$ means that edges $e$ and $e'$ carry the same information; for such edges it is required that $w(e)=w(e')$. The length of a simple $st$ path $U$ is the sum of the weights of the edges on $U$ but edges with $f(e)=f(e')$ are counted only once. The problem is to determine a shortest such $st$ path. We call this problem the \emph{edge information reuse shortest path problem}. It is NP-complete by reduction from PARTITION.

Abstract:
We show how to extend classical work-stealing to deal also with data parallel tasks that can require any number of threads r >= 1 for their execution. We explain in detail the so introduced idea of work-stealing with deterministic team-building which in a natural way generalizes classical work-stealing. A prototype C++ implementation of the generalized work-stealing algorithm has been given and is briefly described. Building on this, a serious, well-known contender for a best parallel Quicksort algorithm has been implemented, which naturally relies on both task and data parallelism. For instance, sorting 2^27-1 randomly generated integers we could improve the speed-up from 5.1 to 8.7 on a 32-core Intel Nehalem EX system, being consistently better than the tuned, task-parallel Cilk++ system.

Abstract:
We present a simple, work-optimal and synchronization-free solution to the problem of stably merging in parallel two given, ordered arrays of m and n elements into an ordered array of m+n elements. The main contribution is a new, simple, fast and direct algorithm that determines, for any prefix of the stably merged output sequence, the exact prefixes of each of the two input sequences needed to produce this output prefix. More precisely, for any given index (rank) in the resulting, but not yet constructed output array representing an output prefix, the algorithm computes the indices (co-ranks) in each of the two input arrays representing the required input prefixes without having to merge the input arrays. The co-ranking algorithm takes O(log min(m,n)) time steps. The algorithm is used to devise a perfectly load-balanced, stable, parallel merge algorithm where each of p processing elements has exactly the same number of input elements to merge. Compared to other approaches to the parallel merge problem, our algorithm is considerably simpler and can be faster up to a factor of two. Compared to previous algorithms for solving the co-ranking problem, the algorithm given here is direct and maintains stability in the presence of repeated elements at no extra space or time cost. When the number of processing elements p does not exceed (m+n)/log min(m,n), the parallel merge algorithm has optimal speedup. It is easy to implement on both shared and distributed memory parallel systems.

Abstract:
Computer science is also an experimental science. This is particularly the case for parallel computing, which is in a total state of flux, and where experiments are necessary to substantiate, complement, and challenge theoretical modeling and analysis. Here, experimental work is as important as are advances in theory, that are indeed often driven by the experimental findings. In parallel computing, scientific contributions presented in research articles are therefore often based on experimental data, with a substantial part devoted to presenting and discussing the experimental findings. As in all of experimental science, experiments must be presented in a way that makes reproduction by other researchers possible, in principle. Despite appearance to the contrary, we contend that reproducibility plays a small role, and is typically not achieved. As can be found, articles often do not have a sufficiently detailed description of their experiments, and do not make available the software used to obtain the claimed results. As a consequence, parallel computational results are most often impossible to reproduce, often questionable, and therefore of little or no scientific value. We believe that the description of how to reproduce findings should play an important part in every serious, experiment-based parallel computing research article. We aim to initiate a discussion of the reproducibility issue in parallel computing, and elaborate on the importance of reproducible research for (1) better and sounder technical/scientific papers, (2) a sounder and more efficient review process and (3) more effective collective work. This paper expresses our current view on the subject and should be read as a position statement for discussion and future work. We do not consider the related (but no less important) issue of the quality of the experimental design.

Abstract:
There has recently been much progress on exact algorithms for the (un)weighted graph (bi)partitioning problem using branch-and-bound and related methods. In this note we present and improve an easily computable, purely combinatorial lower bound for the weighted bipartitioning problem. The bound is computable in $O(n\log n+m)$ time steps for weighted graphs with $n$ vertices and $m$ edges. In the branch-and-bound setting, the bound for each new subproblem can be updated in $O(n+(m/n)\log n)$ time steps amortized over a series of $n$ branching steps; a rarely triggered tightening of the bound requires search on the graph of unassigned vertices and can take from $O(n+m)$ to $O(nm+n^2\log n)$ steps depending on implementation and possible bound quality. Representing a subproblem uses $O(n)$ space. Although the bound is weak, we believe that it can be advantageous in a parallel setting to be able to generate many subproblems fast, possibly out-weighting the advantages of tighter, but much more expensive (algebraic, spectral, flow) lower bounds. We use a recent priority task-scheduling framework for giving a parallel implementation, and show the relative improvements in bound quality and solution speed by the different contributions of the lower bound. A detailed comparison with standardized input graphs to other lower bounds and frameworks is pending. Detailed investigations of branching and subproblem selection rules are likewise not the focus here, but various options are discussed.

Abstract:
Work-stealing systems are typically oblivious to the nature of the tasks they are scheduling. For instance, they do not know or take into account how long a task will take to execute or how many subtasks it will spawn. Moreover, the actual task execution order is typically determined by the underlying task storage data structure, and cannot be changed. There are thus possibilities for optimizing task parallel executions by providing information on specific tasks and their preferred execution order to the scheduling system. We introduce scheduling strategies to enable applications to dynamically provide hints to the task-scheduling system on the nature of specific tasks. Scheduling strategies can be used to independently control both local task execution order as well as steal order. In contrast to conventional scheduling policies that are normally global in scope, strategies allow the scheduler to apply optimizations on individual tasks. This flexibility greatly improves composability as it allows the scheduler to apply different, specific scheduling choices for different parts of applications simultaneously. We present a number of benchmarks that highlight diverse, beneficial effects that can be achieved with scheduling strategies. Some benchmarks (branch-and-bound, single-source shortest path) show that prioritization of tasks can reduce the total amount of work compared to standard work-stealing execution order. For other benchmarks (triangle strip generation) qualitatively better results can be achieved in shorter time. Other optimizations, such as dynamic merging of tasks or stealing of half the work, instead of half the tasks, are also shown to improve performance. Composability is demonstrated by examples that combine different strategies, both within the same kernel (prefix sum) as well as when scheduling multiple kernels (prefix sum and unbalanced tree search).

Abstract:
Priority queues are data structures which store keys in an ordered fashion to allow efficient access to the minimal (maximal) key. Priority queues are essential for many applications, e.g., Dijkstra's single-source shortest path algorithm, branch-and-bound algorithms, and prioritized schedulers. Efficient multiprocessor computing requires implementations of basic data structures that can be used concurrently and scale to large numbers of threads and cores. Lock-free data structures promise superior scalability by avoiding blocking synchronization primitives, but the \emph{delete-min} operation is an inherent scalability bottleneck in concurrent priority queues. Recent work has focused on alleviating this obstacle either by batching operations, or by relaxing the requirements to the \emph{delete-min} operation. We present a new, lock-free priority queue that relaxes the \emph{delete-min} operation so that it is allowed to delete \emph{any} of the $\rho+1$ smallest keys, where $\rho$ is a runtime configurable parameter. Additionally, the behavior is identical to a non-relaxed priority queue for items added and removed by the same thread. The priority queue is built from a logarithmic number of sorted arrays in a way similar to log-structured merge-trees. We experimentally compare our priority queue to recent state-of-the-art lock-free priority queues, both with relaxed and non-relaxed semantics, showing high performance and good scalability of our approach.

Abstract:
We show that the problem of constructing tree-structured descriptions of data layouts that are optimal with respect to space or other criteria from given sequences of displacements, can be solved in polynomial time. The problem is relevant for efficient compiler and library support for communication of noncontiguous data, where tree-structured descriptions with low-degree nodes and small index arrays are beneficial for the communication soft- and hardware. An important example is the Message-Passing Interface (MPI) which has a mechanism for describing arbitrary data layouts as trees using a set of increasingly general constructors. Our algorithm shows that the so-called MPI datatype reconstruction problem by trees with the full set of MPI constructors can be solved optimally in polynomial time, refuting previous conjectures that the problem is NP-hard. Our algorithm can handle further, natural constructors, currently not found in MPI. Our algorithm is based on dynamic programming, and requires the solution of a series of shortest path problems on an incrementally built, directed, acyclic graph. The algorithm runs in $O(n^4)$ time steps and requires $O(n^2)$ space for input displacement sequences of length $n$.