Search Results: 1 - 10 of 100 matches for " "
All listed articles are free for downloading (OA Articles)
Page 1 /100
Display every page Item
Microgrid - The microthreaded many-core architecture  [PDF]
Irfan Uddin
Computer Science , 2013,
Abstract: Traditional processors use the von Neumann execution model, some other processors in the past have used the dataflow execution model. A combination of von Neuman model and dataflow model is also tried in the past and the resultant model is referred as hybrid dataflow execution model. We describe a hybrid dataflow model known as the microthreading. It provides constructs for creation, synchronization and communication between threads in an intermediate language. The microthreading model is an abstract programming and machine model for many-core architecture. A particular instance of this model is named as the microthreaded architecture or the Microgrid. This architecture implements all the concurrency constructs of the microthreading model in the hardware with the management of these constructs in the hardware.
Design space exploration in the microthreaded many-core architecture  [PDF]
Irfan Uddin
Computer Science , 2013,
Abstract: Design space exploration is commonly performed in embedded system, where the architecture is a complicated piece of engineering. With the current trend of many-core systems, design space exploration in general-purpose computers can no longer be avoided. Microgrid is a complicated architecture, and therefor we need to perform design space exploration. Generally, simulators are used for the design space exploration of an architecture. Different simulators with different levels of complexity, simulation time and accuracy are used. Simulators with little complexity, low simulation time and reasonable accuracy are desirable for the design space exploration of an architecture. These simulators are referred as high-level simulators and are commonly used in the design of embedded systems. However, the use of high-level simulation for design space exploration in general-purpose computers is a relatively new area of research.
A polyphase filter for many-core architectures  [PDF]
Karel Adámek,Jan Novotny,Wes Armour
Computer Science , 2015,
Abstract: In this article we discuss our implementation of a polyphase filter for real-time data processing in radio astronomy. We describe in detail our implementation of the polyphase filter algorithm and its behaviour on three generations of NVIDIA GPU cards, on dual Intel Xeon CPUs and the Intel Xeon Phi (Knights Corner) platforms. All of our implementations aim to exploit the potential for data reuse that the algorithm offers. Our GPU implementations explore two different methods for achieving this, the first makes use of L1/Texture cache, the second uses shared memory. We discuss the usability of each of our implementations along with their behaviours. We measure performance in execution time, which is a critical factor for real-time systems, we also present results in terms of bandwidth (GB/s), compute (GFlop/s) and type conversions (GTc/s). We include a presentation of our results in terms of the sample rate which can be processed in real-time by a chosen platform, which more intuitively describes the expected performance in a signal processing setting. Our findings show that, for the GPUs considered, the performance of our polyphase filter when using lower precision input data is limited by type conversions rather than device bandwidth. We compare these results to an implementation on the Xeon Phi. We show that our Xeon Phi implementation has a performance that is 1.47x to 1.95x greater than our CPU implementation, however is not insufficient to compete with the performance of GPUs. We conclude with a comparison of our best performing code to two other implementations of the polyphase filter, showing that our implementation is faster in nearly all cases. This work forms part of the Astro-Accelerate project, a many-core accelerated real-time data processing library for digital signal processing of time-domain radio astronomy data.
A novel and scalable Multigrid algorithm for many-core architectures  [PDF]
Julian Becerra-Sagredo,Carlos Malaga,Francisco Mandujano
Physics , 2011,
Abstract: Multigrid algorithms are among the fastest iterative methods known today for solving large linear and some non-linear systems of equations. Greatly optimized for serial operation, they still have a great potential for parallelism not fully realized. In this work, we present a novel multigrid algorithm designed to work entirely inside many-core architectures like the graphics processing units (GPUs), without memory transfers between the GPU and the central processing unit (CPU), avoiding low bandwitdth communications. The algorithm is denoted as the high occupancy multigrid (HOMG) because it makes use of entire grid operations with interpolations and relaxations fused into one task, providing useful work for every thread in the grid. For a given accuracy, its number of operations scale linearly with the total number of nodes in the grid. Perfect scalability is observed for a large number of processors.
An FMM Based on Dual Tree Traversal for Many-core Architectures  [PDF]
Rio Yokota
Computer Science , 2012,
Abstract: The present work attempts to integrate the independent efforts in the fast N-body community to create the fastest N-body library for many-core and heterogenous architectures. Focus is placed on low accuracy optimizations, in response to the recent interest to use FMM as a preconditioner for sparse linear solvers. A direct comparison with other state-of-the-art fast N-body codes demonstrates that orders of magnitude increase in performance can be achieved by careful selection of the optimal algorithm and low-level optimization of the code. The current N-body solver uses a fast multipole method with an efficient strategy for finding the list of cell-cell interactions by a dual tree traversal. A task-based threading model is used to maximize thread-level parallelism and intra-node load-balancing. In order to extract the full potential of the SIMD units on the latest CPUs, the inner kernels are optimized using AVX instructions. Our code -- exaFMM -- is an order of magnitude faster than the current state-of-the-art FMM codes, which are themselves an order of magnitude faster than the average FMM code.
Optimised hybrid parallelisation of a CFD code on Many Core architectures  [PDF]
Adrian Jackson,M. Sergio Campobasso
Computer Science , 2013,
Abstract: COSA is a novel CFD system based on the compressible Navier-Stokes model for unsteady aerodynamics and aeroelasticity of fixed structures, rotary wings and turbomachinery blades. It includes a steady, time domain, and harmonic balance flow solver. COSA has primarily been parallelised using MPI, but there is also a hybrid parallelisation that adds OpenMP functionality to the MPI parallelisation to enable larger number of cores to be utilised for a given simulation as the MPI parallelisation is limited to the number of geometric partitions (or blocks) in the simulation, or to exploit multi-threaded hardware where appropriate. This paper outlines the work undertaken to optimise these two parallelisation strategies, improving the efficiency of both and therefore reducing the computational time required to compute simulations. We also analyse the power consumption of the code on a range of leading HPC systems to further understand the performance of the code.
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
Stefan Marr,Michael Haupt,Stijn Timbermont,Bram Adams
Electronic Proceedings in Theoretical Computer Science , 2010, DOI: 10.4204/eptcs.17.6
Abstract: The upcoming many-core architectures require software developers to exploit concurrency to utilize available computational power. Today's high-level language virtual machines (VMs), which are a cornerstone of software development, do not provide sufficient abstraction for concurrency concepts. We analyze concrete and abstract concurrency models and identify the challenges they impose for VMs. To provide sufficient concurrency support in VMs, we propose to integrate concurrency operations into VM instruction sets. Since there will always be VMs optimized for special purposes, our goal is to develop a methodology to design instruction sets with concurrency support. Therefore, we also propose a list of trade-offs that have to be investigated to advise the design of such instruction sets. As a first experiment, we implemented one instruction set extension for shared memory and one for non-shared memory concurrency. From our experimental results, we derived a list of requirements for a full-grown experimental environment for further research.
A Fast and Scalable Graph Coloring Algorithm for Multi-core and Many-core Architectures  [PDF]
Georgios Rokos,Gerard Gorman,Paul H J Kelly
Computer Science , 2015,
Abstract: Irregular computations on unstructured data are an important class of problems for parallel programming. Graph coloring is often an important preprocessing step, e.g. as a way to perform dependency analysis for safe parallel execution. The total run time of a coloring algorithm adds to the overall parallel overhead of the application whereas the number of colors used determines the amount of exposed parallelism. A fast and scalable coloring algorithm using as few colors as possible is vital for the overall parallel performance and scalability of many irregular applications that depend upon runtime dependency analysis. Catalyurek et al. have proposed a graph coloring algorithm which relies on speculative, local assignment of colors. In this paper we present an improved version which runs even more optimistically with less thread synchronization and reduced number of conflicts compared to Catalyurek et al.'s algorithm. We show that the new technique scales better on multi-core and many-core systems and performs up to 1.5x faster than its predecessor on graphs with high-degree vertices, while keeping the number of colors at the same near-optimal levels.
Accounting for Secondary Uncertainty: Efficient Computation of Portfolio Risk Measures on Multi and Many Core Architectures  [PDF]
Blesson Varghese,Andrew Rau-Chaplin
Computer Science , 2013,
Abstract: Aggregate Risk Analysis is a computationally intensive and a data intensive problem, thereby making the application of high-performance computing techniques interesting. In this paper, the design and implementation of a parallel Aggregate Risk Analysis algorithm on multi-core CPU and many-core GPU platforms are explored. The efficient computation of key risk measures, including Probable Maximum Loss (PML) and the Tail Value-at-Risk (TVaR) in the presence of both primary and secondary uncertainty for a portfolio of property catastrophe insurance treaties is considered. Primary Uncertainty is the the uncertainty associated with whether a catastrophe event occurs or not in a simulated year, while Secondary Uncertainty is the uncertainty in the amount of loss when the event occurs. A number of statistical algorithms are investigated for computing secondary uncertainty. Numerous challenges such as loading large data onto hardware with limited memory and organising it are addressed. The results obtained from experimental studies are encouraging. Consider for example, an aggregate risk analysis involving 800,000 trials, with 1,000 catastrophic events per trial, a million locations, and a complex contract structure taking into account secondary uncertainty. The analysis can be performed in just 41 seconds on a GPU, that is 24x faster than the sequential counterpart on a fast multi-core CPU. The results indicate that GPUs can be used to efficiently accelerate aggregate risk analysis even in the presence of secondary uncertainty.
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures  [PDF]
Daehyun Kim,Joshua Trzasko,Mikhail Smelyanskiy,Clifton Haider,Pradeep Dubey,Armando Manduca
International Journal of Biomedical Imaging , 2011, DOI: 10.1155/2011/473128
Abstract: Compressive sensing (CS) describes how sparse signals can be accurately reconstructed from many fewer samples than required by the Nyquist criterion. Since MRI scan duration is proportional to the number of acquired samples, CS has been gaining significant attention in MRI. However, the computationally intensive nature of CS reconstructions has precluded their use in routine clinical practice. In this work, we investigate how different throughput-oriented architectures can benefit one CS algorithm and what levels of acceleration are feasible on different modern platforms. We demonstrate that a CUDA-based code running on an NVIDIA Tesla C2050 GPU can reconstruct a 256 × 160 × 80 volume from an 8-channel acquisition in 19 seconds, which is in itself a significant improvement over the state of the art. We then show that Intel's Knights Ferry can perform the same 3D MRI reconstruction in only 12 seconds, bringing CS methods even closer to clinical viability. 1. Introduction and Motivation Magnetic resonance imaging (MRI) is a noninvasive medical imaging modality commonly used to investigate soft tissues in the human body. Clinically, MRI is attractive as it offers flexibility, superior contrast resolution, and use of only nonionizing radiation. However, as the duration of a scan is directly proportional to the number of investigated spectral indices, obtaining high-resolution images under standard acquisition and reconstruction protocols can require a significant amount of time. Prolonged scan duration poses a number of challenges in a clinical setting. For example, during long examinations, patients often exhibit involuntary (e.g., respiration) and/or voluntary motion (e.g., active response to discomfort), both of which can impart spatial blurring that may compromise diagnosis. Also, high temporal resolution is often needed to accurately depict physiological processes. Under standard imaging protocols, spatial resolution must unfortunately be sacrificed to permit quicker scan termination or more frequent temporal updates. Rather than executing a low spatial resolution exam, contemporary MRI protocols often acquire only a subset of the samples associated with a high-resolution exam and attempt to recover the image using alternative reconstruction methods such as homodyne detection [1] or compressive sensing (CS). CS theory asserts that the number of samples needed to form an accurate approximation of an image is largely determined by the image's underlying complexity [2, 3]. Thus, if there exists a means of transforming the image into a more efficient
Page 1 /100
Display every page Item

Copyright © 2008-2017 Open Access Library. All rights reserved.