Abstract:
Based on fast FIR algorithms (FFAs), we proposedistributed arithmetic algorithm based new parallel FIR filterarchitectures, which are beneficial to symmetric convolutions interms of the hardware cost. Multipliers are the major portions inhardware consumption for the parallel FIR filter implementation.The proposed new structures exploit the nature of symmetriccoefficients of odd length and further reduce the amount ofmultipliers required at the expense of additional adders.Exchanging multipliers with adders is advantageous because addersweigh less than multipliers in terms of silicon area, and in addition,the overhead from the increase in adders in preprocessing and postprocessing blocks stay fixed, not increasing along with the length ofthe FIR filter, whereas the number of reduced multipliers increasesalong with the length of the FIR filter.

Abstract:
We describe an efficient parallel implementation of the selected inversion algorithm for distributed memory computer systems, which we call \texttt{PSelInv}. The \texttt{PSelInv} method computes selected elements of a general sparse matrix $A$ that can be decomposed as $A = LU$, where $L$ is lower triangular and $U$ is upper triangular. The implementation described in this paper focuses on the case of sparse symmetric matrices. It contains an interface that is compatible with the distributed memory parallel sparse direct factorization \texttt{SuperLU\_DIST}. However, the underlying data structure and design of \texttt{PSelInv} allows it to be easily combined with other factorization routines such as \texttt{PARDISO}. We discuss general parallelization strategies such as data and task distribution schemes. In particular, we describe how to exploit the concurrency exposed by the elimination tree associated with the $LU$ factorization of $A$. We demonstrate the efficiency and accuracy of \texttt{PSelInv} by presenting a number of numerical experiments. In particular, we show that \texttt{PSelInv} can run efficiently on more than $4,000$ cores for a modestly sized matrix. We also demonstrate how \texttt{PSelInv} can be used to accelerate large-scale electronic structure calculations.

Abstract:
Cellular Automata(CA) is a discrete computing model which provides simple, flexible and efficient platform for simulating complicated systems and performing complex computation based on the neighborhoods information. CA consists of two components 1) a set of cells and 2) a set of rules . Programmable Cellular Automata(PCA) employs some control signals on a Cellular Automata(CA) structure. Programmable Cellular Automata were successfully applied for simulation of biological systems, physical systems and recently to design parallel and distributed algorithms for solving task density and synchronization problems. In this paper PCA is applied to develop cryptography algorithms.This paper deals with the cryptography for a parallel AES encryption algorithm based on programmable cellular automata. This proposed algorithm based on symmetric key systems.

Abstract:
Cellular Automata(CA) is a discrete computing model which provides simple, flexible and efficient platform for simulating complicated systems and performing complex computation based on the neighborhoods information. CA consists of two components 1) a set of cells and 2) a set of rules . Programmable Cellular Automata(PCA) employs some control signals on a Cellular Automata(CA) structure. Programmable Cellular Automata were successfully applied for simulation of biological systems, physical systems and recently to design parallel and distributed algorithms for solving task density and synchronization problems. In this paper PCA is applied to develop cryptography algorithms. This paper deals with the cryptography for a parallel AES encryption algorithm based on programmable cellular automata. This proposed algorithm based on symmetric key systems.

Abstract:
Distributed Generation (DG) is a promising solution to many power system problems such as voltage regulation, power loss, etc. The location in the power system for DG placement is found to be very important. The additional DG placement strategy is also found to depend largely on the total capacity and location of DG already installed on the system. In this paper, a design strategy based on a proposed “critical bus tracking” method for Proton Exchange Membrane Fuel Cell (PEMFC) DG is tested on a modified IEEE 14 bus test case. Matlab Distributed Computing System (MDCS) is applied for a reduced computation time. Program for contingency analysis is also implemented in MDCS to test the design strategy. Tests are conducted in the modified IEEE 14 bus and 300 bus test cases to study the efficiency of the parallel algorithm for DG placement design and contingency analysis.

Abstract:
Resource allocation in heterogeneous parallel and distributed computing systems is the process of allocating user tasks to processing elements for execution such that some performance objective is optimized. In this paper, a new resource allocation algorithm for the computing grid environment is proposed. It takes into account the heterogeneity of the computational resources. It resolves the single point of failure problem which many of the current algorithms suffer from. In this algorithm, any site manager receives two kinds of tasks namely, remote tasks arriving from its associated local grid manager, and local tasks submitted directly to the site manager by local users in its domain. It allocates the grid workload based on the resources occupation ratio and the communication cost. The grid overall mean task response time is considered as the main performance metric that need to be minimized. The simulation results show that the proposed resource allocation algorithm improves the grid overall mean task response time. (Abstract)

Abstract:
In this paper we propose a parallel coordinate descent algorithm for solving smooth convex optimization problems with separable constraints that may arise e.g. in distributed model predictive control (MPC) for linear network systems. Our algorithm is based on block coordinate descent updates in parallel and has a very simple iteration. We prove (sub)linear rate of convergence for the new algorithm under standard assumptions for smooth convex optimization. Further, our algorithm uses local information and thus is suitable for distributed implementations. Moreover, it has low iteration complexity, which makes it appropriate for embedded control. An MPC scheme based on this new parallel algorithm is derived, for which every subsystem in the network can compute feasible and stabilizing control inputs using distributed and cheap computations. For ensuring stability of the MPC scheme, we use a terminal cost formulation derived from a distributed synthesis. Preliminary numerical tests show better performance for our optimization algorithm than other existing methods.

Abstract:
Data-intensive, graph-based computations are pervasive in several scientific applications, and are known to to be quite challenging to implement on distributed memory systems. In this work, we explore the design space of parallel algorithms for Breadth-First Search (BFS), a key subroutine in several graph algorithms. We present two highly-tuned parallel approaches for BFS on large parallel systems: a level-synchronous strategy that relies on a simple vertex-based partitioning of the graph, and a two-dimensional sparse matrix-partitioning-based approach that mitigates parallel communication overhead. For both approaches, we also present hybrid versions with intra-node multithreading. Our novel hybrid two-dimensional algorithm reduces communication times by up to a factor of 3.5, relative to a common vertex based approach. Our experimental study identifies execution regimes in which these approaches will be competitive, and we demonstrate extremely high performance on leading distributed-memory parallel systems. For instance, for a 40,000-core parallel execution on Hopper, an AMD Magny-Cours based system, we achieve a BFS performance rate of 17.8 billion edge visits per second on an undirected graph of 4.3 billion vertices and 68.7 billion edges with skewed degree distribution.

Abstract:
The paper first introduces C++ language, parallelism and some parallel models. Then, the emphasis is put on the design and Implementation of C++ parallel/distributed systems.

Abstract:
This paper describes the architecture of MOSE (My Own Search Engine), a scalable parallel and distributed engine for searching the web. MOSE was specifically designed to efficiently exploit affordable parallel architectures, such as clusters of workstations. Its modular and scalable architecture can easily be tuned to fulfill the bandwidth requirements of the application at hand. Both task-parallel and data-parallel approaches are exploited within MOSE in order to increase the throughput and efficiently use communication, storing and computational resources. We used a collection of html documents as a benchmark, and conducted preliminary experiments on a cluster of three SMP Linux PCs.