Abstract:
GPU has a significantly higher performance in single-precision computing than that of double precision. Hence, it is important to take a maximal advantage of the single precision in the CG inverter, using the mixed precision method. We have implemented mixed precision algorithm to our multi GPU conjugate gradient solver. The single precision calculation use half of the memory that is used by the double precision calculation, which allows twice faster data transfer in memory I/O. In addition, the speed of floating point calculations is 8 times faster in single precision than in double precision. The overall performance of our CUDA code for CG is 145 giga flops per GPU (GTX480), which does not include the infiniband network communication. If we include the infiniband communication, the overall performance is 36 giga flops per GPU (GTX480).

Abstract:
We present the first GPU-based conjugate gradient (CG) solver for lattice QCD with domain-wall fermions (DWF). It is well-known that CG is the most time-consuming part in the Hybrid Monte Carlo simulation of unquenched lattice QCD, which becomes even more computational demanding for lattice QCD with exact chiral symmetry. We have designed a CG solver for the general 5-dimensional DWF operator on NVIDIA CUDA architecture with mixed-precision, using the defect correction as well as the reliable updates algorithms. We optimize our computation by even-odd preconditioning in the 4D space-time lattice, plus several innovative techniques for CUDA kernels. For NVIDIA GeForce GTX 285/480, our CG solver attains 180/233 Gflops (sustained).

Abstract:
Complete spectra of the staggered Dirac operator $\Dirac$ are determined in four-dimensional $SU(2)$ gauge fields with and without dynamical fermions. An attempt is made to relate the performance of multigrid and conjugate gradient algorithms for propagators with the distribution of the eigenvalues of~$\Dirac$.

Abstract:
Results on the computational efficiency of 2-flavor staggered Wilson fermions compared to usual Wilson fermions in a quenched lattice QCD simulation on $16^3\times32$ lattice at $\beta=6$ are reported. We compare the cost of inverting the Dirac matrix on a source by the conjugate gradient (CG) method for both of these fermion formulations, at the same pion masses, and without preconditioning. We find that the number of CG iterations required for convergence, averaged over the ensemble, is less by a factor of almost 2 for staggered Wilson fermions, with only a mild dependence on the pion mass. We also compute the condition number of the fermion matrix and find that it is less by a factor of 4 for staggered Wilson fermions. The cost per CG iteration, dominated by the cost of matrix-vector multiplication for the Dirac matrix, is known from previous work to be less by a factor 2-3 for staggered Wilson compared to usual Wilson fermions. Thus we conclude that staggered Wilson fermions are 4-6 times cheaper for inverting the Dirac matrix on a source in the quenched backgrounds of our study.

Abstract:
Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units have been shown to be highly efficient for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. We describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.

Abstract:
Multigrid (MG) methods for the computation of propagators of staggered fermions in non-Abelian gauge fields are discussed. MG could work in principle in arbitrarily disordered systems. The practical variational MG methods tested so far with a ``Laplacian choice'' for the restriction operator are not competitive with the conjugate gradient algorithm on lattices up to $18^4$. Numerical results are presented for propagators in $SU(2)$ gauge fields.

Abstract:
Complete spectra of the staggered Dirac operator $\Dirac$ are determined in quenched four-dimensional $SU(2)$ gauge fields, and also in the presence of dynamical fermions. Periodic as well as antiperiodic boundary conditions are used. An attempt is made to relate the performance of multigrid (MG) and conjugate gradient (CG) algorithms for propagators with the distribution of the eigenvalues of~$\Dirac$. The convergence of the CG algorithm is determined only by the condition number~$\kappa$ and by the lattice size. Since~$\kappa$'s do not vary significantly when quarks become dynamic, CG convergence in unquenched fields can be predicted from quenched simulations. On the other hand, MG convergence is not affected by~$\kappa$ but depends on the spectrum in a more subtle way.

Abstract:
We present our implementation of the RHMC algorithm for staggered fermions on Graphics Processing Units using the NVIDIA CUDA programming language. While previous studies exclusively deal with the Dirac matrix inversion problem, our code performs the complete MD trajectory on the GPU. After pointing out the main bottlenecks and how to circumvent them, we discuss the performance of our code.

Abstract:
An explanation is proposed for the fact that Lepage--Mackenzie tadpole improvement does not work well for staggered fermions. The idea appears to work for all renormalization constants which appear in the staggered fermion self-energy. Wilson fermions are also discussed.

Abstract:
Staggered Domain Wall Fermions (SDWF) combine the attractive chiral properties of staggered fermions with those of domain wall fermions. SDWF describe four flavors with exact U(1)xU(1) flavor chiral symmetry. An extra lattice dimension is introduced and the full SU(4)xSU(4) flavor chiral symmetry is recovered as its size is increased. Here, the free theory of SDWF is described and a preliminary discussion of the interacting case is presented. SDWF may be well suited for numerical simulation of lattice QCD thermodynamics.