Abstract:
We calculate on GPUs the disconnected diagrams associated with the nucleon form factors and moments of generalized parton distributions using Nf=2+1+1 twisted mass fermions. We employ the truncated solver method (TSM) for estimating the all-to-all propagators. Due to the fact that the TSM involves many low precision stochastic estimators, the usage of GPUs is essential to perform efficiently the contractions and the inversions.

Abstract:
We present results on the disconnected contributions to three point functions entering in studies of hadron structure. We use $N_F = 2+1+1$ twisted mass fermions and give a detailed description on the results of the nucleon {\sigma}-terms, isoscalar axial charge and first moments of bare parton distributions for a range of pions masses. In addition we give the {\sigma}-terms and the computations are performed using QUDA code implemented on GPUs.

Abstract:
A number of stochastic methods developed for the calculation of fermion loops are investigated and compared, in particular with respect to their efficiency when implemented on Graphics Processing Units (GPUs). We assess the performance of the various methods by studying the convergence and statistical accuracy obtained for observables that require a large number of stochastic noise vectors, such as the isoscalar nucleon axial charge. The various methods are also examined for the evaluation of sigma-terms where noise reduction techniques specific to the twisted mass formulation can be utilized thus reducing the required number of stochastic noise vectors.

Abstract:
Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.

Abstract:
We discuss an extension of the QUDA library for the Wilson twisted mass operator. A performance analysis is presented for both degenerate and non-degenerate flavor doublets. The degenerate twisted mass fermion operator runs at up to 190, 487 and 856 Gflops, for double, single and half precisions respectively on recent NVIDIA Kepler GPUs, while our implementation for the non-degenerate flavor doublet allows to reach 163, 516 and 879 GFlops, respectively. The code is currently in production for the hadron structure study.

Abstract:
Motivated by the application of L\"uscher's finite volume method to the study of the lightest scalar resonance in the $\pi\pi \to \pi\pi$ isoscalar channel, in this article we describe our studies of multi-pion correlation functions computed using stochastic propagators in quenched lattice QCD, harnessing GPUs for acceleration. We consider two methods for constructing the correlation functions. One "outer product" approach becomes quite expensive at large lattice extent $L$, having an ${\cal O}(L^7)$ scaling. The other "stochastic operator" approach scales as ${\cal O}(N_r^2 L^4)$, where $N_r$ is the number of random sources. It would become more efficient if variance reduction techniques are used and the volume is fairly large. It is also found that correlations between stochastic propagators appearing in the same diagram, when a single set of random source vectors is used, lead to much larger errors than if separate random sources are used for each propagator. The calculations involve states with quantum numbers of the vacuum, so all-to-all propagators must be computed. For this reason, GPUs are ideally suited to accelerating the calculation. For this work we have integrated the Columbia Physics System (CPS) and QUDA GPU inversion library, in the case of clover fermions. Finally, we show that the completely quark disconnected diagram is crucial to the results, and that neglecting it would lead to answers which are far from the true value for the effective mass in this channel. This is unfortunate, because as we also show, this diagram has very large errors, and in fact dominates the error budget.

Abstract:
Kepler GTX Titan Black and Kepler Tesla K40 are still the best GPUs for high performance computing, although Maxwell GPUs such as GTX 980 are available in the market. Hence, we measure the performance of our lattice QCD codes using the Kepler GPUs. We also upgrade our code to use the latest CPS (Columbia Physics System) library along with the most recent QUDA (QCD CUDA) library for lattice QCD. These new libraries improve the performance of our conjugate gradient (CG) inverter so that it runs twice faster than before. We also investigate the performance of Xeon Phi 7120P coprocessor. It has similar computing power with the Kepler GPUs in principle. However, its performance for our CG code is significantly inferior to that of the GTX Titan Black GPUs at present.

Abstract:
Disconnected diagrams give crucial contributions to the physics of flavor singlet hadrons and to scalar form factors of non-singlet hadrons. Naive lattice calculation of the disconnected diagrams, however, requires a huge number of fermion matrix inversions and hence a prohibitively large computational cost. In this article, we present recent studies of the flavor-singelt meson spectrum and nucleon strange quark content using the all-to-all propagator to calculate the relevant disconnected diagrams.

Abstract:
We compare several methods for computing disconnected fermion loops contributing to nucleon three-point functions. The comparison is carried out using one ensemble of $N_f=2+1+1$ twisted mass fermions with pion mass of 373 MeV. The complete set of operators up to one-derivative are examined by developing optimized code for mutli-GPUs. Simple guidelines are given as to the preferable method for each class of operators.

Abstract:
Quadratic discriminant analysis (QDA) is a standard tool for classification due to its simplicity and flexibility. Because the number of its parameters scales quadratically with the number of the variables, QDA is not practical, however, when the dimensionality is relatively large. To address this, we propose a novel procedure named QUDA for QDA in analyzing high-dimensional data. Formulated in a simple and coherent framework, QUDA aims to directly estimate the key quantities in the Bayes discriminant function including quadratic interactions and a linear index of the variables for classification. Under appropriate sparsity assumptions, we establish consistency results for estimating the interactions and the linear index, and further demonstrate that the misclassification rate of our procedure converges to the optimal Bayes risk, even when the dimensionality is exponentially high with respect to the sample size. An efficient algorithm based on the alternating direction method of multipliers (ADMM) is developed for finding interactions, which is much faster than its competitor in the literature. The promising performance of QUDA is illustrated via extensive simulation studies and the analysis of two datasets.