The main contribution of this paper is to present an efficient hardware algorithm for RSA encryption/decryption based on Montgomery multiplication. Modern FPGAs have a number of embedded DSP blocks (DSP48E1) and embedded memory blocks (BRAM). Our hardware algorithm supporting 2048-bit RSA encryption/decryption is designed to be implemented using one DSP48E1, one BRAM and few logic blocks (slices) in the Xilinx Virtex-6 family FPGA. The implementation results showed that our RSA module for 2048-bit RSA encryption/decryption runs in 277.26ms. Quite surprisingly, the multiplier in DSP48E1 used to compute Montgomery multiplication works in more than 97% clock cycles over all clock cycles. Hence, our implementation is close to optimal in the sense that it has only less than 3% overhead in multiplication and no further improvement is possible as long as Montgomery multiplication based algorithm is used. Also, since our circuit uses only one DSP48E1 block and one Block RAM, we can implement a number of RSA modules in an FPGA that can work in parallel to attain high throughput RSA encryption/decryption.