Abstract:
This work presents a flexible VLSI architecture to compute the -point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, up to , the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150？MHz. 1. Introduction As the technology is evolving day by day, the size of hardware is shrinking with an increase of the storage capacity. High-end video applications have become very demanding in our daily life activities, for example, watching movies, video conferencing, creating and saving videos using high definition video cameras, and so forth. A single device can support all the multimedia applications which seemed to be dreaming before, for example, new high-end mobile phones and smart phones. As a consequence, new highly efficient video coders are of paramount importance. However, high efficiency comes at the expense of computational complexity. As pointed out in [1, 2], several blocks of video codecs, including the transform stage [3], motion estimation and entropy coding [4], are responsible for this high complexity. As an example the discrete-cosine-transform (DCT), that is used in several standards for image and video compression, is a computation intensive operation. In particular, it requires a large number of additions and multiplications for direct implementation. HEVC, the brand new and yet-to-release video coding standard, addresses high efficient video coding. One of the tools employed to improve coding efficiency is the DCT with different transform sizes. As an example, the 16-point DCT of HEVC is shown in [5]. In video compression, the DCT is widely used because it compacts the image energy at the low frequencies, making easy to discard the high frequency components. To meet the requirement of real-time processing, hardware implementations of 2-D DCT/inverse DCT (IDCT) are adopted, for example, [6]. The 2-D DCT/IDCT can be implemented with the 1-D DCT/IDCT and a transpose memory in a row-column decomposition manner. In the direct implementation of DCT, float-point multiplications have to be tackled, which cause precision problems in

Abstract:
When implementing real-time DSP algorithms on digital circuits, the system is always constrained by limited speed, accuracy and roundoff noise. These limitations must be taken into account for the design and implementation stages. Doubling the dynamic rate of theanalog DCT is expensive, whereas in digital DCT an addition of 1 bit in data path is adequate. This paper proposes a novel approach ofanalog CMOS implementation technique for Digital Signal Processing (DSP) algorithms to reduce the area and power requirement in theexisting Digital CMOS implementations. Discrete Cosine Transform (DCT) with signed coefficients have been designed andimplemented in this paper. The problems of digital DCTs viz., quantization error, round-off noise, high power consumption and largearea are overcome by the proposed implementation. It can be used to develop the architecture design of DFT, DST and DHT.

Abstract:
A generic multiplication scheme for the low power VLSI implementation of the DCT is described in this paper. The scheme concurrently processes blocks of cosine coefficient and pixel values during the multiplication procedure, with the aim of reducing the total switched capacitance within the multiplier circuit. The cosine coefficients, within each block, are manipulated such that some are processed using shift operations only. The remaining coefficients are presented to the multiplier inputs as a sequence, ordered according to bit correlation between successive cosine coefficients. The paper describes the multiplication scheme, the power evaluation environment used, and presents results, with a number of standard benchmark examples, demonstrating upto 50% power saving.

Abstract:
Discrete Cosine transform (DCT) and inverse DCT (IDCT) have been widely used in many image processing systems and real-time computation of nonlinear time series. In this paper, a novel lineararray of DCT and IDCT is derived from the data flow of subband decompositions representing the factorized coefficient matrices in the matrix formulation of the recursive algorithm. For increasing the throughput as well as decreasing the hardware cost, the input and output data are reordered. The proposed 8-point DCT/IDCT processor with four multipliers, simple adders, and less registers and ROM storing the immediate results and coefficients, respectively, has been implemented on FPGA (field programmable gate array) and SoC (system on chip). The linear-array DCT/IDCT processor with the computation complexity (5/8) and hardware complexity (5/8) is fully pipelined and scalable for variable-length DCT/IDCT computations.

Abstract:
Image compression is an important topic in digitalworld. It is the art of representing the information in a compactform. This paper deals with the implementation of low powerVLSI architecture for image compression system using DCT.Discrete Cosine Transform (DCT) is the most widely usedtechnique for image compression of JPEG images[5] and is alossy compression method.. The architecture of DCT is based onLo-effler method[1] which is a fast and low complexity algorithm.In the proposed architecture of DCT multipliers are replaced withadders and shifters. Low power approaches like Canonic signeddigit representation for constant coefficients and sub-expressionelimination methods has been used. The 2D DCT is performed on8x8 image matrix using two 1D DCT blocks and a transpositionblock. Similar to DCT, the IDCT is also implemented using theLo.effler algorithm for IDCT. Verilog HDL is used to implementthe design. ISIM of XILINX is used for the simulation of thedesign. CADENCE RTL compiler is used to synthesize and obtainthe detailed power and area reports of the design. MATLAB isused as the support tool to obtain the input pixel values of theimage and the results from both ISIM and MATLAB arecompared.

Abstract:
This paper presents stable, radix-2, completely recursive discrete cosine transformation algorithms DCT-I and DCT-III solely based on DCT-I, DCT-II, DCT-III, and DCT-IV having sparse and orthogonal factors. Error bounds for computing the completely recursive DCT-I, DCT-II, DCT-III, and DCT-IV algorithms having sparse and orthogonal factors are addressed. Image compression results are presented based on the recursive 2D DCT-II and DCT-IV algorithms for image size $512 \times 512$ pixels with transfer block sizes $8 \times 8$, $16 \times 16$, and $32 \times 32$ with $93.75\%$ absence of coefficients in each transfer block. Finally signal flow graphs are demonstrated based on the completely recursive DCT-I, DCT-II, DCT-III, and DCT-IV algorithms having orthogonal factors.

Abstract:
Two multiplierless algorithms are proposed for 4x4 approximate-DCT for transform coding in digital video. Computational architectures for 1-D/2-D realisations are implemented using Xilinx FPGA devices. CMOS synthesis at the 45 nm node indicate real-time operation at 1 GHz yielding 4x4 block rates of 125 MHz at less than 120 mW of dynamic power consumption.

Abstract:
The image data compression has been an active research area for image processing over the last decade [1] and has been used in a variety of applications. This paper investigates the implementation of Low Power VLSI architecture for image compression, which uses Variable Length Coding method to compress JPEG signals [1]. The architecture is proposed for the quantized DCT output [5]. The proposed architecture consists of three optimized blocks, viz, Zigzag scanning, Run-length coding and Huffman coding [17]. In the proposed architecture, Zigzag scanner uses two RAM memories in parallel to make the scanning faster. The Run-length coder in the architecture, counts the number of intermediate zeros in between the successive non-zero DCT coefficients unlike the traditional run-length coder which counts the repeating string of coefficients to compress data [20]. The complexity of the Huffman coder is reduced by making use of a lookup table formed by arranging the {run, value} combinations in the order of decreasing probabilities with associated variable length codes [14]. The VLSI architecture of the design is implemented [12] using Verilog HDL with Low Power approches . The proposed hardware architecture for image compression was synthesized using RTL complier and it was mapped using 90nm standard cells. The Simulation is done using Modelsim. The synthesis is done using RTL compiler from CADENCE. The back end design like Layout is done using IC Compiler. Power consumptions of variable length encoder and decoder are limited to 0.798mW and 0.884mW with minimum area. The Experimental results confirms that 53% power saving is achieved in the dynamic power of huffman decoding [6] by including the lookup table approach and also a 27% of power saving is achieved in the RL-Huffman encoder [8].

Abstract:
An algebraic integer (AI) based time-multiplexed row-parallel architecture and two final-reconstruction step (FRS) algorithms are proposed for the implementation of bivariate AI-encoded 2-D discrete cosine transform (DCT). The architecture directly realizes an error-free 2-D DCT without using FRSs between row-column transforms, leading to an 8$\times$8 2-D DCT which is entirely free of quantization errors in AI basis. As a result, the user-selectable accuracy for each of the coefficients in the FRS facilitates each of the 64 coefficients to have its precision set independently of others, avoiding the leakage of quantization noise between channels as is the case for published DCT designs. The proposed FRS uses two approaches based on (i) optimized Dempster-Macleod multipliers and (ii) expansion factor scaling. This architecture enables low-noise high-dynamic range applications in digital video processing that requires full control of the finite-precision computation of the 2-D DCT. The proposed architectures and FRS techniques are experimentally verified and validated using hardware implementations that are physically realized and verified on FPGA chip. Six designs, for 4- and 8-bit input word sizes, using the two proposed FRS schemes, have been designed, simulated, physically implemented and measured. The maximum clock rate and block-rate achieved among 8-bit input designs are 307.787 MHz and 38.47 MHz, respectively, implying a pixel rate of 8$\times$307.787$\approx$2.462 GHz if eventually embedded in a real-time video-processing system. The equivalent frame rate is about 1187.35 Hz for the image size of 1920$\times$1080. All implementations are functional on a Xilinx Virtex-6 XC6VLX240T FPGA device.

Abstract:
The canonical signed digit (CSD) representation of constant coefficients is a unique signed data representation containing the fewest number of nonzero bits. Consequently, for constant multipliers, the number of additions and subtractions is minimized by CSD representation of constant coefficients. This technique is mainly used for finite impulse response (FIR) filter by reducing the number of partial products. In this paper, we use CSD with a novel common subexpression elimination (CSE) scheme on the optimal Loeffler algorithm for the computation of discrete cosine transform (DCT). To meet the challenges of low-power and high-speed processing, we present an optimized image compression scheme based on two-dimensional DCT. Finally, a novel and a simple reconfigurable quantization method combined with DCT computation is presented to effectively save the computational complexity. We present here a new DCT architecture based on the proposed technique. From the experimental results obtained from the FPGA prototype we find that the proposed design has several advantages in terms of power reduction, speed performance, and saving of silicon area along with PSNR improvement over the existing designs as well as theXilinx core. 1. Introduction Many applications such as video surveillance and patient monitoring systems require many cameras for effective tracking of living and nonliving objects. To manage the huge amount of data generated by several cameras, we proposed an optical implementation of an image compression based on DCT algorithm in [1]. But this solution suffers from bad image quality and higher material complexity. After this optical implementation, in this paper we propose a digital realization of an optimized VLSI for image compression system. This paper is an extension of our prior work [2–4] with a new compression scheme along with supplementary simulations and FPGA implementation followed by performance analysis. More recent video encoders such as H.263 [5] and MPEG-4 Part 2 [6] use the DCT-based image compression along with additional algorithms for motion estimation (ME). A simplified block diagram of the encoder is presented in Figure 1. The 2D DCT of blocks of the image is performed to decorrelate each block of input pixels. The DCT coefficients are then quantized to represent them in a reduced range of values using a quantization matrix. Finally, the quantized components are scanned in a zigzag order, and the encoder employs run-length encoding (RLE) and Huffman coding/binary arithmetic coding (BAC-) based algorithms for entropy coding.