ASIC Implementation of A High Throughput, Low Latency, Memory Optimized FFT Processor
The rapid advancements in semiconductor technology have led to constant shrinking of transistor sizes as per Moore's Law. Wireless communications is one field which has seen explosive growth, thanks to the cramming of more transistors into a single chip. Design of these systems involve trade-offs between performance, area and power. Fast Fourier Transform is an important component in most of the wireless communication systems. FFTs are widely used in applications like OFDM transceivers, Spectrum sensing in Cognitive Radio, Image Processing, Radar Signal Processing etc. FFT is the most compute intensive and time consuming operation in most of the above applications. It is always a challenge to develop an architecture which gives high throughput while reducing the latency without much area overhead. Next generation wireless systems demand high transmission efficiency and hence FFT processor should be capable of doing computations much faster. Architectures based on smaller radices for computing longer FFTs are inefficient. In this thesis, a fully parallel unrolled FFT architecture based on novel radix-4 engine is proposed which is catered for wide range of applications. The radix-4 butterfly unit takes all four inputs in parallel and can selectively produce one out of the four outputs. The proposed architecture uses Radix-4^3 and Radix-4^4 algorithms for computation of various FFTs. The Radix-4^4 block can take all 256 inputs in parallel and can use the select control signals to generate one out of the 256 outputs. In existing Cooley-Tukey architectures, the output from each stage has to be reordered before the next stage can start computation. This needs intermediate storage after each stage. In our architecture, each stage can directly generate the reordered outputs and hence reduce these buffers. A solution for output reordering problem in Radix-4^3 and Radix-4^4 FFT architectures are also discussed in this work. Although the hardware complexity in terms of adders and multipliers are increased in our architecture, a significant reduction in intermediate memory requirement is achieved. FFTs of varying sizes starting from 64 point to 64K point have been implemented in ASIC using UMC 130nm CMOS technology. The data representation used in this work is fixed point format and selected word length is 16 bits to get maximum Signal to Quantization Noise Ratio (SQNR). The architecture has been found to be more suitable for computing FFT of large sizes. For 4096 point and 64K point FFTs, this design gives comparable throughput with considerable reduction in area and latency when compared to the state-of-art implementations. The 64K point FFT architecture resulted in a throughput of 1332 mega samples per second with an area of 171.78 mm^2 and total power of 10.7W at 333 MHz.