fpga-fft

A highly optimized streaming FFT core based on Bailey's 4-step large FFT algorithm: https://www.nas.nasa.gov/assets/pdf/techreports/1989/rnr-89-004.pdf

Data input/output are continuous with no gaps between frames.

Currently only supporting power-of-two sizes and fixed point data.

Resource usage is on par with Xilinx FFT IP core, and Fmax is up to 30% higher for common sizes.

Zynq-7000

Name	Configuration	Device	LUTs	FFs	RAMB36	DSP48E1	Fmax
fft1024	24b data, 17b twiddle, rounded	XC7Z010-1	1648	4087	2	16	350 MHz
fft1024_wide	32b data, 24b twiddle, rounded	XC7Z010-1	2508	6096	3	32	310 MHz
fft1024_spdf_wide	32b data, 24b twiddle, rounded	XC7Z010-1	3259	7101	4	32	310 MHz
fft4096	24b data, 17b twiddle, rounded	XC7Z010-1	2073	5016	6	20	343 MHz

Kintex-7

Name	Configuration	Device	LUTs	FFs	RAMB36	DSP48E1	Fmax
fft1024	24b data, 17b twiddle, rounded	XC7K160T-1	1653	4087	2	16	458 MHz⁽¹⁾
fft4096	24b data, 17b twiddle, rounded	XC7K160T-1	2098	4942	6	20	458 MHz⁽¹⁾
fft8192	24b data, 17b twiddle, rounded	XC7K160T-1	2279	5760	14	24	458 MHz⁽¹⁾
fft16k_2	24b data, 17b twiddle, rounded	XC7K160T-1	2428	6563	29	28	458 MHz⁽¹⁾
fft32k_wide	32b data, 24b twiddle, rounded	XC7K160T-1	3982	10245	73	56	458 MHz⁽¹⁾

Kintex Ultrascale

Name	Configuration	Device	LUTs	RAMB36	DSP48E1	Fmax
fft4096	24b data, 17b twiddle, rounded	XCKU025-1	2071	9	20	525 MHz⁽¹⁾⁽²⁾
fft8192_wide	24b data, 24b twiddle, rounded	XCKU025-1	3137	16.5	48	525 MHz⁽¹⁾⁽²⁾
fft32k_wide	32b data, 24b twiddle, rounded	XCKU025-1	4222	76	56	501 MHz⁽²⁾

⁽¹⁾ Bottlenecked by block ram maximum frequency.

⁽²⁾ Additional contraints are required on BRAM synthesis. See below.

⁽³⁾ Fmax numbers are based on Vivado (2019.1) timing analysis with "Performance_Explore" synthesis strategy.

Architecture

The basic architecture is based on subdividing a size N = N1*N2 FFT into N2 FFTs of size N1 followed by reordering and multiplication by twiddle factors, then N1 FFTs of size N2.

Usage

Top level VHDL code is generated by the script codegen/gen_fft.py. The VHDL sub-blocks in this repository are referenced by the generated code.

To generate a custom FFT size, edit the FFT layout definitions in codegen/gen_fft_layouts.py.

A layout definition looks like this:

fft4096 = \
	FFT4Step(4096,
		FFT4Step(64,
			FFT4Step(16, 
				fft4_scale_none,
				fft4_scale_none),
			fft4_scale_none),
		FFT4Step(64,
			FFT4Step(16, 
				fft4_scale_div_n,
				fft4_scale_div_n),
			fft4_scale_div_n),
		16); # twiddleBits

fft4_scale_none is a base FFT instance (butterfly), and FFT4Step represents the combination of two sub-FFTs to form a larger FFT of size N1*N2. See user guide for more details.

To use a generated FFT core it is necessary to generate all the twiddle ROM sizes used (twiddle ROM size is equal to N). For N <= 32 use gen_twiddle_rom_simple.py, and gen_twiddle_rom.py otherwise. On Linux the shell script gen_roms.sh will generate all common twiddle bit sizes and depths.

Data input and output order are described as an address bit permutation. The exact permutation varies by layout and can be found in the comments at the top of generated files. The do_gen_fft script will generate reorderers and natural order wrappers in addition to the FFT core.

When composing a FFT layout prefer 4-FFT butterflies over 2-FFT. Several 4-FFT butterfly implementations are provided, each with a different tradeoff between area and speed. A 2-FFT butterfly can be used for FFT sizes that are not perfect square.

Timing constraints

A few multi-cycle timing constraints are required (because some inner butterfly implementations deserialize the data and present data every 2 or 4 cycles to the butterfly implementation):

set_multicycle_path -setup -start -from [get_pins -hierarchical *fftIn_mCycle*/C] -to [get_pins -hierarchical *fftOut_mCycle*/D] 4
set_multicycle_path -hold -start -from [get_pins -hierarchical *fftIn_mCycle*/C] -to [get_pins -hierarchical *fftOut_mCycle*/D] 3

set_multicycle_path -setup -start -from [get_pins -hierarchical *fftIn_2Cycle*/C] -to [get_pins -hierarchical *fftOut_2Cycle*/D] 2
set_multicycle_path -hold -start -from [get_pins -hierarchical *fftIn_2Cycle*/C] -to [get_pins -hierarchical *fftOut_2Cycle*/D] 1

For Ultrascale parts you may also need to force the use of width expansion rather than depth expansion of BRAMs to meet timing. After running implementation look for failed timing paths that pass through a series of BRAMs, and constrain those BRAMs to disable cascading. The constraints will look like this:

set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_0]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_1]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_2]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_3]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_4]
set_property CASCADE_ORDER_B NONE [get_cells fft/top_core/transp/ram/ram1_reg_bram_5]

The exact instance names will need to be adjusted based on the timing report. See https://forums.xilinx.com/t5/UltraScale-Architecture/Prevent-Block-Ram-Cascade-Chain/td-p/645310

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
axi-util @ cfcb30f		axi-util @ cfcb30f
codegen		codegen
docs		docs
generated/fft1024_wide		generated/fft1024_wide
gowin		gowin
sw		sw
tests		tests
xilinx		xilinx
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
barrel_shifter.vhd		barrel_shifter.vhd
complex_multiply.vhd		complex_multiply.vhd
complex_multiply2.vhd		complex_multiply2.vhd
complex_multiply_large.vhd		complex_multiply_large.vhd
complex_ram.vhd		complex_ram.vhd
complex_ram_delay.vhd		complex_ram_delay.vhd
complex_ram_lut.vhd		complex_ram_lut.vhd
fft2_no_pipeline.vhd		fft2_no_pipeline.vhd
fft2_serial.vhd		fft2_serial.vhd
fft2_serial2.vhd		fft2_serial2.vhd
fft4.vhd		fft4.vhd
fft4_no_pipeline.vhd		fft4_no_pipeline.vhd
fft4_serial3.vhd		fft4_serial3.vhd
fft4_serial4.vhd		fft4_serial4.vhd
fft4_serial4_bf.vhd		fft4_serial4_bf.vhd
fft4_serial5.vhd		fft4_serial5.vhd
fft4_serial5_natural.vhd		fft4_serial5_natural.vhd
fft4_serial6.vhd		fft4_serial6.vhd
fft4_serial7.vhd		fft4_serial7.vhd
fft4_serial8.vhd		fft4_serial8.vhd
fft4_serial8_bf.vhd		fft4_serial8_bf.vhd
fft4_serial9.vhd		fft4_serial9.vhd
fft4_serial9_bf.vhd		fft4_serial9_bf.vhd
fft_spdf_halfstage.vhd		fft_spdf_halfstage.vhd
fft_spdf_stage.vhd		fft_spdf_stage.vhd
fft_types.vhd		fft_types.vhd
multiply_add.vhd		multiply_add.vhd
reorder_buffer.vhd		reorder_buffer.vhd
spdf_butterfly.vhd		spdf_butterfly.vhd
sr.vhd		sr.vhd
transposer.vhd		transposer.vhd
transposer4.vhd		transposer4.vhd
transposer_addr_gen.vhd		transposer_addr_gen.vhd
twiddle_addr_gen.vhd		twiddle_addr_gen.vhd
twiddle_addr_gen_large.vhd		twiddle_addr_gen_large.vhd
twiddle_generator.vhd		twiddle_generator.vhd
twiddle_generator_1m.vhd		twiddle_generator_1m.vhd
twiddle_generator_large.vhd		twiddle_generator_large.vhd
twiddle_multiplier.vhd		twiddle_multiplier.vhd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fpga-fft

Architecture

Usage

About

Releases

Packages

Languages

License

owocomm-0/fpga-fft

Folders and files

Latest commit

History

Repository files navigation

fpga-fft

Architecture

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages