GitHub - Tugbars/VectorFFT: VectorFFT: A permutation-free mixed-radix FFT library that beats Intel MKL and FFTW on every tested size. Hand-tuned AVX2/AVX-512 codelets for 18 radixes, zero-allocation in-place execution, and a wisdom-based planner. 1.1x–4.9x faster than Intel MKL, 1.3x–12.8x faster than FFTW. Pure C, no dependencies.

A permutation-free mixed-radix FFT library in C with hand-tuned AVX2/AVX-512 codelets.
Beats Intel MKL and FFTW on every tested size. No external dependencies.

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitors: FFTW 3.3.10 (FFTW_MEASURE), Intel MKL (sequential)

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

N	K	vs FFTW	vs MKL
60	256	4.15x	4.86x
49	32	4.73x	4.19x
200	256	3.13x	3.30x
1000	32	2.49x	2.63x
10000	256	6.83x	1.38x
50000	256	12.82x	1.42x
100000	32	6.92x	1.54x

42 wins, 1 tie, 0 losses against Intel MKL across 43 test cases (N=49 to 100,000, K=32 to 1024).

Accuracy

All three libraries achieve comparable accuracy against brute-force O(N^2) DFT reference. VectorFFT's errors are 1.3-2.5x higher than FFTW/MKL due to the multi-stage stride-based decomposition (more intermediate twiddle multiplications), but remain well within double-precision tolerance and follow the theoretical O(N * epsilon * log N) bound.

Roundtrip error (fwd + bwd / N) is at machine epsilon (~1e-16) for all sizes — the permutation-free architecture guarantees perfect cancellation.

Run vfft_bench to see accuracy results for your hardware.

Architecture

VectorFFT uses a permutation-free stride-based Cooley-Tukey architecture:

DIT forward + DIF backward -- roundtrip cancels digit-reversal permutation
Zero permutation passes, zero scratch buffers -- fully in-place, single buffer
18 hand-optimized radixes (2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 16, 17, 19, 20, 25, 32, 64)
Method C fused twiddles -- bakes common factor into per-leg twiddle table at plan time
Log3 twiddle derivation -- derives W^2..W^(R-1) from W^1 via cmul chain when table overflows L1
Split-complex layout -- separate re[] / im[] arrays, SIMD-friendly

All codelets are generated by Python scripts with ISA-specific scheduling (register allocation, spill management, FMA pipelining). Prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman optimal register allocation with explicit spill management to fit AVX2's 16 YMM registers.

Getting Started

git clone https://github.com/Tugbars/VectorFFT.git
cd VectorFFT

# Build
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release

# Generate codelets (requires Python 3)
cmake --build . --target generate_codelets

# Build everything
cmake --build . --config Release

Usage

#include "planner.h"

// Initialize
stride_registry_t reg;
stride_registry_init(&reg);

// Plan (heuristic -- instant)
stride_plan_t *plan = stride_auto_plan(N, K, &reg);

// Execute (in-place, split-complex)
stride_execute_fwd(plan, re, im);   // forward
stride_execute_bwd(plan, re, im);   // backward (output / N to normalize)

stride_plan_destroy(plan);

Wisdom (optional)

Exhaustive search finds optimal factorization for each (N, K) pair:

stride_wisdom_t wis;
stride_wisdom_init(&wis);
stride_wisdom_load(&wis, "vfft_wisdom.txt");

// Use cached optimal plan, or fall back to heuristic
stride_plan_t *plan = stride_wise_plan(N, K, &reg, &wis);

// Calibrate new (N, K) pairs
stride_wisdom_calibrate(&wis, N, K, &reg);
stride_wisdom_save(&wis, "vfft_wisdom.txt");

Project Structure

src/stride-fft/
  core/               Runtime engine (header-only)
    executor.h        In-place stride-based executor (Method C)
    planner.h         Top-level API: auto_plan, wise_plan, exhaustive_plan
    registry.h        ISA-aware codelet registry (AVX-512 > AVX2 > scalar)
    factorizer.h      CPU-aware heuristic factorizer
    exhaustive.h      Exhaustive factorization search
    compat.h          Portable timer + aligned alloc
  codelets/           Generated SIMD headers (~150k lines)
    avx2/             47 AVX2 codelet headers
    avx512/           47 AVX-512 codelet headers
    scalar/           47 scalar fallback headers
  generators/         Python codelet generators
    gen_radix2.py .. gen_radix64.py
    generate_all.bat
  bench/
    bench_planner.c   Full benchmark (vs FFTW + MKL)
    plot_results.py   Comparison graphs

How It Works

Traditional FFT libraries (FFTW, MKL) use Cooley-Tukey with a digit-reversal permutation pass -- an O(N) memory shuffle that doesn't compute anything but costs cache misses.

VectorFFT eliminates this entirely:

Forward (DIT): stages process data at increasing strides, output lands in digit-reversed order
Backward (DIF): stages process in reverse order, naturally undoing the permutation
Roundtrip (fwd + bwd): produces natural-order output with zero permutation overhead

This saves one full data pass per transform. At large N, that's the difference between 1x and 13x over FFTW.

Acknowledgments

FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.

Name		Name	Last commit message	Last commit date
Latest commit History 809 Commits
docs		docs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.bat		build.bat
build.sh		build.sh
vcpkg.json		vcpkg.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

Accuracy

Architecture

Getting Started

Usage

Wisdom (optional)

Project Structure

How It Works

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Benchmark Results

VectorFFT vs Intel MKL vs FFTW

Speedup over Intel MKL and FFTW

VectorFFT vs Intel MKL

Accuracy

Architecture

Getting Started

Usage

Wisdom (optional)

Project Structure

How It Works

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages