Tuning VkFFT

This is a quick demonstration of how to tune low-level VkFFT parameters to achieve the best possible performance - here illustrated on an Apple M1 pro GPU.

Remember: this is only useful for intensive applications, e.g. when using FFTs during a long iterative process. Otherwise, tuning is usually overkill !

Imports & test data

Let’s try a 2D transform of a (250,250,250) array:

[1]:
import timeit
import numpy as np
import pyopencl as cl
import pyopencl.array as cla
from pyvkfft.fft import fftn, ifftn
from pyvkfft.opencl import VkFFTApp
from pyvkfft.benchmark import bench_pyvkfft_opencl
[2]:
ctx = cl.create_some_context()
gpu_name = ctx.devices[0].name
print("GPU:", gpu_name)
cq = cl.CommandQueue(ctx)

n = 250
GPU: Apple M1 Pro

Using the benchmark function

This function executes the tests in a separate process, so it should avoid issues consuming GPU resources. The drawback is that it is relatively slow (need to re-initialise the GPU context for every test).

[3]:
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name)
print(f"Speed with default parameters: {res[1]:6.1f} Gbytes/s")
Speed with default parameters:  107.4 Gbytes/s

Now try changing the coalescedMemory parameter (default is 32 for nvidia/amd, 64 for others) - test 4 values

[4]:
args = {'tune_config':{'backend':'pyopencl',
                       'coalescedMemory':[16,32,64,128]}}
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name, args=args)
print(f"Speed: {res[1]:6.1f} Gbytes/s")

Speed:  109.3 Gbytes/s

This did not work on the M1 pro - no real improvement.

Let’s try instead tuning the aimThreads parameter (defaults at 128).

[5]:
args = {'tune_config':{'backend':'pyopencl',
                       'aimThreads':[32, 64, 128]}}
res = bench_pyvkfft_opencl((n,n,n),ndim=2,gpu_name=gpu_name, args=args)
print(f"Speed: {res[1]:6.1f} Gbytes/s")

Speed:  156.9 Gbytes/s

Much better - 50% faster !

Using the simple FFT interface

Some default tuning options can be used just by passing tune=True to the simple fft API functions.

This will automatically test a few parameters (depending on the GPU) and choose the one yielding the best speed. This was tested on a few types of GPUs.

Let’s try first without tuning:

[6]:
a= cla.empty(cq,(n,n,n), dtype=np.complex64)

cq.finish()
t0 = timeit.default_timer()
for i in range(100):
    a = fftn(a,a, ndim=2)
cq.finish()
dt = timeit.default_timer()-t0
print(f"Without tuning: dt={dt:8.5f}s")
Without tuning: dt= 0.42657s

Now with tuning (we do it twice, the first will cache the result)

[7]:
a = fftn(a,a, ndim=2, tune=True)
cq.finish()
t0 = timeit.default_timer()
for i in range(100):
    a = fftn(a,a, ndim=2, tune=True)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With tuning: dt={dt:8.5f}s")

With tuning: dt= 0.27237s

Using the VkFFTApp API

This allows either to:

  • choose a set of parameters to tune (similarly to tune=True in the simple fft API)

  • or pass directly some parameters

Let’s try first without tuning:

[8]:
a= cla.zeros(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True)

cq.finish()
t0 = timeit.default_timer()
for i in range(100):
    a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"Without tuning: dt={dt:8.5f}s")

Without tuning: dt= 0.40874s

Now with automatic tuning. The tuning part will be done immediately when creating the VkFFTApp, by creating temporary arrays.

[9]:
 a= cla.zeros(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True,
              tune_config={'backend':'pyopencl',
                           'aimThreads':[32, 64, 128]})

cq.finish()
t0 = timeit.default_timer()
for i in range(100):
    a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With auto-tuning: dt={dt:8.5f}s")

With auto-tuning: dt= 0.27309s

The other approach consists in directly giving the known optimised parameter:

[10]:
a= cla.empty(cq,(n,n,n), dtype=np.complex64)
app = VkFFTApp(a.shape, a.dtype, cq, ndim=2, inplace=True, aimThreads=64)

cq.finish()
t0 = timeit.default_timer()
for i in range(100):
    a = app.fft(a,a)
cq.finish()
dt = timeit.default_timer()-t0
print(f"With tuned parameter: dt={dt:8.5f}s")

With tuned parameter: dt= 0.27962s