doc/threads.texi

@node Multi-threaded FFTW, Distributed-memory FFTW with MPI, FFTW Reference, Top
@chapter Multi-threaded FFTW

@cindex parallel transform
In this chapter we document the parallel FFTW routines for
shared-memory parallel hardware.  These routines, which support
parallel one- and multi-dimensional transforms of both real and
complex data, are the easiest way to take advantage of multiple
processors with FFTW.  They work just like the corresponding
uniprocessor transform routines, except that you have an extra
initialization routine to call, and there is a routine to set the
number of threads to employ.  Any program that uses the uniprocessor
FFTW can therefore be trivially modified to use the multi-threaded
FFTW.

A shared-memory machine is one in which all CPUs can directly access
the same main memory, and such machines are now common due to the
ubiquity of multi-core CPUs.  FFTW's multi-threading support allows
you to utilize these additional CPUs transparently from a single
program.  However, this does not necessarily translate into
performance gains---when multiple threads/CPUs are employed, there is
an overhead required for synchronization that may outweigh the
computatational parallelism.  Therefore, you can only benefit from
threads if your problem is sufficiently large.
@cindex shared-memory
@cindex threads

@menu
* Installation and Supported Hardware/Software::
* Usage of Multi-threaded FFTW::
* How Many Threads to Use?::
* Thread safety::
@end menu

@c ------------------------------------------------------------
@node Installation and Supported Hardware/Software, Usage of Multi-threaded FFTW, Multi-threaded FFTW, Multi-threaded FFTW
@section Installation and Supported Hardware/Software

All of the FFTW threads code is located in the @code{threads}
subdirectory of the FFTW package.  On Unix systems, the FFTW threads
libraries and header files can be automatically configured, compiled,
and installed along with the uniprocessor FFTW libraries simply by
including @code{--enable-threads} in the flags to the @code{configure}
script (@pxref{Installation on Unix}), or @code{--enable-openmp} to use
@uref{http://www.openmp.org,OpenMP} threads.
@fpindex configure


@cindex portability
@cindex OpenMP
The threads routines require your operating system to have some sort
of shared-memory threads support.  Specifically, the FFTW threads
package works with POSIX threads (available on most Unix variants,
from GNU/Linux to MacOS X) and Win32 threads.  OpenMP threads, which
are supported in many common compilers (e.g. gcc) are also supported,
and may give better performance on some systems.  (OpenMP threads are
also useful if you are employing OpenMP in your own code, in order to
minimize conflicts between threading models.)  If you have a
shared-memory machine that uses a different threads API, it should be
a simple matter of programming to include support for it; see the file
@code{threads/threads.c} for more detail.

You can compile FFTW with @emph{both} @code{--enable-threads} and
@code{--enable-openmp} at the same time, since they install libraries
with different names (@samp{fftw3_threads} and @samp{fftw3_omp}, as
described below).  However, your programs may only link to @emph{one}
of these two libraries at a time.

Ideally, of course, you should also have multiple processors in order to
get any benefit from the threaded transforms.

@c ------------------------------------------------------------
@node Usage of Multi-threaded FFTW, How Many Threads to Use?, Installation and Supported Hardware/Software, Multi-threaded FFTW
@section Usage of Multi-threaded FFTW

Here, it is assumed that the reader is already familiar with the usage
of the uniprocessor FFTW routines, described elsewhere in this manual.
We only describe what one has to change in order to use the
multi-threaded routines.

@cindex OpenMP
First, programs using the parallel complex transforms should be linked
with @code{-lfftw3_threads -lfftw3 -lm} on Unix, or @code{-lfftw3_omp
-lfftw3 -lm} if you compiled with OpenMP. You will also need to link
with whatever library is responsible for threads on your system
(e.g. @code{-lpthread} on GNU/Linux) or include whatever compiler flag
enables OpenMP (e.g. @code{-fopenmp} with gcc).
@cindex linking on Unix


Second, before calling @emph{any} FFTW routines, you should call the
function:

@example
int fftw_init_threads(void);
@end example
@findex fftw_init_threads

This function, which need only be called once, performs any one-time
initialization required to use threads on your system.  It returns zero
if there was some error (which should not happen under normal
circumstances) and a non-zero value otherwise.

Third, before creating a plan that you want to parallelize, you should
call:

@example
void fftw_plan_with_nthreads(int nthreads);
@end example
@findex fftw_plan_with_nthreads

The @code{nthreads} argument indicates the number of threads you want
FFTW to use (or actually, the maximum number).  All plans subsequently
created with any planner routine will use that many threads.  You can
call @code{fftw_plan_with_nthreads}, create some plans, call
@code{fftw_plan_with_nthreads} again with a different argument, and
create some more plans for a new number of threads.  Plans already created
before a call to @code{fftw_plan_with_nthreads} are unaffected.  If you
pass an @code{nthreads} argument of @code{1} (the default), threads are
disabled for subsequent plans.

@cindex OpenMP
With OpenMP, to configure FFTW to use all of the currently running
OpenMP threads (set by @code{omp_set_num_threads(nthreads)} or by the
@code{OMP_NUM_THREADS} environment variable), you can do:
@code{fftw_plan_with_nthreads(omp_get_max_threads())}. (The @samp{omp_}
OpenMP functions are declared via @code{#include <omp.h>}.)

@cindex thread safety
Given a plan, you then execute it as usual with
@code{fftw_execute(plan)}, and the execution will use the number of
threads specified when the plan was created.  When done, you destroy
it as usual with @code{fftw_destroy_plan}.  As described in
@ref{Thread safety}, plan @emph{execution} is thread-safe, but plan
creation and destruction are @emph{not}: you should create/destroy
plans only from a single thread, but can safely execute multiple plans
in parallel.

There is one additional routine: if you want to get rid of all memory
and other resources allocated internally by FFTW, you can call:

@example
void fftw_cleanup_threads(void);
@end example
@findex fftw_cleanup_threads

which is much like the @code{fftw_cleanup()} function except that it
also gets rid of threads-related data.  You must @emph{not} execute any
previously created plans after calling this function.

We should also mention one other restriction: if you save wisdom from a
program using the multi-threaded FFTW, that wisdom @emph{cannot be used}
by a program using only the single-threaded FFTW (i.e. not calling
@code{fftw_init_threads}).  @xref{Words of Wisdom-Saving Plans}.

@c ------------------------------------------------------------
@node How Many Threads to Use?, Thread safety, Usage of Multi-threaded FFTW, Multi-threaded FFTW
@section How Many Threads to Use?

@cindex number of threads
There is a fair amount of overhead involved in synchronizing threads,
so the optimal number of threads to use depends upon the size of the
transform as well as on the number of processors you have.

As a general rule, you don't want to use more threads than you have
processors.  (Using more threads will work, but there will be extra
overhead with no benefit.)  In fact, if the problem size is too small,
you may want to use fewer threads than you have processors.

You will have to experiment with your system to see what level of
parallelization is best for your problem size.  Typically, the problem
will have to involve at least a few thousand data points before threads
become beneficial.  If you plan with @code{FFTW_PATIENT}, it will
automatically disable threads for sizes that don't benefit from
parallelization.
@ctindex FFTW_PATIENT

@c ------------------------------------------------------------
@node Thread safety,  , How Many Threads to Use?, Multi-threaded FFTW
@section Thread safety

@cindex threads
@cindex OpenMP
@cindex thread safety
Users writing multi-threaded programs (including OpenMP) must concern
themselves with the @dfn{thread safety} of the libraries they
use---that is, whether it is safe to call routines in parallel from
multiple threads.  FFTW can be used in such an environment, but some
care must be taken because the planner routines share data
(e.g. wisdom and trigonometric tables) between calls and plans.

The upshot is that the only thread-safe (re-entrant) routine in FFTW is
@code{fftw_execute} (and the new-array variants thereof).  All other routines
(e.g. the planner) should only be called from one thread at a time.  So,
for example, you can wrap a semaphore lock around any calls to the
planner; even more simply, you can just create all of your plans from
one thread.  We do not think this should be an important restriction
(FFTW is designed for the situation where the only performance-sensitive
code is the actual execution of the transform), and the benefits of
shared data between plans are great.

Note also that, since the plan is not modified by @code{fftw_execute},
it is safe to execute the @emph{same plan} in parallel by multiple
threads.  However, since a given plan operates by default on a fixed
array, you need to use one of the new-array execute functions (@pxref{New-array Execute Functions}) so that different threads compute the transform of different data.

(Users should note that these comments only apply to programs using
shared-memory threads or OpenMP.  Parallelism using MPI or forked processes
involves a separate address-space and global variables for each process,
and is not susceptible to problems of this sort.)

If you are configured FFTW with the @code{--enable-debug} or
@code{--enable-debug-malloc} flags (@pxref{Installation on Unix}),
then @code{fftw_execute} is not thread-safe.  These flags are not
documented because they are intended only for developing
and debugging FFTW, but if you must use @code{--enable-debug} then you
should also specifically pass @code{--disable-debug-malloc} for
@code{fftw_execute} to be thread-safe.