The Parallel Architectures Library (PAL) is a compact C library with optimized routines for vector math, synchronization, and multi-processor communication.
- Why?
- Design goals
- License
- Contribution Wanted!
- A Simple Example
- Library API reference
6.0 Syntax
6.1 Program Flow
6.2 Data Movement
6.3 Synchronization
6.3 Basic Math
6.5 Basic DSP
6.4 Image Processing
6.6 FFT (FFTW)
6.7 Linar Algebra (BLAS)
6.8 System Calls
##Why? As hard as we tried we could not find libraries that were a perfect fit for our design criteria. There a a number of projects and commercial products that offer most of the functionality of PAL but all of the existing offerings were either far too bulky or had the wrong license. In essence, the goal of the PAL effort is to provide extensions to the standard set of C libraries to address the trend towards massive multi-processor parallelism and SIMD computing.
##Design Goals
- Fast (Super fast..but not always safe)
- Compact (as small as possible to fit with processors that have less than <<32KB of RAM)
- Scalable (thread and data scalable, limited only by the amount of local memory)
- Portable across platforms (deployable across different ISAs and system architectures)
- Permissive license (Apache 2.0 license to maximize overall use)
##License The PAL source code is licensed under the Apache License, Version 2.0. See LICENSE for full license text unless otherwise specified.
##Contribution Our goal is to make PAL a broad community project from day one. Some of these functions are tricky, but the biggest challenge with the PAL library is really the volume of simple functions. The good news is that if just 100 people contribute one function each, we'll be done in a couple of days! If you know C, your are ready to contribute!!
Instructions for contributing can be found HERE.
##A Simple Example
Manager Code
#include "pal_base.h"
#include <stdio.h>
#define N 16
int main (int argc, char *argv[]){
//Stack variables
int status, i, all, nargs=1;
char *file="./hello_task.elf";
char *func="main";
char *args[nargs];
char argbuf[20];
// Handles to opaque structures
p_dev_t dev0;
p_prog_t prog0;
p_team_t team0;
p_mem_t mem[4];
//Execution setup
dev0 = p_init(P_DEMO, 0); // initialize device and team
prog0 = p_load(dev0, file, func, 0); // load a program from file system
all = p_query(dev0, P_NODES, all); // find number of nodes in system
p_open(dev0, &team0, 0, all); // create a team
//Running program
for(i=0;i<all;i++){
sprintf(argbuf, "%d", i); //string args needed to run main asis
args[0]=argbuf;
p_run(prog0, team0, i, 1, nargs, args, 0);
}
p_wait(team0); //wait for team to finish (not needed, p_run()
//blocking by default
p_close(team0); //close team
p_finalize(dev0); //finalize memory
}
Worker Code (hello_task.elf)
#include <stdio.h>
int main(int argc, char* argv[]){
int pid=0;
int i;
pid=atoi(argv[2]);
printf("--Processor %d says hello!--\n", pid);
return i;
}
##PROGRAM FLOW
These program flow functions are used to manage the system and to execute programs. All opaque objects are referenced with simple integers.
FUNCTION | NOTES |
---|---|
p_init() | initialize the run time |
p_query() | query a device object |
p_load() | load binary elf file into memory |
p_run() | run a program on a team of processor |
p_open() | open a team of processors |
p_append() | add members to team |
p_remove() | remove members from team |
p_close() | close a team of processors |
p_barrier() | team barrier |
p_wait() | wait for team to finish |
p_fence() | memory fence |
p_finalize() | cleans up run time |
p_get_err() | get error code (if any). |
##MEMORY ALLOCATION
These functions are used for creating memory objects. The function returns a unique integer for each new memory object. This integer can then be used by functions like p_read() and p_write() to access data within the memory object.
FUNCTION | NOTES |
---|---|
p_malloc() | allocate memory on local processor |
p_rmalloc() | allocate memory on remote processor |
p_free() | free memory |
##DATA MOVEMENT
The data movement functions move blocks of data between opaque memory objects and locations specified by pointers. The memory object is specified by a simple integer. The exception is the p_memcpy function which copies blocks of bytes within a shared memory architecture only.
FUNCTION | NOTES |
---|---|
p_gather() | gather operation |
p_memcpy() | fast memcpy() |
p_read() | read from a memory object |
p_scatter() | scatter operation |
p_write() | write to a memory object |
##SYNCHRONIZATION
The synchronization functions are useful for program sequencing and resource locking in shared memory systems.
FUNCTION | NOTES |
---|---|
p_mutex_lock() | lock a mutex |
p_mutex_trylock() | try locking a mutex once |
p_mutex_unlock() | unlock (clear) a mutex |
p_mutex_init() | initialize a mutex |
p_atomic_add() | atomic fetch and add |
p_atomic_sub() | atomic fetch and sub |
p_atomic_and() | atomic fetch and 'and' |
p_atomic_xor() | atomic fetch and 'xor' |
p_atomic_or() | atomic fetch and 'or' |
p_atomic_swap() | atomic exchange |
p_atomic_compswap() | atomic compare and exchange |
##MATH
The math functions are single threaded vectorized functions intended to run on a single processor. Math functions use pointers for input/output arguments and take in a separate variable to indicate the size of the vectors. Speed and size is a priority and some liberties have been taken with respect to accuracy and safety.
FUNCTION | NOTES |
---|---|
p_abs() | absolute value |
p_absdiff() | absolute difference |
p_add() | add |
p_acos() | arc cosine |
p_acosh() | arc hyperbolic cosine |
p_asin() | arc sine |
p_asinh() | arc hyperbolic sine |
p_cbrt() | cubic root |
p_cos() | cosine |
p_cosh() | hyperbolic cosine |
p_div() | division |
p_dot() | dot product |
p_exp() | exponential |
p_ftoi() | float to integer conversion |
p_itof() | integer to float conversion |
p_inv() | inverse |
p_invcbrt() | inverse cube root |
p_invsqrt() | inverse square root |
p_ln() | natural log |
p_log10() | denary log |
p_max() | finds max val |
p_min() | finds min val |
p_mean() | mean operation |
p_median() | finds middle value |
p_mode() | finds most common value |
p_mul() | multiplication |
p_popcount() | count the number of bits set |
p_pow() | element raised to a power |
p_rand() | random number generator |
p_randinit() | initialize random number generator |
p_sort() | heap sort |
p_sin() | sine |
p_sinh() | hyperbolic sine |
p_sqrt() | square root |
p_sub() | subtract |
p_sum() | sum of all vector elements |
p_sumsq() | sum of all vector squared elements |
p_tan() | tangent |
p_tanh() | hyperbolic tangent |
##DSP
The digital signal processing (dsp) functions are similar to the math functions
in that they are single threaded vectorized functions intended to run on a
single base. Also, just like the math functions they take in pointers for
input/output arguments and a separate variable to indicate the size of the
vectors. Speed and size is a priority and some liberties have been taken with
respect to accuracy and safety.
FUNCTION | NOTES |
---|---|
p_acorr() | autocorrelation (r[j] = sum ( x[j+k] * x[k] ), k=0..(n-j-1)) |
p_conv() | convolution: r[j] = sum ( h[k] * x[j-k), k=0..(nh-1) |
p_xcorr() | correlation: r[j] = sum ( x[j+k] * y[k]), k=0..(nx+ny-1) |
p_fir() | FIR filter direct form: r[j] = sum ( h[k] * x [j-k]), k=0..(nh-1) |
p_firdec() | FIR filter with decimation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1) |
p_firint() | FIR filter with inerpolation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1) |
p_firsym() | FIR symmetric form |
p_iir() | IIR filter |
##IMAGE PROCESSING
The image processing functions work on 2D arrays of data and use the same
argument passing conventions as the dsp and math functions.
FUNCTION | NOTES |
---|---|
p_box3x3() | box filter (3x3) |
p_conv2d() | 2d convolution |
p_gauss3x3() | gaussian blur filter (3x3) |
p_median3x3() | median filter (3x3) |
p_laplace3x3() | laplace filter (3x3) |
p_prewitt3x3() | prewitt filter (3x3) |
p_sad8x8() | sum of absolute differences (8x8) |
p_sad16x16() | sum of absolute differences (16x16) |
p_sobel3x3() | sobel filter (3x3) |
p_scharr3x3() | scharr filter (3x3) |
##FFT
- An FFTW like interface
##BLAS
- A port of the BLIS library
##SYSTEM CALLS
- Bionic libc implementation as starting point..