Skip to content
/ pal Public

An optimized C library for math, parallel processing and data movement

License

Notifications You must be signed in to change notification settings

parallella/pal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PAL: The Parallel Architectures Library

Build Status

The Parallel Architectures Library (PAL) is a compact C library with optimized routines for vector math, synchronization, and multi-processor communication.

Content

  1. Why?
  2. Design goals
  3. License
  4. Contribution Wanted!
  5. A Simple Example
  6. Library API reference
    6.0 Syntax
    6.1 Program Flow
    6.2 Data Movement
    6.3 Synchronization
    6.3 Basic Math
    6.5 Basic DSP
    6.4 Image Processing
    6.6 FFT (FFTW)
    6.7 Linar Algebra (BLAS)
    6.8 System Calls

##Why? As hard as we tried we could not find libraries that were a perfect fit for our design criteria. There a a number of projects and commercial products that offer most of the functionality of PAL but all of the existing offerings were either far too bulky or had the wrong license. In essence, the goal of the PAL effort is to provide extensions to the standard set of C libraries to address the trend towards massive multi-processor parallelism and SIMD computing.

##Design Goals

  • Fast (Super fast..but not always safe)
  • Compact (as small as possible to fit with processors that have less than <<32KB of RAM)
  • Scalable (thread and data scalable, limited only by the amount of local memory)
  • Portable across platforms (deployable across different ISAs and system architectures)
  • Permissive license (Apache 2.0 license to maximize overall use)

##License The PAL source code is licensed under the Apache License, Version 2.0. See LICENSE for full license text unless otherwise specified.

##Contribution Our goal is to make PAL a broad community project from day one. Some of these functions are tricky, but the biggest challenge with the PAL library is really the volume of simple functions. The good news is that if just 100 people contribute one function each, we'll be done in a couple of days! If you know C, your are ready to contribute!!

Instructions for contributing can be found HERE.

##A Simple Example

Manager Code

#include "pal_base.h"
#include <stdio.h>
#define N 16
int main (int argc, char *argv[]){

    //Stack variables
    int status, i, all, nargs=1;
    char *file="./hello_task.elf";
    char *func="main";
    char *args[nargs];
    char argbuf[20];

    // Handles to opaque structures
    p_dev_t dev0;
    p_prog_t prog0;
    p_team_t team0;
    p_mem_t mem[4];

    //Execution setup
    dev0 = p_init(P_DEMO, 0);            // initialize device and team
    prog0 = p_load(dev0, file, func, 0); // load a program from file system
    all = p_query(dev0, P_NODES, all);   // find number of nodes in system
    p_open(dev0, &team0, 0, all);        // create a team

    //Running program
    for(i=0;i<all;i++){
        sprintf(argbuf, "%d", i); //string args needed to run main asis
        args[0]=argbuf;
        p_run(prog0, team0, i, 1, nargs, args, 0);
    }
    p_wait(team0);    //wait for team to finish (not needed, p_run()
                      //blocking by default
    p_close(team0);   //close team
    p_finalize(dev0); //finalize memory
}

Worker Code (hello_task.elf)

#include <stdio.h>
int main(int argc, char* argv[]){
    int pid=0;
    int i;
    pid=atoi(argv[2]);
    printf("--Processor %d says hello!--\n", pid);
    return i;
}

PAL LIBRARY API REFERENCE

##PROGRAM FLOW
These program flow functions are used to manage the system and to execute programs. All opaque objects are referenced with simple integers.

FUNCTION NOTES
p_init() initialize the run time
p_query() query a device object
p_load() load binary elf file into memory
p_run() run a program on a team of processor
p_open() open a team of processors
p_append() add members to team
p_remove() remove members from team
p_close() close a team of processors
p_barrier() team barrier
p_wait() wait for team to finish
p_fence() memory fence
p_finalize() cleans up run time
p_get_err() get error code (if any).

##MEMORY ALLOCATION
These functions are used for creating memory objects. The function returns a unique integer for each new memory object. This integer can then be used by functions like p_read() and p_write() to access data within the memory object.

FUNCTION NOTES
p_malloc() allocate memory on local processor
p_rmalloc() allocate memory on remote processor
p_free() free memory

##DATA MOVEMENT
The data movement functions move blocks of data between opaque memory objects and locations specified by pointers. The memory object is specified by a simple integer. The exception is the p_memcpy function which copies blocks of bytes within a shared memory architecture only.

FUNCTION NOTES
p_gather() gather operation
p_memcpy() fast memcpy()
p_read() read from a memory object
p_scatter() scatter operation
p_write() write to a memory object

##SYNCHRONIZATION
The synchronization functions are useful for program sequencing and resource locking in shared memory systems.

FUNCTION NOTES
p_mutex_lock() lock a mutex
p_mutex_trylock() try locking a mutex once
p_mutex_unlock() unlock (clear) a mutex
p_mutex_init() initialize a mutex
p_atomic_add() atomic fetch and add
p_atomic_sub() atomic fetch and sub
p_atomic_and() atomic fetch and 'and'
p_atomic_xor() atomic fetch and 'xor'
p_atomic_or() atomic fetch and 'or'
p_atomic_swap() atomic exchange
p_atomic_compswap() atomic compare and exchange

##MATH
The math functions are single threaded vectorized functions intended to run on a single processor. Math functions use pointers for input/output arguments and take in a separate variable to indicate the size of the vectors. Speed and size is a priority and some liberties have been taken with respect to accuracy and safety.

FUNCTION NOTES
p_abs() absolute value
p_absdiff() absolute difference
p_add() add
p_acos() arc cosine
p_acosh() arc hyperbolic cosine
p_asin() arc sine
p_asinh() arc hyperbolic sine
p_cbrt() cubic root
p_cos() cosine
p_cosh() hyperbolic cosine
p_div() division
p_dot() dot product
p_exp() exponential
p_ftoi() float to integer conversion
p_itof() integer to float conversion
p_inv() inverse
p_invcbrt() inverse cube root
p_invsqrt() inverse square root
p_ln() natural log
p_log10() denary log
p_max() finds max val
p_min() finds min val
p_mean() mean operation
p_median() finds middle value
p_mode() finds most common value
p_mul() multiplication
p_popcount() count the number of bits set
p_pow() element raised to a power
p_rand() random number generator
p_randinit() initialize random number generator
p_sort() heap sort
p_sin() sine
p_sinh() hyperbolic sine
p_sqrt() square root
p_sub() subtract
p_sum() sum of all vector elements
p_sumsq() sum of all vector squared elements
p_tan() tangent
p_tanh() hyperbolic tangent

##DSP
The digital signal processing (dsp) functions are similar to the math functions in that they are single threaded vectorized functions intended to run on a single base. Also, just like the math functions they take in pointers for input/output arguments and a separate variable to indicate the size of the vectors. Speed and size is a priority and some liberties have been taken with respect to accuracy and safety.

FUNCTION NOTES
p_acorr() autocorrelation (r[j] = sum ( x[j+k] * x[k] ), k=0..(n-j-1))
p_conv() convolution: r[j] = sum ( h[k] * x[j-k), k=0..(nh-1)
p_xcorr() correlation: r[j] = sum ( x[j+k] * y[k]), k=0..(nx+ny-1)
p_fir() FIR filter direct form: r[j] = sum ( h[k] * x [j-k]), k=0..(nh-1)
p_firdec() FIR filter with decimation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firint() FIR filter with inerpolation: r[j] = sum ( h[k] * x [j*D-k]), k=0..(nh-1)
p_firsym() FIR symmetric form
p_iir() IIR filter

##IMAGE PROCESSING
The image processing functions work on 2D arrays of data and use the same argument passing conventions as the dsp and math functions.

FUNCTION NOTES
p_box3x3() box filter (3x3)
p_conv2d() 2d convolution
p_gauss3x3() gaussian blur filter (3x3)
p_median3x3() median filter (3x3)
p_laplace3x3() laplace filter (3x3)
p_prewitt3x3() prewitt filter (3x3)
p_sad8x8() sum of absolute differences (8x8)
p_sad16x16() sum of absolute differences (16x16)
p_sobel3x3() sobel filter (3x3)
p_scharr3x3() scharr filter (3x3)

##FFT

  • An FFTW like interface

##BLAS

  • A port of the BLIS library

##SYSTEM CALLS

  • Bionic libc implementation as starting point..