Saturday, April 17, 2010

Complex matrix Inverse I: CUDA

Here is a short program in CUDA to invert complex matrices through the LU decomposition (without permutations) with back substitution in __device__ mode. This means that the whole inversion is designed to run in each thread.

This code was adapted from a C code in real double precision found here

The code is standalone in the sense that it only requires the Standard tools provided by CUDA (without CuBlas) and tested in the NVIDIA GeForce 240M
on a Linux Fedora 11.

The program can be organized in three files
  • cumatrixtools.h: which contains the method itself.
  • which contains the kernel function, which maps the method on the threreads.
  • which administers the memory and calls kernel function.
The purpose of this program is to invert several variations of a matrix A. The matrix A is passed to the kernel and the inverse is evaluated after modifying the first element of A. Each branch executes a different matrix. The kernel is called with <<<2,32>>>, which means that the threads are organized in two blocks, each bloch with 32 threads. The more physical processors one has, the more threads per block one could assign in order to gain performance

The whole program is distributed in three posts (I,II and III) looks like

#include <stdio.h>
#include <complex.h>



void cuPrintMatrix( cuFloatComplex  *C ,int N, int M )

int i,j;

for(j=0;j<M;j++) printf(" (%f,%f)\t ", cuCrealf(C[i*N + j]) , cuCimagf(C[i*N + j]) );

printf(" \n ");

const int N=3;     //Matrix dimension

const int NBranches = 64;
// Main program
int main(int argc, char **argv){

int i;
float  complex A[9] =  //  Base matrix

0.f+I/sqrtf(2.f), 0.0f+ I/sqrt(2.0f),  0.+ 0.*I,

0. -I/2.     , 0.+I/2       ,   0.+ I/sqrt(2.),
-0.5+0.*I ,  0.5+0.*I  ,  -1./sqrt(2.)+.0*I


cuFloatComplex  *h_A, *h_invA;
cuFloatComplex  *d_A, *d_invA,   *d_WorkSpace;

printf("...allocating CPU memory.\n");
h_A =          (cuFloatComplex *)   malloc(                              N*N*sizeof(cuFloatComplex ));
h_invA =       (cuFloatComplex *)   malloc(                    NBranches*N*N*sizeof(cuFloatComplex ));

printf("...allocating GPU memory.\n");

cudaMalloc((void **)&d_A,                                      NBranches*N*N*sizeof(cuFloatComplex ));

cudaMalloc((void **)&d_invA,                                   NBranches*N*N*sizeof(cuFloatComplex ));

cudaMalloc((void **)&d_WorkSpace, NBranches*cgeMatrixInverse_WorkSpace()*N*N*sizeof(cuFloatComplex ));

printf("...Copying memory.\n ");
for(i=0;i<N*N;i++ ) h_A[i] = make_cuFloatComplex( crealf(A[i]) , cimagf(A[i])  );

cudaMemcpy(d_A, h_A, N*N*sizeof(cuFloatComplex) , cudaMemcpyHostToDevice);

printf("...The base matrix is:\n");
cuPrintMatrix( h_A , N, N );

printf("\n...Calling the kernel.\n");

cgeMatrixInverse_kernel<<<2,32>>>(d_invA, d_A , N ,d_WorkSpace);   // Divinding the 64 branches in 2 blocks of 32 threads


cudaMemcpy(h_invA, d_invA, NBranches*N*N*sizeof(float), cudaMemcpyDeviceToHost);

printf("\n The inverse of the first branch is \n");
cuPrintMatrix( h_invA , N, N );

printf("\n The inverse of the second branch is \n");
cuPrintMatrix( h_invA + N*N , N, N );

printf("\n and so on ..\n");





  1. I generated the html code to display the C code using
    the tool provided by

  2. Hi,
    The link to the C code that you point to doesn't go. Can you please share the working link? I am in need of a C code for complex matrix inversion!