After some time off, here I am again.
Here is an interesting implementation of the inverse of a matrix that I found useful when I was developing a CUDA program. This is the most efficient method to implement the inverse of a matrix in terms of memory usage, which is handy when we want to put everything in the fast but limited shared memory.
The Mathematica code is here
The cuda code is next