transfer dynamic 2D array memory to CUDA
There are many ways to get your 2D array shipped from CPU to GPU in CUDA, but I think this particular method is great lesson in arrays and memory in general.
The basic concept is easy: flatten out the 2D array into a single dimensional array.
We have to flatten the array so that we have contiguous memory when using cudaMemcpy
. There isn’t really a good way to copy over the array in 2 dimensions. We’ll have to send it to the GPU as a single-dimensional array, and then figure out the 2D structure of the data on the other side.
creating a dynamic array the “wrong” way
The way I typically define a 2D dynamic array is as an array of pointers:
This creates the following matrix:
0 1 2
3 4 5
Of course we know that this actually creates 2 pointers to arrays (which too, are pointers…). So what we really have is:
A[0] -----> * * *
| | |
| | |
v v v
0 1 2
A[1] -----> * * *
| | |
| | |
v v v
3 4 5
why is this wrong?
It gets tricky when we try using cudaMalloc
and cudaMemcpy
, which are very similar to malloc
and memcpy
, respectively. To make this easier to read, we’ll just use the latter functions for now.
Let’s say we try to malloc
and memcopy
our way into a flattened 1D clone of our 2D array A
.
bad copy #1
This doesn’t work. That’s because here we’re copying in the 2 contiguous row pointers of A into B:
B: * * * * * *
| | |
| | |
v v v
A[0] A[1] ???
| |
| +----> * * *
+--> * * * | | |
| | | | | |
| | | v v v
v v v 3 4 5
0 1 2
But the columns are somewhere else in non-contiguous space.
So what if we try copying A[0]
instead of A
?
bad copy #2
This is closer…
B: * * * * * *
| | | |
| | | |
v v v v
A[0][1] A[0][2] A[0][2] ? ? ?
But the problem now is that A[0][2]
and A[1][0]
are not contiguous in memory in a dynamic array of pointers, so this also fails. Note the italics here. If this were a static array: int A[2][3]
then we’d be fine, but we have a dynamic array… We can’t be certain that A[0][2]
is contiguous with A[1][0]
.
This isn’t a problem on the host device, but it is definitely a problem when we’re trying to copy memory out to the GPU. We need the memory to be contiguous in order to use the CUDA API.
using contiguous memory
We could just copy the original array into a new single dimensional array on the host, and then ship it from there. That would get us contiguous memory on the host, but this takes rows * cols
extra complexity and rows * cols
extra space. That’s pretty bad since it’s not even necessary!
In order to get this to work, we’ll have to refine how we’ve created the original matrix.
static array? (still wrong)
As mentioned above, we could simply change this to a static array:
Now our array is neatly packed into contiguous slots in memory, so the bad copy #2 solution above would work nicely.
why is this still wrong?
It’s not necessarily wrong… It works, but we started the problem by saying we wanted to use a dynamic array, not a static array. So this solution doesn’t fit the constraints of our problem.
the right way
Obviously this isn’t the only “right” way to solve this problem. Of course there are many solutions, but for now just consider this the “right” one.
why is this “right?”
I consider this “right” because it accomplishes two things:
- The matrix is a 2D dynamic matrix on the host
- The memory is contiguous on the host
The magic here happens in the first 3 lines of this solution.
Here we point A
to rows
pointers to arrays, which currently are uninitialized.
A: * *
Next..
We allocate the entire array at A[0]
.
A: * *
|
|
+-> * * * * * *
This means that all of the memory to be used for this array is contiguous after A[0]
. At this point we could have just flattened the array to begin with, since we’ve essentially created a 1D version of the array in memory. But don’t forget about that first line, where we created rows
pointers to int*
. Now we can tie that back into the picture:
A[0]
is already pointing to the beginning of our full array, so we start at A[1]
.
A[1]
is set to A[0]
(which is a pointer) plus an offset of cols
spaces in memory (which means pointer arithmetic).
The result after the first iteration is as follows:
A: * *
| |
| v
+-> * * * * * *
This would repeat for however many rows we have. In this example we’ve only had 2, but a 3 x 2
array would end up looking more like this:
A: * * *
| | |
| v v
+-> * * * * * *
Now we have a dynamic array that is contiguous in memory! We can use memcpy
and/or cudaMemcpy
by referencing A[0]
, and we can also use 2D array syntax to find elements in the array on the host device.
The last step is to just copy this with malloc
-like and memcpy
-like syntax:
Don’t mind the weird void **
syntax here… That’s another discussion. For consistency with the rest of the example, here’s what it looks like with plain memcpy
and malloc
:
flat index to [row][col]
The last thing you’ll need is the translation of the flat index to [row][col]: