c++ - CUDA Device To Device transfer expensive

Question

Welcome To Ask or Share your Answers For Others

c++ - CUDA Device To Device transfer expensive

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

c++ - CUDA Device To Device transfer expensive

I have written some code to try to swap quadrants of a 2D matrix for FFT purposes, that is stored in a flat array.

    int leftover = W-dcW;

    T *temp;
    T *topHalf;
cudaMalloc((void **)&temp, dcW * sizeof(T));

    //swap every row, left and right
    for(int i = 0; i < H; i++)
    {
        cudaMemcpy(temp, &data[i*W], dcW*sizeof(T),cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W],&data[i*W+dcW], leftover*sizeof(T), cudaMemcpyDeviceToDevice);
        cudaMemcpy(&data[i*W+leftover], temp, dcW*sizeof(T), cudaMemcpyDeviceToDevice); 
    }

cudaMalloc((void **)&topHalf, dcH*W* sizeof(T));
    leftover = H-dcH;
    cudaMemcpy(topHalf, data, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(data, &data[dcH*W], leftover*W*sizeof(T), cudaMemcpyDeviceToDevice);
    cudaMemcpy(&data[leftover*W], topHalf, dcH*W*sizeof(T), cudaMemcpyDeviceToDevice);

Notice that this code takes device pointers, and does DeviceToDevice transfers.

Why does this seem to run so slow? Can this be optimized somehow? I timed this compared to the same operation on host using regular memcpy and it was about 2x slower.

Any ideas?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:39:46+0000

I ended up writing a kernel to do the swaps. This was indeed faster than the Device to Device memcpy operations

Categories

c++ - CUDA Device To Device transfer expensive

c++ - CUDA Device To Device transfer expensive

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags