Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
279 views
in Technique[技术] by (71.8m points)

c++ - Why Opencv GPU code is slower than CPU?

I'm using opencv242 + VS2010 by a notebook.
I tried to do some simple test of the GPU block in OpenCV, but it showed the GPU is 100 times slower than CPU codes. In this code, I just turn the color image to grayscale image, use the function of cvtColor

Here is my code, PART1 is CPU code(test cpu RGB2GRAY), PART2 is upload image to GPU, PART3 is GPU RGB2GRAY, PART4 is CPU RGB2GRAY again. There are 3 things makes me so wondering:

1 In my code, part1 is 0.3ms, while part4 (which is exactly same with part1) is 40ms!!!
2 The part2 which upload image to GPU is 6000ms!!!
3 Part3( GPU codes) is 11ms, it is so slow for this simple image!

    #include "StdAfx.h"
    #include <iostream>
    #include "opencv2/opencv.hpp"
    #include "opencv2/gpu/gpu.hpp"
    #include "opencv2/gpu/gpumat.hpp"
    #include "opencv2/core/core.hpp"
    #include "opencv2/highgui/highgui.hpp"
    #include <cuda.h>
    #include <cuda_runtime_api.h>
    #include <ctime>
    #include <windows.h>

    using namespace std;
    using namespace cv;
    using namespace cv::gpu;

    int main()
    {
        LARGE_INTEGER freq;
        LONGLONG QPart1,QPart6;
        double dfMinus, dfFreq, dfTim;
        QueryPerformanceFrequency(&freq);
        dfFreq = (double)freq.QuadPart;

        cout<<getCudaEnabledDeviceCount()<<endl;
        Mat img_src = imread("d:\CUDA\train.png", 1);

        // PART1 CPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        // From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        Mat img_gray;
        cvtColor(img_src,img_gray,CV_BGR2GRAY);
        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("CPU RGB2GRAY running time is %.2f ms

",dfTim);

        // PART2 GPU upload image~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        GpuMat gimg_src;
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        gimg_src.upload(img_src);
        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("Read image running time is %.2f ms

",dfTim);

        GpuMat dst1;
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;

        /*dst.upload(src_host);*/
        dst1.upload(imread("d:\CUDA\train.png", 1));

        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("Read image running time 2 is %.2f ms

",dfTim);

        // PART3~ GPU code~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        // gpuimage From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;

        GpuMat gimg_gray;
        gpu::cvtColor(gimg_src,gimg_gray,CV_BGR2GRAY);

        QueryPerformanceCounter(&freq);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("GPU RGB2GRAY running time is %.2f ms

",dfTim);

        // PART4~CPU code(again)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

        // gpuimage From color image to grayscale image.
        QueryPerformanceCounter(&freq);
        QPart1 = freq.QuadPart;
        Mat img_gray2;
        cvtColor(img_src,img_gray2,CV_BGR2GRAY);
        BOOL i_test=QueryPerformanceCounter(&freq);
        printf("%d 
",i_test);
        QPart6 = freq.QuadPart;
        dfMinus = (double)(QPart6 - QPart1);
        dfTim = 1000 * dfMinus / dfFreq;
        printf("CPU RGB2GRAY running time is %.2f ms

",dfTim);

        cvWaitKey();
        getchar();
        return 0;
    }
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Most answers above are actually wrong. The reason why it is slow by a factor 20.000 is of course not because of 'CPU clockspeed is faster' and 'it has to copy it to the GPU' (accepted answers). These are factors, but by saying that you omit the fact that you have vastly more computing power for a problem that is disgustingly parallel. Saying 20.000x performance difference is because of the latter is just so plain ridiculous. The author here knew something was wrong that's not straight forward. Solution:

Your problem is that CUDA needs to initialize! It will always initialize for the first image and generally takes between 1-10 seconds, depending on the alignment of Jupiter and Mars. Now try this. Do the computation twice and then time them both. You will probably see in this case that the speeds are within the same order of magnutide, not 20.000x, that's ridiculous. Can you do something about this initialization? Nope, not that I know of. It's a snag.

edit: I just re-read the post. You say you're running on a notebook. Those often have shabby GPU's, and CPU's with a fair turbo.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...