Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
654 views
in Technique[技术] by (71.8m points)

c++ - getline while reading a file vs reading whole file and then splitting based on newline character

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

But the "load the whole file in one go" has several, distinct, problems:

  1. You don't start processing until you have read the whole file.
  2. You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

Edit: So, I wrote a program to measure this, since I think it's quite interesting.

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

I also decided to measure the system and user-time by the process.

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...
";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...