Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
284 views
in Technique[技术] by (71.8m points)

c++ - Fast string splitting with multiple delimiters

I investigated some time here on StackOverflow to find good algorithms to split strings with multiple delimiters into a vector< string >. I also found some methods:

The Boost way:

boost::split(vector, string, boost::is_any_of(" "));

the getline method:

std::stringstream ss(string);
std::string item;
while(std::getline(ss, item, ' ')) {
    vector.push_back(item);
}

the tokenize way of Boost:

char_separator<char> sep(" ");
tokenizer<char_separator<char>> tokens(string, sep);
BOOST_FOREACH(string t, tokens)
{
   vector.push_back(t);
}

and the cool STL way:

     istringstream iss(string);
     copy(istream_iterator<string>(iss),
     istream_iterator<string>(),
     back_inserter<vector<string> >(vector));

and the method of Shadow2531 (see the linked topic).

Most of them came from this topic. But they unfortunately don't solve my problem:

  • Boost's split is easy to use but with the big data (about 1.5*10^6 single elements in best cases) and about 10 delimiters I am using it's horrific slow.

  • The getline, STL and Shadow2531's method have the problem that I can only use one single char as delimiter. I need a few more.

  • Boost's tokenize is even more horrific in the aspect of speed. It took 11 seconds with 10 delimiters to split a string into 1.5*10^6 elements.

So I don't know what to do: I want to have a really fast string splitting algorithm with multiple delimiters.

Is Boost's split the maximum or is there a way to do it faster?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Two things come to mind:

  1. Use string views instead of strings as the split result, saves a lot of allocations.
  2. If you know you're only going to be working with chars (in the [0,255] range), try using a bitset to test membership instead of finding into the delimiter characters.

Here's a quick attempt at applying these ideas:

#include <vector>
#include <bitset>
#include <iostream>
#include <boost/algorithm/string/split.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/timer.hpp>

using namespace std;
size_t const N = 10000000;

template<typename C>
void test_custom(string const& s, char const* d, C& ret)
{
  C output;

  bitset<255> delims;
  while( *d )
  {
    unsigned char code = *d++;
    delims[code] = true;
  }
  typedef string::const_iterator iter;
  iter beg;
  bool in_token = false;
  for( string::const_iterator it = s.begin(), end = s.end();
    it != end; ++it )
  {
    if( delims[*it] )
    {
      if( in_token )
      {
        output.push_back(typename C::value_type(beg, it));
        in_token = false;
      }
    }
    else if( !in_token )
    {
      beg = it;
      in_token = true;
    }
  }
  if( in_token )
    output.push_back(typename C::value_type(beg, s.end()));
  output.swap(ret);
}

template<typename C>
void test_strpbrk(string const& s, char const* delims, C& ret)
{
  C output;

  char const* p = s.c_str();
  char const* q = strpbrk(p+1, delims);
  for( ; q != NULL; q = strpbrk(p, delims) )
  {
    output.push_back(typename C::value_type(p, q));
    p = q + 1;
  }

  output.swap(ret);
}

template<typename C>
void test_boost(string const& s, char const* delims)
{
  C output;
  boost::split(output, s, boost::is_any_of(delims));
}

int main()
{
  // Generate random text
  string text(N, ' ');
  for( size_t i = 0; i != N; ++i )
    text[i] = (i % 2 == 0)?('a'+(i/2)%26):((i/2)%2?' ':'');

  char const* delims = " [],-'/\!"§$%&=()<>?";

  // Output strings
  boost::timer timer;
  test_boost<vector<string> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Output string views
  typedef string::const_iterator iter;
  typedef boost::iterator_range<iter> string_view;
  timer.restart();
  test_boost<vector<string_view> >(text, delims);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vs;
  test_custom(text, delims, vs);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsv;
  test_custom(text, delims, vsv);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string> vsp;
  test_strpbrk(text, delims, vsp);
  cout << "Time: " << timer.elapsed() << endl;

  // Custom split
  timer.restart();
  vector<string_view> vsvp;
  test_strpbrk(text, delims, vsvp);
  cout << "Time: " << timer.elapsed() << endl;

  return 0;
}

Compiling this with Boost 1.46.1 using GCC 4.5.1 with the -O4 flag enabled I get:

  • Time: 5.951 (Boost.Split + vector)
  • Time: 3.728 (Boost.Split + vector
  • Time: 1.662 (Custom split + vector)
  • Time: 0.144 (Custom split + vector)
  • Time: 2.13 (Strpbrk + vector)
  • Time: 0.527 (Strpbrk + vector)

NOTE: There's a slight difference in the output as empty tokens are dropped by my custom function. But you can adapt this code to your needs if you decide to use it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...