Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
976 views
in Technique[技术] by (71.8m points)

string - Counting Syllables In A Word

I'm looking for a fully accurate statement of an algorithm to count syllables in words. What I'm finding when I research is inconsistent or what I know to generate incorrect results. Does anyone have any suggestions of how to accomplish this? Thanks.

The algorithm I'm using now:

  1. Count the number of vowels in the word.
  2. Do not count double-vowels ("rain" has 2 vowels but is only 1 syllable)
  3. If last letter in word is vowel do not count ("side" is 1 syllable)

Are there any more rules I'm missing? I'm trying to determine in testing for my incorrect results if the algorithm I'm using is wrong or my implementation of it.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ambiguity is a huge issue in natural language processing, but some tasks can actually handle with the ambiguity with nice accuracy. It turns out syllabification is one of them, so don't listen to the other answers. :)

Syllabification

Heuristic-based

You could come up with algorithms achieving correct syllabification virtually throughout the English vocabulary, but it seems complicated to program correctly.

Corpus-based

As always, when hand-made algorithms don't help too much, Natural Language Processing researchers use hand-tagged corpora containing the correct answers for given words. Learnings algorithms are then used and often provide great accuracy. You can use LingPipe's syllabification (see "English syllabification") which follows this approach.

Exhaustive list

English only has so many words, which is how we came up with dictionaries. Such dictionaries often contain the correct syllabification. You could scrape reference.com. For example, the undulate entry contains ? un·du·late ?, which is enough to know there are three syllables.

Other such dictionaries include Answers.com, The Free Dictionary, Merriam-Webster, and so on. Do read the Terms and Conditions, automated retrieval may not be allowed. And different dictionaries don't always agree with each other.

It won't help with new words or proper nouns, but I'd say it's going to be the most accurate method.

About hyphenation

Another related problem got a lot more exposure: hyphenation. But don't use that! It is used in typesetting programs such as LaTeX, but only aims to provide some of the correct hyphens, without ever providing an incorrect one (high precision, low recall). It's interesting to note that there only are 14 exceptions, eg. project which has a different hyphenation depending on the part-of-speech (verb or noun).

Hyphenation programs

If you decide that it's enough for you needs, note that a few implementations of the TeX hyphenation algorithm exist in other languages, such as Python, Perl or Ruby.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...