Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
598 views
in Technique[技术] by (71.8m points)

string - 普通英语的Ukkonen后缀树算法(Ukkonen's suffix tree algorithm in plain English)

I feel a bit thick at this point.

(在这一点上我感觉有点浓。)

I've spent days trying to fully wrap my head around suffix tree construction, but because I don't have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology.

(我花了几天的时间试图完全围绕后缀树构造,但是由于我没有数学背景,因此许多解释都使我难以理解,因为它们开始过度使用数学符号系统。)

The closest to a good explanation that I've found is Fast String Searching With Suffix Trees , but he glosses over various points and some aspects of the algorithm remain unclear.

(我发现的最接近很好的解释是带有后缀树的快速字符串搜索 ,但是他掩盖了各个要点,并且算法的某些方面仍然不清楚。)

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I'm sure.

(我敢肯定,在堆栈溢出上对此算法的分步说明对我以外的其他许多人来说都是无价的。)

For reference, here's Ukkonen's paper on the algorithm: http://www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf

(作为参考,这里是有关算法的Ukkonen论文: http : //www.cs.helsinki.fi/u/ukkonen/SuffixT1withFigs.pdf)

My basic understanding, so far:

(到目前为止,我的基本了解:)

  • I need to iterate through each prefix P of a given string T

    (我需要遍历给定字符串T的每个前缀P)

  • I need to iterate through each suffix S in prefix P and add that to tree

    (我需要遍历前缀P中的每个后缀S并将其添加到树中)

  • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down.

    (要将后缀S添加到树中,我需要遍历S中的每个字符,其中的迭代包括沿着以S中相同的字符集C开头的现有分支以及当我将边缘拆分成后代节点时进行在后缀中找到一个不同的字符,或者如果没有匹配的边要走。)

    When no matching edge is found to walk down for C, a new leaf edge is created for C.

    (当找不到匹配的边沿C向下走时,将为C创建新的叶边。)

The basic algorithm appears to be O(n 2 ), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix.

(正如大多数解释中所指出的那样,基本算法似乎是O(n 2 ),因为我们需要逐步处理所有前缀,然后才需要逐步处理每个前缀的每个后缀。)

Ukkonen's algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I'm having trouble understanding.

(Ukkonen的算法显然是独特的,因为他使用了后缀指针技术,尽管我认为是我难以理解的。)

I'm also having trouble understanding:

(我也很难理解:)

  • exactly when and how the "active point" is assigned, used and changed

    (准确地分配,使用和更改“活动点”的时间和方式)

  • what is going on with the canonization aspect of the algorithm

    (该算法的规范化方面发生了什么)

  • Why the implementations I've seen need to "fix" bounding variables that they are using

    (为什么我看到的实现需要“修复”他们使用的边界变量)


Here is the completed C# source code.

(这是完整的C#源代码。)

It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output.

(它不仅可以正常工作,而且支持自动规范化,并呈现输出的外观更好的文本图。)

Source code and sample output is at:

(源代码和示例输出位于:)

https://gist.github.com/2373868

(https://gist.github.com/2373868)


Update 2017-11-04

(更新2017-11-04)

After many years I've found a new use for suffix trees, and have implemented the algorithm in JavaScript .

(多年后,我发现了后缀树的新用法,并在JavaScript中实现了该算法。)

Gist is below.

(要点在下面。)

It should be bug-free.

(它应该没有错误。)

Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output.

(将其转储到js文件中, npm install chalk从同一位置npm install chalk ,然后与node.js一起运行以查看一些彩色输出。)

There's a stripped down version in the same Gist, without any of the debugging code.

(在同一个Gist中有一个精简版,没有任何调试代码。)

https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6

(https://gist.github.com/axefrog/c347bf0f5e0723cbd09b1aaed6ec6fc6)

  ask by Nathan Ridley translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (ie does not contain any repeated characters), and then extending it to the full algorithm.

(以下是描述Ukkonen算法的尝试,首先显示字符串简单时(即不包含任何重复字符),然后将其扩展到完整算法。)

First, a few preliminary statements.

(首先,一些初步陈述。)

  1. What we are building, is basically like a search trie.

    (我们正在构建的基本上就像一个搜索树。)

    So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

    (因此,存在一个根节点,边缘向外延伸到新节点,进一步的边缘向外延伸,依此类推)

  2. But : Unlike in a search trie, the edge labels are not single characters.

    (但是 :与搜索Trie不同,边缘标签不是单个字符。)

    Instead, each edge is labeled using a pair of integers [from,to] .

    (而是使用一对整数[from,to]标记每个边缘。)

    These are pointers into the text.

    (这些是文本的指针。)

    In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

    (从这个意义上讲,每个边都带有任意长度的字符串标签,但仅占用O(1)空间(两个指针)。)

Basic principle (基本原则)

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:

(我想首先演示如何创建一个特别简单的字符串(没有重复字符的字符串)的后缀树:)

abc

The algorithm works in steps, from left to right .

(该算法从左到右逐步执行 。)

There is one step for every character of the string .

(字符串的每个字符都有一个步骤 。)

Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

(每个步骤可能涉及多个操作,但是我们将看到(请参阅最后的最终观察结果)操作总数为O(n)。)

So, we start from the left , and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#] , which means the edge represents the substring starting at position 0 and ending at the current end .

(因此,我们从左侧开始,首先通过创建从根节点(在左侧)到叶的边,然后将其标记为[0,#] ,仅插入单个字符a ,这意味着该边代表子字符串从位置0开始, 到当前结束 。)

I use the symbol # to mean the current end , which is at position 1 (right after a ).

(我使用符号#表示当前端点 ,该端点位于位置1(在a之后)。)

So we have an initial tree, which looks like this:

(因此,我们有一个初始树,如下所示:)

And what it means is this:

(这意味着什么:)

Now we progress to position 2 (right after b ).

(现在我们前进到位置2(紧接b之后)。)

Our goal at each step is to insert all suffixes up to the current position .

(我们每个步骤的目标是将所有后缀插入到当前位置 。)

We do this by

(我们这样做)

  • expanding the existing a -edge to ab

    (将现有的a edge扩展到ab)

  • inserting one new edge for b

    (为b插入一条新边)

In our representation this looks like

(在我们的表示中,这看起来像)

在此处输入图片说明

And what it means is:

(它的意思是:)

We observe two things:

(我们观察到两件事:)

  • The edge representation for ab is the same as it used to be in the initial tree: [0,#] .

    (对于边缘表示ab是因为它使用的是在初始树中的相同[0,#] 。)

    Its meaning has automatically changed because we updated the current position # from 1 to 2.

    (由于我们将当前位置#从1更新为2,其含义已自动更改。)

  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

    (每个边占用O(1)空间,因为它仅包含两个指向文本的指针,无论它代表多少个字符。)

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c .

(接下来,我们再次增加位置并通过将c附加到每个现有边并为新后缀c插入一个新边来更新树。)

In our representation this looks like

(在我们的表示中,这看起来像)

And what it means is:

(它的意思是:)

We observe:

(我们观察到:)

  • The tree is the correct suffix tree up to the current position after each step

    (该树是每个步骤后直到当前位置的正确后缀树)

  • There are as many steps as there are characters in the text

    (文字中有多少个步骤)

  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing # , and inserting the one new edge for the final character can be done in O(1) time.

    (每个步骤的工作量为O(1),因为所有现有的边都会通过增加#来自动更新,并且可以在O(1)时间内为最终字符插入一个新边。)

    Hence for a string of length n, only O(n) time is required.

    (因此,对于长度为n的字符串,仅需要O(n)时间。)

First extension: Simple repetitions (第一次扩展:简单重复)

Of course this works so nicely only because our string does not contain any repetitions.

(当然,这很好用只是因为我们的字符串不包含任何重复。)

We now look at a more realistic string:

(现在我们来看一个更现实的字符串:)

abcabxabcd

It starts with abc as in the previous example, then ab is repeated and followed by x , and then abc is repeated followed by d .

(如前面的示例中一样,它以abc开头,然后重复ab ,后跟x ,然后重复abc ,后跟d 。)

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

(第1步到第3步在前3个步骤之后,我们有了上一个示例中的树:)

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

(步骤4:#移至位置4。这会将所有现有边隐式更新为:)

and we need to insert the final suffix of the current step, a , at the root.

(我们需要在根目录中插入当前步骤的最后一个后缀a 。)

Before we do this, we introduce two more variables (in addition to # ), which of course have been there all the time but we haven't used them so far:

(在执行此操作之前,我们引入两个变量 (除了# ),这些变量当然一直存在,但是到目前为止我们还没有使用它们:)

  • The active point , which is a triple (active_node,active_edge,active_length)

    (活动点 ,为三重(active_node,active_edge,active_length))

  • The remainder , which is an integer indicating how many new suffixes we need to insert

    (remainder ,它是一个整数,指示我们需要插入多少个新后缀)

The exact meaning of these two will become clear soon, but for now let's just say:

(这两个的确切含义将很快变得清楚,但是现在让我们说:)

  • In the simple abc example, the active point was always (root,'\0x',0) , ie active_node was the root node, active_edge was specified as the null character '\0x' , and active_length was zero.

    (在简单的abc示例中,活动点始终为(root,'\0x',0) ,即active_node是根节点, active_edge被指定为空字符'\0x' ,而active_length为零。)

    The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge.

    (这样做的效果是,我们在每个步骤中插入的一条新边作为新创建的边插入了根节点。)

    We will see soon why a triple is necessary to represent this information.

    (我们很快就会看到为什么需要三元组来表示此信息。)

  • The remainder was always set to 1 at the beginning of each step.

    (在每个步骤的开始, remainder始终设置为1。)

    The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

    (意思是,在每个步骤的最后我们必须主动插入的后缀数是1(总是最后一个字符)。)

Now this is going to change.

(现在这将改变。)

When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a , specifically: abca .

(当我们在根中插入当前的最后一个字符a时,我们注意到已经有一个以a开头的传出边,特别是: abca 。)

Here is what we do in such a case:

(在这种情况下,我们要做的是:)

  • We do not insert a fresh edge [4,#] at the root node.

    (我们不在根节点插入新边[4,#] 。)

    Instead we simply notice that the suffix a is already in our tree.

    (相反,我们只是注意到后缀a已经在我们的树中。)

    It ends in the middle of a longer edge, but we are not bothered by that.

    (它结束于较长边缘的中间,但是我们对此并不感到困扰。)

    We just leave things the way they are.

    (我们只是把事情保持原样。)

  • We set the active point to (root,'a',1) .

    (我们将活动点设置(root,'a',1) 。)

    That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a , specifically, after position 1 on that edge.

    (这意味着活动点现在位于根节点的传出边缘的中间某个位置,该位置以a开头,特别是在该边缘的位置1之后。)

    We notice that the edge is specified simply by its first character a .

    (我们注意到,边缘仅由其第一个字符a指定。)

    That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).

    (这样就足够了,因为只能有一个以任何特定字符开头的边(在通读整个说明后,请确保这是对的)。)

  • We also increment remainder , so at the beginning of the next step it will be 2.

    (我们还增加了remainder ,因此在下一步开始时为2。)

Observation: When the final suffix we need to insert is found to exist in the tree already , the tree itself is not chan


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...