The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (ie does not contain any repeated characters), and then extending it to the full algorithm.
(以下是描述Ukkonen算法的尝试,首先显示字符串简单时(即不包含任何重复字符),然后将其扩展到完整算法。)
First, a few preliminary statements.
(首先,一些初步陈述。)
What we are building, is basically like a search trie.
(我们正在构建的基本上就像一个搜索树。)
So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth (因此,存在一个根节点,边缘向外延伸到新节点,进一步的边缘向外延伸,依此类推)
But : Unlike in a search trie, the edge labels are not single characters.
(但是 :与搜索Trie不同,边缘标签不是单个字符。)
Instead, each edge is labeled using a pair of integers [from,to]
. (而是使用一对整数[from,to]
标记每个边缘。)
These are pointers into the text. (这些是文本的指针。)
In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers). (从这个意义上讲,每个边都带有任意长度的字符串标签,但仅占用O(1)空间(两个指针)。)
Basic principle (基本原则)
I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:
(我想首先演示如何创建一个特别简单的字符串(没有重复字符的字符串)的后缀树:)
abc
The algorithm works in steps, from left to right .
(该算法从左到右逐步执行 。)
There is one step for every character of the string . (字符串的每个字符都有一个步骤 。)
Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n). (每个步骤可能涉及多个操作,但是我们将看到(请参阅最后的最终观察结果)操作总数为O(n)。)
So, we start from the left , and first insert only the single character a
by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#]
, which means the edge represents the substring starting at position 0 and ending at the current end .
(因此,我们从左侧开始,首先通过创建从根节点(在左侧)到叶的边,然后将其标记为[0,#]
,仅插入单个字符a
,这意味着该边代表子字符串从位置0开始, 到当前结束 。)
I use the symbol #
to mean the current end , which is at position 1 (right after a
). (我使用符号#
表示当前端点 ,该端点位于位置1(在a
之后)。)
So we have an initial tree, which looks like this:
(因此,我们有一个初始树,如下所示:)
And what it means is this:
(这意味着什么:)
Now we progress to position 2 (right after b
).
(现在我们前进到位置2(紧接b
之后)。)
Our goal at each step is to insert all suffixes up to the current position . (我们每个步骤的目标是将所有后缀插入到当前位置 。)
We do this by (我们这样做)
- expanding the existing
a
-edge to ab
(将现有的a
edge扩展到ab
)
- inserting one new edge for
b
(为b
插入一条新边)
In our representation this looks like
(在我们的表示中,这看起来像)
And what it means is:
(它的意思是:)
We observe two things:
(我们观察到两件事:)
- The edge representation for
ab
is the same as it used to be in the initial tree: [0,#]
. (对于边缘表示ab
是因为它使用的是在初始树中的相同 : [0,#]
。)
Its meaning has automatically changed because we updated the current position #
from 1 to 2. (由于我们将当前位置#
从1更新为2,其含义已自动更改。)
- Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.
(每个边占用O(1)空间,因为它仅包含两个指向文本的指针,无论它代表多少个字符。)
Next we increment the position again and update the tree by appending a c
to every existing edge and inserting one new edge for the new suffix c
.
(接下来,我们再次增加位置并通过将c
附加到每个现有边并为新后缀c
插入一个新边来更新树。)
In our representation this looks like
(在我们的表示中,这看起来像)
And what it means is:
(它的意思是:)
We observe:
(我们观察到:)
First extension: Simple repetitions (第一次扩展:简单重复)
Of course this works so nicely only because our string does not contain any repetitions.
(当然,这很好用只是因为我们的字符串不包含任何重复。)
We now look at a more realistic string: (现在我们来看一个更现实的字符串:)
abcabxabcd
It starts with abc
as in the previous example, then ab
is repeated and followed by x
, and then abc
is repeated followed by d
.
(如前面的示例中一样,它以abc
开头,然后重复ab
,后跟x
,然后重复abc
,后跟d
。)
Steps 1 through 3: After the first 3 steps we have the tree from the previous example:
(第1步到第3步:在前3个步骤之后,我们有了上一个示例中的树:)
Step 4: We move #
to position 4. This implicitly updates all existing edges to this:
(步骤4:将#
移至位置4。这会将所有现有边隐式更新为:)
and we need to insert the final suffix of the current step, a
, at the root.
(我们需要在根目录中插入当前步骤的最后一个后缀a
。)
Before we do this, we introduce two more variables (in addition to #
), which of course have been there all the time but we haven't used them so far:
(在执行此操作之前,我们引入两个变量 (除了#
),这些变量当然一直存在,但是到目前为止我们还没有使用它们:)
- The active point , which is a triple
(active_node,active_edge,active_length)
(活动点 ,为三重(active_node,active_edge,active_length)
)
- The
remainder
, which is an integer indicating how many new suffixes we need to insert (remainder
,它是一个整数,指示我们需要插入多少个新后缀)
The exact meaning of these two will become clear soon, but for now let's just say:
(这两个的确切含义将很快变得清楚,但是现在让我们说:)
- In the simple
abc
example, the active point was always (root,'\0x',0)
, ie active_node
was the root node, active_edge
was specified as the null character '\0x'
, and active_length
was zero. (在简单的abc
示例中,活动点始终为(root,'\0x',0)
,即active_node
是根节点, active_edge
被指定为空字符'\0x'
,而active_length
为零。)
The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. (这样做的效果是,我们在每个步骤中插入的一条新边作为新创建的边插入了根节点。)
We will see soon why a triple is necessary to represent this information. (我们很快就会看到为什么需要三元组来表示此信息。)
- The
remainder
was always set to 1 at the beginning of each step. (在每个步骤的开始, remainder
始终设置为1。)
The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character). (意思是,在每个步骤的最后我们必须主动插入的后缀数是1(总是最后一个字符)。)
Now this is going to change.
(现在这将改变。)
When we insert the current final character a
at the root, we notice that there is already an outgoing edge starting with a
, specifically: abca
. (当我们在根中插入当前的最后一个字符a
时,我们注意到已经有一个以a
开头的传出边,特别是: abca
。)
Here is what we do in such a case: (在这种情况下,我们要做的是:)
- We do not insert a fresh edge
[4,#]
at the root node. (我们不在根节点插入新边[4,#]
。)
Instead we simply notice that the suffix a
is already in our tree. (相反,我们只是注意到后缀a
已经在我们的树中。)
It ends in the middle of a longer edge, but we are not bothered by that. (它结束于较长边缘的中间,但是我们对此并不感到困扰。)
We just leave things the way they are. (我们只是把事情保持原样。)
- We set the active point to
(root,'a',1)
. (我们将活动点设置为(root,'a',1)
。)
That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a
, specifically, after position 1 on that edge. (这意味着活动点现在位于根节点的传出边缘的中间某个位置,该位置以a
开头,特别是在该边缘的位置1之后。)
We notice that the edge is specified simply by its first character a
. (我们注意到,边缘仅由其第一个字符a
指定。)
That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description). (这样就足够了,因为只能有一个以任何特定字符开头的边(在通读整个说明后,请确保这是对的)。)
- We also increment
remainder
, so at the beginning of the next step it will be 2. (我们还增加了remainder
,因此在下一步开始时为2。)
Observation: When the final suffix we need to insert is found to exist in the tree already , the tree itself is not chan