r - Trouble accessing quanteda corpus quantities in version >= 2

Question

Welcome To Ask or Share your Answers For Others

r - Trouble accessing quanteda corpus quantities in version >= 2

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - Trouble accessing quanteda corpus quantities in version >= 2

I am having a problem when running the same script I have written before. Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object. But when I run the same script it returns "corpus" and "character" class objects now. And this affects the subsequent codes. What could be the reason for this and how can I solve this issue?

Here is the script:

txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
  corpus_reshape(to = "sentences")

docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))

The files are international treaties. All the filenames are in the same format, they contain the name of the treaty and the year it was signed. And I was extracting these.

Back then the the class of corpus txt was "corpus" "list":

> class(corpus_txt)
[1] "corpus" "list"

But now:

> class(corpus_txt)
[1] "corpus"    "character"
> packageVersion("quanteda")
[1] ‘2.1.2’

And I cannot extract information from the corpus the way I did before. Since I was working on this since the last October I should be using the same version all along.

Many thanks in advance.

question from:https://stackoverflow.com/questions/65672195/trouble-accessing-quanteda-corpus-quantities-in-version-2

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T18:51:16+0000

We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.

From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20:

quanteda 2.0 introduces some major changes, detailed here.

New corpus object structure.

The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.

From ?corpus:

For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.

Solution? Use docnames(corpus_txt).

Categories

r - Trouble accessing quanteda corpus quantities in version >= 2

r - Trouble accessing quanteda corpus quantities in version >= 2

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags