Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
168 views
in Technique[技术] by (71.8m points)

r - Trouble accessing quanteda corpus quantities in version >= 2

I am having a problem when running the same script I have written before. Back then, when I applied quanteda::corpus on a readtext object, it returned a "corpus" and "list" class object. But when I run the same script it returns "corpus" and "character" class objects now. And this affects the subsequent codes. What could be the reason for this and how can I solve this issue?

Here is the script:

txt <- readtext("C:/Users/aerol/Desktop/txt_sample")
corpus_txt <- corpus(txt) %>%
  corpus_reshape(to = "sentences")

docvars(corpus_txt, "Treaty") <- corpus_txt$documents$`_document`
docvars(corpus_txt, "Year") <- as.integer(stri_sub(corpus_txt$documents$`_document`, -9, -6))

The files are international treaties. All the filenames are in the same format, they contain the name of the treaty and the year it was signed. And I was extracting these.

Back then the the class of corpus txt was "corpus" "list":

> class(corpus_txt)
[1] "corpus" "list"  

But now:

> class(corpus_txt)
[1] "corpus"    "character"
> packageVersion("quanteda")
[1] ‘2.1.2’

And I cannot extract information from the corpus the way I did before. Since I was working on this since the last October I should be using the same version all along.

Many thanks in advance.

question from:https://stackoverflow.com/questions/65672195/trouble-accessing-quanteda-corpus-quantities-in-version-2

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

We changed the corpus internal structure in v2, after two years of warning in the documentation that users should not access the corpus internals directly, or their code would not likely work under future major versions.

From https://github.com/quanteda/quanteda/blob/master/NEWS.md#quanteda-20:

quanteda 2.0 introduces some major changes, detailed here.

  1. New corpus object structure.

    The internals of the corpus object have been redesigned, and now are based around a character vector with meta- and system-data in attributes. These are all updated to work with the existing extractor and replacement functions. If you were using these before, then you should not even notice the change. Docvars are now handled separately from the texts, in the same way that docvars are handled for tokens objects.

From ?corpus:

For quanteda >= 2.0, this is a specially classed character vector. It has many additional attributes but you should not access these attributes directly, especially if you are another package author. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change. Using the accessor and replacement functions ensures that future code to manipulate corpus objects will continue to work.

Solution? Use docnames(corpus_txt).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...