在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:jennybc/analyze-github-stuff-with-r开源软件地址:https://github.com/jennybc/analyze-github-stuff-with-r开源编程语言:R 100.0%开源软件介绍:How to obtain a bunch of GitHub issues or pull requests with RI want to make Three motivating examples, where I marshal data from the GitHub API using the excellent
This is a glorified note-to-self. It might be interesting to a few other people. But I presume a lot of experience with R and a full-on embrace of Oliver's open issuesLet's start with the easiest task: does have Oliver issues? If so, can we be more specific? First, load packages. Install # install_github("gaborcsardi/gh")
# install_github("hadley/purrr")
library(gh)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr)) Use repos <- gh("/users/ironholds/repos", .limit = Inf)
length(repos)
#> [1] 48 Create a data frame with one row per repo and two variables.
iss_df <-
data_frame(
repo = repos %>% map_chr("name"),
issue = repo %>%
map(~ gh(repo = .x, endpoint = "/repos/ironholds/:repo/issues",
.limit = Inf))
)
str(iss_df, max.level = 1)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 48 obs. of 2 variables:
#> $ repo : chr "arin" "averageimage" "batman" "billund" ...
#> $ issue:List of 48 Create a decent display of how many open issues there are on each repo. I use iss_df %>%
mutate(n_open = issue %>% map_int(length)) %>%
select(-issue) %>%
filter(n_open > 0) %>%
arrange(desc(n_open)) %>%
print(n = nrow(.))
#> Source: local data frame [13 x 2]
#>
#> repo n_open
#> (chr) (int)
#> 1 passbypromised 7
#> 2 distributions 5
#> 3 driver 5
#> 4 practice 3
#> 5 rgeolocate 2
#> 6 urltools 2
#> 7 primes 1
#> 8 protein 1
#> 9 rope 1
#> 10 webreadr 1
#> 11 WikidataR 1
#> 12 WikipediR 1
#> 13 wmf 1 A clean script for this is available in open-issue-count-by-repo.R. Pull requests on a repoEven though it was Advanced R that got me thinking about this, I first started playing around with R Packages, which happens to have 50% fewer PRs than Advanced R. But I've done this for both books and present a script and figure for each at the end of this example. Load packages. Even more this time. library(gh)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
suppressPackageStartupMessages(library(purrr))
library(curl)
suppressPackageStartupMessages(library(readr)) Use owner <- "hadley"
repo <- "r-pkgs"
pr_list <-
gh("/repos/:owner/:repo/pulls", owner = owner, repo = repo,
state = "all", .limit = Inf)
length(pr_list)
#> [1] 295 Define a little helper function that won't be necessary forever, but is useful below when we dig info out of map_chr_hack <- function(.x, .f, ...) {
map(.x, .f, ...) %>%
map_if(is.null, ~ NA_character_) %>%
flatten_chr()
} Use pr_df <- pr_list %>%
{
data_frame(number = map_int(., "number"),
id = map_int(., "id"),
title = map_chr(., "title"),
state = map_chr(., "state"),
user = map_chr(., c("user", "login")),
commits_url = map_chr(., "commits_url"),
diff_url = map_chr(., "diff_url"),
patch_url = map_chr(., "patch_url"),
merge_commit_sha = map_chr_hack(., "merge_commit_sha"),
pr_HEAD_label = map_chr(., c("head", "label")),
pr_HEAD_sha = map_chr(., c("head", "sha")),
pr_base_label = map_chr(., c("base", "label")),
pr_base_sha = map_chr(., c("base", "sha")),
created_at = map_chr(., "created_at") %>% as.Date(),
closed_at = map_chr_hack(., "closed_at") %>% as.Date(),
merged_at = map_chr_hack(., "merged_at") %>% as.Date())
}
pr_df
#> Source: local data frame [295 x 16]
#>
#> number id title
#> (int) (int) (chr)
#> 1 327 51398771 `.rda` extension is case sensitive
#> 2 326 48678175 modified git command for deleting branch
#> 3 324 47463575 Update tests.rmd
#> 4 323 47457827 Update man.rmd
#> 5 322 47344525 slightly modified in the "Binary builds" section.
#> 6 320 43412833 removed extraneous word "are"
#> 7 319 43271518 Fixing typos
#> 8 318 42719902 Homogenize LaTeX spelling
#> 9 317 42078372 Merge pull request #1 from scw/git-typos
#> 10 316 42037653 Fix small typos in src.Rmd
#> .. ... ... ...
#> Variables not shown: state (chr), user (chr), commits_url (chr), diff_url
#> (chr), patch_url (chr), merge_commit_sha (chr), pr_HEAD_label (chr),
#> pr_HEAD_sha (chr), pr_base_label (chr), pr_base_sha (chr), created_at
#> (date), closed_at (date), merged_at (date). I want to know which files are affected by each PR. If I had all this stuff locally, I would do something like this: git diff --name-only SHA1 SHA2 I have to emulate that with the GitHub API. It seems the compare two commits feature only works for two branches or two tags, but not two arbitrary SHAs. Please enlighten me and answer this question on StackOverflow if you know how to do this. My current workaround is to get info on the diff associated with a pull request from its associated patch file. We've already stored these URLs in the source("get-pr-affected-files-from-patch.R") Add a list-column to the data frame of pull requests. It holds one data frame per PR, with info on the file changes. We use pr_df <- pr_df %>%
mutate(pr_files = patch_url %>% map(get_pr_affected_files_from_patch)) Sanity check the pr_df$pr_files[[69]]
#> Source: local data frame [2 x 2]
#>
#> file diffstuff
#> (chr) (chr)
#> 1 man.rmd 10 +++++-----
#> 2 package.rmd 4 ++--
pr_df$pr_files %>% map(dim) %>% do.call(rbind, .) %>% apply(2, table)
#> [[1]]
#>
#> 0 1 2 6
#> 1 285 8 1
#>
#> [[2]]
#>
#> 2
#> 295 Simplify the list-column elements from data frame to character vector. Then use nrow(pr_df)
#> [1] 295
pr_df <- pr_df %>%
mutate(pr_files = pr_files %>% map("file")) %>%
unnest(pr_files)
nrow(pr_df)
#> [1] 307 Write pr_df %>%
select(number, id, title, state, user, pr_files) %>%
write_csv("r-pkgs-pr-affected-files.csv") Here's a figure depicting how often each chapter has been the target of a pull request. I'm not adjusting for length of the chapter or anything, so take it with a huge grain of salt. But there's no obvious evidence that people read and edit the earlier chapters more. We like to make suggestions about Git apparently!. Recap of files related to PRs on R Packages
I went through the same steps with all pull requests on Here's the same figure as above but for Advanced R. There's a stronger case for earlier chapters being targeted with PRs more often. Recap of files related to PRs on Advanced R:
Issue threadsSTAT 545 has a public Discussion repo, where we use the issues as a discussion board. I want to look at the posts there, as something related to student engagement that I can actually quantify. This starts out fairly similar to the previous example: I retrieve all issues that have been modified since September 1, 2015. owner <- "STAT545-UBC"
repo <- "Discussion"
issue_list <-
gh("/repos/:owner/:repo/issues", owner = owner, repo = repo,
state = "all", since = "2015-09-01T00:00:00Z", .limit = Inf)
(n_iss <- length(issue_list))
#> [1] 212 This retrieves 212 issues. I use this list to create a conventional data frame with one row per issue. issue_df <- issue_list %>%
{
data_frame(number = map_int(., "number"),
id = map_int(., "id"),
title = map_chr(., "title"),
state = map_chr(., "state"),
n_comments = map_int(., "comments"),
opener = map_chr(., c("user", "login")),
created_at = map_chr(., "created_at") %>% as.Date())
}
issue_df
#> Source: local data frame [212 x 7]
#>
#> number id title state
#> (int) (int) (chr) (chr)
#> 1 276 119601582 Creating PDFs via latex in command line/make open
#> 2 275 119272439 general makefile confusion closed
#> 3 274 119262468 do is dropping countries closed
#> 4 273 119259601 linear regression within each country closed
#> 5 272 119257920 adding another column? closed
#> 6 271 119252407 how to add the residual error? closed
#> 7 270 119236992 Pandoc error solution open
#> 8 269 119230359 how many scripts closed
#> 9 268 119133218 Can't download packages from github open
#> 10 267 119112488 using gapminder.tsv closed
#> .. ... ... ... ...
#> Variables not shown: n_comments (int), opener (chr), created_at (date). It turns out some of these issues were created during the 2014 run but show up here because I closed them in early September. Get rid of them. issue_df <- issue_df %>%
filter(created_at >= "2015-09-01T00:00:00Z")
(n_iss <- nrow(issue_df))
#> [1] 192 Down to 192 issues. My ultimate goal is a data frame with one row per issue comment, but it's harder than you expect to get there. Each issue should be represented by at least one row and many will have several rows, as there are typically follow up comments. I need to loop over the issues and retrieve the follow up comments. I mean that literally -- the Issue Comment endpoint does not return a comment for the opening of the issue. This makes for a little extra data manipulation ... and more practice with Make a data frame of issue "opens" with a set of variables chosen for maximum bliss in future binds and joins. The opens <- issue_df %>%
select(number, who = opener) %>%
mutate(i = 0L)
opens
#> Source: local data frame [192 x 3]
#>
#> number who i
#> (int) (chr) (int)
#> 1 276 samhinshaw 0
#> 2 275 molliejmcdowell 0
#> 3 274 molliejmcdowell 0
#> 4 273 molliejmcdowell 0
#> 5 272 bdacunha 0
#> 6 271 bdacunha 0
#> 7 270 zhamel 0
#> 8 269 molliejmcdowell 0
#> 9 268 wang114 0
#> 10 267 bdacunha 0
#> .. ... ... ...
nrow(opens)
#> [1] 192 Make a data frame of issue follow up comments. At first, this has to hold an unfriendly list-column comments <- issue_df %>%
select(number) %>%
mutate(res = number %>% map(
~ gh(number = .x,
endpoint = "/repos/:owner/:repo/issues/:number/comments",
owner = owner, repo = repo, .limit = Inf)))
str(comments, max.level = 1)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 192 obs. of 2 variables:
#> $ number: int 276 275 274 273 272 271 270 269 268 267 ...
#> $ res :List of 192
#> .. [list output truncated] What is the comments %>%
filter(number %in% c(275, 273, 272)) %>%
select(res) %>%
walk(str, max.level = 2, give.attr = FALSE)
#> List of 3
#> $ :List of 2
#> ..$ :List of 8
#> ..$ :List of 8
#> $ : list()
#> $ :List of 6
#> ..$ :List of 8
#> ..$ :List of 8
#> ..$ :List of 8
#> ..$ :List of 8
#> ..$ :List of 8
#> ..$ :List of 8 All I really want to know is who made the comment, so I mutate comments <- comments %>%
mutate(who = res %>% map(. %>% map_chr(c("user", "login")))) %>%
select(-res)
comments %>%
filter(number %in% c(275, 273, 272))
#> Source: local data frame [3 x 2]
#>
#> number who
#> (int) (list)
#> 1 275 <chr[2]>
#> 2 273 <chr[0]>
#> 3 272 <chr[6]> Use comments <- comments %>%
unnest(who) %>%
group_by(number) %>%
mutate(i = row_number(number)) %>%
ungroup()
comments
#> Source: local data frame [863 x 3]
#>
#> number who i
#> (int) (chr) (int)
#> 1 275 jennybc 1
#> 2 275 molliejmcdowell 2
#> 3 274 ksamuk 1
#> 4 274 molliejmcdowell 2
#> 5 272 jennybc 1
#> 6 272 jennybc 2
#> 7 272 bdacunha 3
#> 8 272 jennybc 4
#> 9 272 jennybc 5
#> 10 272 bdacunha 6
#> .. ... ... ... No more list-columns! It's time for a sanity check. Do the empirical counts of follow up comments match the number of comments initially reported by the API? count_empirical <- comments %>%
count(number)
count_stated <- issue_df %>%
select(number, stated = n_comments)
checker <- left_join(count_empirical, count_stated)
#> Joining by: "number"
with(checker, n == stated) %>% all() # hopefully TRUE
#> [1] TRUE I row bind issue "opens" and follow up comments, feeling very smug that that they have exactly the same variables, though it is no accident. atoms <- bind_rows(opens, comments) Join back to the original data frame of issues, since that still holds issue title, state and creation date. It is intentional that the finally <- atoms %>%
left_join(issue_df) %>%
select(number, id, opener, who, i, everything()) %>%
arrange(desc(number), i)
#> Joining by: "number" A quick look at this and ... we're ready for analysis. Our work here is done. finally
#> Source: local data frame [1,055 x 9]
#>
#> number id opener who i
#> (int) (int) (chr) (chr) (int)
#> 1 276 119601582 samhinshaw samhinshaw 0
#> 2 275 119272439 molliejmcdowell molliejmcdowell 0
#> 3 275 119272439 molliejmcdowell jennybc 1
#> 4 275 119272439 molliejmcdowell molliejmcdowell 2
#> 5 274 11926246 |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论