jennybc/analyze-github-stuff-with-r: Marshal data from the GitHub API with R

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

jennybc/analyze-github-stuff-with-r

开源软件地址：

https://github.com/jennybc/analyze-github-stuff-with-r

开源编程语言：

R 100.0%

开源软件介绍：

How to obtain a bunch of GitHub issues or pull requests with R

I want to make purrr and dplyr and tidyr play nicely with each other. How can I use purrr for iteration, while still using dplyr and tidyr to manage the data frame side of of the house?

Three motivating examples, where I marshal data from the GitHub API using the excellent gh package:

In STAT 545, 10% of the course mark is awarded for engagement. I want to use contributions to the course Discussion as a primary input here. This is how I fell down this rabbit hole in the first place.
Oliver Keyes tweeted that he wanted "a script that goes through all my GitHub repositories and generates a list of which ones have open issues". How could I resist this softball? Sure, there are easier ways to do this, but why not use R?
Jordan Ellenberg, writing for the Wall Street Journal, used Amazon's "Popular Highlights" feature to define the Hawking Index:

Take the page numbers of a book's five top highlights, average them, and divide by the number of pages in the whole book. The higher the number, the more of the book we're guessing most people are likely to have read.

I mean, how many people really stick with "A Brief History of Time" to the bitter end? I was reading through Hadley Wickham's Advanced R when I read Jordan's article and wondered ... how many people read this entire book? Or do they start and sort of fizzle out? So I wanted to look at the distribution of pull requests. Are they evenly distributed throughout the book or do they cluster in the early chapters?

This is a glorified note-to-self. It might be interesting to a few other people. But I presume a lot of experience with R and a full-on embrace of %>%, dplyr, etc.

Oliver's open issues
Pull requests on a repo
Issue threads

Oliver's open issues

Let's start with the easiest task: does have Oliver issues? If so, can we be more specific?

First, load packages. Install gh and purrr from GitHub, if necessary. gh is not on CRAN and purrr is under active development; I doubt my code code would work with CRAN version.

# install_github("gaborcsardi/gh")
# install_github("hadley/purrr")
library(gh)
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(purrr))

Use gh() to retrieve all of Oliver's public GitHub repositories.

repos <- gh("/users/ironholds/repos", .limit = Inf)
length(repos)
#> [1] 48

Create a data frame with one row per repo and two variables.

repo = repository name. Use purrr::map_chr() to extract all elements named name from the repository list. The map functions are much like base lapply() or vapply(). There is a lot of flexibility around how to specify the function to apply over the input list. Here I use a shortcut: the character vector "name" is converted into an extractor function.
issue = list-column of the issues for each repository. Again, I use a map function, in this case to provide vectorization for gh(). I use a different shortcut: the ~ formula syntax creates an anonymous function on-the-fly, where .x stands for "the input".

iss_df <-
  data_frame(
    repo = repos %>% map_chr("name"),
    issue = repo %>%
      map(~ gh(repo = .x, endpoint = "/repos/ironholds/:repo/issues",
               .limit = Inf))
    )
str(iss_df, max.level = 1)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    48 obs. of  2 variables:
#>  $ repo : chr  "arin" "averageimage" "batman" "billund" ...
#>  $ issue:List of 48

Create a decent display of how many open issues there are on each repo. I use map_int() to count the open issues for each repo and then standard dplyr verbs to select, filter, and arrange. I'm not even bothering with knitr::kable() here because these experiments are definitely not about presentation.

iss_df %>%
  mutate(n_open = issue %>% map_int(length)) %>%
  select(-issue) %>%
  filter(n_open > 0) %>%
  arrange(desc(n_open)) %>%
  print(n = nrow(.))
#> Source: local data frame [13 x 2]
#> 
#>              repo n_open
#>             (chr)  (int)
#> 1  passbypromised      7
#> 2   distributions      5
#> 3          driver      5
#> 4        practice      3
#> 5      rgeolocate      2
#> 6        urltools      2
#> 7          primes      1
#> 8         protein      1
#> 9            rope      1
#> 10       webreadr      1
#> 11      WikidataR      1
#> 12      WikipediR      1
#> 13            wmf      1

A clean script for this is available in open-issue-count-by-repo.R.

Pull requests on a repo

Even though it was Advanced R that got me thinking about this, I first started playing around with R Packages, which happens to have 50% fewer PRs than Advanced R. But I've done this for both books and present a script and figure for each at the end of this example.

Load packages. Even more this time.

library(gh)
suppressPackageStartupMessages(library(dplyr))
library(tidyr)
suppressPackageStartupMessages(library(purrr))
library(curl)
suppressPackageStartupMessages(library(readr))

Use gh() to retrieve all pull requests on hadley/r-pkgs.

owner <- "hadley"
repo <- "r-pkgs"
pr_list <-
  gh("/repos/:owner/:repo/pulls", owner = owner, repo = repo,
     state = "all", .limit = Inf)
length(pr_list)
#> [1] 295

Define a little helper function that won't be necessary forever, but is useful below when we dig info out of pr_list.

map_chr_hack <- function(.x, .f, ...) {
  map(.x, .f, ...) %>%
    map_if(is.null, ~ NA_character_) %>%
    flatten_chr()
}

Use map_*() functions to extract and data-frame-ize the potentially useful parts of the pull request list. I'm extracting much more than I ultimately use, which betrays how overly optimistic I was when I started. So far I can't figure out how to use the API to directly compare two commits, but I haven't given up yet.

pr_df <- pr_list %>%
{
  data_frame(number = map_int(., "number"),
             id = map_int(., "id"),
             title = map_chr(., "title"),
             state = map_chr(., "state"),
             user = map_chr(., c("user", "login")),
             commits_url = map_chr(., "commits_url"),
             diff_url = map_chr(., "diff_url"),
             patch_url = map_chr(., "patch_url"),
             merge_commit_sha = map_chr_hack(., "merge_commit_sha"),
             pr_HEAD_label = map_chr(., c("head", "label")),
             pr_HEAD_sha = map_chr(., c("head", "sha")),
             pr_base_label = map_chr(., c("base", "label")),
             pr_base_sha = map_chr(., c("base", "sha")),
             created_at = map_chr(., "created_at") %>% as.Date(),
             closed_at = map_chr_hack(., "closed_at") %>% as.Date(),
             merged_at = map_chr_hack(., "merged_at") %>% as.Date())
}
pr_df
#> Source: local data frame [295 x 16]
#> 
#>    number       id                                             title
#>     (int)    (int)                                             (chr)
#> 1     327 51398771                `.rda` extension is case sensitive
#> 2     326 48678175          modified git command for deleting branch
#> 3     324 47463575                                  Update tests.rmd
#> 4     323 47457827                                    Update man.rmd
#> 5     322 47344525 slightly modified in the "Binary builds" section.
#> 6     320 43412833                     removed extraneous word "are"
#> 7     319 43271518                                      Fixing typos
#> 8     318 42719902                         Homogenize LaTeX spelling
#> 9     317 42078372          Merge pull request #1 from scw/git-typos
#> 10    316 42037653                        Fix small typos in src.Rmd
#> ..    ...      ...                                               ...
#> Variables not shown: state (chr), user (chr), commits_url (chr), diff_url
#>   (chr), patch_url (chr), merge_commit_sha (chr), pr_HEAD_label (chr),
#>   pr_HEAD_sha (chr), pr_base_label (chr), pr_base_sha (chr), created_at
#>   (date), closed_at (date), merged_at (date).

I want to know which files are affected by each PR. If I had all this stuff locally, I would do something like this:

git diff --name-only SHA1 SHA2

I have to emulate that with the GitHub API. It seems the compare two commits feature only works for two branches or two tags, but not two arbitrary SHAs. Please enlighten me and answer this question on StackOverflow if you know how to do this.

My current workaround is to get info on the diff associated with a pull request from its associated patch file. We've already stored these URLs in the pr_df data frame. You can read my rather hacky helper function, get_pr_affected_files_from_patch(), if you wish, but I'll just source it here.

source("get-pr-affected-files-from-patch.R")

Add a list-column to the data frame of pull requests. It holds one data frame per PR, with info on the file changes. We use map() again and also use dplyr and purrr together here, in order to preserve association between the existing PR info and the modified files. This takes around 4 minutes for me FYI.

pr_df <- pr_df %>%
    mutate(pr_files = patch_url %>% map(get_pr_affected_files_from_patch))

Sanity check the pr_files list-column. First, look at an example element. We have one row per file and two variables: file and diffstuff (currently I do nothing with this but ...). Do all elements of the list-column have exactly two variables? What's the distribution of the number of rows? I expect to see that the vast majority of PRs affect exactly 1 file, because there are lots of typo corrections.

pr_df$pr_files[[69]]
#> Source: local data frame [2 x 2]
#> 
#>          file     diffstuff
#>         (chr)         (chr)
#> 1     man.rmd 10 +++++-----
#> 2 package.rmd        4 ++--
pr_df$pr_files %>% map(dim) %>% do.call(rbind, .) %>% apply(2, table)
#> [[1]]
#> 
#>   0   1   2   6 
#>   1 285   8   1 
#> 
#> [[2]]
#> 
#>   2 
#> 295

Simplify the list-column elements from data frame to character vector. Then use tidyr::unnest() to "explode" things, i.e. give each element its own row. Each row is now a file modified in a PR.

nrow(pr_df)
#> [1] 295
pr_df <- pr_df %>%
  mutate(pr_files = pr_files %>% map("file")) %>%
  unnest(pr_files)
nrow(pr_df)
#> [1] 307

Write pr_df out to file, omitting lots of the variables I currently have no use for.

pr_df %>%
  select(number, id, title, state, user, pr_files) %>%
  write_csv("r-pkgs-pr-affected-files.csv")

Here's a figure depicting how often each chapter has been the target of a pull request. I'm not adjusting for length of the chapter or anything, so take it with a huge grain of salt. But there's no obvious evidence that people read and edit the earlier chapters more. We like to make suggestions about Git apparently!.

Recap of files related to PRs on R Packages

script to marshal data: r-pkgs-pr-affected-files.R
ready-to-analyze data: r-pkgs-pr-affected-files.csv
barchart: r-pkgs-pr-affected-files-barchart.png
script to make barchart: r-pkgs-pr-affected-files-figs.R

I went through the same steps with all pull requests on hadley/adv-r, the repository for Advanced R.

Here's the same figure as above but for Advanced R. There's a stronger case for earlier chapters being targeted with PRs more often.

Recap of files related to PRs on Advanced R:

script to marshal data: adv-r-pr-affected-files.R
ready-to-analyze data: adv-r-pr-affected-files.csv
barchart: adv-r-pr-affected-files-barchart.png
script to make barchart: adv-r-pr-affected-files-figs.R

Issue threads

STAT 545 has a public Discussion repo, where we use the issues as a discussion board. I want to look at the posts there, as something related to student engagement that I can actually quantify.

This starts out fairly similar to the previous example: I retrieve all issues that have been modified since September 1, 2015.

owner <- "STAT545-UBC"
repo <- "Discussion"

issue_list <-
  gh("/repos/:owner/:repo/issues", owner = owner, repo = repo,
     state = "all", since = "2015-09-01T00:00:00Z", .limit = Inf)
(n_iss <- length(issue_list))
#> [1] 212

This retrieves 212 issues. I use this list to create a conventional data frame with one row per issue.

issue_df <- issue_list %>%
{
  data_frame(number = map_int(., "number"),
             id = map_int(., "id"),
             title = map_chr(., "title"),
             state = map_chr(., "state"),
             n_comments = map_int(., "comments"),
             opener = map_chr(., c("user", "login")),
             created_at = map_chr(., "created_at") %>% as.Date())
}
issue_df
#> Source: local data frame [212 x 7]
#> 
#>    number        id                                        title  state
#>     (int)     (int)                                        (chr)  (chr)
#> 1     276 119601582 Creating PDFs via latex in command line/make   open
#> 2     275 119272439                   general makefile confusion closed
#> 3     274 119262468                     do is dropping countries closed
#> 4     273 119259601        linear regression within each country closed
#> 5     272 119257920                       adding another column? closed
#> 6     271 119252407               how to add the residual error? closed
#> 7     270 119236992                        Pandoc error solution   open
#> 8     269 119230359                             how many scripts closed
#> 9     268 119133218          Can't download packages from github   open
#> 10    267 119112488                          using gapminder.tsv closed
#> ..    ...       ...                                          ...    ...
#> Variables not shown: n_comments (int), opener (chr), created_at (date).

It turns out some of these issues were created during the 2014 run but show up here because I closed them in early September. Get rid of them.

issue_df <- issue_df %>%
  filter(created_at >= "2015-09-01T00:00:00Z")
(n_iss <- nrow(issue_df))
#> [1] 192

Down to 192 issues.

My ultimate goal is a data frame with one row per issue comment, but it's harder than you expect to get there. Each issue should be represented by at least one row and many will have several rows, as there are typically follow up comments.

I need to loop over the issues and retrieve the follow up comments. I mean that literally -- the Issue Comment endpoint does not return a comment for the opening of the issue. This makes for a little extra data manipulation ... and more practice with purrr and dplyr!

Make a data frame of issue "opens" with a set of variables chosen for maximum bliss in future binds and joins. The i variable records comment position within the thread.

opens <- issue_df %>%
  select(number, who = opener) %>%
  mutate(i = 0L)
opens
#> Source: local data frame [192 x 3]
#> 
#>    number             who     i
#>     (int)           (chr) (int)
#> 1     276      samhinshaw     0
#> 2     275 molliejmcdowell     0
#> 3     274 molliejmcdowell     0
#> 4     273 molliejmcdowell     0
#> 5     272        bdacunha     0
#> 6     271        bdacunha     0
#> 7     270          zhamel     0
#> 8     269 molliejmcdowell     0
#> 9     268         wang114     0
#> 10    267        bdacunha     0
#> ..    ...             ...   ...
nrow(opens)
#> [1] 192

Make a data frame of issue follow up comments. At first, this has to hold an unfriendly list-column res where I dump issue comments as returned by the API.

comments <- issue_df %>%
  select(number) %>%
  mutate(res = number %>% map(
    ~ gh(number = .x,
         endpoint = "/repos/:owner/:repo/issues/:number/comments",
         owner = owner, repo = repo, .limit = Inf)))
str(comments, max.level = 1)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    192 obs. of  2 variables:
#>  $ number: int  276 275 274 273 272 271 270 269 268 267 ...
#>  $ res   :List of 192
#>   .. [list output truncated]

What is the res variable? A list-column of length 192, each component of which is another list of comments, each of which is also a nested list. Here's a look at 3 elements corresponding to issues that generated anywhere from no discussion to lots of discussion.

comments %>%
  filter(number %in% c(275, 273, 272)) %>%
  select(res) %>%
  walk(str, max.level = 2, give.attr = FALSE)
#> List of 3
#>  $ :List of 2
#>   ..$ :List of 8
#>   ..$ :List of 8
#>  $ : list()
#>  $ :List of 6
#>   ..$ :List of 8
#>   ..$ :List of 8
#>   ..$ :List of 8
#>   ..$ :List of 8
#>   ..$ :List of 8
#>   ..$ :List of 8

All I really want to know is who made the comment, so I mutate res into who using map_chr() and a character vector as extractor function. Push this one level down in the res nested list. I can drop the nasty res variable and revisit the same threads above to show how much simpler things have gotten.

comments <- comments %>%
  mutate(who = res %>% map(. %>% map_chr(c("user", "login")))) %>%
  select(-res)
comments %>%
  filter(number %in% c(275, 273, 272))
#> Source: local data frame [3 x 2]
#> 
#>   number      who
#>    (int)   (list)
#> 1    275 <chr[2]>
#> 2    273 <chr[0]>
#> 3    272 <chr[6]>

Use tidyr::unnest() to "explode" the who list-column and get one row per follow up comment. I now add the i variable for numbering within the thread.

comments <- comments %>%
  unnest(who) %>%
  group_by(number) %>%
  mutate(i = row_number(number)) %>%
  ungroup()
comments
#> Source: local data frame [863 x 3]
#> 
#>    number             who     i
#>     (int)           (chr) (int)
#> 1     275         jennybc     1
#> 2     275 molliejmcdowell     2
#> 3     274          ksamuk     1
#> 4     274 molliejmcdowell     2
#> 5     272         jennybc     1
#> 6     272         jennybc     2
#> 7     272        bdacunha     3
#> 8     272         jennybc     4
#> 9     272         jennybc     5
#> 10    272        bdacunha     6
#> ..    ...             ...   ...

No more list-columns!

It's time for a sanity check. Do the empirical counts of follow up comments match the number of comments initially reported by the API?

count_empirical <- comments %>%
  count(number)
count_stated <- issue_df %>%
  select(number, stated = n_comments)
checker <- left_join(count_empirical, count_stated)
#> Joining by: "number"
with(checker, n == stated) %>% all() # hopefully TRUE
#> [1] TRUE

I row bind issue "opens" and follow up comments, feeling very smug that that they have exactly the same variables, though it is no accident.

atoms <- bind_rows(opens, comments)

Join back to the original data frame of issues, since that still holds issue title, state and creation date. It is intentional that the number variable has been set up as the natural by variable.

finally <- atoms %>%
  left_join(issue_df) %>%
  select(number, id, opener, who, i, everything()) %>%
  arrange(desc(number), i)
#> Joining by: "number"

A quick look at this and ... we're ready for analysis. Our work here is done.

finally
#> Source: local data frame [1,055 x 9]
#> 
#>    number        id          opener             who     i
#>     (int)     (int)           (chr)           (chr) (int)
#> 1     276 119601582      samhinshaw      samhinshaw     0
#> 2     275 119272439 molliejmcdowell molliejmcdowell     0
#> 3     275 119272439 molliejmcdowell         jennybc     1
#> 4     275 119272439 molliejmcdowell molliejmcdowell     2
#> 5     274 11926246

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

majkrzak/create-release: Github Action for handling release creation发布时间：2022-06-13

Nthily/netease-cloud-music-card: 发布时间：2022-06-13

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18043|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9599|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8143|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8525|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8428|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9334|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8392|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7827|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8380|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7375|2022-11-06

客服电话

电子邮件

jennybc/analyze-github-stuff-with-r: Marshal data from the GitHub API with R

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

How to obtain a bunch of GitHub issues or pull requests with R

Oliver's open issues

Pull requests on a repo

Issue threads

请发表评论

全部评论

上一篇：

下一篇：

librespeed/speedtest: Self-hosted Speedt

avehtari/BDA_m_demos: Bayesian Data Anal

四维彩超怎么看性别？四维看男孩女孩诀窍

jacobmoncur/QuiltViewLibrary: Android Qu

medfreeman/markdown-it-toc-and-anchor: m

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053