Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
220 views
in Technique[技术] by (71.8m points)

r - How to group rows based on conditions set by another dataset?

I've tried to code an answer to this based on similar questions, however I'm struggling to get any answers working exactly for my use case.

I have 2 genetic datasets (looking at positions of mutations in the genome).

These look like:

#df1:
Chromosome  Min     Max   Group
1           500     1000    1
1           400     1900    2
2           300     400     3
3           2000    2100    4
#df2
Gene    Chromosome  Position
Gene1          1    600
Gene1          1    650 
Gene2          1    1700
Gene3          2    350
Gene4          2    355
Gene5          2    450
Gene6          3    2050

I am looking to find which rows in df2 belong to which Group in df1 - so I am coding to ask if a row in df2 has a matching Chromosome number AND the Position column value is inside the range between the Max and Min columns of df1, then if it is assign that row the same Group number as is in df1. I want this to also duplicate rows in df2 if they appear in multiple Groups.

Expected output of the example is:

Gene    Chromosome  Position   Group
Gene1          1    600          1
Gene1          1    650          1 
Gene1          1    600          2
Gene1          1    650          2 
Gene2          1    1700         2
Gene3          2    350          3
Gene4          2    355          3 
Gene5          2    450          NA
Gene6          3    2050         4

#Gene1 enters both groups 1 and 2 as their Chromosome and Position fits in both those groups of df1

They key thing I'm trying to get that I am missing is that some groups in df1 have overlap in their min-max range, but I still want each individual group to be kept and just Genes/rows of df2 to be duplicated as their position might match mulitple groups with overlap.

Currently I am trying to code this with:

df1$ID <- seq.int(nrow(df1))
df2$ID<- seq.int(nrow(df2))

df2[df1, Group := i.ID, on = .(Chromosome, Position > Min, Position < Max ) ]

For my example this outputs:

Gene  Chromosome Position  ID Group
Gene1   1           600     1   2
Gene1   1           650     2   2
Gene2   1           1700    3   2
Gene3   2           350     4   3
Gene4   2           355     5   3
Gene5   2           450     6   NA
Gene6   3           2050    7   4

So in this case because Groups 1 and 2 have an overlap in their min-max range, Group 1 has been lost from the output result. Is there another way I can code to avoid this and match rows but maintain all Groups despite any range overlaps?

I have tried other ways to code this with a similar question (How to perform join over date ranges using data.table?) using foverlaps() but this also does not work as expected.

Input data:

df1 <-
structure(list(Chromosome = c(1L, 1L, 2L, 3L), Min = c(500L, 
400L, 300L, 2000L), Max = c(1000L, 1900L, 400L, 2100L), Group = 1:4, 
    ID = 1:4), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"))

df2 <-
structure(list(Gene = c("Gene1", "Gene1", "Gene2", "Gene3", "Gene4", 
"Gene5", "Gene6"), Chromosome = c(1L, 1L, 1L, 2L, 2L, 2L, 3L), 
    Position = c(600L, 650L, 1700L, 350L, 355L, 450L, 2050L)), row.names = c(NA, 
-7L), class = c("data.table", "data.frame"))

question from:https://stackoverflow.com/questions/65921081/how-to-group-rows-based-on-conditions-set-by-another-dataset

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Are you looking for something like this?

setDT(df1)
setDT(df2)

df2[, Group := df1[.SD, 
                   on = .(Chromosome = Chromosome, Max > Position, Min < Position), 
                   toString(Group), 
                   by = .EACHI]$V1]

#     Gene Chromosome Position Group
# 1: Gene1          1      600  1, 2
# 2: Gene1          1      650  1, 2
# 3: Gene2          1     1700     2
# 4: Gene3          2      350     3
# 5: Gene4          2      355     3
# 6: Gene5          2      450    NA
# 7: Gene6          3     2050     4

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...