提问者:小点点

根据另一个数据框中的两列替换一列中的字符串


我在多列数据框中有一列,让我们称之为df1,它由一些文本组成,下面是一些示例行:

"896 European ancestry cases, 2,455 European ancestry controls"
"591 European individuals, 687 European males"
"1,968 African American cases, 3,928 African American controls"

然后我在另一个数据帧中有两列,让我们称之为df2,如下所示:

"European ancestry cases", "European"
"European ancestry controls", "European"
"European individuals", "European"
"European males", "European"
"African American cases", "African"
"African American controls", "African"

引号实际上不存在,我只是用它们来阐明列的数量。

我想在df1中用df2的第二列替换df2的第一列的所有实例。换句话说,我只想更改df1,但使用df2中的信息,这将返回以下结果:

"896 European, 2,455 European"
"591 European, 687 European"
"1,968 African, 3,928 African"

关于如何在R中实现这一点有什么想法吗?

(PS.实际数据对于df1和df2都有数千行,还有更多的变化,这些示例行简化了问题)


共2个答案

匿名用户

你可以试试

# 1
gsub("(European|African)([^,]+)", "\\1", df1$txt)

# 2
gsub("(?<=European|African)[^,]+", "", df1$txt, perl = TRUE)

# [1] "896 European, 2,455 European"
# [2] "591 European, 687 European"  
# [3] "1,968 African, 3,928 African"

您还可以使用stringr中的str_replace_all()来执行多个替换,方法是向其传递一个命名向量(c(模式1=替换1))。

library(tidyverse)

df1 %>%
  mutate(txt = str_replace_all(txt, deframe(df2)))

#                            txt
# 1 896 European, 2,455 European
# 2   591 European, 687 European
# 3 1,968 African, 3,928 African
df1 <- data.frame(txt = c("896 European ancestry cases, 2,455 European ancestry controls",
                          "591 European individuals, 687 European males",
                          "1,968 African American cases, 3,928 African American controls"))

df2 <- structure(list(
V1 = c("European ancestry cases", "European ancestry controls", "European individuals", "European males", "African American cases", "African American controls"),
V2 = c("European", "European", "European", "European", "African", "African")),
class = "data.frame", row.names = c(NA, -6L))

匿名用户

这里还有一个:

#library(dplyr)
#library(stringr)
#library(tidyr)
#library(readr)
library(tidyverse)

df1 %>% 
  mutate(id = row_number()) %>% 
  separate_rows(col1, sep = "\\, ") %>% 
  mutate(col1 = str_c(parse_number(col1), 
                      str_extract(col1, str_c(unique(df2$col2), collapse = "|")), sep = " ")) %>% 
  group_by(id) %>% 
  mutate(col1 = toString(col1)) %>% 
  slice(1) %>% 
  ungroup() %>% 
  select(-id)
 col1                       
  <chr>                      
1 896 European, 2455 European
2 591 European, 687 European 
3 1968 African, 3928 African 

数据:

> dput(df1)
structure(list(col1 = c("896 European ancestry cases, 2,455 European ancestry controls", 
"591 European individuals, 687 European males", "1,968 African American cases, 3,928 African American controls"
)), class = "data.frame", row.names = c(NA, -3L))
> dput(df2)
structure(list(col1 = c("European ancestry cases", "European ancestry controls", 
"European individuals", "European males", "African American cases", 
"African American controls"), col2 = c("European", "European", 
"European", "European", "African", "African")), class = "data.frame", row.names = c(NA, 
-6L))