我在多列数据框中有一列,让我们称之为df1,它由一些文本组成,下面是一些示例行:
"896 European ancestry cases, 2,455 European ancestry controls"
"591 European individuals, 687 European males"
"1,968 African American cases, 3,928 African American controls"
然后我在另一个数据帧中有两列,让我们称之为df2,如下所示:
"European ancestry cases", "European"
"European ancestry controls", "European"
"European individuals", "European"
"European males", "European"
"African American cases", "African"
"African American controls", "African"
引号实际上不存在,我只是用它们来阐明列的数量。
我想在df1中用df2的第二列替换df2的第一列的所有实例。换句话说,我只想更改df1,但使用df2中的信息,这将返回以下结果:
"896 European, 2,455 European"
"591 European, 687 European"
"1,968 African, 3,928 African"
关于如何在R中实现这一点有什么想法吗?
(PS.实际数据对于df1和df2都有数千行,还有更多的变化,这些示例行简化了问题)
你可以试试
# 1
gsub("(European|African)([^,]+)", "\\1", df1$txt)
# 2
gsub("(?<=European|African)[^,]+", "", df1$txt, perl = TRUE)
# [1] "896 European, 2,455 European"
# [2] "591 European, 687 European"
# [3] "1,968 African, 3,928 African"
您还可以使用stringr
中的str_replace_all()
来执行多个替换,方法是向其传递一个命名向量(c(模式1=替换1)
)。
library(tidyverse)
df1 %>%
mutate(txt = str_replace_all(txt, deframe(df2)))
# txt
# 1 896 European, 2,455 European
# 2 591 European, 687 European
# 3 1,968 African, 3,928 African
df1 <- data.frame(txt = c("896 European ancestry cases, 2,455 European ancestry controls",
"591 European individuals, 687 European males",
"1,968 African American cases, 3,928 African American controls"))
df2 <- structure(list(
V1 = c("European ancestry cases", "European ancestry controls", "European individuals", "European males", "African American cases", "African American controls"),
V2 = c("European", "European", "European", "European", "African", "African")),
class = "data.frame", row.names = c(NA, -6L))
这里还有一个:
#library(dplyr)
#library(stringr)
#library(tidyr)
#library(readr)
library(tidyverse)
df1 %>%
mutate(id = row_number()) %>%
separate_rows(col1, sep = "\\, ") %>%
mutate(col1 = str_c(parse_number(col1),
str_extract(col1, str_c(unique(df2$col2), collapse = "|")), sep = " ")) %>%
group_by(id) %>%
mutate(col1 = toString(col1)) %>%
slice(1) %>%
ungroup() %>%
select(-id)
col1
<chr>
1 896 European, 2455 European
2 591 European, 687 European
3 1968 African, 3928 African
数据:
> dput(df1)
structure(list(col1 = c("896 European ancestry cases, 2,455 European ancestry controls",
"591 European individuals, 687 European males", "1,968 African American cases, 3,928 African American controls"
)), class = "data.frame", row.names = c(NA, -3L))
> dput(df2)
structure(list(col1 = c("European ancestry cases", "European ancestry controls",
"European individuals", "European males", "African American cases",
"African American controls"), col2 = c("European", "European",
"European", "European", "African", "African")), class = "data.frame", row.names = c(NA,
-6L))