Discussion:
[R] Partial LookUP
gary chimuzinga
2018-11-20 16:06:17 UTC
Permalink
I am working n R, using R studio,
I have a dataframe with 4 columns. Column A contains passenger iD, B contains passenger name, C contains husband name.
I am attempting to create a new column which look to see if the husband name in column C is listed in any of the records in column B. If so it should then return to me the passenger iD of the husband from column A.
To make things more complicated, as in the first example in some cases, the husband's given in column C might not include the his second name, which would be included in column B.

Reproducible Example
library(stringr)
rm(list=ls())
passengerid <- c(0908,9883,7767,3302)

Name<- c("Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)",
"Backstrom, Mr. Karl Alfred John",
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Cumings, Mr. John Bradley")

HusbandName <- c("Backstrom, Mr. Karl Alfred","","Cumings, Mr. John
Bradley","")



df1<- data.frame(cbind(passengerid,Name,HusbandName))
df1$Name <- as.character(df1$Name)
df1$HusbandName <- as.character(df1$HusbandName)

I have tried using Stringr, but facing problems because 1)I need the code to look at only 1 element of the vector HusbandName and search for it in the whole vector Name. 2) I found it difficult to use regular expressions given that the pattern I am looking for is vectorised (as HusbandName)
This is what I have tried so far:

Attempt 1 - only finds exact matches & doesn't return the passengerID & doesn't add column to df
df1$Husbandid < - for (i in 1:NROW(df1$HusbandName)) {
print(HusbandName[i] %in% Name)}


Attempt 2 - finds partial matches, but does not ignore blanks & does not tell me passenger id & doesn't add column to df
df1$Husbandid <- for (i in 1:NROW(df1$HusbandName)) {
print(which(str_detect(df1$Name,df1$HusbandName[i])))}


#Attempt 3 - almost works but - the printed results are different from those added into the dataframe as a new column. how can i correct for this? Ultimately I need the ones in the df to be correct. the error is that those without husbands are showing husbandiD when this should be blank or na. can this be corrected or is there a way to convert the output of the for loop into a vector we can add to the df?
for (i in 1:NROW(df1$HusbandName)) {
if (df1$HusbandName[i] =="") {
print("Man") & next()
}
FoundHusbandNames<- c(which(str_detect(df1$Name,df1$HusbandName[i])))
print(df1$passengerid[FoundHusbandNames]) -> df1$Husbandid[i] }


[[alternative HTML version deleted]]

______________________________________________
R-***@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
PIKAL Petr
2018-11-22 07:27:15 UTC
Permalink
Hi

I did not see any answer so I try to generate some answer.
It seems to me that your second attempt was quite close.

If passengerid was numeric, following code could probably give you the required result.

res <- rep(NA, nrow(df1))
for (i in 1:NROW(df1)) {
sel <- which(str_detect(df1$Name,coll(df1$HusbandName[i])))
if (length(sel) > 0) { res[i] <- df1$passengerid[sel]}
}

res should contain passengerid for each relevant line and NA if there is no match. You just could add it to your data frame as a new column.

The problem is that although you provide "a kind of" example, HTML format probably scrambled it somehow. Better is to use dput for sending test data and not use HTML formating.

This is data frame I got from your mail.
dput(df1)
structure(list(passengerid = structure(c(3L, 4L, 2L, 1L), .Label = c("3302",
"7767", "908", "9883"), class = "factor"), Name = c("Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)",
"Backstrom, Mr. Karl Alfred John", "Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Cumings, Mr. John Bradley"), HusbandName = c("Backstrom, Mr. Karl Alfred",
"", "Cumings, Mr. John\nBradley", "")), row.names = c(NA, -4L
), class = "data.frame")

Cheers
Petr
-----Original Message-----
Sent: Tuesday, November 20, 2018 5:06 PM
Subject: [R] Partial LookUP
I am working n R, using R studio,
I have a dataframe with 4 columns. Column A contains passenger iD, B contains
passenger name, C contains husband name.
I am attempting to create a new column which look to see if the husband name
in column C is listed in any of the records in column B. If so it should then
return to me the passenger iD of the husband from column A.
To make things more complicated, as in the first example in some cases, the
husband's given in column C might not include the his second name, which
would be included in column B.
Reproducible Example
library(stringr)
rm(list=ls())
passengerid <- c(0908,9883,7767,3302)
Name<- c("Backstrom, Mrs. Karl Alfred (Maria Mathilda Gustafsson)",
"Backstrom, Mr. Karl Alfred John",
"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",
"Cumings, Mr. John Bradley")
HusbandName <- c("Backstrom, Mr. Karl Alfred","","Cumings, Mr. John
Bradley","")
df1<- data.frame(cbind(passengerid,Name,HusbandName))
df1$Name <- as.character(df1$Name)
df1$HusbandName <- as.character(df1$HusbandName)
I have tried using Stringr, but facing problems because 1)I need the code to look
at only 1 element of the vector HusbandName and search for it in the whole
vector Name. 2) I found it difficult to use regular expressions given that the
pattern I am looking for is vectorised (as HusbandName)
Attempt 1 - only finds exact matches & doesn't return the passengerID &
doesn't add column to df
df1$Husbandid < - for (i in 1:NROW(df1$HusbandName)) {
print(HusbandName[i] %in% Name)}
Attempt 2 - finds partial matches, but does not ignore blanks & does not tell
me passenger id & doesn't add column to df
df1$Husbandid <- for (i in 1:NROW(df1$HusbandName)) {
print(which(str_detect(df1$Name,df1$HusbandName[i])))}
#Attempt 3 - almost works but - the printed results are different from those
added into the dataframe as a new column. how can i correct for this?
Ultimately I need the ones in the df to be correct. the error is that those
without husbands are showing husbandiD when this should be blank or na. can
this be corrected or is there a way to convert the output of the for loop into a
vector we can add to the df?
for (i in 1:NROW(df1$HusbandName)) {
if (df1$HusbandName[i] =="") {
print("Man") & next()
}
FoundHusbandNames<-
c(which(str_detect(df1$Name,df1$HusbandName[i])))
print(df1$passengerid[FoundHusbandNames]) -> df1$Husbandid[i] }
[[alternative HTML version deleted]]
______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Osobní údaje: Informace o zpracování a ochraně osobních údajů obchodních partnerů PRECHEZA a.s. jsou zveřejněny na: https://www.precheza.cz/zasady-ochrany-osobnich-udaju/ | Information about processing and protection of business partner’s personal data are available on website: https://www.precheza.cz/en/personal-data-protection-principles/
Důvěrnost: Tento e-mail a jakékoliv k němu připojené dokumenty jsou důvěrné a podléhají tomuto právně závaznému prohláąení o vyloučení odpovědnosti: https://www.precheza.cz/01-dovetek/ | This email and any documents attached to it may be confidential and are subject to the legally binding disclaimer: https://www.precheza.cz/en/01-disclaimer/

______________________________________________
R-***@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-co

Loading...