R/dplyr: Extracting Data Frame Column Value for Filtering With %in%
I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.
I had a data frame:
library(dplyr) df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))
And wanted to extract the userIds of those people who have a score greater than 3. I started with:
highScoringPeople = df %>% filter(score > 3) %>% select(userId) > highScoringPeople userId 1 3 2 4 3 5
And then filtered the data frame expecting to get back those 3 people:
> df %>% filter(userId %in% highScoringPeople) [1] userId score <0 rows> (or 0-length row.names)
No rows! I created vector with the numbers 3-5 to make sure that worked:
> df %>% filter(userId %in% c(3,4,5)) userId score 1 3 4 2 4 5 3 5 5
That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an ‘in lookup’. Let’s explore:
> str(c(3,4,5)) num [1:3] 3 4 5 > str(highScoringPeople) 'data.frame': 3 obs. of 1 variable: $ userId: num 3 4 5
Now it’s even more obvious why it doesn’t work – highScoringPeople is still a data frame when we need it to be a vector/list.
One way to fix this is to extract the userIds using the $ syntax instead of the select function:
highScoringPeople = (df %>% filter(score > 3))$userId > str(highScoringPeople) num [1:3] 3 4 5 > df %>% filter(userId %in% highScoringPeople) userId score 1 3 4 2 4 5 3 5 5
Or if we want to do the column selection using dplyr we can extract the values for the column like this:
highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]] > str(highScoringPeople) num [1:3] 3 4 5
Not so difficult after all.