//
you're reading...
R

Identifying Records in Data Frame A That Are Not Contained In Data Frame B – A Comparison

Yesterday I launched my first question at Stackoverflow and apparently did a lot of things wrong as I managed to get my question closed wihtin hours ;-)

http://stackoverflow.com/questions/7728462/identify-records-in-data-frame-a-not-contained-in-data-frame-b

I had collected 9 different solutions to the problem and made the mistake to put it all within the original question space. So people complained and told me that such a collection belongs into a blog and not on Stackoverflow. Hence, I decided to go ahead and finally make my first blogging steps! I’ll probably still miss out on a lot of cool technical stuff on the blog, so please bear with me.

So here’s the thing:

Problem

How to identify records/rows in data frame x.1 that are not contained in data frame x.2 based on all attributes available (i.e. all columns) in the most efficient way?

Example Data

> x.1 <- data.frame(a=c(1,2,3,4,5), b=c(1,2,3,4,5))
> x.1
a b
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5

> x.2 <- data.frame(a=c(1,1,2,3,4), b=c(1,1,99,3,4))
> x.2
a  b
1 1  1
2 1  1
3 2 99
4 3  3
5 4  4

# BENCHMARK SETTINGS
require(microbenchmark)
to.sec <- 1000000000

Desired Result


  a b
2 2 2
5 5 5

Best Solution So Far

by Prof. Brian Ripley and Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
  a b
2 2 2
5 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/to.sec
> sol.12
> [1] 0.0002059665

Efficiency Comparison

> sol.scope <- 1:13
> comp <- data.frame(lapply(sol.scope, function(x){
+     eval(substitute(get(SOL), list(SOL=paste("sol.", x, sep=""))))
+ }))
> names(comp) <- paste("solution", sol.scope)
> comp <- as.data.frame(t(comp[order(comp[1,])]))
> colnames(comp) <- "time"
> comp
time   time.rel
solution 12 0.0002080150   1.000000
solution 3  0.0002548310   1.225061
solution 1  0.0004656785   2.238677
solution 10 0.0006398950   3.076196
solution 6  0.0007878430   3.787434
solution 8  0.0010459795   5.028385
solution 11 0.0021617355  10.392210
solution 9  0.0025755710  12.381660
solution 7  0.0103444610  49.729399
solution 13 0.0211265200 101.562483
solution 5  0.0225685395 108.494770
solution 2            NA         NA
solution 4            NA         NA

Here are all the solutions I collected so far:

Solution 1

by Chase

> fun.1 <- function(x.1,x.2,...){
+     expr <- paste("subset(x.1,", paste(sapply(names(x.1), function(x){
+         paste("!(", x, " %in% x.2$", x, ")", sep="")
+     }), collapse=" | "), ")")
+     eval(parse(text=expr))
+ }
> fun.1(x.1,x.2)
a b
2 2 2
5 5 5
> sol.1 <- microbenchmark(fun.1(x.1,x.2))
> sol.1 <- median(sol.1$time)/to.sec
> sol.1
[1] 0.0004656785

Solution 2 (rather just an approach)

by Ramnath

> setdiff(x.1$a, x.2$a)   # elements in x.1$a NOT in x.2$a
[1] 5
> setdiff(x.2$a, x.1$a)   # elements in x.2$a NOT in x.1$a
numeric(0)
> sol.2 <- microbenchmark(setdiff(x.1$a, x.2$a))
> sol.2 <- median(sol.2$time)/to.sec
> sol.2
[1] 4.03865e-05
# This helps, but as it does not directly provide the solution, we will set this to 'NA'
sol.2 <- NA

Solution 3

by Prof. Brian Ripley

> fun.3 <- function(x.1, x.2, ...){
+     x.1.id <- do.call("paste", c(x.1, sep = "\r"))
+     x.2.id <- do.call("paste", c(x.2, sep = "\r"))
+     x.1[match(setdiff(x.1.id, x.2.id),x.1.id), ]
+ }
> fun.3(x.1,x.2)
a b
1 2 2
2 5 5
> sol.3 <- microbenchmark(fun.3(x.1,x.2))
> sol.3 <- median(sol.3$time)/to.sec
> sol.3
[1] 0.000254831

Solution 4 (rather just an approach)

by me

> fun.4 <- function(x.1, x.2, ...){
+     # Combine
+     df <- rbind(x.1,x.2)
+     df <- df[order(df[,1]),]
+     # Find duplicates
+     idx.1 <- duplicated(df, all=TRUE)
+     idx.2 <- duplicated(df, fromLast=TRUE)
+     idx <- cbind(idx.1, idx.2)
+     idx <- apply(idx, MARGIN=1, any)
+     # Index records
+     df[-which(idx),]
+ }
> fun.4(x.1,x.2)
a b
1 2 2
8 2 99
2 5 5
> sol.4 <- microbenchmark(fun.4(x.1,x.2))
> sol.4 <- median(sol.4$time)/to.sec
> sol.4
[1] 0.001062621
# As it does not match the desired records, we set this to 'NA'
sol.4 <- NA

Solution 5

base on solution by Gabor Grothendieck

> library(sqldf)
> fun.5 <- function(x1,x2,...){
+     out <- sqldf(
+         "SELECT * FROM x1
+             WHERE
+                 x1.a NOT IN (SELECT x2.a FROM x2) OR
+                 x1.b NOT IN (SELECT x2.b FROM x2)"
+     )
+     out
+ }
> fun.5(x1=x.1,x2=x.2)
a b
1 2 2
2 5 5
> sol.5 <- microbenchmark(fun.5(x1=x.1,x2=x.2))
> sol.5 <- median(sol.5$time)/to.sec
> sol.5
[1] 0.02256854

Solution 6

by Tal Galili

> fun.6 <- function(x.1,x.2){
+     x.1.vec <- apply(x.1, 1, paste, collapse = "")
+     x.2.vec <- apply(x.2, 1, paste, collapse = "")
+     x.1.without.x.2.rows <- x.1[!x.1.vec %in% x.2.vec,]
+     return(x.1.without.x.2.rows)
+ }
> fun.6(x.1,x.2)
a b
1 2 2
2 5 5
> sol.6 <- microbenchmark(fun.6(x.1,x.2))
> sol.6 <- median(sol.6$time)/to.sec
> sol.6
[1] 0.000787843

Solution 7

by nullglob

# COULD NOT REPRODUCE RESULTS WITH MY DATA
> fun.7 <- function(x.1,x.2,...){
+     a1 <- data.frame(a = 1:5, b = letters[1:5])
+     a2 <- data.frame(a = 1:3, b = letters[1:3])
+     comparison <- compare(a1,a2,allowAll=TRUE)
+     comparison$tM
+     difference <-
+         data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
+     colnames(difference) <- colnames(a1)
+     difference
+ }
> fun.7(x.1,x.2)
  a b
1 4 d
2 5 e
> ### DISCUSSION ###
> # 1) Effectiveness
> # Could not reproduce results with my data frames.
> # 2) Efficiency
> sol.7 <- microbenchmark(fun.7(x.1,x.2))
> sol.7 <- median(sol.7$time)/to.sec
> sol.7
[1] 0.01034446

Solution 8

by Henrico

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/
> XinY <-
+     function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
+              notin = FALSE, incomparables = NULL,
+              ...)
+ {
+     fix.by <- function(by, df)
+     {
+         ## fix up 'by' to be a valid set of cols by number: 0 is row.names
+         if(is.null(by)) by <- numeric(0L)
+         by <- as.vector(by)
+         nc <- ncol(df)
+         if(is.character(by))
+             by <- match(by, c("row.names", names(df))) - 1L
+         else if(is.numeric(by)) {
+             if(any(by < 0L) || any(by > nc))
+                 stop("'by' must match numbers of columns")
+         } else if(is.logical(by)) {
+             if(length(by) != nc) stop("'by' must match number of columns")
+             by <- seq_along(by)[by]
+         } else stop("'by' must specify column(s) as numbers, names or logical")
+         if(any(is.na(by))) stop("'by' must specify valid column(s)")
+         unique(by)
+     }
+     nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
+     by.x <- fix.by(by.x, x)
+     by.y <- fix.by(by.y, y)
+     if((l.b <- length(by.x)) != length(by.y))
+         stop("'by.x' and 'by.y' specify different numbers of columns")
+     if(l.b == 0L) {
+         ## was: stop("no columns to match on")
+         ## returns x
+         x
+     }
+     else {
+         if(any(by.x == 0L)) {
+             x <- cbind(Row.names = I(row.names(x)), x)
+             by.x <- by.x + 1L
+         }
+         if(any(by.y == 0L)) {
+             y <- cbind(Row.names = I(row.names(y)), y)
+             by.y <- by.y + 1L
+         }
+         ## create keys from 'by' columns:
+         if(l.b == 1L) {                  # (be faster)
+             bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
+             by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
+         } else {
+             ## Do these together for consistency in as.character.
+             ## Use same set of names.
+             bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
+             names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
+             bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
+             bx <- bz[seq_len(nx)]
+             by <- bz[nx + seq_len(ny)]
+         }
+         comm <- match(bx, by, 0L)
+         if (notin) {
+             res <- x[comm == 0,]
+         } else {
+             res <- x[comm > 0,]
+         }
+     }
+     ## avoid a copy
+     ## row.names(res) <- NULL
+     attr(res, "row.names") <- .set_row_names(nrow(res))
+     res
+ }
> XnotinY <-
+     function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
+              notin = TRUE, incomparables = NULL,
+              ...)
+ {
+     XinY(x,y,by,by.x,by.y,notin,incomparables)
+ }
> fun.8 <- XnotinY
> fun.8(x.1,x.2)
a b
1 2 2
2 5 5
> sol.8 <- microbenchmark(fun.8(x.1,x.2))
> sol.8 <- median(sol.8$time)/to.sec
> sol.8
[1] 0.00104598

Solution 9

by Tomas T.

> fun.9 <- function(x.1,x.2,...){
+     tmp = merge(x.1, cbind(x.2, q=1:nrow(x.2)), all.x = TRUE)
+         # provided that there's no column q in both dataframes
+     tmp[is.na(tmp$q), 1:ncol(x.1)] # the result
+ }
> fun.9(x.1,x.2)
a b
1 2 2
2 5 5
> sol.9 <- microbenchmark(fun.9(x.1,x.2))
> sol.9 <- median(sol.9$time)/to.sec
> sol.9
[1] 0.002575571

Solution 10

by Gabor Grothendieck

> fun.10 <- function(x.1,x.2,...){
+     x.1[!duplicated(rbind(x.2, x.1))[-(1:nrow(x.2))],]
+ }
> fun.10(x.1,x.2)
a b
1 2 2
2 5 5
> sol.10 <- microbenchmark(fun.10(x.1,x.2))
> sol.10 <- median(sol.10$time)/to.sec
> sol.10
[1] 0.000639895

Solution 11

by Gabor Grothendieck

> fun.11 <- function(x.1,x.2,...){
+     do.call("rbind", setdiff(split(x.1, rownames(x.1)), split(x.2, rownames(x.2))))
+ }
> fun.11(x.1,x.2)
a b
1 2 2
2 5 5
> sol.11 <- microbenchmark(fun.11(x.1,x.2))
> sol.11 <- median(sol.11$time)/to.sec
> sol.11
[1] 0.002161735

Solution 12

by Gabor Grothendieck

> fun.12 <- function(x.1,x.2,...){
+     x.1p <- do.call("paste", x.1)
+     x.2p <- do.call("paste", x.2)
+     x.1[! x.1p %in% x.2p, ]
+ }
> fun.12(x.1,x.2)
a b
1 2 2
2 5 5
> sol.12 <- microbenchmark(fun.12(x.1,x.2))
> sol.12 <- median(sol.12$time)/to.sec
> sol.12
[1] 0.000208015

Solution 13

by Gabor Grothendieck

> library(sqldf)
> fun.13 <- function(x.1,x.2,...){
+     sqldf("select * from `x.1` except select * from `x.2`")
+ }
> fun.13(x.1,x.2)
a b
1 2 2
2 5 5
> sol.13 <- microbenchmark(fun.13(x.1,x.2))
> sol.13 <- median(sol.13$time)/to.sec
> sol.13
[1] 0.02112652
About these ads

Discussion

4 thoughts on “Identifying Records in Data Frame A That Are Not Contained In Data Frame B – A Comparison

  1. Sorry, first blog and of course things go wrong (on the post featured on http://www.r-bloggers.com/) ;-) The layout of the code snipets got messed up somehow, but this should be fixed by now (at least the major part of it; some formatting always seems to be messed up anew). Well, still learning ;-)

    Posted by songpants | October 12, 2011, 11:25 am
  2. The “best” way to do this is not in R, but in RDBMS (where most data is stored, anyway). Each data frame is a table (or more, if you’ve normalized; becoming a view). Then find the difference with the EXCEPT clause (Oracle syntax is MINUS). While SQL databases aren’t fully relational, they do support most of the set operations Dr. Codd defined, and these set operations are what you’re looking for.

    Posted by Robert Young | October 12, 2011, 3:04 pm
  3. Oops. Just saw #13 ;-)

    Posted by Robert Young | October 12, 2011, 3:06 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: