Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging cells scales badly #503

Open
vpetzel opened this issue Sep 23, 2024 · 1 comment
Open

Merging cells scales badly #503

vpetzel opened this issue Sep 23, 2024 · 1 comment

Comments

@vpetzel
Copy link

vpetzel commented Sep 23, 2024

Using mergeCells can be very slow when doing many merges. This is caused by a check for overlapping merges. This is done by extracting all affected fields from all previous merges and by checking if any of these happen to fall in the fields affected by the previous merge. All of these things are done in a very expensive manner, as merges are stored as xml strings, and thus require expensive parsing back to integers. The check for intersection is done in a very expensive way too: openxlsx calculates a vector of all positions in the rectangles and subsequently compares these members for intersection.

But even then it is easy to see that this method will have a complexity of Θ(n) for adding a new merge to n existing merges (assuming limited size of the merges), and thus a complexity of Θ(n²) for adding n merges.

This code showcases this issue:

library(stringi)

M1 <- character()
M2 <- data.frame(
    r1 = integer(), r2 = integer(), c1 = integer(), c2 = integer()
)
M3 <- character()

add_merge1 <- function(r1, r2, c1, c2) {
    sqref <- openxlsx::getCellRefs(data.frame(x = c(r1, r2), 
        y = c(c1, c2)))
    exMerges <- regmatches(M1, regexpr("[A-Z0-9]+:[A-Z0-9]+", M1))
    if (!is.null(exMerges)) {
        comps <- lapply(exMerges, function(rectCoords) {
            unlist(strsplit(rectCoords, split = ":"))
        })
        exMergedCells <- openxlsx:::build_cell_merges(comps = comps)
        newMerge <- unlist(openxlsx:::build_cell_merges(comps = list(sqref)))
        mergeIntersections <- sapply(exMergedCells, function(x) {
            any(x %in% newMerge)
        })
        if (any(mergeIntersections)) {
            stop(sprintf("Merge intersects with existing merged cells: \n\t\t%s.\nRemove existing merge first.", 
                stri_join(exMerges[mergeIntersections], collapse = "\n\t\t")))
        }
    }
    M1 <<- c(
        M1, 
        sprintf("<mergeCell ref=\"%s\"/>",
                stri_join(sqref, collapse = ":", sep = " "))
    )
}

add_merge2 <- function(r1, r2, c1, c2) {
    if (r2 < r1) {
        .h <- r2
        r2 <- r1
        r1 <- .h
    }
    if (c2 < c1) {
        .h <- c2
        c2 <- c1
        c1 <- .h
    }

    intersections <- M2[
        ((M2$r1 <= r2) &
         (M2$r2 >= r1) &
         (M2$c1 <= c2) &
         (M2$c2 >= c1)),
    ]

    if (nrow(intersections) > 0) {
        stop(sprintf("Merge intersects with existing merged cells: \n\t\t%s.\nRemove existing merge first.", 
            paste(
                openxlsx::getCellRefs(intersections[c("r1", "c1")]),
                openxlsx::getCellRefs(intersections[c("r2", "c2")]),
                sep = ":"
            ),
            collapse = "\n\t\t"))
    }

    M2 <<- rbind(
        M2, 
        data.frame(r1 = r1, r2 = r2, c1 = c1, c2 = c2)
    )
}

add_merge3 <- function(r1, r2, c1, c2) {
    sqref <- openxlsx::getCellRefs(data.frame(x = c(r1, r2), 
        y = c(c1, c2)))
    M3 <<- c(
        M3, 
        sprintf("<mergeCell ref=\"%s\"/>",
                stri_join(sqref, collapse = ":", sep = " "))
    )
}

take_time <- function(expr) {
    a <- Sys.time()
    force(expr)
    b <- Sys.time()
    as.numeric(b) - as.numeric(a)
}

tapply <- function(x, fun, limit = 10, ...) {
    res <- rep(NA_real_, length(x))
    ttime <- 0
    for (i in seq_along(x)) {
        res[i] <- take_time(fun(x[i]))
        ttime <- ttime + res[i]
        if (ttime > limit) break()
    }
    res
}

M <- c(2, 10, 1000)
N <- 3000

pdat <- dplyr::bind_rows(lapply(
    M,
    function(M) {
        M1 <<- character()
        M2 <<- data.frame(
            r1 = integer(), r2 = integer(), c1 = integer(), c2 = integer()
        )
        M3 <<- character()

        timing1 <- tapply(
            seq_len(N),
            function(i) {
                add_merge1(i, i, 1, M)
            }
        )

        timing2 <- tapply(
            seq_len(N),
            function(i) {
                add_merge2(i, i, 1, M)
            }
        )

        timing3 <- tapply(
            seq_len(N),
            function(i) {
                add_merge3(i, i, 1, M)
            }
        )

        data.frame(
            n = seq_along(timing1), t1 = cumsum(timing1), t2 = cumsum(timing2), t3 = cumsum(timing3), M = M
        )
    }
))


ggplot2::ggplot(pdat, ggplot2::aes(n)) +
    ggplot2::geom_point(ggplot2::aes(y = t1, color = "current")) +
    ggplot2::geom_point(ggplot2::aes(y = t2, color = "alternative")) +
    ggplot2::geom_point(ggplot2::aes(y = t3, color = "no_check")) +
    ggplot2::xlab("# Merges") +
    ggplot2::ylab("Total Time in Seconds") +
    ggplot2::theme_light() +
    ggplot2::facet_wrap(~ M)

pt

Here we see the current behaviour, one that avoids all of the expensive string conversion overhead (formatting would only need to happen when writing the xlsx file) and instead stores merges in a data frame of corner coordinates. Intersection are determined using corner points only. The third version does not check for intersections.

We see that the current implementation is really expensive and gets worse as the area of merges increases, while the alternative structure is still quadratic, but much faster and does not depend on area. But not doing the check at all is of course very fast and nice (and could be even faster if done vectorized). Extending the behaviour we see that even with an area of 2 with 10000 merges it should take about 13 minutes, for 50000 6 hours (for an area of 1000 fields this would be over 5 days), while the alternative would take 2 Minutes, and the one without checks 40 seconds (meanwhile doing it vectorized would take 0.3 seconds).

So I do suggest to maybe include a flag check_overlap = TRUE or whatever to allow skipping that check. If set to FALSE would be the responsibilty of the user to make sure that the inputs are sane with a large gain of performance.

In the long term switching from the current flow to one that does not require as many string transformations would also very significantly speed up the regular use.

@JanMarvin
Copy link
Collaborator

Hi @vpetzel , if you want to, please add a PR with a flag that allows disabling the check. I’m not sure why anyone would want to add 10.000+ merges into a spreadsheet, but if we can step out of the way, for this rare case we can allow the user to disable the check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants