Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in performance between pyvo and astroquery #633

Closed
pavyamsiri opened this issue Dec 17, 2024 · 4 comments
Closed

Difference in performance between pyvo and astroquery #633

pavyamsiri opened this issue Dec 17, 2024 · 4 comments
Labels

Comments

@pavyamsiri
Copy link

Background

I am not sure this issue is that significant so feel free to close this.

I used to use astroquery and its gaia module to perform Gaia queries programmatically to be able to parallelise my queries across say bins in Galactic longitude.

Something like this

SELECT ... FROM gaia WHERE longitude BETWEEN {left_edge} AND {right_edge}

and using python's either multiprocessing or ThreadPoolExecutor.

I then changed to using pyvo and by feel it seemed slower than before. This is mostly feelings based as I didn't do a benchmark of the two approaches because I changed how I dispatched the queries (astroquery and ThreadPoolExecutor while I used async python with pyvo).

Benchmark results

I was curious to see if there is a measurable difference so I wrote a benchmark script (see the bottom of the post) and got these as the results.

gaia_query_benchmark

You can see that for smaller queries, pyvo tends to be slower than astroquery mostly due to consistently longer "read" times (read being the time taken for the query and to download to memory). This flips at higher number of objects queried but then "write" times (writing the in-memory table to disk as votable) become higher for pyvo although the error bars imply that their performance is comparable.

The point of this issue is that although the performance isn't that different at large queries (what I thought) it does seem to be different enough at smaller queries with pyvo being slower than astroquery.

Is this performance comparison known? I would like to use pyvo over astroquery because it is more flexible and is more general over different TAP services but it seems to be slightly worse in performance.

I know the benchmark is not solid evidence because of other factors like network and disk complicating things so I also wanted to know if others get the same-ish results.

Benchmarking script

import concurrent.futures
import tempfile
import multiprocessing as mp
import time
from collections.abc import Callable
from itertools import pairwise, repeat
from pyvo.dal import TAPService
import astropy.table
from typing import cast, TypeAlias
from astroquery.gaia import Gaia

import numpy as np
from matplotlib import pyplot as plt

SQL_TEMPLATE: str = "SELECT TOP ? G.source_id FROM gaiadr3.gaia_source_lite AS G WHERE G.l BETWEEN @ AND #;"

QueryFunction: TypeAlias = Callable[[str, tuple[float, float]], tuple[float, float]]


def query_via_astroquery(
    template: str, edges: tuple[float, float]
) -> tuple[float, float]:
    left_edge, right_edge = edges
    text = template.replace("@", str(left_edge)).replace("#", str(right_edge))
    start_read_time = time.perf_counter()
    job = Gaia.launch_job_async(text)
    results: astropy.table.Table = cast(astropy.table.Table, job.get_results())
    read_time = time.perf_counter() - start_read_time

    with tempfile.TemporaryFile(mode="w") as out_file:
        start_write_time = time.perf_counter()
        results.write(out_file, format="votable")
        write_time = time.perf_counter() - start_write_time

    return (read_time, write_time)


def query_via_pyvo(template: str, edges: tuple[float, float]) -> tuple[float, float]:
    left_edge, right_edge = edges
    text = template.replace("@", str(left_edge)).replace("#", str(right_edge))
    service = TAPService("https://gea.esac.esa.int/tap-server/tap")
    start_read_time = time.perf_counter()
    job = service.submit_job(text, maxrec=service.hardlimit)
    job.run()
    job.wait()
    results = job.fetch_result()
    read_time = time.perf_counter() - start_read_time

    with tempfile.TemporaryFile(mode="w") as out_file:
        start_write_time = time.perf_counter()
        results.votable.to_xml(out_file)
        write_time = time.perf_counter() - start_write_time

    return (read_time, write_time)


TEST_FUNCTIONS: set[QueryFunction] = {query_via_astroquery, query_via_pyvo}
LABEL_COLORS: dict[QueryFunction, tuple[str, str, str]] = {
    query_via_astroquery: ("#0072B2", "#E69F00", "#009E73"),
    query_via_pyvo: ("#CC79A7", "#F0E442", "#D55E00"),
}


def main() -> None:
    num_workers: int = mp.cpu_count()
    num_bins: int = 3 * num_workers
    test_num_sources = [1, 10, 100, 1000, 10_000, 100_000, 1_000_000, 3_000_000]
    edges = np.linspace(0, 360, num_bins + 1)

    run_times: dict[
        QueryFunction, dict[int, tuple[tuple[float, float], tuple[float, float]]]
    ] = {func: {} for func in TEST_FUNCTIONS}

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as pool:
        for query_func in TEST_FUNCTIONS:
            print(f"Testing {query_func.__qualname__}...")
            for num_objects in test_num_sources:
                print(f"Querying {num_objects} objects...")
                current_template: str = SQL_TEMPLATE.replace("?", str(num_objects))
                result = tuple(
                    pool.map(query_func, repeat(current_template), pairwise(edges))
                )
                current_read_times = np.array(tuple(map(lambda x: x[0], result)))
                current_write_times = np.array(tuple(map(lambda x: x[1], result)))
                mean_read_time = float(np.mean(current_read_times))
                std_read_time = float(np.std(current_read_times))
                mean_write_time = float(np.mean(current_write_times))
                std_write_time = float(np.std(current_write_times))
                run_times[query_func][num_objects] = (
                    (mean_read_time, std_read_time),
                    (mean_write_time, std_write_time),
                )

    fig = plt.figure(figsize=(12, 9))
    axis = fig.add_subplot(111)
    axis.set_title("Gaia query python library benchmarks")
    axis.set_xscale("log")
    axis.set_xlabel("Number of sources queried")
    axis.set_ylabel("Time taken (seconds)")
    for func, times in run_times.items():
        num_objects = sorted(times.keys())

        # Extract sorted keys and separate float values into two lists
        mean_read_times = np.array(
            [times[current_num_sources][0][0] for current_num_sources in num_objects]
        )
        std_read_times = np.array(
            [times[current_num_sources][0][1] for current_num_sources in num_objects]
        )
        mean_write_times = np.array(
            [times[current_num_sources][1][0] for current_num_sources in num_objects]
        )
        std_write_times = np.array(
            [times[current_num_sources][1][1] for current_num_sources in num_objects]
        )
        mean_total_times = mean_read_times + mean_write_times
        std_total_times = std_read_times + std_write_times
        read_color, write_color, total_color = LABEL_COLORS[func]

        axis.errorbar(
            num_objects,
            mean_read_times,
            yerr=std_read_times,
            label=f"{func.__qualname__} (read)",
            ls="-",
            color=read_color,
        )
        axis.errorbar(
            num_objects,
            mean_write_times,
            yerr=std_write_times,
            label=f"{func.__qualname__} (write)",
            ls="--",
            color=write_color,
        )
        axis.errorbar(
            num_objects,
            mean_total_times,
            yerr=std_total_times,
            label=f"{func.__qualname__} (total)",
            ls="-.",
            color=total_color,
        )
    axis.legend(loc="center left")
    fig.savefig("./benchmark.png")
    plt.close(fig)
    print("DONE")


if __name__ == "__main__":
    main()
@jwfraustro
Copy link

Out of curiosity, I tried this myself. I limited TOP to 1000 results, since that should have been illustrative enough.

I did get a similar disparity to yours, when running your script directly:
benchmark1_infunc

However, I noticed that you are re-instantiating the TAPService() for pyvo on every query, here:

def query_via_pyvo(template: str, edges: tuple[float, float]) -> tuple[float, float]:
    left_edge, right_edge = edges
    text = template.replace("@", str(left_edge)).replace("#", str(right_edge))
    service = TAPService("https://gea.esac.esa.int/tap-server/tap")

this seemed odd to me, because if you move it to the top-level of the module, you can use the same requests session that pyvo has instantiated. Moving that one line to the top of the module, say, here:

QueryFunction: TypeAlias = Callable[[str, tuple[float, float]], tuple[float, float]]
service = TAPService("https://gea.esac.esa.int/tap-server/tap")

def query_via_astroquery(

got me a much different result, now that we were able to make use of the session:
benchmark_toplevel

As you can see, it seems pyvo was even quicker than astroquery with this change. Perhaps this change might be useful to your scripts as well?

@jwfraustro
Copy link

Out of curiosity, and an attempt to remove as many possible confounders, I made an even further stripped down test script, to evaluate if that was the source of the inefficiency in the reads.

The script only executes the queries serially, and executes them against a source I'm familiar with the internals of, MAST's GAIA TAP service. I've additionally opted for a straight TOP query against the table. Additionally, since GaiaClass can't be used with MAST, I just used TapPlus, which is the parent class. Both of the client services were initialized at the top of the module as well.

My plot is not as pretty as yours, but it seems to show the same result again, pyvo being ~2x as fast, with a top-level import. The result lengths only go up to 100,000, since that's MAST's MAXREC.
query_times_comparison

The test script:

import csv
import time

import matplotlib.pyplot as plt
import numpy as np
from astroquery.utils.tap.core import TapPlus
from pyvo.dal import TAPService

MAST_URL = "https://mast.stsci.edu/vo-tap/api/v0.1/gaiadr3/"

QUERY = "SELECT TOP {} source_id FROM dbo.gaia_source"

astroquery_tap = TapPlus(url=MAST_URL)
pyvo_tap = TAPService(MAST_URL)


def query_via_astroquery(n: int) -> float:
    start_time = time.perf_counter()
    job = astroquery_tap.launch_job_async(QUERY.format(n))
    job.get_results()
    return time.perf_counter() - start_time


def query_via_pyvo(n: int) -> float:
    start_time = time.perf_counter()
    job = pyvo_tap.submit_job(QUERY.format(n))
    job.run()
    job.wait()
    job.fetch_result()
    return time.perf_counter() - start_time


def main():
    n_values = [1, 100, 1000, 10000, 100000]
    astroquery_times = []
    pyvo_times = []

    for n in n_values:
        astroquery_run_times = [query_via_astroquery(n) for _ in range(10)]
        pyvo_run_times = [query_via_pyvo(n) for _ in range(10)]

        astroquery_mean = np.mean(astroquery_run_times)
        astroquery_std = np.std(astroquery_run_times)
        pyvo_mean = np.mean(pyvo_run_times)
        pyvo_std = np.std(pyvo_run_times)

        astroquery_times.append((n, astroquery_mean, astroquery_std))
        pyvo_times.append((n, pyvo_mean, pyvo_std))

    n_values = [x[0] for x in astroquery_times]

    astroquery_means = [x[1] for x in astroquery_times]
    astroquery_stds = [x[2] for x in astroquery_times]

    pyvo_means = [x[1] for x in pyvo_times]
    pyvo_stds = [x[2] for x in pyvo_times]

    plt.errorbar(n_values, astroquery_means, yerr=astroquery_stds, label="Astroquery", fmt="-o")
    plt.errorbar(n_values, pyvo_means, yerr=pyvo_stds, label="PyVO", fmt="-o")

    plt.xlabel("Number of Records")
    plt.ylabel("Time (seconds)")
    plt.title("Query Time Comparison")
    plt.legend()
    plt.xscale("log")
    plt.grid(True, which="both", ls="--")

    plt.savefig("query_times_comparison.png")
    plt.show()


if __name__ == "__main__":
    main()

@bsipocz
Copy link
Member

bsipocz commented Dec 18, 2024

Thank you @jwfraustro for the experiment. For context: we do have TAPPlus around to support the ESA modules, but we do not support its usage for other TAP services. For the other services the only supported client within astroquery and pyvo is the pyvo implementation.
And there are plans to move the missing functionality over from TAPPlus and retire that module. I don't have ETA for it though. So independently of the conclusions of this test (which anyway favours pyvo), I close this issue now as it doesn't contain any actionable items for pyvo.

@bsipocz bsipocz closed this as completed Dec 18, 2024
@pavyamsiri
Copy link
Author

This comment is not a request to re-open the issue, just wanted to add to the findings.

@jwfraustro Thanks for investigating further! I was also thinking I was doing something un-idiomatic so wanted to make sure by asking here.

I think the primary reason I recreated the service every time was because I used to use a process pool to avoid potential issues with python's GIL. I ran into issues with passing the TAPService across process boundaries due to not being pickleable. This seems to work now though? Testing this again with process pool and passing in the service as a function parameter, I get this

windows_wsl_benchmark_process_pool

I don't think this is actionable or a fault with pyvo but rather passing services across processes doesn't really make sense to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants