db_comparisons

Compare sqlite and DuckDB for aggregation queries.

Copy from sqlite to duckdb. Follow create_duckdb_from_mag.md. Need access to current database.
Create sqlite copy from the duckdb file created in step 1 (for exact comparison). Create indexes. TODO
Compare queries: directly in Duck, directly in sqlite, and querying sqlite with duck. Compare timing; make sure results are the same. TODO

Queries to compare

Citation count

create temp table temp_citations as 
SELECT  a.PaperReferenceId, 
            b.Year, 
            b.DocType AS ReferencingDocType, 
            COUNT(DISTINCT a.PaperId) AS CitationCount 
FROM PaperReferences a 
INNER JOIN (
    SELECT PaperId, Year, DocType 
    FROM Papers 
    INNER JOIN (
        SELECT PaperId 
        FROM PaperMainFieldsOfStudy
    ) USING (PaperId)
    WHERE 
    DocType IN ('Journal', 'Conference', 'Book', 'BookChapter', 'Thesis') 
) b on a.PaperId = b.PaperId 
INNER JOIN (
    SELECT PaperId, Year, DocType 
    FROM Papers 
    INNER JOIN ( 
        SELECT PaperId 
        FROM PaperMainFieldsOfStudy
    ) USING (PaperId) 
    WHERE 
    DocType IN ('Journal', 'Conference', 'Book', 'BookChapter', 'Thesis') 
    AND 
    Year >= 1950
) c on a.PaperReferenceId = c.PaperId 
GROUP BY a.PaperReferenceId, b.Year, ReferencingDocType

Highest N papers by author-affiliation-year

create temp table author_affil_year as 
SELECT AuthorId, AffiliationId, Year
FROM (
    SELECT *,
            MAX(PaperCount) OVER(PARTITION BY AuthorId, Year) AS MaxPaperCount
    FROM (
        SELECT a.AuthorId, 
                a.AffiliationId, 
                c.Year, 
                count(PaperId) AS PaperCount
        FROM PaperAuthorAffiliations a
        INNER JOIN (
            SELECT AuthorId 
            FROM author_sample 
        ) b USING (AuthorId)
        INNER JOIN (
            SELECT PaperId, Year 
            FROM Papers
            WHERE DocType IN ('Journal', 'Conference')
        ) c USING (PaperId)
        GROUP BY a.AuthorId, a.AffiliationId, c.Year 
    ) 
)   
WHERE PaperCount = MaxPaperCount ;

Thoughts and lessons learned

if sqlite does not define types, duckdb cannot convert them.
- raises Error: Conversion Error: Unimplemented type for cast (BLOB -> INTEGER)
- So, it's important to have a type for each column on each table when combining sqlite and duckdb.
"static input data" != static database.
- what seems to matter is whether input data is static; I did not understand that correctly until today
sqlite still has advantages that may be overlooked by these comparisons
- for instance, the unique constraint on indexes is quick check that data integrity is not violated; a similar mechanism should always be used even when using duckdb
I think Duck parallelizes across CPUs. So the gains may differ by by laptop vs cluster/server
- are the gains proportional to the CPUs available?
- compare both on a cluster and on a laptop

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_duckdb_from_mag.md		create_duckdb_from_mag.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

db_comparisons

Steps

Queries to compare

Thoughts and lessons learned

About

Releases

Packages

License

f-hafner/db_comparisons

Folders and files

Latest commit

History

Repository files navigation

db_comparisons

Steps

Queries to compare

Thoughts and lessons learned

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages