Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding weights and improving search ranking #41

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dylanpivo
Copy link
Contributor

@dylanpivo dylanpivo commented Feb 13, 2025

  • Add weights to title, keyword and abstract.
    • Curation request for the weighting order to be: keyword, title and then abstract.

Fix and improve ranking at search:
during search the query was not treating 'full_text' and 'query' as columns, but rather as values.
Normalisation - current normalisation is set to 4 | 1 but will be investigated.

Fixes #9

@dylanpivo dylanpivo marked this pull request as draft February 13, 2025 09:17
@dylanpivo
Copy link
Contributor Author

dylanpivo commented Feb 13, 2025

Normalization:

The below found here, outlines the normalization options.

  • 0 (the default) ignores the document length
  • 1 divides the rank by 1 + the logarithm of the document length
  • 2 divides the rank by the document length
  • 4 divides the rank by the mean harmonic distance between extents (this is implemented only by ts_rank_cd)
  • 8 divides the rank by the number of unique words in document
  • 16 divides the rank by 1 + the logarithm of the number of unique words in document
  • 32 divides the rank by itself + 1

4 | 1 is currently in use.

4: weighs the record higher if the words in the search term occur closer together in the record. for instance if "climate" and "change" occur right after each other as opposed to at opposite ends of the document.

1: weighs the record lower if the document is longer. the log ensures the penalization is lessened.

@dylanpivo
Copy link
Contributor Author

dylanpivo commented Feb 17, 2025

Testing:
The testing involves mocking up a list of different metadata records, publishing them and then searching. The metadata will be put together with only Lorum Ipsum mock data and in such a way to cater for the different ranking circumstances.

The list of options from which combinations will be generated are as follows:

The search term will be fixed.

Search terms in title no harmonic distance.
Search terms in title with harmonic distance.
No search term in title.

Short length abstract. (50 words)
Long abstract. (200 words)

No/Low harmonic distance of search terms in abstract.
High harmonic distance between search terms in abstract.

Many instances of search terms in abstract. (6 instances)
Few instances of search terms in abstract. (3 instances)
Note: the amount of instances does not increase if the abstract length increases. This is so lengthening the abstract effects the ranking in isolation.

Keywords with all search terms.
Keywords with no search terms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use weightings to improve sorting by relevance
1 participant