This repository contains all my programming assignments for CS276 - Information Retrieval and web search
Block Sort Based Indexing, Index compression using a) Variable Byte Encoding b) Gamma Encoding
Task1:
>python index.py arg1 arg2
Indexes all files from arg1 into arg2
arg1 -- directory containing all files to be indexed
arg2 -- output directory where indexed files will be stored by the program i.e folder where the postings list will be stored
>python query.py arg1
Enter query terms after issuing this command. Searches the built index for relevant files (unranked retrieval)
arg1 -- directory where the indexed files were written to in the previous step i.e arg2 of step 1 above
Task2
>python index.py arg1 arg2 Indexes all files from arg1 into arg2 with variable byte compression. Notice the reduction in the size of the index arg1 -- directory containing all files to be indexed
arg2 -- output directory where indexed files will be stored by the program i.e folder where the postings list will be stored
>python query.py arg1
Enter query terms after issuing this command. Searches the built index for relevant files (unranked retrieval)
arg1 -- directory where the indexed files were written to in the previous step i.e arg2 of step 1 above
Task3
>python index.py arg1 arg2 Indexes all files from arg1 into arg2 with gamma compression. Notice the even more reduction in index size arg1 -- directory containing all files to be indexed
arg2 -- output directory where indexed files will be stored by the program i.e folder where the postings list will be stored
>python query.py arg1
Enter query terms after issuing this command. Searches the built index for relevant files (unranked retrieval)
arg1 -- directory where the indexed files were written to in the previous step i.e arg2 of step 1 above
Spelling Corrector
Step1: Train the noisy channel model
>./buildmodels.sh
training corpus -- corpus on which the noisy channel model will be trained for computing N-gram probabilties
training edits file -- A file containing a word and its misspellings, used for training the edit model probabilities
Step2: Test the noisy channel model on sentences/queries with spelling errors
>./runcorrector.sh <dev | test> <uniform | emperical>
dev | test -- really, pass in anything
uniform | empirical -- Pass in uniform for using a uniform edit model, empirical for a trained edit model queries file -- A file containing one query string per line, with possible spelling mistakes
Reference :
Uniform edit model -- An edit model that considers all spelling errors occur with the same probabilty
Empirical edit model-- An edit model that is trained on a dataset provided in the edits file with buildmodel.sh
This will be used in computing the probabilities for each kind of spelling error
Ranked Information Retrieval based on
- Cosine Similarity
- BM25 ranking measure
- Smallest Window of query terms
- Number of in-links to a webpage
>python rank.py task_number queryDocTrainData queryDocTrainRel queryDocTestData task_number -- 1 | 2 | 3 | 4 1 -- Ranking based on cosine similarity 2 -- Ranking based on BM25 similarity measure 3 -- Ranking based on Smallest Window signal 4 -- Ranking based on # in-links for a page queryDocTrainData -- Use the file with the same name. This contains formatted data for training the model
on a set of queries and a set of relevant results for each query. queryDocTrainRel -- Use the file with the same name. This containes formatted data for training the model on a set of queries and a set of relevant results for each query. queryDocTestData -- A file containing a query followed by a set of documents. The program will rank the documents
according to the specified ranking measure specified in task_number
Text Classification based on
- Binomial Naive Bayes
- Binomial Naive Bayes with Chi squared feature selection
- Multinomial Naive Bayes
- Multinomial Naive Bayes with Chi squared feature selection
- Transform Weight normalized complementary Naive Bayes
- Support Vector Machines
- K fold cross validation performed for tuning parameters for each of the above learning algorithms
Detailed Instructions on how to run this assignment are given in PA4.pdf in the PA4 directory