-
Notifications
You must be signed in to change notification settings - Fork 3
Open source Winnow is a very high performance content recommendation engine that efficiently trains and operates any number of unique classifiers on large sets of content. Winnow was developed over the course of a number of years, and has been heavily used to power winnowtag.org during that time, where it has proven to be accurate and stable.
License
WinnowTag/winnow
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Copyright (c) 2007-2010 The Kaphan Foundation This file contains instructions for running and configuring the Winnow content recommendation engine. For information on building and installing the engine see the INSTALL file. Please see the pages of the wiki our github repository http://github.com/WinnowTag/winnow/wiki for detailed explanations of Winnow's architecture and operation. Creating an item cache ============================================ Before running Winnow, you'll need to create the persistent item cache. This is a directory "item_cache" containing the three sqlite3 databases atom.db, tokens.db and catalog.db. See the "Item Cache" document page in the github repository wiki for more information. To initialize the database schemas, use sqlite3 to execute the SQL shown below in each corresponding newly created sqlite3 database file. After you've used the classifier on a sufficient number of items, you'll need to set up the random background. The classifier accuracy will be poor until you do this. See the section "Setting up random background" at the end of this file. TODO: Create a script that does this. TODO: Provide a reasonably small item_cache already populated and with random background initialized that can be used for testing and initial development. atom.db ------- CREATE TABLE entry_atom (id integer not null primary key, atom BLOB); tokens.db --------- CREATE TABLE entry_tokens (id integer not null primary key, tokens BLOB); catalog.db ---------- CREATE TABLE entries (id integer NOT NULL PRIMARY KEY, full_id text, updated real, created_at real, last_used_at real); CREATE TABLE "random_backgrounds" ( "entry_id" integer NOT NULL PRIMARY KEY, constraint "random_backgrounds_entry_id" foreign key ("entry_id") references "entries" ("id") ); CREATE TABLE "tokens" ( "id" integer NOT NULL PRIMARY KEY, "token" text NOT NULL UNIQUE ); CREATE UNIQUE INDEX entries_full_id on entries(full_id); CREATE INDEX entries_last_used_at on entries(last_used_at); CREATE INDEX entries_updated on entries(updated); CREATE TRIGGER random_backgrounds_entry_id BEFORE DELETE ON entries FOR EACH ROW BEGIN SELECT RAISE(ROLLBACK, 'delete on table "entries" violates foreign key constraint "random_backgrounds_entry_id"') WHERE (SELECT entry_id FROM random_backgrounds WHERE entry_id = OLD.id) IS NOT NULL; END; Running the Classifier ============================================ The classifier can be run using the "classifier" executable. If you performed the "make install" of the build instructions this is likely in your PATH. Run "classifier -h" to see a list of supported arguments. A typical usage will be to start the classifier on a specified port and item cache directory and a positive cutoff threshold of 0.9. You probably also want a log file to be generated. This will be enough for most development cases. This can be done using the following command: classifier --db <item_cache_dir> --port 8008 -t 0.9 --log-file classifier.log You can then tail -f classifier.log to see the log messages the classifier produces. Once a message appears in the log saying the classifier has completed initialization you can then access it's http server on port 8008. Initialization can sometimes take a few minutes depending on the size of the item cache and whether the file system has cached item cache file access. Once things are up and running Winnow will be able to talk to the classifier using the standard classifier UI elements. You can also do a simple test by entering http://localhost:8009/classifier.xml into a web browser to make sure everything is working. Shutting it down ============================================= The classifier is shutdown by sending it a SIGTERM signal. This can be done using CTRL-C or kill <pid>. Setting up the random background ============================================= After you have classified more than 1000-5000 items, set up the random background. This is done by executing the following SQL in the item.db sqlite3 database: INSERT INTO RANDOM_BACKGROUNDS SELECT ID FROM ENTRIES ORDER BY random() LIMIT N ...where N is the size that you want. We recommend between 1000 and 5000 items depending upon the size of the corpus that you classify. For winnowtag.org, which works with over 3/4 million items, we use a random background of 5000 items. Until you set up the random background, Winnow's classification accuracy will be poor.
About
Open source Winnow is a very high performance content recommendation engine that efficiently trains and operates any number of unique classifiers on large sets of content. Winnow was developed over the course of a number of years, and has been heavily used to power winnowtag.org during that time, where it has proven to be accurate and stable.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published