GitHub - vagvaz/Leads-QueryProcessorM

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
conf		conf
repo		repo
scripts		scripts
src		src
LICENSE		LICENSE
README		README
pom.xml		pom.xml

Repository files navigation

Prototype implementation of Query Processor

The source code can be found online at https://github.com/vagvaz/Leads-QueryProcessor.git

Furthermore, the prototype implementation is deployed at https://dashboard-dresden4.aocloud.eu/project/images_and_snapshots/
as a snapshot of a virtual machine with name Leads-QueryProcessor Demo


The virtual machine image includes the following software:
* An open-source operating system (Ubuntu linux), including the Java Virtual Machine
(OpenJDK 1.7.0).
* The Query Processor Engine of WP3
* The distributed PageRank algorithm of WP3
* The terminal-based user interface for executing SQL queries of WP3
* A deployment of the Key-Value Store of WP2
* The distributed web crawler of WP1



1.1 Compiling and  Running the prototype
To compile the leads query procesor  the following command must be executed in the root directory of the project.
mvn clean compile assembly:single
Then, in the target directory you can find the jar file with all the runtime dependencies.
To execute the leads query processor you can use the processor-with-crawler.sh script found in scripts directory. 
Note that the script expects a jar in the working directory named LeadsQueryProcessor.jar and the crawler jar named as crawler.jar.


processor-with-crawler.sh: This is a stand-alone script, which starts an instance of a web
crawler together with the query processor. This script is provided only for demonstration
purposes and will not be useful in the final installation, since the crawlers and the query
processors will be running independently, possibly also in different micro-clouds.

processor.sh: This script starts an instance of the query processor. Use this script if there
already exists a running instance of the web crawler in the same subnet (see D1.2 for moredetails on the web crawler).
Any of the two scripts will start a terminal-based user interface for executing SQL queries.
In order to quit the user interface, use the command ‘quit;’ (note that terminating with Ctrl-C
may lead to information loss, due to abnormal termination of the KVS store) .

The query processor implementation currently supports the following SQL syntax:
SELECT
select_expr [, select_expr ...]
[FROM table_reference
[ JOIN table_reference ON col_name = col_name ]

[WHERE where_condition]
[GROUP BY {col_name} ]
[HAVING where_condition]
[ORDER BY {col_name}
[ASC | DESC]]
[LIMIT { row_count }]

The select expression can be a column name, or one of the following functions: count, avg,sum.
The following two tables are created and maintained automatically, and can be used for writing SQL queries:

Table webpages: This table contains all crawled web pages, and has the following structure:
{
string url: the url of the webpage as a string,
string domainName: the domainName for the webpage as a string,
double pagerank: the pagerank computed by WP3 pagerank algorithm as double,
string body: the content of the webpage as a string and
double sentiment: an overall estimation of the sentiment of webpage’s content.
}

Table entities: This table contains information for entities that are extracted from the webpages
(e.g., adidas) and has the following structure:
{
string webpageURL: the url of the webpage that contains the entity,
string name: the name of the entity and
double sentimentScore: The sentiment for the entity in the webpage
}

The following sample queries can be used to demonstrate the capabilities of the prototype query
processor:
* SELECT domainName, avg(pagerank) FROM webpages GROUP BY
domainName ORDER BY avg(pagerank) DESC LIMIT 10;
* SELECT domainName, avg(pagerank), avg(sentimentScore) FROM
webpages JOIN entities on url=webpageURL WHERE entities.name
like 'adidas' GROUP BY domainName HAVING avg(sentimentScore) >
0.5 ORDER BY avg(pagerank) DESC;
* SELECT count(*) from webpages;
* SELECT url,sentiment FROM entities WHERE sentiment>0.4 LIMIT 5;
* SELECT domainName, sum(pagerank) FROM webpages group by domainName ORDER BY sum(pagerank) DESC LIMIT 10;

The prototype implementation has the following limitations:
1. The query processor does not support aliases.
2. In queries with a join, the column of the table found in the FROM clause must be the left
operand in the equality relation. In the following example, the URL column of the webpages
table is the left operand in the relation of the join :
SELECT domainName,name FROM webpages JOIN entities on
url=webpageURL;

3. The * wildcard can be used only inside a function. For example, the following query is
supported
SELECT count(*) from webpages;
whereas the following query is not supported
SELECT * FROM webpages;
As a workaround, user must set the column names explicitly.

4. The free version of the third-party, web-service AlchemyAPI used for extracting the entities
and sentiment scores imposes a quota on the number of requests per day and user.
Furthermore, the service offers no up-time guarantees. If the quota is surpassed or if the
service is down, some sentiments may be set to -2.

5. Robustness: If the query processor terminates unexpectedly (e.g., through a Control-C signal,
or due to a network or hardware fault) the hosted segment from the KVS may be lost.
Clearly, this can lead to information loss. This limitation will be addressed by replication at
the KVS level.

1.2 Configuration
The configuration for the query processor is read by the file processor.properties, which is saved at
the root directory. The file includes three parameters:
1. processorInfinispanConfigFile: This parameter defines the file used to
configure the KVS instance that will be started by the query processor.
processorSentimentAnalysisKeyFile: This parameter is used to specify the file
containing the AlchemyAPI key to use.
2. verbose: This parameter configures the verbosity of the logging information presented
in the querying terminal. verbose can be set to true or false.
A sample configuration file follows:
processorInfinispanConfigFile=infinispan-clustered-tcp-processor.xml
processorSentimentAnalysisKeyFile=key-processor.txt
verbose=true;