Project

1. Preprocessing

Process: NewsArticle → ArticleWordsDic

Get the unique query term, store them in a HashSet — method getQueryWordsSet().
Create a custom HashMap Accumulator — TotalQueryWordsAccumulator, and 2 Long Accumulator:
- wordCountAccumulator: The overall all document length after preprocessing.
- docCountAccumulator: The total doc number after preprocessing.
- TotalQueryWordsAccumulator: Count the total number of Query term appears in the whole corpus.
And using above variables, we can get averageDocumentLengthInCorpus.
Use PreprocessFlatMap() to convert NewsArticle DataSet to ArticleWordsDic DataSet and collect, in order to lower the data size, only convert the article has query term to ArticleWordsDic.
1. Check if the title or ID is null, and the get 5 paragraph and title into a List.
2. wordCountAccumulator += list size, and docCountAccumulator + 1.
3. Use the query term hashset, count each term and frequency in current Article, store in a HashMap.
4. If the map size == 0, means that no term showed up in this Article, return an empty list.
5. TotalQueryWordsAccumulator add the HashMap.
6. Return the created ArticleWordsDic
Retrieve above DPH parameter and the HashMap.

Structures & Functions

Class `ArticleWordsDic`

public class ArticleWordsDic implements Serializable {

    String id; // unique article identifier

    String title; // article title

    int length;

		Map<String, Integer> map;
}

Class `TotalQueryWordsAccumulator`

public class TotalQueryWordsAccumulator extends AccumulatorV2<HashMap<String, Integer>, HashMap<String, Integer>> {

    private HashMap<String, Integer> hashMap = new HashMap<>();
		....
}

FlatMapFunction `PreprocessFlatMap`

FlatMapFunction `QueryWordsFlatMap`

2. Query

Process: ArticleWordsDic → QueryResultWithArticleId

Wrap the DPH parameter and the List into Broadcast variables.
Based on the queries dataset, call QueryToQueryResultMap method to get the QueryResultWithArticleId.
1. Get the score of each article, store the articleID and score into List<DPHResult> dphResultList
2. Sort the dphResultList.
3. Calculate the distance between titles, and create 10 article list, storing at queryResultWithArticleId.
4. Return the queryResultWithArticleId.

Structures & Functions

Class `QueryResultWithArticleId`

public class QueryResultWithArticleId implements Serializable {

    Query query;

    List<DPHResult> articleIdList;
}

Class `DPHResult`

public class DPHResult implements Serializable {

    String id;

    String title;

    double score;
}

MapFunction `QueryToQueryResultMap`

public class QueryToQueryResultMap implements MapFunction<Query, QueryResultWithArticleId> {

    private static final long serialVersionUID = -484410270146328326L;

    Broadcast<List<ArticleWordsDic>> listBroadcast;
    Broadcast<Long> totalDocsInCorpus;
    Broadcast<Double> averageDocumentLengthInCorpus;
    Broadcast<Map<String, Integer>> totalTermFrequencyInCorpusDic;

    public QueryToQueryResultMap(Broadcast<List<ArticleWordsDic>> listBroadcast, Broadcast<Long> totalDocsInCorpus,
                                 Broadcast<Double> averageDocumentLengthInCorpus,
                                 Broadcast<Map<String, Integer>> totalTermFrequencyInCorpusDic) {
        this.listBroadcast = listBroadcast;
        this.totalDocsInCorpus = totalDocsInCorpus;
        this.averageDocumentLengthInCorpus = averageDocumentLengthInCorpus;
        this.totalTermFrequencyInCorpusDic = totalTermFrequencyInCorpusDic;
    }
}

3. Generate DocumentRanking

Process: QueryResultWithArticleId → DocumentRanking

In Dataset<QueryResultWithArticleId> queryResultWithArticleIdDataset, there are all the ArticleID needed for the result, then using GetReusltArticleIdFlatMap() to generate the unique ArticleID as HashSet, use it on NewsArticleResultFlatMap of news dataset to get the Dataset needed for create DocumentRanking.
After getting Dataset, map it to JavaPairRDD<String, NewsArticle>, the key is the ArticleId, collect as Map.
Then broadcast it to the previous queryResultWithArticleIdDataset, using QueryWithArticleIdToDR Map to get the final List<DocumentRanking> documentRankingList.

Functions

FlatMapFunction `GetReusltArticleIdFlatMap`

FlatMapFunction `NewsArticleResultFlatMap`

MapFunction `QueryWithArticleIdToDR`

Edits

Renamed flatmap function GetReusltArticleIdFlatMap to GetResultArticleIdFlatMap, and updated the related classes.
rephrase comments in AssessedExercise and QueryToQueryResultMap files.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
resources		resources
src/uk/ac/gla/dcs/bigdata		src/uk/ac/gla/dcs/bigdata
.gitignore		.gitignore
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project

1. Preprocessing

Process: NewsArticle → ArticleWordsDic

Structures & Functions

Class `ArticleWordsDic`

Class `TotalQueryWordsAccumulator`

FlatMapFunction `PreprocessFlatMap`

FlatMapFunction `QueryWordsFlatMap`

2. Query

Process: ArticleWordsDic → QueryResultWithArticleId

Structures & Functions

Class `QueryResultWithArticleId`

Class `DPHResult`

MapFunction `QueryToQueryResultMap`

3. Generate DocumentRanking

Process: QueryResultWithArticleId → DocumentRanking

Functions

FlatMapFunction `GetReusltArticleIdFlatMap`

FlatMapFunction `NewsArticleResultFlatMap`

MapFunction `QueryWithArticleIdToDR`

Edits

About

Releases

Packages

Languages

hjtan75/BigData-AE

Folders and files

Latest commit

History

Repository files navigation

Project

1. Preprocessing

Process: NewsArticle → ArticleWordsDic

Structures & Functions

Class ArticleWordsDic

Class TotalQueryWordsAccumulator

FlatMapFunction PreprocessFlatMap

FlatMapFunction QueryWordsFlatMap

2. Query

Process: ArticleWordsDic → QueryResultWithArticleId

Structures & Functions

Class QueryResultWithArticleId

Class DPHResult

MapFunction QueryToQueryResultMap

3. Generate DocumentRanking

Process: QueryResultWithArticleId → DocumentRanking

Functions

FlatMapFunction GetReusltArticleIdFlatMap

FlatMapFunction NewsArticleResultFlatMap

MapFunction QueryWithArticleIdToDR

Edits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Class `ArticleWordsDic`

Class `TotalQueryWordsAccumulator`

FlatMapFunction `PreprocessFlatMap`

FlatMapFunction `QueryWordsFlatMap`

Class `QueryResultWithArticleId`

Class `DPHResult`

MapFunction `QueryToQueryResultMap`

FlatMapFunction `GetReusltArticleIdFlatMap`

FlatMapFunction `NewsArticleResultFlatMap`

MapFunction `QueryWithArticleIdToDR`

Packages