WeMoney User Personalisation

The problem presented was to take a dataset of 50 users and their interests and 50 historic posts to find a methodology to rank posts specifically for each user. From this sample of data, there's not really enough data to get any sort of statistical significance so instead, I’ve gone with a method that would work for larger pieces of data and still applied it anyway.

I’ve gone with a method that uses unsupervised learning because the dataset is unknown to me so I can’t apply any sort of expert system to initially group these users for supervised learning methods. I used a simple K-Means algorithm in clustering_model.py to cluster the users into groups which is calculated by their age and their interest categories. Running some analysis, it appeared that with the dataset I’ve been given, 4 groups could be made without making data too sparse. As you can see below, the elbow point at 4 for the SSE (Sum of Square Error).

Of these 4 groups, I wanted to know what sort of posts they typically create. The thinking behind this is that most social media posts I know of have a tendency to show you more of what you either look at or post. Because I have no data on what these users look at, I will show them what similar users typically post.

For this, I have decided to use LDA in LDA.py to break the posts into topics. Analysis shows that 5 topics seems to give a level of comprehension of the topics at an acceptable level. And there seems to be some distinct topic creation out of the 5 as can be seen below:

It must be noted however that with the minimal data, these topics will have some cross-over of keywords. For example, 'save' and 'saving' get used multiple times in topics but with varying levels of importance.

By taking these topics and applying them to each post, we get a score of each post by topic. Taking the highest value yields the topic assigned for that post. I have then aggregated the posts by the cluster group assigned to the user who wrote it. The results are shown below:

Using this information, I have then created ratio scores to show the propensity for a group to post a certain topic in topic_mapping.py.

With these two models, I can now decide which group a user belongs to and which topics they typically like to post about.

The last piece I’ve looked at is to add a recency measure for each post. The methodology used is to take a scalar (1000000 in this case) and divide it by the number of seconds from the passed date value to the post datetime. This is then cubed to stretch the values out. This is a means to penalise old posts in the ranking system and heavily reward very recent posts. With a larger dataset, these parameters can be tweaked to get the desired balance of topic relevance and recency. To do this though, there must be a source of behavioural data that can provide testable answers to hypothesis.

Algorithm Calculation

The makeup of the final algorithm groups the posts together in their conversations (by parent_id) and calculates metrics to rank.

The first is the mean LDA score ratio explained above. So for a conversation with three posts, the mean of the LDA score ratio is taken. For a single post as a conversation, the mean is simply the ratio value calculated. This score is meant to provide a relevance measure to the user. The second is the sum of the LDA ratios. This is a measure to try and show the relevancy of a conversation overall. If there are posts within a conversation of high relevence then adding them together will rank the conversation higher. To get around the problem of many irrelevant posts summed together, the mean and the sum are multiplied together. These are then finally multiplied by the recency score to give a final value. This end score is ranked to sort the posts.

This algorithm can be changed as needed. If there is some data available on how users interact with these posts then the algorithm can be optimised for success.

I have then put all of this together in a final function which is called from an API by providing a UID and date value. The sorted posts data is returned.

Installation

All of the application dependencies are in the requirements.txt file. After pulling this repository down, you can get started with the following:

pip install -r requirements.txt

Once the required packages are loaded, you must then load from Spacy the used model for LDA.

python -m spacy download en_core_web_sm

I have created this in python 3.9 and recommend this version for consistency.

Running

There are several scripts that can be run to either explore the creation of the models I have mentioned above or to run the finished API locally.

Personalisation API

To run the API locally, simply run the script personalisation_api.py. Open a browser and make the API GET call in the format below.

Please note that date format must be in the format 'YYYY-MM-DD' or Datetime 'YYYY-MM-DD HH:MM:SS' Please also make sure to make the space character in the datetime a '%20'. I haven't set a default value as current date, so it must be entered for each call.

http://127.0.0.1:5000/uid/date

EXAMPLE:

http://127.0.0.1:5000/eb2ec654-dbdd-4437-9eda-3b03b1ea13e3/2022-02-10

Or alternatively, make an API call with parameters.

http://127.0.0.1:5000/?uid=xxx&date=xxx

EXAMPLE:

http://127.0.0.1:5000/?uid=eb2ec654-dbdd-4437-9eda-3b03b1ea13e3&date=2022-02-10

This will return the fully sorted and ranked list of posts for any given user at a specific Date/Time.

Unit Testing

To check unit testing, run personalisation_api.py

Within this test, functions that make up the full calculation of the sorted list are tested in order of usage. Most test data outputs against set known outputs that have been saved to csv in /data/test_cases/

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
.DS_Store		.DS_Store
LDA.py		LDA.py
LDA_topics.png		LDA_topics.png
LICENSE		LICENSE
README.md		README.md
Topic_Analysis.png		Topic_Analysis.png
cluster_sse.png		cluster_sse.png
clustering_model.py		clustering_model.py
data_calculation.py		data_calculation.py
personalisation_api.py		personalisation_api.py
requirements.txt		requirements.txt
topic_mapping.py		topic_mapping.py
unit_testing.py		unit_testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WeMoney User Personalisation

Algorithm Calculation

Installation

Running

Personalisation API

Unit Testing

About

Releases

Packages

Languages

License

JonKissick/WeMoney_Personalisation

Folders and files

Latest commit

History

Repository files navigation

WeMoney User Personalisation

Algorithm Calculation

Installation

Running

Personalisation API

Unit Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages