This project is a web crawler for the website ikman.lk. It is built using AWS Lambda functions and Python.
In order for this to be deployed to AWS, the followings should be done.
- Create 2 Github Actions secrets named
PROD_AWS_ACCESS_KEY_ID
&PROD_AWS_SECRET_ACCESS_KEY
- Create a Google Cloud Service Account and store the credentials as an AWS Secret named
gsheet_client_secret
. The same secret ARN is mentioned asarn:aws:secretsmanager:us-east-1:310340543340:secret:gsheet_client_secret-eq2CuG
in the serverless.yml file. - Add an verify either the same or 2 emails in Amazon SES. In my case, I use one Gmail for sending. My personl email for receiving.
- Create the following AWS Systems Manager Parameters.
Parameter Name | Type | Explanation |
---|---|---|
ikman_crawler_from_email | String | Here I used a test Gmail I created. This has to be verified in AWS SES. |
ikman_crawler_to_emails | StringList | Here I use my personal Gmail. It is not required to verify this. |
ikman_crawler_google_sheet | String | Create a copy of this Google sheet, and use the copy. Makesure the Google Cloud Service Account mentioned in #2 is an editor of the sheet. |
These Lambdas will be updating the copy of the Google Sheet created.
The project consists of the following AWS Lambda functions:
-
house_crawler
: This is scheduled to run once a day. It will go through multiple Ikman.lk ad list URLs listed withSource
key in this tab. Here is an example URL which has house for sale ads in Nawala area with 2+ bathrooms, 3+ bedrooms and price between 15 to 80 million. This lambda will save all those ads in theSale
tab of the Google Sheet. -
description_processor
: This is scheduled to run 15 mins after the 1st one. It will go through individual ads and copy the ad description text toDescription
tab. -
duplicate_processor
: This is scheduled to run 15 min after the 2nd one. In the Sale, there is a column namedStatus
, a dropdown which you can manually selectIgnore
andConsider
if you want to ignore that ad or consider to revisit the ad later. Based on your last selection, for a given ad if you have selectedIgnore
,Consider
or EMPTY (haven't taken any action) before, based on that the following logic will be executed.
The following prompt was used to get this code written by ChatGPT. The same prompt explains the logic as well.
I want you to iterate through all the data rows of New sheet.
Assume A is the current record of the iteration.
For the A row, do a 1:1 mapping between New and Description sheet using the URL column and find the Description column value for the A row.
Then do a text diff comparison between A.Description and all the values in the Description column in the Description sheet and find all matching rows from the Description sheet.
When doing this matching makesure line matching is 90% or more.
When doing this matching, ignore the row where Description sheet.URL = A.URL.
Now we have a list of rows from Description sheet which has similar Description value compared to A row.
From this filtered list of Description sheet rows extract their URL column values. I assume all those values = C.
Do a 1:1 mapping between Description and New sheet again and find Status, Total, Notes column values for C, from the New sheet. Here I assume all those matching rows from New sheet = B.
If A's Status column is empty Then execute below.
If all B rows, has empty Status
Step X - find the minimum Total amount B rows. If there are multiple rows with minimum Total amount, just consider the 1st one.
If the row found in X's Total is less than to A row's Total value Then
1. Update A's status = "Ignore", A's Notes = the URL value of the row found in Step X
If all B rows' Total value is greater than or equal to the A row's Total value Then
2. Update B rows Status="Ignore" and Notes=A.URL
Else If one or more of the B rows has Status="Consider", Then
Step Y - find the minimum Total amount from B with "Consider" status. If there are multiple rows with minimum Total amount, just consider the 1st one.
If A's Total is less than the minimum Total amount found in Y
3. Update A's status = "Consider"
Else
4. Update A's status = "Ignore", A's Notes = the URL value of the row found in Step Y
Else If one or more of the B rows has Status="Ignore", then
Step Z - find the minimum Total amount from B with "Ignore" status. If there are multiple rows with minimum Total amount, just consider the 1st one.
If A's Total greater than or equals to the minimum Total amount found from Z then
5. Update A's status = "Ignore", A's Notes = the URL value of the row found in Step Z
As per the above logic there are 5 outcomes. Here are the explanations.
Update A's status = "Ignore", A's Notes = the URL value of the row found in Step X
- All these matching ads are posted today. I haven't reviewed them before. That is why all of them has status as Empty. In this case A's Total is higher than others. So I am going toIgnore
A. But I might be considering the same house from a different ad. For now, A is expensive.Update B rows Status="Ignore" and Notes=A.URL
- All these matching ads are posted today. I haven't reviewed them before. That is why all of them has status as Empty. In this case A's Total is lower than others. So I am going toIgnore
others, and keeping A as empty, because I am going to check ads with empty status. A is not inConsider
status yet, because I am doing that change manually if I like the house.Update A's status = "Consider"
- this means the same house has been marked asConsider
before. This time, the new ad's total is lower than previous one. So this is defintely an ad to consider.Update A's status = "Ignore", A's Notes = the URL value of the row found in Step Y
- this means the same house has been marked asConsider
before. This time the ad is same amount or more expensive. So not considering this ad.Update A's status = "Ignore", A's Notes = the URL value of the row found in Step Z
- this means the same house has been marked asIgnore
before. This time the ad is same amount or more expensive. So not considering this ad.
To run the project, you need to deploy the AWS Lambda functions using the Serverless Framework. The serverless.yml
file contains the configuration for the deployment. You can simply uncomment the following line and run each lambda locally. Makesure you have AWS config
and credential
file in this location C:\Users\<USERNAME>\.aws
.
handler({}, {})
If you wish to contribute to this project, please fork the repository and submit a pull request.