Aim is to detect phishing URLs as well as narrow down to best machine learning algorithmby comparingaccuracy rate, false positive and false negative rate.
Phishing becomes a main area of concern for security researchers because it is not difficult to create the fake website which looks so close to legitimate website. Main aimof theattacker is to steal banks account credentials. Phishing attacks are becoming successful because lack of user awareness. The general method to detect phishing websites by updating blacklisted URLs, Internet Protocol (IP) to the antivirus database which is also known as “blacklist" method. To overcome the drawbacks of blacklist and heuristics-based method, many securityresearchers now focused on machine learning techniques. Using this technique, algorithm will analyze various blacklisted and legitimate URLs andtheir features to accurately detect the phishing websites including zero-hour phishing websites.
For our study, a large number of phishing pages were necessary. We concatenated three databases from Kaggle and merged it into one. The information collected by this methodis freely available and the amount of reported phishing sites is very large.
Depending on the data type (qualitativeor quantitative)of the target variable (commonly referred to as the Y variable) we are going to be building a classification model. We will be using logistic regression, MultinomialNB and Random Forest.
The output of front end is gonna look like this:
For a genuine site:
For a phishing site: