Classifying Phishing URLs Using Recurrent Neural Networks

In a recent research paper, we showed how we are able to detect with a high level of accuracy if a website is a phish just by looking at the URL. This post lays out in greater detail how, by using a deep recurrent neural network, we’re able to accurately classify more than 98 percent of URLs.

Phishing attacks are a growing problem worldwide. According to the Anti-Phishing Working Group (APWG), the number of phishing websites increased by 250 percent during the last quarter of 2015 and the first quarter of 2016, targeting more than 400 brands each month, the most APWG has seen since it began tracking and reporting on phishing in 2004. The use of phishing by a criminal is focused on using social engineering schemes to steal consumers’ personal and financial information. The attacks are designed to trick consumers into giving financial data–such as usernames and passwords–to fake websites purporting to be legitimate businesses.

Phishing attacks can be launched from anywhere in the world by people with little to no technical skills at insignificant costs. Organizations trying to protect their users from these attacks have a hard time dealing with the massive amount of emerging sites, which must be identified and labeled as either malicious or harmless before users access them.

There are many methods used by phishers to distribute their attacks, and various strategies are used to make the site resemble a legitimate site while evading detection techniques. However, all of them rely on URLs to direct their victims to the phishing website. Resemblance to a legitimate site’s URL is the most common social engineering method used by phishers to lure victims to their websites. Therefore, the use of URLs to filter potential phishing websites from legitimate websites is a good first step when it comes to blocking phishing.

Being able to determine the maliciousness of a website just by looking at its URL provides a major strategic advantage: the number of victims can be reduced to nearly zero while simultaneously minimizing operational efforts by avoiding the massive use of more complex methods such as content analysis. In this work, we focus on classifying phishing sites using only the URL with machine-learning techniques. In particular, we compare the performance of a combination of lexical and statistical analysis of URLs as input to a random forest (RF) classifier against a novel approach using recurrent neural networks, specifically a Long Short Term Memory network (LSTM).

Phishing Classification Using Lexical and Statistical Frequencies of URLs

Machine-learning classifiers with manually created features, such as RF, have been widely used for classification problems. Moreover, this method has been successfully applied to identifying phishing URLs. Last year we published a blog post on the use of this strategy to detect phishing sites (http://blog.easysol.net/the-technical-side-of-phishing/).

An attacker’s objective when crafting a URL for a phishing site is to confuse users into thinking it is a legitimate website. In attempting this, the cybercriminal hopes to trick users into giving their personal data and financial information. In order to do that, the attackers tend to follow some patterns which can be uncovered by an experienced analyst. In collaboration with security analysts at Easy Solutions, Inc., we identified a set of features that can be used to create a lexical and statistical analysis of the URLs.

Once the features are extracted, a binary classifier is trained using the URL features presented. To achieve this, we use a RF method, a classification algorithm that relies on weaker models to build a stronger model with the average of the weaker models’ responses. The weaker models it uses are classification trees. Each classification tree recursively splits the data set based on the feature values and stops the current split when all the input instances belong to the same class. We chose RFs since they are widely used and can be trained to(?) run in parallel.

Modeling Phishing URLs with Deep Recurrent Neural Networks

In the previous section, we designed a set of features extracted from a URL and fed them to a classification model to predict whether a URL is a case of phishing. We now approach the problem in a different way: instead of manually extracting the features, we directly learn a representation from the URL’s sequence of characters.

Each sequence of characters exhibits correlations, meaning nearby characters in a URL are likely to be related to each other. These sequential patterns are important because they can be exploited to improve the performance of the predictors.

A neural network is a bio-inspired, machine-learning model that consists of a set of artificial neurons with connections between them. Recurrent Neural Networks (RNN) are a type of neural network able to model sequential patterns. The distinctive characteristic of RNNs is that they introduce the notion of time to the model. This allows them to process sequential data one element at a time and to learn their sequential dependencies.

One limitation of general RNNs is that they cannot learn the correlation between elements distanced by more than five or ten time steps. A model that overcomes this problem is LSTM. It can bridge elements distanced by more than 1,000 time steps without loss of short time lag capabilities.

An LSTM is an adaptation of an RNN, where each neuron is replaced by a memory cell that, in addition to a conventional neuron that represents an internal state, uses multiplicative units as gates to control the flow of information. In particular, a typical LSTM cell has an input gate that controls the input of information from outside, a forget cell that controls whether to keep or forget the information in the internal state, and an output gate that allows or prevents the internal state from being seen from the outside.

In this work, we use LSTM units to build a model that receives as input a URL as a character sequence and predicts whether the URL corresponds to a case of phishing. The architecture is illustrated in the following figure:

Each input character is translated by a 128-dimension embedding. The translated URL is fed to an LSTM layer as a 150-step sequence. Finally, the classification is performed using an output sigmoid neuron. The network is trained by backpropagation using a cross-entropy loss function and dropout in the last layer.

Results

To train both models, a dataset of real and phishing URLs was constructed. In total, 2 million URLs were used in the training–half of them legitimate, and half of them phishing. The legitimate URLs came from the Common Crawl, a corpus of web crawl data. The phishing URLs came from Phishtank, a website used to submit phishing URLs.

In the following table, a sample of 10 legitimate URLs and 10 phishing URLs are shown.

Note how similar the legitimate and malicious URLs can be. This is expected as the objective of an attacker is to confuse the web user by making the phishing site looks as legitimate as possible.

In order to evaluate the performance of the models, we used a 3-fold cross-validation strategy. This process consists of splitting the data into three folds, then training the data using two folds, with the remaining one used for validation of the model. This process is repeated three times, using each fold only once for validation. In the end, all the performance metrics on the validation folds are averaged. This way the variance is reduced, and we can get a better estimate of the model’s performance.

We compared the accuracy of the two methods and found that the difference between them is consistent across folds. On average, the LSTM network has an accuracy rate of 5-percent higher than the feature engineer model with RF.

We also compared the accuracy of the models using different numbers of URLs for training. For this experiment we randomly selected 200,000 URLs and used them to test the different models. Then, we selected samples sizes of 1,000, 5,000, 10,000, 50,000, 100,000, 500,000 and 1,000,000, and trained the two methods in each sample. The results are summarized in the following figure.

The LSTM model consistently outperforms the RF model. Further, the LSTM network improves its performance faster than the RF as the number of URLs used for training increases.

Lastly, we compare the training and evaluation time for both models. As can be seen in the table below, the LSTM model requires significantly more time to train. On average, using 2 million URLs, the RF is trained in less than 3 minutes, but the LSTM requires 238 minutes. Moreover, it is also interesting to observe that once the models have been trained, the RF methodology is able to evaluate 942 URLs per second, compared to 281 URLs per second by the LSTM. However, the memory requirements of the RF model are almost 500 times those of the LSTM due to the complexity of storing the models’ parameters.

Conclusion

We have explored how well we can discern phishing URLs from legitimate URLs using two methodologies: feature engineering using a lexical and statistical analysis of URLs with a random forest (RF) classifier, and a novel approach using a Long Short Term Memory neural network (LSTM). Both models showed great statistical results–the RF had an accuracy of 93.5 percent, and the LSTM had an accuracy of 98.7 percent.

With these results, one can conclude that discerning URLs by their patterns is a good predictor of phishing websites. The results indicate that instead of doing a full content analysis, creating a proactive phishing detection system using the URL is a feasible approach. In comparison, the latter system exhibits faster responses since full content analysis doesn’t have to be performed. RF and LSTM are able to evaluate URLs at a rate of 942 per second and 281 per second, respectively. However, there is a significant difference in the memory requirements of the models. The RF uses 288.7MB of memory versus 581KB of the LSTM. This is very important as there are applications that are memory restricted, such as mobile apps; in that case the RF model is unpractical and the LSTM should be used.

Article reposted with permission from Easy Solutions. Check the original piece.

Comments are closed.

Up ↑