A²I : AI in Cybersecurity

  • August 06, 2018

Phishing emails

The extensive growth of Internet users provides the opportunities for anomalies to intrude our privacy and security. Phishing is one among them that has turned out to be a major issue in the recent times, that directly hit specific targeted group of people asking for their credentials, personal and other sensitive information.


“Email is the cockroach of the Internet.” — Stuart Butterfield, co-founder, Slack.


Amongst all other forms of data, email seems to gain more interest to the phishers. Though other mediums of communications have already started to produce more data, emails are still on trend from the time it has been initially developed by Ray Tomlinson in 1972. This year, the number of email users worldwide increased to 3.8 billion as per the Radicati research group Inc. and is expected to rise to 4.1 billion users by 2021. The number of business and consumer emails produced and delivered per day in 2018 has reached 279 billion and is set to grow at a rate of 4.4% annually resulting to 319.6 billion emails by the end of 2021. So, almost half of the population uses email as a mode of communication these days.

With its increasing popularity and ease of use, many people even use it for inappropriate activities by sending illegitimate or spam emails. Through spam emails people deliver all kinds of malicious attacks. The frequently used type of malware attack through spam emails are blended attacks. It uses more than one method to deliver malware to an internal network. Blended attacks often starts from illegitimate emails, which may not contain malware but provide links to compromised websites. Usually attackers send emails in such a way that it looks legitimate to a normal user by mixing authentic links and false links that will contain URLs to some fake website. As per the survey produced by IBM’s X-Force research team, more than half of the emails produced worldwide are scam. The percentage of spam email amounted to 55.9% in the first quarter of 2017 and shows gradual increase in the coming years.

Spam mails may also consist of phishing mails hence resulting in leakage of sensitive information most of the times. As reported by APWG, the number of phishing email has increased from 68270 in 2014 to 106421 in 2015. According to Gartner report, 109 million users have received phishing email. It can be delivered using several ways, by attaching files with malicious content or by sending a link to a compromised website.

At the year end of 2017, the anti-phishing system prevented nearly 60 million attempts to phishing pages, which shows the potential use of an anti-phishing system. Every year, the phishing attack on unique users of internet is increasing world wide. This ensures the need of effective anti-phishing methods and induces the research community to integrate the Artificial Intelligence (AI) methods with Cyber security modules.

AI for anti-phishing

By knowing the need for anti-phishing, lets understand how AI could turn out to be a measure in bringing out a solution to this problem. Consider the phishing scenario in emails (attacked frequently) as a text classification problem in which emails are the documents and target classes are phishing and legitimate ones. Any text classification application will contain representation (representing text as a numeric values), feature extraction (getting informative words with respect to the target classes) and a classifier (transforms features to target classes) as its base components. Among them representation is more complex and the core part of the module, that represents the context of the text in numbers. Representation defines the effectiveness of the classification models to make final predictions.

The content from the phishing sites are highly semantically similar to contents in original sites. Thus this becomes mandatory to represent the context of the text, than representing text as symbols.

The classical representation methods Vector Space Models (VSM) were not upto the mark. The Vector Space Models of Semantics (VSMs) or Distributional Representation methods are able to include context only to some extent. Unlike image and speech, texts are represented using numerical values by taking terms (words or phrases) as a symbol in classical methods. Both these models requires high computation because of well known problem called “Curse of Dimensionality”. Due to this, these methods cannot be run on the huge corpus which is necessary for the effective representation.

Finally distributed representation methods are introduced that reflects the context of the text as a low dimensional dense vector and provides the flexibility in choosing the dimension of the vector. By considering these factors, distributed representation methods are taken into account for further processing.

One of the well known distributed representation method is word2vec (word to vector) and the latterly introduced methods like doc2vec (document to vector), Glove (global vectors) and fastText which are the flavors of word2vec with some notable changes to enhance the representation. Given a word to word2vec, it will produce the vector in desired dimension that reflects the context of word. When it comes to representing a text with multiple words, either average of those word vectors or the matrix out of concatenating those word vectors will be decomposed to form a single vector. The learning of word2vec is improved by combining word2vec with the co-occurrence matrix by forming the so called Glove. Glove provides the flexibility to train small corpus with promising performance. Both these methods represent poor sequence of words since averaging of word vector does not consider the order of the word. The doc2vec method introduces a way to represent the sequence of words to a vector. The architecture is similar to the word2vec, provided one more weight matrix will also be learned along with the weight matrix of word2vec for representing the sequence of words. fastText has been introduced, where it learns vector for a given word from the class it belongs to, rather than the earlier methods where it learns by predicting next word of the given word. Since the number of classes is always less than number of words, this method is faster than others.

From the above methods fastText seems to be the most appealing and efficient way of representation for a text classification problem. We have experimented the same in The International Workshop on Security and Privacy Analytics — Anti Phishing (IWSPA-AP 2) data in discriminating the legitimate emails from the phished ones. More details on this experimentation could be found on the papers we have published on the same using class embeddings and data driven algorithms.