official Journal of AlNoor University

Phishing URL detection based on contextualized word representations

Document Type : Research paper

Authors

1 University Of Mosul

2 Cybersecurity Department, Computer and Math College, University of Mosul, Mosul, Nineveh, Iraq.

Abstract
Phishing is still a prevalent cybercrime, and attackers keep improving their URL obfuscation schemes that complicate the conventional detection systems based on fragile and manually constructed lexical characteristics. In response to this, this paper presents a competent phishing URL detector model using ELMo (Embeddings from Language Models) to produce deep contextual representations of words in raw URLs, both syntactic and semantic tie, even in homoglyph substitutions and randomly generated strings. The data processing methodology includes a transformation of the tokenized URLs of the PhiUSIIL data into contextual embeddings of 1024 dimensions, followed by the training of a sequential Dense Neural Network (DNN) classifier. Upon assessment on the PhiUSIIL benchmark, the proposed ELMo-based system was revealed to have high performance measures, such as Accuracy of 0.95, Precision of 0.94, Recall of 0.96, and an F1-score of 0.95, which is more robust and generalized as opposed to baseline approaches. The findings substantiate the usefulness of the contextualized embeddings to reduce critical false negatives and emphasize the practicality of the model in practice.

Keywords

Subjects