AUTOMATING CYBER THREAT INTELLIGENCE EXTRACTION USING NATURAL LANGUAGE PROCESSING TECHNIQUES

Amjad Jumani; Amber Baig; Engr. Dr. Shamim Akhtar; Muhammad Shahmir Shamim; Hira Zaheer; Areej Changaiz

doi:10.71146/kjmr498

Authors

Amjad Jumani Lecturer At Faculty Of Science And Technology Ilma University Karachi Author
Amber Baig Department of Computer Science, Faculty of Engineering, Science & Technology, Isra University, Hyderabad. Author
Engr. Dr. Shamim Akhtar Adjunct professor, Department of Information Systems and Cybersecurity, University of Common wealth Caribbean Author
Muhammad Shahmir Shamim Student, University of California Irvine, Author
Hira Zaheer UET Lahore Author
Areej Changaiz MSCS Computer Science , MYU University Author

DOI:

https://doi.org/10.71146/kjmr498

Keywords:

Cyber Threat Intelligence, Natural Language Processing, BERT, Entity Recognition, Information Extraction, Transformer Models, Cybersecurity, Threat Detection, Text Mining, Dependency Parsing

Abstract

The increasing negligence and complexity of online confrontations have made it abundantly clear that an organization must place a premium on real-time, ready-to-use, and expandable Cyber Threat Intelligence (CTI) strategies. The classical approach to CTI collection and analysis that heavily involves manual work over raw unstructured text-based data including threat reports, blogs, and advisories cannot keep up with the requirements of current cybersecurity threats. In this study, an intermediate form of Natural Language Processing (NLP) framework is introduced utilizing the state-of-the-art transformer models, namely fine-tuned versions of BERT architectures, and syntactic dependency parsing and domain-specific rule-based post-processing to automate CTI extraction. The dataset of more than 5,000 cybersecurity documents was created with a custom label that allows the system to extract the strongest threat entities such as names of malware, CVEs, IP addresses, threat actors, and TTPs. As experimental comparisons prove the proposed system vastly surpasses the existing BiLSTM-CRF and traditional CRF baselines scoring 0.90 F1-score in entity recognition. Error analysis also showed that syntactic and rule-based enhancements produced a big difference in entity fragmentation and false positives. The paper also investigates how preprocessing or data source quality and the process of entity links to external knowledge bases can aid in the optimal extraction of CTI. The findings demonstrate the promise of using advanced NLP methods to revolutionize CTI processes to perform more accurate, faster, and scalable threat intelligence processing to support proactive cybersecurity defense.