# Intrusion detection on NSL-KDD
This is my try with [NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) dataset, which is an improved version of well-known [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset. I've used Python, Scikit-learn and PySpark via [ready-to-run Jupyter applications in Docker](https://github.com/jupyter/docker-stacks).
I've tried a variety of approaches to deal with this dataset. Here are presented some of them.
To be able to run this notebook, use `make nsl-kdd-pyspark` command. It'll download the latest jupyter/pyspark-notebook docker image and start a container with Jupyter available at `8889` port.
## Contents
1. [Task description summary](#1-task-description-summary)
2. [Data loading](#2-data-loading)
3. [Exploratory Data Analysis](#3-exploratory-Data-Analysis)
4. [One Hot Encoding for categorical variables](#4-one-Hot-Encoding-for-categorical-variables)
5. [Feature Selection using Attribute Ratio](#5-feature-Selection-using-Attribute-Ratio)
6. [Data preparation](#6-data-preparation)
7. [Visualization via PCA](#7-visualization-via-PCA)
8. [KMeans clustering with Random Forest Classifiers](#8-kMeans-clustering-with-Random-Forest-Classifiers)
9. [Gaussian Mixture clustering with Random Forest Classifiers](#9-gaussian-Mixture-clustering-with-Random-Forest-Classifiers)
10. [Supervised approach for dettecting each type of attacks separately](#10-supervised-approach-for-dettecting-each-type-of-attacks-separately)
11. [Ensembling experiments](#11-ensembling-experiments)
12. [Results summary](#12-results-summary)
## 1. Task description summary
Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories:
- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges, e.g., various ''buffer overflow'' attacks;
- probing: surveillance and other probing, e.g., port scanning.
It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.
The complete task description could be found [here](http://kdd.ics.uci.edu/databases/kddcup99/task.html).
### NSL-KDD dataset description
[NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) is a data set suggested to solve some of the inherent problems of the [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data set.
The NSL-KDD data set has the following advantages over the original KDD data set:
- It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records.
- There is no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records.
- The number of selected records from each difficultylevel group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques.
- The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable.
## 2. Data loading
```python
# Here are some imports that are used along this notebook
import math
import itertools
import pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from collections import OrderedDict
%matplotlib inline
gt0 = time()
```
```python
import pyspark
from pyspark.sql import SQLContext, Row
# Creating local SparkContext with 8 threads and SQLContext based on it
sc = pyspark.SparkContext(master='local[8]')
sc.setLogLevel('INFO')
sqlContext = SQLContext(sc)
```
```python
from pyspark.sql.types import *
from pyspark.sql.functions import udf, split, col
import pyspark.sql.functions as sql
train20_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+_20Percent.txt"
train_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+.txt"
test_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTest+.txt"
col_names = np.array(["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","labels"])
nominal_inx = [1, 2, 3]
binary_inx = [6, 11, 13, 14, 20, 21]
numeric_inx = list(set(range(41)).difference(nominal_inx).difference(binary_inx))
nominal_cols = col_names[nominal_inx].tolist()
binary_cols = col_names[binary_inx].tolist()
numeric_cols = col_names[numeric_inx].tolist()
```
```python
# Function to load dataset and divide it into 8 partitions
def load_dataset(path):
dataset_rdd = sc.textFile(path, 8).map(lambda line: line.split(','))
dataset_df = (dataset_rdd.toDF(col_names.tolist()).select(
col('duration').cast(DoubleType()),
col('protocol_type').cast(StringType()),
col('service').cast(StringType()),
col('flag').cast(StringType()),
col('src_bytes').cast(DoubleType()),
col('dst_bytes').cast(DoubleType()),
col('land').cast(DoubleType()),
col('wrong_fragment').cast(DoubleType()),
col('urgent').cast(DoubleType()),
col('hot').cast(DoubleType()),
col('num_failed_logins').cast(DoubleType()),
col('logged_in').cast(DoubleType()),
col('num_compromised').cast(DoubleType()),
col('root_shell').cast(DoubleType()),
col('su_attempted').cast(DoubleType()),
col('num_root').cast(DoubleType()),
col('num_file_creations').cast(DoubleType()),
col('num_shells').cast(DoubleType()),
col('num_access_files').cast(DoubleType()),
col('num_outbound_cmds').cast(DoubleType()),
col('is_host_login').cast(DoubleType()),
col('is_guest_login').cast(DoubleType()),
大数据分析与安全课程实验报告及结课论文
需积分: 0 74 浏览量
更新于2023-04-14
1
收藏 10.8MB RAR 举报
题目:
提出一个有意思的研究假设或洞见,根据数据分析流程及探索性数据分析(EDA)方法,证明假设或洞见是否成立,并用可视化方法进行成果展示。其中分析步骤以Markdown的形式给出;
用非监督学习算法设计一个通用的网络攻击分类器,可将样本归为5类:benign(良性的)、DoS类、r2l类、u2r类、probe类。根据机器学习在网络空间安全研究中的应用流程,进行模型选择及参数调优,使得模型精准度越高越好。过程及结果最好有可视化方法进行展示。其中分析步骤以Markdown的形式给出。
遇到困难呼噜噜(Java版)
- 粉丝: 2
- 资源: 7
最新资源
- 光伏逆变器设计方案TMS320F28335-176资料 PCB 原理图 源代码 1. 本设计DC-DC采用Boost升压,DCAC采用单相全桥逆变电路结构 2. 以TI公司的浮点数字信号控制器TMS
- 通讯录排序-基于Python实现的通讯录多维度排序方法
- 全志V3S linux qt程序实现按键控制LED灯代码.zip
- allwinner全志-V3S-LINUX-QT-实现RJ45以太网数据收发通讯.zip
- allwinner全志-V3S-LINUX-QT-第一个程序实现helloword.zip
- 扑克牌数字检测19-YOLO(v5至v11)、COCO、CreateML、Paligemma、TFRecord、VOC数据集合集.rar
- allwinner全志-V3S-LINUX-QT-GC0308摄像头实现人脸检测-MTCNN神经网络-OPENCV-FACENET.zip
- 毕业设计-基于SpringBoot+Mybatis开发的分布式校园租赁系统全部资料+详细文档+高分项目.zip
- 毕业设计-基于java的校园二手交易系统全部资料+详细文档+高分项目.zip
- 毕业设计-基于Hadoop的校园资源云存储的设计与开发全部资料+详细文档+高分项目.zip
- 毕业设计-基于tp5的校园生活系统全部资料+详细文档+高分项目.zip
- 毕业设计-基于SpringBoot的二手商城系统、二手交易平台,校园二手书籍交易,社区二手交易平台全部资料+详细文档+高分项目.zip
- 基于 Next.js(RSC) & tRPC 的多功能校园表白墙论坛系统校园万能墙全部资料+详细文档+高分项目.zip
- 基于 Laravel 校园二手交易平台全部资料+详细文档+高分项目.zip
- 基于 微信小程序-云开发 的校园服务平台(提供二手交易和失物招领功能)全部资料+详细文档+高分项目.zip
- 基于 React Native 的校园社交APP.全部资料+详细文档+高分项目.zip