# Intrusion detection on NSL-KDD
This is my try with [NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) dataset, which is an improved version of well-known [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset. I've used Python, Scikit-learn and PySpark via [ready-to-run Jupyter applications in Docker](https://github.com/jupyter/docker-stacks).
I've tried a variety of approaches to deal with this dataset. Here are presented some of them.
To be able to run this notebook, use `make nsl-kdd-pyspark` command. It'll download the latest jupyter/pyspark-notebook docker image and start a container with Jupyter available at `8889` port.
## Contents
1. [Task description summary](#1-task-description-summary)
2. [Data loading](#2-data-loading)
3. [Exploratory Data Analysis](#3-exploratory-Data-Analysis)
4. [One Hot Encoding for categorical variables](#4-one-Hot-Encoding-for-categorical-variables)
5. [Feature Selection using Attribute Ratio](#5-feature-Selection-using-Attribute-Ratio)
6. [Data preparation](#6-data-preparation)
7. [Visualization via PCA](#7-visualization-via-PCA)
8. [KMeans clustering with Random Forest Classifiers](#8-kMeans-clustering-with-Random-Forest-Classifiers)
9. [Gaussian Mixture clustering with Random Forest Classifiers](#9-gaussian-Mixture-clustering-with-Random-Forest-Classifiers)
10. [Supervised approach for dettecting each type of attacks separately](#10-supervised-approach-for-dettecting-each-type-of-attacks-separately)
11. [Ensembling experiments](#11-ensembling-experiments)
12. [Results summary](#12-results-summary)
## 1. Task description summary
Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories:
- DOS: denial-of-service, e.g. syn flood;
- R2L: unauthorized access from a remote machine, e.g. guessing password;
- U2R: unauthorized access to local superuser (root) privileges, e.g., various ''buffer overflow'' attacks;
- probing: surveillance and other probing, e.g., port scanning.
It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only.
The complete task description could be found [here](http://kdd.ics.uci.edu/databases/kddcup99/task.html).
### NSL-KDD dataset description
[NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) is a data set suggested to solve some of the inherent problems of the [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data set.
The NSL-KDD data set has the following advantages over the original KDD data set:
- It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records.
- There is no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records.
- The number of selected records from each difficultylevel group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques.
- The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable.
## 2. Data loading
```python
# Here are some imports that are used along this notebook
import math
import itertools
import pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
from collections import OrderedDict
%matplotlib inline
gt0 = time()
```
```python
import pyspark
from pyspark.sql import SQLContext, Row
# Creating local SparkContext with 8 threads and SQLContext based on it
sc = pyspark.SparkContext(master='local[8]')
sc.setLogLevel('INFO')
sqlContext = SQLContext(sc)
```
```python
from pyspark.sql.types import *
from pyspark.sql.functions import udf, split, col
import pyspark.sql.functions as sql
train20_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+_20Percent.txt"
train_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+.txt"
test_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTest+.txt"
col_names = np.array(["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","labels"])
nominal_inx = [1, 2, 3]
binary_inx = [6, 11, 13, 14, 20, 21]
numeric_inx = list(set(range(41)).difference(nominal_inx).difference(binary_inx))
nominal_cols = col_names[nominal_inx].tolist()
binary_cols = col_names[binary_inx].tolist()
numeric_cols = col_names[numeric_inx].tolist()
```
```python
# Function to load dataset and divide it into 8 partitions
def load_dataset(path):
dataset_rdd = sc.textFile(path, 8).map(lambda line: line.split(','))
dataset_df = (dataset_rdd.toDF(col_names.tolist()).select(
col('duration').cast(DoubleType()),
col('protocol_type').cast(StringType()),
col('service').cast(StringType()),
col('flag').cast(StringType()),
col('src_bytes').cast(DoubleType()),
col('dst_bytes').cast(DoubleType()),
col('land').cast(DoubleType()),
col('wrong_fragment').cast(DoubleType()),
col('urgent').cast(DoubleType()),
col('hot').cast(DoubleType()),
col('num_failed_logins').cast(DoubleType()),
col('logged_in').cast(DoubleType()),
col('num_compromised').cast(DoubleType()),
col('root_shell').cast(DoubleType()),
col('su_attempted').cast(DoubleType()),
col('num_root').cast(DoubleType()),
col('num_file_creations').cast(DoubleType()),
col('num_shells').cast(DoubleType()),
col('num_access_files').cast(DoubleType()),
col('num_outbound_cmds').cast(DoubleType()),
col('is_host_login').cast(DoubleType()),
col('is_guest_login').cast(DoubleType()),
大数据分析与安全课程实验报告及结课论文
需积分: 0 177 浏览量
更新于2023-04-14
1
收藏 10.8MB RAR 举报
题目:
提出一个有意思的研究假设或洞见,根据数据分析流程及探索性数据分析(EDA)方法,证明假设或洞见是否成立,并用可视化方法进行成果展示。其中分析步骤以Markdown的形式给出;
用非监督学习算法设计一个通用的网络攻击分类器,可将样本归为5类:benign(良性的)、DoS类、r2l类、u2r类、probe类。根据机器学习在网络空间安全研究中的应用流程,进行模型选择及参数调优,使得模型精准度越高越好。过程及结果最好有可视化方法进行展示。其中分析步骤以Markdown的形式给出。
![avatar](https://profile-avatar.csdnimg.cn/093e4451bdc54a4694104355d1ce3c70_hanhahaha_ovo_.jpg!1)
遇到困难呼噜噜(Java版)
- 粉丝: 2
- 资源: 7
最新资源
- 【JCR一区级】秃鹰算法BES-Transformer-GRU负荷数据回归预测【含Matlab源码 6347期】.zip
- 【独家首发】开普勒算法KOA优化Transformer-BiLSTM负荷数据回归预测【含Matlab源码 6560期】.zip
- 【JCR一区级】雾凇算法RIME-Transformer-GRU负荷数据回归预测【含Matlab源码 6348期】.zip
- 【JCR1区】雪融算法SAO-CNN-SVM故障诊断分类预测【含Matlab源码 5823期】.zip
- 【JCR1区】蚁狮算法ALO-CNN-SVM故障诊断分类预测【含Matlab源码 5825期】.zip
- 【JCR一区级】鹈鹕算法POA-Transformer-GRU负荷数据回归预测【含Matlab源码 6345期】.zip
- 【JCR一区级】金豺算法GJO-Transformer-GRU负荷数据回归预测【含Matlab源码 6326期】.zip
- 【JCR一区级】天鹰算法AO-Transformer-GRU负荷数据回归预测【含Matlab源码 6346期】.zip
- 【LSTM时序预测】鲸鱼算法优化卷积长短期记忆神经网络WOA-CNN-LSTM股价序列预测【含Matlab源码 3008期】.zip
- 【独家首发】粒子群算法PSO优化Transformer-LSTM负荷数据回归预测【含Matlab源码 6388期】.zip
- 【JCR1区】遗传算法GA-CNN-SVM故障诊断分类预测【含Matlab源码 5824期】.zip
- 【JCR1区】飞蛾扑火算法MFO-CNN-SVM故障诊断分类预测【含Matlab源码 5784期】.zip
- 【JCR1区】引力搜索算法GSA-CNN-SVM故障诊断分类预测【含Matlab源码 5826期】.zip
- 【JCR一区级】金枪鱼算法TSO-Transformer-GRU负荷数据回归预测【含Matlab源码 6327期】.zip
- 【JCR一区级】鲸鱼算法WOA-Transformer-GRU负荷数据回归预测【含Matlab源码 6328期】.zip
- 【JCR一区级】淘金算法GRO-Transformer-GRU负荷数据回归预测【含Matlab源码 6344期】.zip