【免费】大数据分析与安全课程实验报告及结课论文资源-CSDN文库

共660个文件

py：558个

exe：24个

txt：15个

数据分析

课程资源

毕业设计

需积分: 0 177 浏览量更新于2023-04-14 1 收藏 10.8MB RAR 举报

题目：提出一个有意思的研究假设或洞见，根据数据分析流程及探索性数据分析（EDA）方法，证明假设或洞见是否成立，并用可视化方法进行成果展示。其中分析步骤以Markdown的形式给出；用非监督学习算法设计一个通用的网络攻击分类器，可将样本归为5类：benign（良性的）、DoS类、r2l类、u2r类、probe类。根据机器学习在网络空间安全研究中的应用流程，进行模型选择及参数调优，使得模型精准度越高越好。过程及结果最好有可视化方法进行展示。其中分析步骤以Markdown的形式给出。

收起资源包目录

大数据分析与安全课程实验报告及结课论文（660个子文件）

activate 2KB

activate.bat 1013B

deactivate.bat 510B

pydoc.bat 24B

sysconfig.cfg 3KB

pyvenv.cfg 410B

AB_NYC_2019.csv 6.75MB

学术报告论文.doc 549KB

实验报告.docx 573KB

实验报告1.docx 502KB

python.exe 260KB

pythonw.exe 249KB

t64-arm.exe 177KB

w64-arm.exe 163KB

gui-arm64.exe 135KB

cli-arm64.exe 134KB

pip.exe 104KB

pip3.exe 104KB

pip-3.10.exe 104KB

pip3.10.exe 104KB

wheel3.exe 104KB

wheel-3.10.exe 104KB

wheel.exe 104KB

wheel3.10.exe 104KB

t64.exe 104KB

w64.exe 98KB

t32.exe 95KB

w32.exe 88KB

gui-64.exe 74KB

cli-64.exe 73KB

cli.exe 64KB

cli-32.exe 64KB

gui-32.exe 64KB

gui.exe 64KB

activate.fish 3KB

.gitignore 184B

.gitignore 44B

.gitignore 42B

Airbnb-2019-NYV-master.iml 411B

nsl-kdd-master.iml 291B

INSTALLER 5B

NSL-KDD-checkpoint.ipynb 306KB

Test1.ipynb 231KB

Test1-checkpoint.ipynb 231KB

NSL-KDD.ipynb 124KB

NSL-KDD.ipynb 57KB

nsl-kdd-classification-experiment.ipynb 2KB

nsl-kdd-classification-experiment-checkpoint.ipynb 2KB

LICENSE 11KB

LICENSE 1KB

Makefile 278B

README.md 116KB

METADATA 5KB

METADATA 4KB

METADATA 2KB

activate.nu 1KB

deactivate.nu 333B

cacert.pem 253KB

output_62_0.png 67KB

output_62_1.png 66KB

activate.ps1 2KB

distutils-precedence.pth 151B

_virtualenv.pth 18B

pyparsing.py 267KB

pyparsing.py 227KB

uts46data.py 197KB

langrussianmodel.py 128KB

more.py 115KB

html5parser.py 114KB

__init__.py 106KB

langbulgarianmodel.py 103KB

langthaimodel.py 101KB

langhungarianmodel.py 100KB

langgreekmodel.py 97KB

langhebrewmodel.py 96KB

langturkishmodel.py 94KB

tarfile.py 90KB

easy_install.py 84KB

constants.py 82KB

_tokenizer.py 75KB

util.py 66KB

locators.py 51KB

database.py 50KB

msvc.py 49KB

dist.py 49KB

distro.py 47KB

ccompiler.py 47KB

dist.py 42KB

wheel.py 42KB

idnadata.py 41KB

compat.py 41KB

package_index.py 39KB

metadata.py 38KB

fallback.py 37KB

connectionpool.py 37KB

共 660 条

身份认证购VIP最低享 7 折!

30元优惠券

资源推荐

资源预览

资源评论

# Intrusion detection on NSL-KDD This is my try with [NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) dataset, which is an improved version of well-known [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) dataset. I've used Python, Scikit-learn and PySpark via [ready-to-run Jupyter applications in Docker](https://github.com/jupyter/docker-stacks). I've tried a variety of approaches to deal with this dataset. Here are presented some of them. To be able to run this notebook, use `make nsl-kdd-pyspark` command. It'll download the latest jupyter/pyspark-notebook docker image and start a container with Jupyter available at `8889` port. ## Contents 1. [Task description summary](#1-task-description-summary) 2. [Data loading](#2-data-loading) 3. [Exploratory Data Analysis](#3-exploratory-Data-Analysis) 4. [One Hot Encoding for categorical variables](#4-one-Hot-Encoding-for-categorical-variables) 5. [Feature Selection using Attribute Ratio](#5-feature-Selection-using-Attribute-Ratio) 6. [Data preparation](#6-data-preparation) 7. [Visualization via PCA](#7-visualization-via-PCA) 8. [KMeans clustering with Random Forest Classifiers](#8-kMeans-clustering-with-Random-Forest-Classifiers) 9. [Gaussian Mixture clustering with Random Forest Classifiers](#9-gaussian-Mixture-clustering-with-Random-Forest-Classifiers) 10. [Supervised approach for dettecting each type of attacks separately](#10-supervised-approach-for-dettecting-each-type-of-attacks-separately) 11. [Ensembling experiments](#11-ensembling-experiments) 12. [Results summary](#12-results-summary) ## 1. Task description summary Software to detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between bad connections, called intrusions or attacks, and good normal connections. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories: - DOS: denial-of-service, e.g. syn flood; - R2L: unauthorized access from a remote machine, e.g. guessing password; - U2R: unauthorized access to local superuser (root) privileges, e.g., various ''buffer overflow'' attacks; - probing: surveillance and other probing, e.g., port scanning. It is important to note that the test data is not from the same probability distribution as the training data, and it includes specific attack types not in the training data. This makes the task more realistic. Some intrusion experts believe that most novel attacks are variants of known attacks and the "signature" of known attacks can be sufficient to catch novel variants. The datasets contain a total of 24 training attack types, with an additional 14 types in the test data only. The complete task description could be found [here](http://kdd.ics.uci.edu/databases/kddcup99/task.html). ### NSL-KDD dataset description [NSL-KDD](http://www.unb.ca/research/iscx/dataset/iscx-NSL-KDD-dataset.html) is a data set suggested to solve some of the inherent problems of the [KDD'99](http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) data set. The NSL-KDD data set has the following advantages over the original KDD data set: - It does not include redundant records in the train set, so the classifiers will not be biased towards more frequent records. - There is no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on the frequent records. - The number of selected records from each difficultylevel group is inversely proportional to the percentage of records in the original KDD data set. As a result, the classification rates of distinct machine learning methods vary in a wider range, which makes it more efficient to have an accurate evaluation of different learning techniques. - The number of records in the train and test sets are reasonable, which makes it affordable to run the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of different research works will be consistent and comparable. ## 2. Data loading ```python # Here are some imports that are used along this notebook import math import itertools import pandas import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from time import time from collections import OrderedDict %matplotlib inline gt0 = time() ``` ```python import pyspark from pyspark.sql import SQLContext, Row # Creating local SparkContext with 8 threads and SQLContext based on it sc = pyspark.SparkContext(master='local[8]') sc.setLogLevel('INFO') sqlContext = SQLContext(sc) ``` ```python from pyspark.sql.types import * from pyspark.sql.functions import udf, split, col import pyspark.sql.functions as sql train20_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+_20Percent.txt" train_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTrain+.txt" test_nsl_kdd_dataset_path = "NSL_KDD_Dataset/KDDTest+.txt" col_names = np.array(["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","labels"]) nominal_inx = [1, 2, 3] binary_inx = [6, 11, 13, 14, 20, 21] numeric_inx = list(set(range(41)).difference(nominal_inx).difference(binary_inx)) nominal_cols = col_names[nominal_inx].tolist() binary_cols = col_names[binary_inx].tolist() numeric_cols = col_names[numeric_inx].tolist() ``` ```python # Function to load dataset and divide it into 8 partitions def load_dataset(path): dataset_rdd = sc.textFile(path, 8).map(lambda line: line.split(',')) dataset_df = (dataset_rdd.toDF(col_names.tolist()).select( col('duration').cast(DoubleType()), col('protocol_type').cast(StringType()), col('service').cast(StringType()), col('flag').cast(StringType()), col('src_bytes').cast(DoubleType()), col('dst_bytes').cast(DoubleType()), col('land').cast(DoubleType()), col('wrong_fragment').cast(DoubleType()), col('urgent').cast(DoubleType()), col('hot').cast(DoubleType()), col('num_failed_logins').cast(DoubleType()), col('logged_in').cast(DoubleType()), col('num_compromised').cast(DoubleType()), col('root_shell').cast(DoubleType()), col('su_attempted').cast(DoubleType()), col('num_root').cast(DoubleType()), col('num_file_creations').cast(DoubleType()), col('num_shells').cast(DoubleType()), col('num_access_files').cast(DoubleType()), col('num_outbound_cmds').cast(DoubleType()), col('is_host_login').cast(DoubleType()), col('is_guest_login').cast(DoubleType()),