一、实验名称

处理 KDD99 数据集

二、实验目的

KDD99 数据集进行处理,获得统计信息。

三、实验内容

  1. 对“kddcup.data_10_percent”数据集(with label)进行处理,获得统计信息

  2. 对“kddcup.testdata.unlabeled”数据集进行处理,获得统计信息

  3. 根据“kddcup.data_10_percent”数据集(with label) 分出不同种类的攻击和正常数据,获得基本的统计信息

  4. 利⽤labeled数据集的统计特征,分析unlabeled数据集。假设KDD99的正常数据集服从正态分布,则在置信度为0.9的情况下,请检验unlabeled数据集是否属于正常数据?(提示:首先需要获得一个normal的数据样本集,然后需要排除掉数据集中无关features的影响,之后以处理后的数据集估计KDD99正常数据集的均值和方差,然后利用假设检验,对应的去检验unlabeled数据集的相关性质。)

  5. 扩展:尝试使用机器学习分类算法利用labeled的数据集进行训练,在使用对应模型对unlabeled数据集进行分类。

四、实验数据及结果分析

KDD99一共有42项特征,前41项可分为四大类,第42项为入侵检测试验类型标识

1.单个 TCP 连接的基本特征(1~9)

特征名称 描述 类型
duration 连接持续的秒数 连续
protocol_type 协议类型,TCP,UDP,ICMP 离散
service 目标主机的网络服务类型,共70种 离散
flag 连接正常或错误的状态,共11种 离散
src_bytes 从源主机到目标主机的数据字节数 连续
dst_bytes 从目标主机到源主机的数据字节数 连续
land 若连接来自/送达同一个主机/端口则为1 离散
wrong_fragment 错误分段的数量 连续
urgent 加急包的个数 连续

补充

service:

1
'aol', 'auth', 'bgp', 'courier', 'csnet_ns','ctf', 'daytime', 'discard', 'domain', 'domain_u', 'echo', 'eco_i', 'ecr_i', 'efs', 'exec', 'finger', 'ftp', 'ftp_data','gopher', 'harvest', 'hostnames', 'http', 'http_2784', 'http_443', 'http_8001', 'imap4', 'IRC', 'iso_tsap','klogin', 'kshell', 'ldap', 'link', 'login', 'mtp', 'name', 'netbios_dgm', 'netbios_ns', 'netbios_ssn', 'netstat','nnsp', 'nntp', 'ntp_u', 'other', 'pm_dump', 'pop_2', 'pop_3', 'printer', 'private', 'red_i', 'remote_job', 'rje','shell', 'smtp', 'sql_net', 'ssh', 'sunrpc', 'supdup', 'systat', 'telnet', 'tftp_u', 'tim_i', 'time', 'urh_i', 'urp_i','uucp', 'uucp_path', 'vmnet', 'whois', 'X11', 'Z39_50'

flag

1
2
>'OTH', 'REJ', 'RSTO', 'RSTOS0', 'RSTR', 'S0', 'S1',
>'S2', 'S3', 'SF', 'SH'

2.TCP连接的内容特征(10~22)

特征名称 描述 类型
hot 访问敏感文件或目录的次数 连续
num_failed_logins 登陆尝试失败的次数 连续
logged_in 成功登陆则为1,否则为0 离散
num_compromised compromised条件出现的次数 连续
root_shell 获得 root shell 则为1,否则为0 离散
su_attempted 若出现”su root“命令则为1,否则为0 离散
num_root **root **用户访问次数 连续
num_file_creations 文件创建操作的次数 连续
num_shells 使用 shell 命令的次数 连续
num_access_files 访问控制文件的次数 连续
num_outbound_cmds 一个 FTP 会话中出战连接的次数 连续
is_hot_login 登陆是否属于”hot“列表,是为1,否则为0 离散
is_guest_login 若是 guest 登陆则为1 离散

3.两秒内网络流量统计特征(23~31)

特征名称 描述 类型
count 与当前连接具有相同的目标主机的连接数 连续
srv_count 与当前连接具有相同服务的连接数 连续
serror_rate 在与当前连接具有相同目标主机的连接中,出现”SYN“错误的连接的百分比 连续
srv_serror_rate 在与当前连接具有相同服务的连接中,出现”SYN“错误的连接的百分比 连续
rerror_rate 在与当前连接具有相同目标主机的连接中,出现””REJ“错误的连接的百分比 连续
srv_rerror_rate 在与当前连接具有相同服务的连接中,出现”REJ“错误的连接的百分比 连续
same_srv_rate 在与当前连接具有相同目标主机的连接中,与当前连接具有相同服务的连接的百分比 连续
diff_srv_rate 在与当前连接具有相同目标主机的连接中,与当前连接具有不同服务的连接的百分比 连续
srv_diff_host_rate 在与当前连接具有相同服务的连接中,与当前连接具有不同目标主机的连接的百分比 连续

4.一百个连接中的网络流量统计特征(32~41)

特征名称 描述 类型
dst_host_count 与当前连接具有相同目标主机的连接数 连续
dst_host_srv_count 与当前连接具有相同目标主机相同服务的连接数 连续
dst_host_same_srv_rate 与当前连接具有相同目标主机相同服务的连接所占的百分比 连续
dst_host_diff_srv_rate 与当前连接具有相同目标主机不同服务的连接所占的百分比 连续
dst_host_same_src_port_rate 与当前连接具有相同目标主机相同源端口的连接所占的百分比 连续
dst_host_srv_diff_host_rate 与当前连接具有相同目标主机相同服务的连接中,与当前连接具有不同源主机的连接所占的百分比 连续
dst_host_serror_rate 与当前连接具有相同目标主机的连接中,出现”SYN“错误的连接所占的百分比 连续
dst_host_srv_serror_rate 与当前连接具有相同目标主机相同服务的连接中,出现”SYN“错误的连接所占的百分比 连续
dst_host_rerror_rate 与当前连接具有相同目标主机的连接中,出现”REJ“错误的连接所占的百分比 连续
dst_host_srv_rerror_rate 与当前连接具有相同目标主机相同服务的连接中,出现”REJ“错误的连接所占的百分比 连续

第42项

标识类型 描述 具体分类标识
Normal 正常记录 Normal
DOS 拒绝服务攻击 back、land、neptune、pod、smurf、teardrop、apache2、mailbomb、processtable、udpstorm
Probing 监视和其他探 ipsweep、nmap、portsweep、satan、mscan、saint
R2L 来自远程机器的非法访问 tp_write、guess_passwd、imap、multihop、phf、spy、warezclient、warezmaster、multihop、named、sendmail、snmpgetattack、worm、xlock、xsnoop
U2R 普通用户对本地超级用户特权的非法访问 buffer_overflow、loadmodule、perl、rootkit、ps、sqlattack、xterm、httptunnel

实验结果

1.对“kddcup.data_10_percent”数据集(with label)进行处理,获得统计信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./data/corrected-Test data with corrected labels")
df.columns = labels

def plot_pie_optimize(title, dict_value: dict, font_size):
keys = list(dict_value.keys())
values = np.array(list(dict_value.values()))
patches, texts = plt.pie(values)
labels = ['{} {:1.2f}% ({})'.format(i, j, v) for i, j, v in
zip(keys, 100. * values / values.sum(), values)] # 构造图例数据
patches, labels, dummy = zip(*sorted(
zip(patches, labels, values), key=lambda x: x[2], reverse=True))
plt.legend(patches, labels, loc="center", bbox_to_anchor=(-0.1, 0.5), fontsize=font_size)
plt.title(title)
plt.tight_layout()
plt.savefig("./dataimg/{}.png".format(title), dpi = 600, bbox_inches = 'tight') # 使得到的图片更完整清晰

def plot_label(title, font_size):
df_label = df.loc[:, title].value_counts()
dict_value = {}
for i in df_label.keys():
dict_value[i] = df_label[i]
plot_pie_optimize(title, dict_value, font_size)


plot_label("label", 5)
plot_label("flag", 6)
plot_label("protocol_type", 6)

展示部分数据

2.对“kddcup.testdata.unlabeled”数据集进行处理,获得统计信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./data/kddcup.data.corrected-The full data set")
df.columns = labels

def plot_pie_optimize(title, dict_value: dict, font_size):
keys = list(dict_value.keys())
values = np.array(list(dict_value.values()))
patches, texts = plt.pie(values)
labels = ['{} {:1.2f}% ({})'.format(i, j, v) for i, j, v in
zip(keys, 100. * values / values.sum(), values)] # 构造图例数据
patches, labels, dummy = zip(*sorted(
zip(patches, labels, values), key=lambda x: x[2], reverse=True))
plt.legend(patches, labels, loc="center", bbox_to_anchor=(-0.1, 0.5), fontsize=font_size)
plt.title(title)
plt.tight_layout()
plt.savefig("./dataimg/{}.png".format(title), dpi = 600, bbox_inches = 'tight') # 使得到的图片更完整清晰

def plot_label(title, font_size):
df_label = df.loc[:, title].value_counts()
dict_value = {}
for i in df_label.keys():
dict_value[i] = df_label[i]
plot_pie_optimize(title, dict_value, font_size)


plot_label("label", 5)
plot_label("flag", 6)
plot_label("protocol_type", 6)

展示部分数据

3.根据“kddcup.data_10_percent”数据集(with label) 分出不同种类的攻击和正常数据,获得基本的统计信息

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./data/corrected-Test data with corrected labels")
df.columns = labels

def plot_pie_optimize(title, dict_value: dict, font_size):
keys = list(dict_value.keys())
values = np.array(list(dict_value.values()))
patches, texts = plt.pie(values)
labels = ['{} {:1.2f}% ({})'.format(i, j, v) for i, j, v in
zip(keys, 100. * values / values.sum(), values)] # 构造图例数据
patches, labels, dummy = zip(*sorted(
zip(patches, labels, values), key=lambda x: x[2], reverse=True))
plt.legend(patches, labels, loc="center", bbox_to_anchor=(-0.1, 0.5), fontsize=font_size)
plt.title(title)
plt.tight_layout()
plt.savefig("./dataimg/{}.png".format(title), dpi = 600, bbox_inches = 'tight') # 使得到的图片更完整清晰


normal_label = ['normal']
dos_label = ['back','land','neptune','pod','smurf','teardrop','apache2','mailbomb','processtable','udpstorm']
probing_label = ['ipsweep','nmap','portsweep','satan','mscan','saint']
r2l_label = ['tp_write','guess_passwd','imap','multihop','phf','spy','warezclient','warezmaster','multihop','named','sendmail','snmpgetattack','worm','xlock','xsnoop']
u2r_label = ['buffer_overflow','loadmodule','perl','rootkit','ps','sqlattack','xterm','httptunnel']

final_label = ['Normal','Dos','Probing','R2l','U2r']
final_dict = {
"Normal": 0,
"Dos": 0,
"Probing": 0,
"R2l": 0,
"U2r": 0
}

for i in normal_label:
final_dict["Normal"] += df[df.values == i + '.'].__len__()

for i in dos_label:
final_dict["Dos"] += df[df.values == i + '.'].__len__()

for i in probing_label:
final_dict["Probing"] += df[df.values == i + '.'].__len__()

for i in r2l_label:
final_dict["R2l"] += df[df.values == i + '.'].__len__()

for i in u2r_label:
final_dict["U2r"] += df[df.values == i + '.'].__len__()

plot_pie_optimize("Attack data", final_dict, 8)

4.利⽤labeled数据集的统计特征,分析unlabeled数据集。假设KDD99的正常数据集服从正态分布,则在置信度为0.9的情况下,请检验unlabeled数据集是否属于正常数据?

提示:首先需要获得一个normal的数据样本集,然后需要排除掉数据集中无关features的影响,之后以处理后的数据集估计KDD99正常数据集的均值和方差,然后利用假设检验,对应的去检验unlabeled数据集的相关性质。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import pandas as pd
import numpy as np

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./data/corrected-Test data with corrected labels")
df1 = pd.read_csv("./data/kddcup.data.corrected-The full data set")
df.columns = labels
df1.columns = labels

normal_label = ['normal']
dos_label = ['back','land','neptune','pod','smurf','teardrop','apache2','mailbomb','processtable','udpstorm']
probing_label = ['ipsweep','nmap','portsweep','satan','mscan','saint']
r2l_label = ['tp_write','guess_passwd','imap','multihop','phf','spy','warezclient','warezmaster','multihop','named','sendmail','snmpgetattack','worm','xlock','xsnoop']
u2r_label = ['buffer_overflow','loadmodule','perl','rootkit','ps','sqlattack','xterm','httptunnel']

final_label = ['Normal','Dos','Probing','R2l','U2r']
final_dict = {
"Normal": 0,
"Dos": 0,
"Probing": 0,
"R2l": 0,
"U2r": 0
}

for i in normal_label:
df[df.values == i +'.'].to_csv('./label/normal/'+ i +'.csv', header=0, index=0)
df1[df1.values == i +'.'].to_csv('./label1/normal/'+ i +'.csv', header=0, index=0)

for i in dos_label:
df[df.values == i +'.'].to_csv('./label/dos/'+ i +'.csv', header=0, index=0)
df1[df1.values == i +'.'].to_csv('./label1/dos/'+ i +'.csv', header=0, index=0)

for i in probing_label:
df[df.values == i +'.'].to_csv('./label/probing/'+ i +'.csv', header=0, index=0)
df1[df1.values == i +'.'].to_csv('./label1/probing/'+ i +'.csv', header=0, index=0)

for i in r2l_label:
df[df.values == i +'.'].to_csv('./label/r2l/'+ i +'.csv', header=0, index=0)
df1[df1.values == i +'.'].to_csv('./label1/r2l/'+ i +'.csv', header=0, index=0)

for i in u2r_label:
df[df.values == i +'.'].to_csv('./label/u2r/'+ i +'.csv', header=0, index=0)
df1[df1.values == i +'.'].to_csv('./label1/r2l/'+ i +'.csv', header=0, index=0)

先获得各种不同种类的数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import pandas as pd
import numpy as np
from scipy.stats import norm

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./label/normal/normal.csv")
df1 = pd.read_csv("./label1/normal/normal.csv")
df.columns = labels
df1.columns = labels

# 选择要分析的特征
feature = "duration"

# 计算 labeled 数据集中特征的均值和标准差
mean = df[feature].mean()
std = df[feature].std()

# 初始化概率和计数器
probability_sum = 0
count = 0

# 遍历 unlabeled 数据集中的每个数据点
for value in df1[feature]:
# 计算数据点在假设正常数据服从正态分布的情况下的概率
probability = norm.pdf(value, mean, std)
# 累加概率和计数器
probability_sum += probability
count += 1

# 计算平均概率
mean_probability = probability_sum / count
print(mean_probability)
# 将平均概率与所需的置信度进行比较
if mean_probability > 0.9:
print("Unlabeled dataset is likely to belong to normal data.")
else:
print("Unlabeled dataset is not likely to belong to normal data.")

0.0007079938317307032
Unlabeled dataset is not likely to belong to normal data.

五、实验过程

处理数据

参考菜鸟教程的示例,可发现pandas库对csv,也就是本实验给出的数据是很好处理的

1
2
3
4
5
import pandas as pd
labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("test.txt")
df.columns = labels
df_dict = df.loc[:,"duration"].value_counts()

从而根据案例编写出如下代码,可以得到duration行的统计数据

1
2
3
4
0    6
1 4
2 2
Name: duration, dtype: int64

对其进行简单处理

1
2
3
4
5
dict = {}
for i in df_dict.keys():
dict[i] = df_dict[i]
keys = np.array(list(dict.keys()))
values = np.array(list(dict.values()))

可以得到

1
2
keys: [0, 1, 2]
values: [6 4 2]

正好满足 Matplotlib 绘制饼图需要的数据

Matplotlib 饼图

根据菜鸟教程以下代码,可得到

1
2
3
4
5
6
7
8
9
10
11
import matplotlib.pyplot as plt
import numpy as np

y = np.array([35, 25, 25, 15])

plt.pie(y,
labels=['A','B','C','D'], # 设置饼图标签
colors=["#d5695d", "#5d8ca8", "#65a479", "#a564c9"], # 设置饼图颜色
)
plt.title("RUNOOB Pie Test") # 设置标题
plt.show()

在上面对测试数据的处理下对该代码进行改变可得到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("test.txt")
df.columns = labels
df_dict = df.loc[:,"duration"].value_counts()
dict = {}
for i in df_dict.keys():
dict[i] = df_dict[i]
keys = np.array(list(dict.keys()))
values = np.array(list(dict.values()))
y = values

plt.pie(y,
labels=np.array(keys), # 设置饼图标签
colors=["#d5695d", "#5d8ca8", "#65a479"], # 设置饼图颜色
)
plt.title("Duration Pie Test") # 设置标题
plt.show()

而由于本题中的数据过多过杂,将数据名称显示在每个扇形的旁边是不现实的,需要有一个图例实现如下效果

参考 https://blog.csdn.net/weixin_35757704/article/details/126965333

根据其代码,可得到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("test.txt")
df.columns = labels
df_dict = df.loc[:,"duration"].value_counts()
dict_value = {}
for i in df_dict.keys():
dict_value[i] = df_dict[i]
keys = list(dict_value.keys())
values = np.array(list(dict_value.values()))
patches, texts = plt.pie(values)
labels = ['{} {:1.2f}% ({})'.format(i, j, v) for i, j, v in
zip(keys, 100. * values / values.sum(), values)] # 构造图例数据
patches, labels, dummy = zip(*sorted(
zip(patches, labels, values), key=lambda x: x[2], reverse=True))
plt.legend(patches, labels, loc="center", bbox_to_anchor=(-0.1, 0.5), fontsize=8)
plt.tight_layout()
plt.show()

将其整合为两个函数并稍微改进

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

labels =["duration","protocol_type","service","flag","src_bytes","dot_bytes","land","wrong_fragment","urgent","hot","num_failed_logins","logged_in","num_compromised","su_attempted","numroot","num_file_creations","num_shells","num_access_files","num_outbound_cmds","is_hot_login","is_guest_login","count","srv_count","serror_rate","srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate","same_srv_rate","diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count","dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate","dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate","dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
df = pd.read_csv("./data/corrected-Test data with corrected labels")
df.columns = labels

def plot_pie_optimize(title, dict_value: dict, font_size):
keys = list(dict_value.keys())
values = np.array(list(dict_value.values()))
patches, texts = plt.pie(values)
labels = ['{} {:1.2f}% ({})'.format(i, j, v) for i, j, v in
zip(keys, 100. * values / values.sum(), values)] # 构造图例数据
patches, labels, dummy = zip(*sorted(
zip(patches, labels, values), key=lambda x: x[2], reverse=True))
plt.legend(patches, labels, loc="center", bbox_to_anchor=(-0.1, 0.5), fontsize=font_size)
plt.title(title)
plt.tight_layout()
plt.savefig("./dataimg/{}.png".format(title), dpi = 600, bbox_inches = 'tight') # 使得到的图片更完整清晰

def plot_label(title, font_size):
df_label = df.loc[:, title].value_counts()
dict_value = {}
for i in df_label.keys():
dict_value[i] = df_label[i]
plot_pie_optimize(title, dict_value, font_size)


plot_label("flag", 6)

参考链接

https://www.runoob.com/pandas/pandas-csv-file.html

https://www.runoob.com/w3cnote/matplotlib-tutorial.html

https://www.runoob.com/numpy/numpy-tutorial.html

https://www.runoob.com/matplotlib/matplotlib-pie.html

https://juejin.cn/post/7031475009688191007

https://blog.csdn.net/weixin_35757704/article/details/126965333

还有z神的实验报告🤗