Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
_y_sub.csv		_y_sub.csv
labels.csv		labels.csv
main_splitting_mask.npy		main_splitting_mask.npy
names.csv		names.csv
sub_splitting_mask.npy		sub_splitting_mask.npy
tsne_data.csv		tsne_data.csv

README.md

KDDCup99 dataset issues

We were first alerted to the problems inherent to the KDDCup99 dataset through this feature in KDNuggets from Terry Brugger. Although he makes some mistakes¹ in conflating it with aspects of the original DARPA/Lincoln Labs dataset, he points to some legitimate research on it. The most thorough analysis can be found in Tavallaee et al.

The problems with KDDCup99 found in Tavallaee et al can be summarized as follows:

Severe and unrealistic class imbalance
Tons of duplicate samples (75-78%)
Most samples are too easily classified for benchmarking based on accuracy/false positive rate

The first two problems make it harder to learn useful models from the dataset, while the third prevents us from measuring false positive rate with precision. We're not so concerned with the first two, since we managed to develop a pretty strong model despite these.

The third, however, is cause for concern. This third factor is why we plan on reevaluating our model on the improved NSL-KDD dataset.

Since false positive rate is factored into recall, our results on recall are harder to interpret. However, our superior performance on precision should suggest that the ensemble of deep nets developed here is viable as a powerful IDS.

¹ For example, TLL, which was known to nearly linearly separate benign vs. malicious connections on its own, was included as a feature in DARPA. Brugger claims this is true of KDDCup99 as well, which is false according to Tavallaee et al.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

KDDCup99 dataset issues

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

KDDCup99 dataset issues