ISP-DDoS: ISP-LEVEL DDoS DETECTION AND FILTERING

This is the description of datasets included in Sigcomm 2021 hackathon ISP-DDoS.

Basics about data collection
How data was labeled
Hackathon goals
Hackathon instructions

Basics

The datasets are derived from sampled Netflow records, which are collected on all switches at FrontRange GigaPOP exchange (FRGP). The records have been upsampled, anonymized and processed to label the attack flows. Figure below shows the simplified network architecture.

field	meaning
start_time	flow start in epoch time
end_time	flow end in epoch time
s_IP	anonymized source IP
s_port	original source port
d_IP	anonymized destination IP
d_port	original destination port
proto	protocol number from IANA
flags	TCP flags, cummulative (in case of ICMP flows, this field can be interpreted as ICMP type and code), converted into decimal format
bytes	total bytes on the flow, upsampled
pkts	total packets on the flow, upsampled
label	A - attack, B - benign, N - not labeled
ex_src	o - old source, n - new source, N - not labeled
Flow fields

Example:

      #start_time	end_time	s_IP		s_port	d_IP		d_port	proto	flags	bytes		pkts	label   ex_src
      1589324697	1589324697	14.181.72.246	64330	137.221.131.228	443	6	17	163840		4096	 B       o

means there was a flow from 1589324697 to 1589324697 (really only one packet was sampled at time 1589324697) from source 14.181.72.246 port 64330 to destination 137.221.131.228 port 443. The protocol was TCP and flags 17 (16 - ACK and 1 - FIN). The sampling rate for this flow was 1:4096 packets so the captured packet was of length 40 bytes. Upsampled, this maps into 163,840 bytes and 4,096 packets on this flow. The flow was marked as benign. The source of the flow has sent benign traffic to the same destination in the past.

Attack alerts

Commercial attack alerts have the simplifed format shown in the table below.

field	meaning
record type	C for commercial defense, I for our inferred record
ID	unique ID for this record
start_time	attack start, as seen by C in epoch time (can be -1 if we don't have mitigation end report)
end_time	attack end, as seen by C in epoch time
target_p	anonymized target /24 prefix
type	attack type, cumulative, shown as decimal (see attack type table for further details)
severity	low/medium/high
Alert fields

Attack types and their values as decimal numbers are shown below:

type signature decimal value

DNS Amplification src port 53 and proto udp 1

ICMP Flood proto icmp 2

Total Traffic 4

IP Fragmentation src port 0 8

CLDAP Amplification src port 389 16

TCP SYNACK Amplification proto tcp and flags & 18 != 0 32

TCP RST Flood proto tcp and flags & 1 != 0 64

UDP Flood proto udp 128

NTP Amplification proto udp and src port 123 256

mDNS Amplification src port 5353 and proto udp 2048

TCP SYN Flood proto tcp and flags & 2 != 0 8192

Chargen Amplification src port 19 and proto udp 16384

L2TP Amplification src port 1701 and proto udp 32768

Memcached Amplification src port 11211 and proto udp 65536

DNS Flood dst port 53 and proto udp 131072

RPCbind Amplification src port 111 and proto udp 262144

TCP ACK Flood proto tcp and flags & 16 != 0 524288

Attack type table

type	signature	decimal value
DNS Amplification	src port 53 and proto udp	1
ICMP Flood	proto icmp	2
Total Traffic		4
IP Fragmentation	src port 0	8
CLDAP Amplification	src port 389	16
TCP SYNACK Amplification	proto tcp and flags & 18 != 0	32
TCP RST Flood	proto tcp and flags & 1 != 0	64
UDP Flood	proto udp	128
NTP Amplification	proto udp and src port 123	256
mDNS Amplification	src port 5353 and proto udp	2048
TCP SYN Flood	proto tcp and flags & 2 != 0	8192
Chargen Amplification	src port 19 and proto udp	16384
L2TP Amplification	src port 1701 and proto udp	32768
Memcached Amplification	src port 11211 and proto udp	65536
DNS Flood	dst port 53 and proto udp	131072
RPCbind Amplification	src port 111 and proto udp	262144
TCP ACK Flood	proto tcp and flags & 16 != 0	524288
Attack type table

Attack types are encoded as cumulative values, e.g., if an attack has fragmented flows and CLDAP amplification flows it will be encoded as type 8+16=24.

Labeling

There are four confounders that prevent us from just labeling traffic directly based on attack alerts. First, alert start and stop may lag after the actual onset of attacks, since commercial defenses delay alerts in some cases to reduce false positives. Second, link observed by C is different than set of links generating Netflow records, so some attacks may be observable in our records and not by C and vice versa. Third, our traffic records are generated from sampled traffic (and in some cases may be double-sampled) and therefore may be skewed. Fourth, our attack alerts are anonymized at the prefix level, but our flows are anonymized at the IP level.

We label traffic by performing the following steps:

We extract number of unique sources, bytes and packets going to each unique destination within anonymized FRGP prefixes per traffic categories that match those in attack type table
We calculate CUSUM for number of unique sources, bytes and packets per traffic category, and sum up three CUSUM values to come up with an aggregate value per category
We tag a traffic category as anomalous when the aggregate value crosses a conservative threshold (30)
We flag events where there is a sustained anomaly as potential attack events
We match potential attack events to attack alerts from C if their target prefix matches, and the intersection of their durations is non-empty. We then report target, start, stop and type of matched attack events.
If any attack alerts from C are not matched we look for them manually and match them if possible
We use the final list of matched attack events to flag all matching flow records (those whose target and signature match the event and whose duration overlaps the attack event's duration) as attack flows. The rest of the flows are tagged as benign

We also release our matched alerts. The table below details their fields. We show the matched alerts along with the C's alerts.

field	meaning
record type	C for commercial defense, I for our inferred record
ID	unique ID for this record
start_time	attack start, inferred in epoch time
end_time	attack end, inferred in epoch time
target	target IP, inferred
type	attack type, inferred, cumulative, shown as decimal (see attack type table for further details)
cID	unique ID of commercial attack record that matches this inferred record
Alert match fields

Hackathon Goals

The main goal of hackathon is to fully label training, validation and test sets with all attacks. We will then work to release the labeled data to DDoS researechers. Accurately labeling DDoS data is challenging, because there is no community-established criteria for what constitutes a DDoS attack. We will start the hackathon by discussing what would make a good criteria. Some options are listed below and will be updated during the hackathon:

Large spike in traffic volume or packet count to a given destination
- how large?
- relative or absolute?
- spike in all traffic or subset of traffic?
- which subsets are of interest? which attack types to focus on?
Many new sources sending data to a given destination
- how many?
- new in what preceding period?
Is spike in one measure enough (e.g., volume) or do we need spikes in multiple measures (e.g., volume, packets, sources)?
Does duration of the spike matter?
Does bi-directionality of traffic matter?

Additionally, we would like to assemble a collection of algorithms that may be helpful in accurately labeling DDoS attack data. We expect the participants to develop their own attack detection approaches using the data we will release. If we (as a community) label more data during the hackathon participants are welcome to revise their algorithms, as needed. All code should be uploaded to our shared Google drive prior to evaluation. The code should ingest all flow fields except the label and produce flows with B or A labels.

Hackathon Instructions

Please follow these steps to participate in hackathon.

Sign our Memorandum of Understanding outlining acceptable use of the data. E-mail the completed form to sunshine@isi.edu.
You will receive a download link for the training/validation dataset. Download the data onto a secure machine and carefully study the dataset description above.
Work on training/validation of your approach. You can use any hardwar e and software you have access to. Bear mind that real data is noisy. There may very well be more attacks in the data than we could label.
We will join all participants into isp-ddos at mailman.isi.edu mailing list. Please use this mailing list to report new findings. You can:
- Report any new attacks you find. Please tailor your attack alert format to match the current format outlined in attack format table. Also inlcude some reasoning - why do you believe this is an attack. Feel free to include graphs, statistics, etc.
- Vote for new attacks reported by others.
- Dispute either C's attack alerts or new attacks reported by others. Include your reasoning.
- Ask for and offer clarification about the dataset.
Remember: you cannot share this data with anyone else.
Remember: you must delete the dataset upon completion of the hackathon. If you want to use this dataset for your research or teaching, please contact us. We are developing mechanisms to share the dataset with a broad research community.
Comment your code and release it at shared Google drive prior to the hackathon. You can keep updating it until the end of the hackathon.
Release your validation accuracy by filling out this document prior to the hackathon. You can keep updating it until the end of the hackathon.

ISP-DDoS: ISP-LEVEL DDoS DETECTION AND FILTERING

Table of Contents

Basics

Sampling

Anonymization

Flow format

Attack alerts

Labeling

Hackathon Goals

Hackathon Instructions