ISP-DDoS: ISP-LEVEL DDoS DETECTION AND FILTERING

This is the description of datasets included in Sigcomm 2021 hackathon ISP-DDoS.

Table of Contents

Basics

The datasets are derived from sampled Netflow records, which are collected on all switches at FrontRange GigaPOP exchange (FRGP). The records have been upsampled, anonymized and processed to label the attack flows. Figure below shows the simplified network architecture.

Sampling

Netflow records are collected on all switches (shown as grey in the Figure). Collection points are illustrated with yellow circles in the Figure. Only internal links between switches are monitored to collect flows. Traffic is sampled by Netflow capture software at each collection point with either 1:100 or 1:4096 sampling rate. Netflow capture software converts sampled packets into flow records. You can tell which rate was used to sample each flow if you look at the number of packets on a flow. If the number is evenly divisible by 100, the rate 1:100 was used. Similarly, for 1:4096 sampling rate. One link, leading to one upstream ISP (anonymized as uISP) hosts a commercial DDoS defense (anonymized as C). We have obtained alerts indicating attack events from C to customers of FRGP.

Anonymization

Source and destination IPs are anonymized using CryptoPAN with a single key. This means that addresses are anonymized consistently across datasets and that addresses within a same prefix remain within the same anonymized prefix.

Flow format

We release flow records as separate files, each covering 5 minutes of bi-directional traffic. Bi-directional flows will appear as two separate uni-directional flows. Each record's format is shown in the table below.
fieldmeaning
start_timeflow start in epoch time
end_timeflow end in epoch time
s_IPanonymized source IP
s_portoriginal source port
d_IPanonymized destination IP
d_portoriginal destination port
protoprotocol number from IANA
flagsTCP flags, cummulative (in case of ICMP flows, this field can be interpreted
as ICMP type and code), converted into decimal format
bytestotal bytes on the flow, upsampled
pktstotal packets on the flow, upsampled
labelA - attack, B - benign, N - not labeled
ex_srco - old source, n - new source, N - not labeled
Flow fields

Example:

      #start_time	end_time	s_IP		s_port	d_IP		d_port	proto	flags	bytes		pkts	label   ex_src
      1589324697	1589324697	14.181.72.246	64330	137.221.131.228	443	6	17	163840		4096	 B       o

means there was a flow from 1589324697 to 1589324697 (really only one packet was sampled at time 1589324697) from source 14.181.72.246 port 64330 to destination 137.221.131.228 port 443. The protocol was TCP and flags 17 (16 - ACK and 1 - FIN). The sampling rate for this flow was 1:4096 packets so the captured packet was of length 40 bytes. Upsampled, this maps into 163,840 bytes and 4,096 packets on this flow. The flow was marked as benign. The source of the flow has sent benign traffic to the same destination in the past.

Attack alerts

Commercial attack alerts have the simplifed format shown in the table below.
fieldmeaning
record typeC for commercial defense, I for our inferred record
IDunique ID for this record
start_timeattack start, as seen by C in epoch time (can be -1 if we don't have mitigation end report)
end_timeattack end, as seen by C in epoch time
target_panonymized target /24 prefix
typeattack type, cumulative, shown as decimal (see attack type table for further details)
severitylow/medium/high
Alert fields

Attack types and their values as decimal numbers are shown below:
typesignaturedecimal value
DNS Amplificationsrc port 53 and proto udp1
ICMP Floodproto icmp2
Total Traffic4
IP Fragmentationsrc port 08
CLDAP Amplificationsrc port 38916
TCP SYNACK Amplificationproto tcp and flags & 18 != 032
TCP RST Floodproto tcp and flags & 1 != 064
UDP Floodproto udp128
NTP Amplificationproto udp and src port 123256
mDNS Amplificationsrc port 5353 and proto udp2048
TCP SYN Floodproto tcp and flags & 2 != 08192
Chargen Amplificationsrc port 19 and proto udp16384
L2TP Amplificationsrc port 1701 and proto udp32768
Memcached Amplificationsrc port 11211 and proto udp65536
DNS Flooddst port 53 and proto udp131072
RPCbind Amplificationsrc port 111 and proto udp262144
TCP ACK Floodproto tcp and flags & 16 != 0524288
Attack type table

Attack types are encoded as cumulative values, e.g., if an attack has fragmented flows and CLDAP amplification flows it will be encoded as type 8+16=24.

Labeling

There are four confounders that prevent us from just labeling traffic directly based on attack alerts. First, alert start and stop may lag after the actual onset of attacks, since commercial defenses delay alerts in some cases to reduce false positives. Second, link observed by C is different than set of links generating Netflow records, so some attacks may be observable in our records and not by C and vice versa. Third, our traffic records are generated from sampled traffic (and in some cases may be double-sampled) and therefore may be skewed. Fourth, our attack alerts are anonymized at the prefix level, but our flows are anonymized at the IP level.

We label traffic by performing the following steps:

  1. We extract number of unique sources, bytes and packets going to each unique destination within anonymized FRGP prefixes per traffic categories that match those in attack type table
  2. We calculate CUSUM for number of unique sources, bytes and packets per traffic category, and sum up three CUSUM values to come up with an aggregate value per category
  3. We tag a traffic category as anomalous when the aggregate value crosses a conservative threshold (30)
  4. We flag events where there is a sustained anomaly as potential attack events
  5. We match potential attack events to attack alerts from C if their target prefix matches, and the intersection of their durations is non-empty. We then report target, start, stop and type of matched attack events.
  6. If any attack alerts from C are not matched we look for them manually and match them if possible
  7. We use the final list of matched attack events to flag all matching flow records (those whose target and signature match the event and whose duration overlaps the attack event's duration) as attack flows. The rest of the flows are tagged as benign
We also release our matched alerts. The table below details their fields. We show the matched alerts along with the C's alerts.
fieldmeaning
record typeC for commercial defense, I for our inferred record
IDunique ID for this record
start_timeattack start, inferred in epoch time
end_timeattack end, inferred in epoch time
targettarget IP, inferred
typeattack type, inferred, cumulative, shown as decimal (see attack type table for further details)
cIDunique ID of commercial attack record that matches this inferred record
Alert match fields

Hackathon Goals

The main goal of hackathon is to fully label training, validation and test sets with all attacks. We will then work to release the labeled data to DDoS researechers. Accurately labeling DDoS data is challenging, because there is no community-established criteria for what constitutes a DDoS attack. We will start the hackathon by discussing what would make a good criteria. Some options are listed below and will be updated during the hackathon:
  1. Large spike in traffic volume or packet count to a given destination
    • how large?
    • relative or absolute?
    • spike in all traffic or subset of traffic?
    • which subsets are of interest? which attack types to focus on?
  2. Many new sources sending data to a given destination
    • how many?
    • new in what preceding period?
  3. Is spike in one measure enough (e.g., volume) or do we need spikes in multiple measures (e.g., volume, packets, sources)?
  4. Does duration of the spike matter?
  5. Does bi-directionality of traffic matter?

Additionally, we would like to assemble a collection of algorithms that may be helpful in accurately labeling DDoS attack data. We expect the participants to develop their own attack detection approaches using the data we will release. If we (as a community) label more data during the hackathon participants are welcome to revise their algorithms, as needed. All code should be uploaded to our shared Google drive prior to evaluation. The code should ingest all flow fields except the label and produce flows with B or A labels.

Hackathon Instructions

Please follow these steps to participate in hackathon.
  1. Sign our Memorandum of Understanding outlining acceptable use of the data. E-mail the completed form to sunshine@isi.edu.
  2. You will receive a download link for the training/validation dataset. Download the data onto a secure machine and carefully study the dataset description above.
  3. Work on training/validation of your approach. You can use any hardwar e and software you have access to. Bear mind that real data is noisy. There may very well be more attacks in the data than we could label.
  4. We will join all participants into isp-ddos at mailman.isi.edu mailing list. Please use this mailing list to report new findings. You can:
  5. Remember: you cannot share this data with anyone else.
  6. Remember: you must delete the dataset upon completion of the hackathon. If you want to use this dataset for your research or teaching, please contact us. We are developing mechanisms to share the dataset with a broad research community.
  7. Comment your code and release it at shared Google drive prior to the hackathon. You can keep updating it until the end of the hackathon.
  8. Release your validation accuracy by filling out this document prior to the hackathon. You can keep updating it until the end of the hackathon.