Long-lived Internet Flows

This web page documents the format of the long-lived Internet flows data. Our dataset is available upon request.

Dataset Contents

We collect IP flow records spanning from seconds to days and weeks. Here a flow is defined as the standard five-tuple: <source IP, source Port, protocol number, destination IP, destination Port>.

We organize the flow records hierarchically, into different directories as shown below:

level flow duration (up to)
r0 10 minute long
r1 20 minute long
r2 40 minute long
... ...
rn 10*2n minute long

The duration of flow records increases exponentially, with a base duration of 10 minutes. Two level i flow files (numbered 2n and 2n+1) are merged into one level i+1 flow file (numbered n).

All IP addresses are fully anonymized, with all bits consistently scrambled. Respecting user's privacy is important, and we only draw statistical conclusions over our dataset. We also compress all our flow files with bzip2.

Flow file format

Our flow record uses an extended Argus format:

start_timestamp end_timestamp sourceIP.sourcePort protocol destinationIP.destinationPort num_packets num_bytes state sigma_bytes_square bytes_avg N_timebins

(last three are used to compute burstiness of a flow, which is defined as variance of bytes over a time bin of 10 minutes. burstiness =   sigma_bytes_square/N_timebins - bytes_avg*bytes_avg  )

Here is a sample flow record:

20090606:02:15:48.049447 20090606:03:37:13.873638 194.177.210.209.41157 udp 224.2.127.195.sapv1 3822 1048400 INT 12133392238 10920.8333333 96

Guaranteed property

Suppose a level i flow file starts at time t, we guarantee to contain all flows (starting in the time [t, t+10*2i-1]) ranging from: 10*2i-2 minutes to 10*2i-1 minutes.

For details of this property, please see the paper "On the Characteristics and Reasons of Long-lived Internet Flows" that appeared at ACM IMC 2010.