Testbed Usage Data Repository
Anonymization
- Textual identifiers (project pid, user uid, experiment eid, OS) are anonymized using a positional, two-pass, case-insensitive anonymization.
This means that we do one pass through the data to collect all the identifiers that we want to anonymize and place them in each of the four categories (pid, uid, eid, OS) that we want to anonymize. Identifiers are used as keys for a dictionary, and are made case-insensitive in the process.
The entries in the dictionary are assigned numbers 1, 2, 3, etc. This dictionary is used in the second pass to anonymize all the
identifiers.
- Anonymization is consistent across files, e.g. pid X will be mapped to number Y in every file we released.
- Anonymization is done within a category. Thus pid X may be mapped to number Y, but eid X (same name) will be mapped to number Z. Eid X in two different projects will be mapped to the same number Z.
- The way that OS images were identified in .top and .ptop files changed in early 2008 from pid-textName to numerical identifiers. Thus one could have pre-2008 OS projName-RedHat and post-2008 OS 1001 refer to the same OS image. We anonymize the pid part of the OS names the same way that we anonymize pid data in any other file. We apply separate positional anonymization to the second, textual part of OS name. When OS names changed to numerical identifiers in our data we stopped applying anonymization to them - this data carries original numbers.
- In user topology files (.top) node names are all mapped into 'nodeX' namespace where X is a number.
Similarly lan names are mapped into 'lanX' namespace.
Table of Contents
- DB folder contains anonymized results of SQL queries ran on DeterLab's control server's database.
- categories file maps each anonymized pid into a category. Internal projects are those that focus on monitoring and innovating for DeterLab. Class projects are those that use DeterLab to assign homeworks or projects to students. The rest are research projects and are further categorized into several security fields.
- eventsall file lists the details of each testbed event that was recorded in the database. This is a join of multiple tables in the database, hence some fields will be NULL. The start_time is the time of the event as Epoch time. The exptidx is the unique numerical identifier of an experiment and is not anonymized - if an experiment is destroyed and its name is reused within the same project, this would produce two experiments with the same eid but different exptidx. The anonymized eid, pid and uid identify the experiment, project and user that caused the event. The state field shows if the experiment is swapped (currently doesn't hold resources), active (currently holds resources), new (never swapped in) or destroyed (NULL value) - this helps us identify inconsistencies when the database failed to record the swapout event for some experiments. The pnodes field shows the number of physical nodes the event affected - this number may not be entirelly correct in presence of delay nodes in the topology. The action field shows the type of the event - new (creating the experiment), preload (same as new), swapin (acquire resources), swapout (return resources), swapmod (modify experiment), destroy (delete experiment). The swapout_time is the last swapout time recorded in the database for the experiment and it helps further identify and fix DB inconsistencies. The exitcode is 0 on success and 1 on failure. Note that some experiment manipulation errors will show in the error log but will not generate an event in the database with exitcode=1 and vice versa.
- members file maps each anonymized uid with an anonymized pid that he/she belongs to. A user may belong to multiple projects. Column trust shows if a user is the project's creator (project_root), a trusted member (group_root - can do everything a project_root can, local_root - has sudo privileges on experiments) or ordinary user with no sudo privileges (user). The created field shows when the user became the member of a given project in a given role.
- users file lists the anonymized uid, the state, country and registration time for each DeterLab's user.
- FS folder contains anonymized and summarized .top and .ptop files. The .top files describe user-desired topology for an experiment and .ptop files describe the state of the testbed at a given time.
- topdetails.txt file contains summaries for each .top file stored on the DeterLab's control node. The summary is a human-friendly version of the original .top format. Each summary starts with LABEL followed by the anonymized .top name and the Epoch time when it was created. The .top name is in the format '/usr/testbed/expinfo/pid-eid.exptidx/pid-eid-unique_number.[top|vtop]'. Then it uses keyword FIX for fixed node mappings vnode to pnode. We anonymize vnode names but leave pnode names intact. There could be multiple FIX directives. Keyword CLASS defines a new type of nodes vclass followed by several hardware types that make this vclass. This pretty much means that user wants nodes of any of the listed types and will call this new type vclass. Keyword TYPE is used to list the node type requested and any anonymized OS requests, and non-anonymized feature requests. This is followed by the number of nodes requested and their anonymized names. There could be multiple TYPE lines in one .top file. Finally links are listed by first listing the anonymized node name, keyword links and how many links it has. The next line then gives for each link the names of two nodes that are connected and the bandwidth in bps. Some links may be 'emulated' and this is noted after their bandwidth spec - for more information about emulated links see Multiplexed Links. Each .top summary ends with keyword DONE.
- ptopdetails.txt file contains summaries for each .ptop file stored on the DeterLab's control node. The summary is a human-friendly version of the original .ptop format. Each summary starts with LABEL followed by the anonymized .ptop name and the Epoch time when it was created. The .ptop name is in the format '/usr/testbed/expinfo/pid-eid.exptidx/pid-eid-unique_number.ptop'. Top and ptop files are created together and their names match. Ptop files that end in suffix 'empty' are created by assuming that there are no experiments allocated on the testbed and they are used to check if a mapping failure is temporary (someone is holding desired resources) or permanent (user asked for something the testbed doesn't have). In HOSTS BY TYPE section we list a hardware type and all available nodes of that type on one line and print their count as the last element after the = sign. This allows for a quick human check of what's available. In HOSTS BY SUPERTYPE section we do the same thing but by supertype - pc, appl, netfpga, router, etc. In TYPE CAPABILITIES each line contains a list of capabilities including feature:weight and OS support (anonymized OS names) followed by all nodes that have these capabilities. Note that some capabilities are per physical node (e.g. connected-to-...) while others are per hardware type (e.g. all nodes of the same type will support the same OS images). In SWITCHES section we list the hardware type and the switch hosting this type, followed by the list of nodes hosted. Note that one switch may host multiple hardware types, and one hardware type can be hosted by multiple switches. In INTERSWITCH LINKS section we list each two switches that have a link between them and the link's bandwidth in bps.
- Synthetic folder contains files we used for our '2011 synthetic setup' in the paper.
- start.ptop file shows the start state of the testbed in .ptop format.
- SIMworkload.txt file shows the instance ID, start time, stop time, and anonymized .top name (the file name from the FS/topdetails.txt). In the course of anonymization we have discovered that 13 allocation requests could not have allocated on an empty testbed (using start.ptop) and we have removed them from SIMworkload.txt as well as their topologies from the tops folder.
- tops folder contains all the .top files with anonymized content and names.