Reconfigurable compute and network testbeds form the basis of much scientific cyberinfrastructure (CI). In such CI, users are given unrestricted access to experimental nodes, which can reach into the Internet and host publicly reachable services. A unifying theme across this type of CI is a desire to remain open and flexible to meet the ever evolving needs of the science and engineering communities. Yet, this openness and flexibility, coupled with a lack of system administration knowledge among CI users, opens the CI to attacks and misuse by external actors. Compromised nodes can be used to attack other public targets, they can encrypt or exfiltrate users' scientific data, or they can perform illicit computational activities such as cryptocurrency mining,
The DISCERN project collects datasets that capture legitimate and illegitimate use of scientific CI, and thus can be used by cybersecurity researchers to develop detection and defense approaches for CI attacks and misuse. We have instrumented the SPHERE testbed (https://sphere-testbed.net), used for cybersecurity and networking experimentation, to collect data about user activities via various user interfaces, experimental node processes, system events and file changes, experimental node resource usage and internal and external network traffic interacting with user experiments. Data is collected in a privacy-preserving and an intellectual-property-preserving manner to protect users and their research. We have further launched carefully designed, ethical attacks that misuse testbed nodes in a variety of realistic CI misuse scenarios, and have collected data from these events. Legitimate and illegitimate use datasets are released in their entirety, but they are also curated for diversity, interleaved to create mixed, balanced datasets, and released in that form to aid cybersecurity researchers in data mining and classification tasks. Our data collection tools and methodology are portable to other scientific CIs. We will work closely with owners of these CIs to promote their adoption of our tools and help them produce their own CI-usage datasets.