Clouds have gained popularity over the years as they provide high storage capacities and computing power, reduced hardware costs and an on-demand availability. Cloud users often gain superuser access to cloud machines, which is necessary to fully customize the cloud resources to user needs. But superuser access to a vast amount of resources, without support or oversight of experienced system administrators, can create fertile ground for accidental or intentional misuse. Attackers can rent cloud machines or hijack them from cloud users, and leverage them to generate unwanted traffic, such as spam and phishing, denial of service, vulnerability scans, drive-by downloads, etc. Some clouds, which engage in bulletproof hosting, knowingly permit malicious traffic generation.
In this project, we analyze 13 datasets, containing various types of unwanted traffic, to quantify cloud misbehavior and identify clouds that most often and most aggressively generate unwanted traffic. We find that although clouds own only 5.4% of the routable IPv4 address space (with 94.6% going to non-clouds), they often generate similar amounts of scans as non-clouds, and contribute to 22-96% of entries on blocklists. Among /24 prefixes that send vulnerability scans, a cloud's /24 prefix is 20-100 times more aggressive than a non-cloud's. Among /24 prefixes whose addresses appear on blocklists, a cloud's /24 prefix is almost twice as likely to have its address listed, compared to a non-cloud's /24 prefix. Misbehavior is heavy-tailed among both clouds and non-clouds. OVH and DigitalOcean are two of the most misused clouds across all our datasets. We discern that maliciousness of a cloud is heavy-tailed, with top 25 clouds contributing 90% of all the scans from clouds, and 10 clouds contributing more than 20% of blocklist entries. This project analyzes network traces of a US ISP to identify shifts in network traffic due to Covid-19 and stay-at-home orders. We are grateful to IPInfo.io for letting us use their data to map IP prefixes from network traces into AS numbers and organizations.
(1) CAIDA real-time network telescope data: It captures all traffic to an unused /8 prefix owned by CAIDA. The traffic to these unassigned addresses (darknet) represent unsolicited traffic, which results from a wide range of events, including scanning (unwanted traffic looking for vulnerabilities) and backscatter (replies by the victims of DDoS attacks to randomly spoofed traffic, including darknet space).
(2) Merit network real-time network telescope data: It captures full packet traces, using a /13 dark prefix. We analyze their data from March 11, 2020 to March 19, 2020. Each compressed hourly file is close to 2 GBs, and usually contains fewer than 0.1 billion packets, so we can process it fully. The traffic is not anonymized and can be analyzed only on Merit's machines.
(3) Regional optical network RONX: This dataset contains sampled Netflow records from a mid-size US regional network, connecting educational, research, government and business institutions to the Internet.
(1) Scamalytics - IP fraud risk lookup tool: It is known for maintaining the largest shared anti-fraud database, mainly dedicated to the online dating industry. It computes the fraud score for all known IPs, and publishes the top 100 IPs each month, and their fraud scores. Our dataset includes top 102 IP addresses for March 2020 and April 2020.
(2) F5 Labs: It maintains the list of the top 50 malicious autonomous systems and attacker IPs, which are associated with 14 million different attacks across the globe. Majority of these are Web application attacks. Organizations and attacker IPs are ranked by the number of attacks generated. It also lists the top 50 malicious IP addresses that generated the maximum number of attacks. This dataset is spread over 90-day period from Aug 2019 to Oct 2019.
(3) BLAG: It produces a publicly available master blocklist, obtained by aggregating content from 157 publicly available, popular blocklists. A new master list is published whenever any of the blocklists updates its contents. We use the data for the entire 2018 and 2019, separated into two datasets: 0.5 billion IP addresses in 2018, and 5 billion IP addresses for 2019 and the first one month of 2020.
(4) Google Safe Browsing: It examines billions of URLs per day looking for unsafe websites, and publishes their list. We collected this list from May 8, 2020 to May 16, 2020 and used the DNS to map it into 7,886 IP addresses.
(5) COVID-19 phishing URL's list from maltiverse.com: It contains 239 phishing URLs related to COVID-19 content, from March 13, 2020 to May 16, 2020.
(6) COVID-19 malicious hostnames/URL's List from maltiverse.com: It contains 9,874 malicious hostnames/URLs that contain the word "COVID-19"/"corona" and are known to generate different types of unwanted traffic, from January 2020 to May 2020.
(7) Openphish: It maintains the list of autonomous systems associated with phishing. We collect 54 snapshots, from May 15, 2020 to May 22, 2020, each containing top 10 ASNs associated with phishing.
(8) Cybercrime Tracker: It maintains a public list of IP addresses that are known to spread malware at. We use data from January 1st, 2019 to May 12th, 2020, which comprised 3,471 IP addresses that spread malwares. This dataset overlaps with the BLAG dataset.
(9) udger.com: It maintains a list of known user agent strings, and also maintains a list of source IPs of known attacks, which is updated every half an hour. We collected a total of 101 snapshots from this list including a total of 3,030 IP addresses, on April 13, 2020, and also from May 4, 2020 to May 20, 2020. Attacks include installations of vulnerable versions of popular web applications, brute-force login attempts, and floods.
(10) BGP Ranking: It provides a ranking model to rank the ASNs from the most malicious ASN to the least malicious ASN using data from compromised systems and other publicly available blocklists. We collected the snapshot of Top 100 most malicious ASNs on August 21st, 2020.