Problem; Solution
Lack of provider with various sets of data: domains, passive DNS, passive SSL, DNS records, open ports, applications running on the ports, files that communicate with domain names and IP addresses. Explanation: Normally, providers offer only specific types of data and, in order to obtain a general picture, it is necessary to purchase subscriptions from various providers. However, even numerous subscriptions do not guarantee that all the data needed has been obtained: some passive SSL providers provide data only on certificates issued only by CA associates, while their coverage of self-signed certificates is extremely poor. Meanwhile, others provide data on self-signed certificates that are collected from standard ports only.;We gathered all the mentioned data sets ourselves. For example, to collect data about SSL certificates, we created our own service that gathers information from trusted CAs by screening the entire IPv4 address space. The certificates were collected not only from IP addresses but also all the domains and subdomains from our base. For example, if a certificate has a domain “example.com” and a subdomain “www.example.com”, both of which resolve to IP address 1.1.1.1, they might receive three different results while trying to obtain an SSL certificate from port 443 for the IP address, the domain, and its subdomain. To collect data on the open ports and running services, a separate distributed scanning system has been created given that other services frequently blacklisted IP addresses belonging to scanning servers. Our own scanning servers also sometimes get blacklisted, but our results in detecting the necessary servers are better compared to companies that simply scan as many servers as possible and then sell access to that data.
Lack of access to the entire database of historical records. Clarification: Each provider normally has a broad database, but we were unable to obtain access to all the historical records due to obvious causes. This means that it is possible to obtain the records for a specific element (a domain or an IP address) but not to see the whole range of data.;In order to collect as many historical records on domains as possible, we have purchased various databases, parsed data from open sources, and reached agreements with domain name registration services. All the updates in our own bases have been kept together with the entire history of changes.
All existing solutions allow for the creation of graphs in manual mode only. Clarification: For example, a number of subscriptions have been bought for all available data providers (they are usually called “enrichers”). When a graph is required, they manually issue a command to create links for a given element, after which they choose the elements from which to build further links. In such scenarios, the responsibility for the graph’s quality rests with a specific individual.;We have designed automated graph creation. This means that the links to the searched element are built in automated mode, while the researcher only specifies the number of steps. The automated process of graph creation is quite simple but other vendors refrain from using it since a graph built in automated mode includes a number of irrelevant results, which we had to take into account while creating our own graph (you can read about it below).
A number of irrelevant results is a problem with nearly all graphs. Clarification: For example, a malicious domain (involved in the attack) is linked to a server that developed links to 500 other domains over the past ten years. When a graph is created in manual or automated mode, all these 500 domains should be depicted on the graph, even though they have no relation to the attack. Or, for example, the IP indicator from a vendor's security report is checked. Such reports are normally issued with a great deal of detail and contain information covering a year or more. It is highly likely that, when the report is read, the mentioned server with this IP address will be leased to other people with other links, while the graph created based on this information will contain irrelevant results.;We have taught our system to identify irrelevant results based on the logic used by our experts while addressing the matter manually. For example, the malicious domain “example.com” is checked and currently resolves to IP address11.11.11.11, while a month earlier, it used to resolve to IP 22.22.22.22. Apart from the domain “example.com”, the IP address 11.11.11.11 also has links to the domain “example.ru”, while IP address 22.22.22.22 has links to 25,000 other domains. Both system and individual understand that IP address 11.11.11.11 is likely to be a dedicated server since the spelling of the domain name “example.ru” resembles “example.com”, and it seems that they are both set to have links on the graph. The IP address 22.22.22.22, for its part, belongs to a shared hosting, which is why it is not necessary to include all of its domains in the graph unless there is some other link showing that some of these 25,000 domains also should be included in the graph (e.g. “example.net”). Before the system decides that the links found should be broken and some of the elements be excluded from the graph, it analyses several features of the elements and clusters that are made up of these elements, as well as the strength of these links. For example, if we have a small cluster (about 50 elements) that includes a malicious domain and a large cluster (5,000 elements) that are connected with a weak link, then such a link will be broken, while the elements from the big cluster will be deleted from the graph. Conversely, if the small and large clusters have many links with increasing strength, the links will not be broken, and the graph will contain the necessary elements from both clusters.
The period of server or domain possession is not taken into account. Clarification: The period of malicious domains registration eventually expires, and they are subsequently purchased again for fraudulent or legitimate goals. Even bulletproof hosting providers lease servers to hackers, which is why it is extremely important to know and take into account the period when a given domain or server was controlled by a specific owner. We often come across situations where a server with the IP address 11.11.11.11 is used as a C&C server for a banking bot, while a couple of months earlier it was used to control ransomware. If a graph is created without taking into account the so-called possession interval, it will look like the owners of the banking botnet and fraud scheme have a link, while in reality it does not exist. In our field of work, such a mistake is critical.;We have taught our system to determine possession periods. It is relatively easy in the case of domains given that services such as “whois” often mention the registration and expiration date, which means that it is easy to determine these periods when you have access to the entire history of changes on “whois”. In cases when the domain name has not expired but was handed over to other owners, the graph is able to see such changes. For example, SSL certificates are issued only once and cannot be extended or handed over to anyone else. However, when it comes to self-signed certificates, we cannot trust the validity dates as it is possible to generate an SSL certificate with a random validity start date. It is most difficult to establish possession periods for servers as the information about the lease term is only available to hosting providers. In order to establish possession periods for servers, we started using the results of port screening and fingerprinting applications running on those ports. This information helps us determine, with a high degree of accuracy, when a server changed its owner.
Few links. Clarification: Today, it is easy to obtain a list of domains linked to a specific email or IP address using free services such as “whois”. However, in the case of hackers, who take every measure possible to complicate being tracked and identified, additional tricks are necessary to uncover new features and build new links.;We have spent a considerable amount of time establishing how to extract data that is not available using ordinary means. We cannot reveal details for obvious reasons, but in some cases mistakes made by hackers during domain registration or server configuration help us discover their emails, pseudonyms, or backend addresses. The more links can be determined, the more accurate the graph will be.