Threat Hunting: Data Collection and Analysis
Threat hunting requires proactively looking within the network and searching for anomalies that might indicate a breach. The vast amount of data that needs to be collected and analyzed means that it is a painstaking and time-consuming process, and the speed of this process can hamper its effectiveness. However, that can be highly improved by the use of proper data collection and analysis methods. In this article, we’ll discuss the various data collection and analysis methods that can be used by threat hunters and analysts during a hunt.
What Kind of Data Are We Collecting?
As a threat hunter, you require adequate data in order to perform your hunt. Without the right data, you cannot hunt. Let’s take a look at what qualifies as the right data used for hunting.
It’s important to also note that determining the right data depends on what you will be looking for during your hunt. Generally, data can be classified into three sections:
1. Endpoint Data
Endpoint data comes from endpoint devices within the network. These devices can, for instance, be end-user devices such as mobile phones, laptops and desktop PCs, but may also cover hardware such as servers (like in a data center). Definitions of what an endpoint actually is will significantly vary, but for the most part, it is what we have described above.
You will be interested in collecting the following data from within endpoints:
- Process execution metadata: This data will contain information on the different processes running on hosts (endpoints). The most sought-after metadata will include command-line commands and arguments, and process file names and IDs.
- Registry access data: This data will be related to registry objects, including key and value metadata, on Windows-based endpoints.
- File data: This data will, for example, be dates when files on the host were created or modified, as well as their size, type and location where they are stored within the disk.
- Network data: This data will define the parent process for network connections.
- File prevalence: This data will shed light on how common a file is in the environment (host).
2. Network Data
This data will have its sources from network devices such as firewalls, switches and routers, DNS and proxy servers. You will mostly be interested in collecting the following data from network devices:
- Network session data: Of interest here will be connection information between hosts on the network. This information will, for instance, include source and destination IP addresses, connection duration times (including start and end times), netflow, IPFIX and other similar data sources.
- Monitoring tool logs: Network monitoring tools will collect connection-based flow data and application metadata. This logged data is what you want to be collecting here. Application metadata on HTTP, DNS and SMTP will also be of interest.
- Proxy logs: Here you will be collecting HTTP data containing information on outgoing Web requests such as internet resources that are being accessed within the internal network.
- DNS logs: The logs you will get here will contain data related to domain name server resolution. These will include domain-to-IP address mappings and identification of internal clients that are making resolution requests.
- Firewall logs: This data is one of the most important data that you will be collecting. It will contain information on network traffic at the border of a network.
- Switch and router logs: This data will basically show what is going on behind your network.
3. Security Data
This data will have its sources from security devices and solutions such as SIEM, IPS and IDS solutions. You want to be collecting the following data from security solutions:
- Threat intelligence: This is data will include the indicators and tactics, techniques and procedures (TTPs) as well as the operations that malicious entities are executing on the network.
- Alerts: Data here will include notifications from solutions such as IDS and SIEMs, indicating that a ruleset was violated or any other incident had occurred.
- Friendly intelligence: This data will for instance include critical assets, accepted organization assets, employee information and business processes. The importance of this data is to help the hunter and analyst to understand the environment in which they operate.
What Are the Four Threat-Hunting Techniques for Data Collection?
One of the most important parts of a threat-hunting process is having experienced personnel employ effective data collection and analysis methods. There are four main methods/techniques that hunters use for data collection, and these are:
This technique is used when you have a large data set and you establish specific data points on groups (called clusters) of the large data set. It is advisable to use this method when the data points you are working on do not share behavioral characteristics. Using this method, you will be able to find precise cumulative behaviors. You can, for example, find an unusual number of instances of a common occurrence using various applications such as outlier detection.
This technique is best used when you are hunting for artifacts that are unique yet similar. It takes these unique artifacts and identifies them by using specific criteria. The specific criteria that are used to group data are determined by, for instance, events occurring within a certain time. Specific items of interest are also taken and used as input.
This is a technique in which hunters can query data for certain specific artifacts which can be used in most tools. However, it is ineffective due to the fact that hunters only get results that they searched for, making it quite difficult to obtain outliers from the search results. The hunter is forced to make specific searches, since general searches would otherwise result in an overload of results. Care should be taken while performing searches, since a very narrow search might yield ineffective results.
Stack Counting (Stacking)
This technique is used when investigating a hypothesis. The hunter counts the number of occurrences for specific value types while examining the outliers of the results. This technique is most effective as long as the hunter has thoughtfully filtered the input. Hunters can predict the volume of output if they properly understand the input.
There are some things to note, though. When using stacking, you should count the number of command artifact executions.
Even though the standard data collection methods described above exist and are manual, threat hunters are also able to employ machine learning (or data science-powered techniques) which involve creating frameworks of feedback given to automated classification systems. Simply put, what the hunter needs to ensure is that training data is properly used to tune algorithms so that these algorithms can accurately label unclassified data. You should note that although it is not a strict requirement that you employ machine learning techniques, knowing that the technique exists might help you when you need it.
What Are the Most Common Data Analysis and Illustration Methods?
Once the data has been collected, the hunter can then interpret, illustrate and analyze the data to determine patterns within the data. Several approaches are at this point available to the hunter. These approaches include:
Box Plots (Box-Whisker Plots)
This technique is used when the hunter is interested in identifying outliers and determining the distribution of a dataset. Using Box Plots, a hunter can be able to show differences between distributions by portraying extreme values. The hunter is able to group entities based on the function or type, thereby being able to identify any possible inconsistencies.
Sparklines are line charts used to represent data relationships but drawn without axes like on graphs. Hunters are able to use these while displaying trends in an array of values where the values continuously fluctuate. This fluctuation is able to alter the sparklines, enabling the hunter to visualize the changes. This makes it easier to interpret the data.
This allows hunters to represent data in color format. Hunters are able to represent different data using different colors, and this is especially efficient in displaying different data groups and their relationships. This method of representing data makes it possible to represent distinct data that is of interest, such as outliers. See below for an example of how heat maps look.
In this article we have covered the various techniques that are at the disposal of a hunter during an active hunt, both to collect and visualize/analyze results from a data collection exercise. We also discussed what qualifies as the right data to be collected. It should be noted that the methods discussed are not comprehensive but are the main ones most commonly used today during a hunt.
Four Common Threat Hunting Techniques with Sample Hunts, Ely Kahn (LinkedIn)
Cyber Threat Hunting: Detect Advanced Threats Hiding in Your Network, LIFARS
Hunt Evil: Your Practical Guide to Threat Hunting, Sqrrl