Correlation, Linear Regression, Analysis and Threat Hunting

October 26, 2017

Yuval’s Cybersecurity Monitoring Maturity Model (MMM)

Cybersecurity Monitoring Maturity Model

Correlation & Linear Regression

“Correlation quantifies the degree to which two variables are related. Correlation does not fit a line through the data points. You simply are computing a correlation coefficient (r) that tells you how much one variable tends to change when the other one does. When r is 0.0, there is no relationship. When r is positive, there is a trend that one variable goes up as the other one goes up. When r is negative, there is a trend that one variable goes up as the other one goes down.

Linear regression finds the best line that predicts Y from X.  Correlation does not fit a line.”

Source: What is the difference between correlation and linear regression?

“The figure below shows four hypothetical scenarios in which one continuous variable is plotted along the X-axis and the other along the Y-axis.


  • Scenario 1 depicts a strong positive association (r=0.9), similar to what we might see for the correlation between infant birth weight and birth length.
  • Scenario 2 depicts a weaker association (r=0,2) that we might expect to see between age and body mass index (which tends to increase with age).
  • Scenario 3 might depict the lack of association (r approximately 0) between the extent of media exposure in adolescence and age at which adolescents initiate sexual activity.
  • Scenario 4 might depict the strong negative association (r= -0.9) generally observed between the number of hours of aerobic exercise per week and percent body fat.”

Source: Introduction to Correlation and Regression Analysis

In practice, correlation is a very basic & common security technique. This technique can be found in common security products that rely on “Rules” and “Signatures”, such Firewalls and Antivirus. In other words, correlation rely usually on finding of IOCs (Indicator of Compromise) and IOAs (Indicator of Attack). Moreover, this technique unusually relies on common Cyber Threat Intelligence, such as Blacklists and Whitelists.


The idea behind the Analysis technique is to find ‘Abnormal Behavior’, before a real damage to the organization assets would occur. In other words, instead of waiting for the Threat Actor (Adversary) to ‘appear’ or to a real Cyber alert to initiate after a specific security event (e.g. rule violation), this technique tries to detect an exception in a proactive way. The side effect of this technique is that due the fact that each organization & system/network are owning specific, temporarily (usually) and a unique baseline, its common to find a higher rate of FP (False Positive – a type I error), for example.

“This example shows how one KeyLines customer, an online currency exchange provider, uses network visualization to analyze user login behaviors.

Irregularities in login patterns can be a useful indicator of compromise, often indicating an impending breach. Patterns to look for include:

  • Accounts accessing a system from many geographic locations
  • Accounts using many different devices
  • Logins from locations in which the company does not operate
  • Accounts accessing a system from two devices simultaneously
  • Multiple failed logins

Humans are uniquely equipped with the analysis skills required to see patterns and find outliers. By presenting a visual overview of our data in a single chart, the brain automatically spots unusual patterns:

cyber anomaly detection 1

In this screenshot, the central node of each structure indicates an online account; each connected node is an IP address that has been used to access that account. We can see that most accounts have been accessed by 1-4 different IP addresses.

There are specific star structures throughout the chart that stand out:

cyber anomaly detection 2

This indicates that individual login accounts have been accessed from multiple locations. Let’s zoom into one:

cyber anomaly detection 3

Here we have zoomed in on two ‘star’ structures. At this level, we can see more detail:

  • Green nodes are the user
  • Yellow nodes are devices used
  • Purple nodes are the account.

Looking closer still, we can see that the user node uses a glyph to indicate the country of registration for the account. The node connected by a thick yellow link is the account’s ‘original’ IP address.

cyber anomaly detection 4

In this example, the analyst should look at this account and ask why this user has logged into the system from more than 20 locations. If we integrate our chart with a case management system, CRM or the login database, the investigation could be reached through a context menu.”

Source: Visualizing Cyber Security Data: Detecting Anomalies

Threat Hunting

“Cyber threat hunting is a proactive security approach for organizations to detect advanced threats in their networks. Until recently, most security teams have relied on traditional rule- and signature-based solutions that produce floods of alerts and notifications, and typically only analyze data sets after an indicator of a breach had been discovered as a part of forensic investigations.

The Threat Hunting process is meant to be iterative. You will never be able to fully secure your network after just a single hunt. To avoid one-off, potentially ineffective hunting trips, it’s important for your team to implement a formal cyber hunting process. The following four stages make up a model process for successful hunting.

The hunting loop illustrates that hunting is most effective when it’s habitual and adaptable. Let’s break it down step by step, beginning with hunting starting points, or what we call “trailheads”:

A hunt starts with creating a hypothesis, or an educated guess, about some type of activity that might be going on in your IT environment. An example of a hypothesis could be that users who have recently traveled abroad are at elevated risk of being targeted by state-sponsored threat actors, so you might begin your hunt by planning to look for signs of new malware on their laptops or assuming that their accounts are being misused around your network. Hypotheses are typically formulated by analysts based on any number of factors, including friendly and threat intelligence. There are various ways that a hunter might form a hypothesis. Often this involves laying out attack models and the possible tactics a threat might use, determining what would already be covered by automated alerting systems, and then formulating a hunting investigation of what else might be happening.

A hunter follows up on hypotheses by investigating via various tools and techniques, including Linked Data Search and visualization. Effective tools will leverage both raw and linked data analysis techniques such as visualization, statistical analysis or machine learning to fuse disparate cybersecurity datasets. Linked Data Analysis is particularly effective at laying out the data necessary to address the hypotheses in an understandable way, and so is a critical component for a hunting platform. Linked data can even add weights and directionality to visualizations, making it easier to search large data sets and use more powerful analytics. Many other complementary techniques exist, including row-oriented techniques such as stack counting and datapoint clustering. Analysts can use these techniques to discover new malicious patterns in their data and reconstruct complex attack paths to reveal an attacker’s Tactics, Techniques, and Procedures (TTPs).

Various tools and techniques are used in uncovering new malicious patterns of behavior and adversary TTPs. This step is the definitive success criteria for a hunt. An example of this process could be that a previous investigation revealed that a user account has been behaving anomalously, with the account sending an unusually high amount of outbound traffic. After conducting a Linked Data investigation, it is discovered that the user’s account was initially compromised via an exploit targeting a third party service provider of the organization. New hypotheses and analytics are developed to specifically discover other user accounts affiliated with similar third party service providers.

Finally, successful hunts form the basis for informing and enriching automated analytics. Don’t waste your team’s time doing the same hunts over and over. Once you find a technique that works to bring threats to light, automate it so that your team can continue to focus on the next new hunt. Information from hunts can be used to improve existing detection mechanisms, which might include updating SIEM rules or detection signatures. For example, you may uncover information that leads to new threat intelligence or indicators of compromise. You might even create some friendly intelligence, that is, information about your own environment and how it is meant to operate, such as network maps, software inventories, lists of authorized web servers, etc. The more you know about your own network, the better you can defend it, so it makes sense to try to record and leverage new findings as you encounter them on your hunts.”

Source: The Threat Hunting Reference Model Part 2: The Hunting Loop, sqrll

Please note that varies of Threat Hunting solutions and methodologies exists, such as LIFARS Threat Hunting methodology.

For further information please review:

Security Information and Event Management(SIEM)

Why Your SIEM Doesn’t Work, exabeam

Machine learning at Elasticsearch: In quest of data anomalies

Add comment
facebook linkedin twitter email