Blog Post

Evaluating Performance of Counter-Insider Threat Assessment Tools

By Frank Greitzer

We generally use the term “measures of effectiveness” (MOE) for metrics used to evaluate the performance of models used to detect or classify treats. As I noted in an earlier blog, various operational MOEs may be applied to assess the insider risk management team’s efficiency or “throughput”, such as number of cases processed per week or average time required to resolve a case. These are productivity metrics.

In this blog, I’d like to focus on measures that reflect the accuracy of insider risk assessment tools or methods. Measuring tool performance is challenging due to several factors: (a) Insider incidents are rare, so there are relatively few case studies available to include in a rigorous (i.e., statistical) examination; (b) real data and ground truth are not easily obtained — most organizations are reluctant to share such data, and when we can find public accounts of cases, these accounts do not typically include all the relevant data that served as evidence. The lack of data and ground truth is one of the reasons that insider threat was judged to rank second in the INFOSEC “hard problems list.”

Challenges in Obtaining Test Data

One approach to address this challenge is the collection and management of test data sets. There have been sets of real, anonymized data available to support research (such as the Enron data set comprising emails, available from Carnegie Mellon University School of Computer Science) or collections of synthesized data (such as an insider threat test dataset maintained by Carnegie Mellon’s Software Engineering Institute. Unfortunately, in most cases, these publicly available corpuses lack sufficient breadth in representing diverse data types, especially the kinds of behavioral data required for a whole person approach. In my research, I have used such corpuses on some occasions, but most often I have used “home-grown” test data sets that are constructed to reflect the broad range of data that the models address.

In these studies, the aim is to evaluate the performance of risk assessment tools in a triage environment where their ability to identify “at-risk” individuals is compared with expert judgments — thus, expert judgments are taken as proxies for ground truth. The argument is that if a tool agrees with expert judgments, then it has sufficient validity to install and use on a provisional basis in an operational environment. Of course, going forward, feedback is gained during this deployment that is fed back into the threat assessment and/or case management software to continually improve their performance.

Measuring Model Performance

The output of a risk assessment model is a score — for example, the Cogynt model produces a risk score between 0 and 100. There are several options for testing the agreement of a model with expert judgments. In several of my studies, I asked subject-matter experts (SMEs) to rank-order a set of synthesized cases from lowest to highest risk, and then compared the outputs of various models to the experts’ rankings using a correlation statistic. The square of the correlation is a useful metric that reflects the total percentage of variance accounted for by the model. For example, a simple Counting model (which merely counts the number of observed potential risk indicators, or PRIs) performs relatively poorly, accounting for only about one-fourth or less of the variance in the expert judgments. A Sum-of-Risk model that sums the risk values of observed PRIs does somewhat better, accounting for about 50-60% of the variance (this corresponds to a correlation of roughly 0.75). Several other models (including Bayesian Belief Networks) that I studied 5-10 years ago performed roughly the same — it seemed that the best performance was in this range, meaning that there was a lot of room for improvement!

We can adopt other, more rigorous metrics, from fields related to signal detection and data mining. Instead of asking our SMEs to rank or rate the test cases, we can simply ask them to identify each as a “YES” (worthy of an alert to examine more fully) or “NO” (not of interest). Now, our model(s) will produce a risk score, say a value between 0 and 100, where 100 represents the highest risk. In order to gauge the performance of a model, we must choose a “decision criterion” — if a risk score exceeds this threshold, the model output is “Alert” and if the risk score is less than the threshold, then the model output is “no-Alert.” Given this triage routine, a formidable set of performance measures are readily calculated based on frequencies observed in a contingency table comparing the model output with SME judgments of “ground truth”:

TP (True Positive) = correct detection: Proportion of cases that the model correctly “alerts” the analyst given that the expert categorized these as “YES” (worthy of further analysis) = P(Alert | YES)
TN (True Negative) = correct rejection: P(no-Alert | “NO”)
FP (False Positive) = “false alarm”: Proportion of cases that the model identifies as “alerts” when the SME says “NO” = P(Alert | “NO”)
FN (False Negative) = “miss”: Proportion of cases that the model identifies as “no-alert” when the SME says “YES” = P(no-Alert | “YES”)
Precision = TP / (TP + FP). Out of all the cases predicted to be threats, what percentage was a TRUE threat?
Recall = TP / (TP + FN). Out of all the TRUE threats, what percentage was predicted to be threats?
F1 = harmonic mean of Precision and Recall = 2 x precision x recall / (precision + recall). The F1 score is a general measure that gives equal weight to precision and recall. F1 scores above 0.90 reflect excellent performance.

Now, you might ask: What do we pick as the “decision criterion”? Clearly, the performance of the model will change radically based on the value we pick for our threshold, and the performance measures such as TP and FP will change accordingly.

Signal Detection Theory instructs us to plot the FP and TP rates as we change the decision criterion. This generates points on the plot of a “Receiver Operating Characteristic” (ROC) curve. For example, if the threshold is zero, then all cases are classified as Alerts and both the TP (true positive) rate and the FP (false positive) rate will be 1.0, yielding the point (1.0, 1.0) on the ROC curve; if the threshold is 100, then zero cases are called Alerts and both the TP and FP rates are 0. If the model ignores the data and outputs “Alert” randomly 80 percent of the time no matter what, the TP and FP rates would be 0.80, yielding the point (0.80, 0.80) on the ROC curve. Thus, random performance generates a ROC curve along the diagonal (dashed line).

The ROC curve reflects the tradeoff between the false negative and false positive rates, for every possible decision threshold. Good performance is characterized by low FP rates and low FN rates (i.e., high TP rates) across a reasonable range of threshold values. Therefore, desirable performance is reflected in ROC curves that are furthest from the (lower left to upper right) diagonal, approaching the (0.0,1.0) coordinate. Therefore, another measure of a model’s performance is the area under the curve (AUC): random performance produces AUC = 0.50; ideal/perfect performance yields AUC = 1.00.

Illustration

Let’s examine the results of a demonstration study that I created for illustrative purposes. I created a set of 100 sample cases (some of which were modeled after real cases) to show how we test the predictions of alternative models. The models that I tested included the Counting model, Sum-of-Risk model, and the Cogynt model. Each of these models was configured based on the SOFIT knowledge base of potential risk indicators (PRIs), which are calibrated with weights or degrees of association with insider threat behaviors of concern. To test the models, I used my own risk assessments of the 100 test cases (distinguishing “Alert” worthy cases from “no-Alert” cases). Then each of the models are applied, based on the observed PRIs in each of the sample cases, to compute risk scores. Points along the ROC curve are computed by tallying the TP and FP rates for different risk thresholds.

A comparison of the performance of these three models is summarized in the Table below, which shows the precision, recall, FP and FN metrics obtained when thresholds are selected that generate the highest F1 scores. The results show that the Cogynt model yields superior performance for all the metrics of interest.

Metric	Cogynt Model	Sum-of-Risk Model	Counting Model
Precision	0.96	0.90	0.63
Recall (True Positive)	1.00	0.75	0.83
False Positive	0.01	0.03	0.16
False Negative	0.00	0.25	0.17
F1	0.94	0.82	0.71

Summary

This paper described a collection of measures that reflect the accuracy of insider risk management tools or methods. We discussed challenges in obtaining detailed data and ground truth, which are used to evaluate the performance of models and tools. We showed how one can use expert judgments as proxies for ground truth and then apply various metrics to gauge the level of accuracy in alternative models. Signal Detection Theory provides robust methods and measures including the ROC curve and statistics like precision, recall, and the F1 metric that facilitate the interpretation of MOEs.

We demonstrated these concepts using an illustrative evaluation study comparing the predictions of three models, revealing the superior performance of the Cogynt model. For more information about the Cogynt continuous intelligence platform, see the Cogility website.