Abstract

In modern enterprise security, endpoint detection products fire an alert when process activity matches known attack behavior patterns. Human analysts then perform Root Cause Analysis (RCA) over event logs to determine if the alert is indicative of an actual attack. Data Provenance can help to automate RCA by representing event logs as a causal dependency graphs; in fact, researchers are now examining whether provenance-based anomaly detection should replace pattern-based detection altogether. Unfortunately, we observe that current approaches leverage off-the-shelf graph embedding techniques that are unable to associate events with their root causes. This shortcoming not only fails to capitalize on the RCA capabilities of provenance, but also leaves provenance-based IDS vulnerable to mimicry and evasion attacks.

This work presents the design and implementation of R-CAID, a novel approach to incorporate RCA into provenance-based IDS. R-CAID precomputes each node’s root causes during graph construction, then directly links those nodes to their root causes during embedding. Further, R-CAID’s classification model is node/process-level, rather than graph/system-level, bringing it more in line with the precision of commercial systems. Under a passive adversary model, we find that R-CAID consistently outperforms baseline graph neural networks, sequence-based log IDS, and even a commercial endpoint detection system. Under a white-box active adversary model, R-CAID maintains a high level of performance (e.g., for DARPA Theia, 0.94 AUC adversarial down from 0.99 passive). R-CAID achieves this by associating each system entity with its immutable and unforgeable root causes, preventing adversaries from being able to masquerade as legitimate processes. This work is thus the first to demonstrate the promise of provenance-based IDS in a manner that avoids the pitfalls of mimicry and evasion