link to code and test data: https://drive.google.com/drive/folders/1u1e7SFrW8shPVdttCeKViH4O08ro9-Gu?usp=sharing link to future code repo: http://tfs.tsl.telus.com/tfs/telus/BT-GIT/_git/ID-Cust-ci-batch?path=%2Fsplunk-machine-learning
Splunk MLTK algorithms added:
- Rule_based_detection.py
- Working, no dependencies needed. (By far) Most important
- CBRW_based_detection.py
- Hard disabled till Dependencies installed, low priority
- Ensemble_based_detection.py
- Hard disabled till Dependencies installed, low priority
This is a modified fork of github.com/splunk/mltk-algo-contrib.
3 detectors are added.
Unneeded detectors are commented out in algos.conf
, note that since you will copy & paste in my folder, you dont have to deal with that.
- Install the Python for Scientific Computing
You must install the Python for Scientific Computing Add-on before installing the Machine Learning Toolkit. Please download and install the appropriate version here:
Linux 64-bit: https://splunkbase.splunk.com/app/2882/
Windows 64-bit: https://splunkbase.splunk.com/app/2883/
Installation
To install an app within Splunk Enterprise:
Log into Splunk Enterprise.
Next to the Apps menu, click the Manage Apps icon.
Click Install app from file.
In the Upload app dialog box, click Choose File.
Locate the .tar.gz or .tar file you just downloaded, then click Open or Choose.
Click Upload.
Note, due to the size of this app, installing it via web installer/deployer may fail with a timeout error
Alt method is to Copy it to your $SPLUNK_HOME/etc/apps folder (don't forget to restart Splunk)
-
Install MLTK https://splunkbase.splunk.com/app/2890/#/details
-
Copy and paste my MLTK add-on to the equivalent of
C:\Program Files\Splunk\etc\apps\
, folder will have a name similar to "epic_mltk_addon_by_Ali". -
[Optional - needed for full scale] Will need change default limits. We will only be running the Algos at off hours (3-6am EST)
URL end points:
/en-US/app/Splunk_ML_Toolkit/algorithm
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Rule_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Ensemble_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=CBRW_based_detection
Changes:
max_inputs to 10000000// large int; 10 Mil
max_memory_usage_mb to 10000 // one algo is multithreaded, I manage cpu&mem cost via batch processing. 5000 should be fine
max_fit_time to 10000 // 2.66H
- [Optional - only for local deployment of Splunk] Get test data
- See Drive link to get
all_actions_all_notMissing.csv
- Upload to Splunk, change URL as needed: http://127.0.0.1:8000/en-US/manager/search/adddata
- Note that the data is old, and you will need to change the "scope" Splunk searchs to get any data. As the default in 24h
- [Optional] Machine learning, which is currently not scheduled to be used in Prod
There are two ways to resolve the dependency issues faced for deploying this Splunk ML app
- (Temporally) install python on the Spunk Machine (OS level)
- Run
pip install suod pyod coupled-biased-random-walks -t "C:\Program Files\Splunk\Python-3.7\Lib\site-packages"
, to get the dependencies (3 packages, which have their own dependencies)
- Run
- Copy, paste, and replace with my
\Python-3.7\Lib\site-packages
- 100MB stored on Drive (lots of defaults thing updated)
Packages used:
- Added coupled_biased_random_walks from https://github.com/dkaslovsky/Coupled-Biased-Random-Walks
- Added PYOD from https://github.com/yzhao062/pyod
- Added SUOD from https://github.com/yzhao062/SUOD
Code This is now working:
- CBRW_based_dection.py
- Ensemble_based_detection.py
fit LocalOutlierFactor <fields>// Columns,features,variables to pass in
[n_neighbors=<int>] // A parameter
[p=<int>] // A parameter, Minkowski distance. 1 is MAE (Manhattan distance), 2 is MSE (Euclidean distance)
[contamination=<float>] // A parameter, [0 .. 0.5], for the about of anomalies expected in our data
// Select index and get logs with an `action`, then filter
index=cii_pingfederate action=* requester!= "nascent" requester != "-1" action != "paneView" action != "move" action != "change" action != "preview" action != "view" action != "rowView" action != "Next+time" NOT "@ci-qa.com" NOT "@telusinternal.com"
// Get only the variables explicitly ask for, this removes many internal (parsed) time/date Fields
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time
// Derive [Country, City, Region, lat, lon] from requesterIp
| iplocation requesterIp
// Fit on all Fields (note we used `Table`), 2 parameters
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30
// (Ab)use `Table` to order
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon"
// Sort by time, note that this is is debatable, the natural order might be better for analysis
| sort _time
Bring up a Map
source="all_actions_all_notMissing.csv"
| table requesterIp
| iplocation requesterIp locallimit=20 | geostats count by Country
use (hardcoded) rules to detect anomalous traffic
- Best performing detector, and it's dependencies are including in MLTK and its prerequisite
Two approaches used:
- One Email, being accessed from many IPs
- Get most common emails
- Flag as suspicious, if for a given email, the number of "unique locations" divided by "total number of requests" is
>= 0.4
.
- One IP attempting to log into to a variety of emails.
- The IP has generated at least 4 events, with at least 3 requesters
- Get the most common IPs, mark as malicious, if, all
status
are NOTFOUND, OR no logins are successful - If an IP successfully logs into many emails, mark emails as compromised.
- Compute Every IP that interacted with compromised email
source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Rule_based_detection *
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon" | sort _time
Machine learning based detector. Second best, but Rule_based_detection is by far better
- Coupled Biased Random Walks (CBRW) is for identifying outliers in categorical data with diversified frequency distributions and
many noisy features.
- Features (Feilds) used are: action, status, adapter, serviceType, requester, requesterIp, City, Country
- contamination = 0.20, 0.15 if 1,000,000 over datapoints
source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit CBRW_based_detection * contamination=0.20
| table "index", "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort "index"
Machine learning based detector. Third best, more expensive than CBRW_based_detection
- A multithreaded Ensemble (group) of ~17 detectors
- contamination = 0.25
- n_estimators = 30
- [hardcoded] 3 workers (threads)
source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort _time
Output of Rule_based_detection
will be (pseudo)sorted by offending IP/email. Till final fine tuning, and ultimately trust is built.
To verify correctness we need to search every IP and email. We can manually search every offending IP or email
source="all_actions_all_notMissing.csv" [email protected] | iplocation requesterIp
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Country", "tid", "Region", "lat", "lon"
Or we can use Ali's Splunk SDK to programmatically query n
number of emails or IPs, and output in a Excel sheet with n
number of tabs. This would alleviate the need, to manually create browser tabs & to actively wait on Splunk to search