alik604/Local_setup.md

Last active November 9, 2022 02:03

Star (1) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/alik604/601790eb14de9d51c22b02b5c2b4014b.js"></script>
Save alik604/601790eb14de9d51c22b02b5c2b4014b to your computer and use it in GitHub Desktop.

Raw

MLTK addon by Ali

link to code and test data: https://drive.google.com/drive/folders/1u1e7SFrW8shPVdttCeKViH4O08ro9-Gu?usp=sharing link to future code repo: http://tfs.tsl.telus.com/tfs/telus/BT-GIT/_git/ID-Cust-ci-batch?path=%2Fsplunk-machine-learning

Splunk MLTK algorithms added:

Rule_based_detection.py
- Working, no dependencies needed. (By far) Most important
CBRW_based_detection.py
- Hard disabled till Dependencies installed, low priority
Ensemble_based_detection.py
- Hard disabled till Dependencies installed, low priority

Installation

This is a modified fork of github.com/splunk/mltk-algo-contrib. 3 detectors are added. Unneeded detectors are commented out in algos.conf, note that since you will copy & paste in my folder, you dont have to deal with that.

Install the Python for Scientific Computing

You must install the Python for Scientific Computing Add-on before installing the Machine Learning Toolkit. Please download and install the appropriate version here:  
Linux 64-bit: https://splunkbase.splunk.com/app/2882/
Windows 64-bit: https://splunkbase.splunk.com/app/2883/
Installation
To install an app within Splunk Enterprise:
Log into Splunk Enterprise.
Next to the Apps menu, click the Manage Apps icon.
Click Install app from file.
In the Upload app dialog box, click Choose File.
Locate the .tar.gz or .tar file you just downloaded, then click Open or Choose.
Click Upload.

Note, due to the size of this app, installing it via web installer/deployer may fail with a timeout error
Alt method is to Copy it to your $SPLUNK_HOME/etc/apps folder (don't forget to restart Splunk)

Install MLTK https://splunkbase.splunk.com/app/2890/#/details
Copy and paste my MLTK add-on to the equivalent of C:\Program Files\Splunk\etc\apps\, folder will have a name similar to "epic_mltk_addon_by_Ali".
[Optional - needed for full scale] Will need change default limits. We will only be running the Algos at off hours (3-6am EST)

URL end points:

/en-US/app/Splunk_ML_Toolkit/algorithm
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Rule_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=Ensemble_based_detection
/en-US/app/Splunk_ML_Toolkit/algorithm?stanza=CBRW_based_detection

Changes:

max_inputs to 10000000// large int; 10 Mil
max_memory_usage_mb to 10000 // one algo is multithreaded, I manage cpu&mem cost via batch processing. 5000 should be fine
max_fit_time  to 10000  // 2.66H

[Optional - only for local deployment of Splunk] Get test data

See Drive link to get all_actions_all_notMissing.csv
Upload to Splunk, change URL as needed: http://127.0.0.1:8000/en-US/manager/search/adddata
Note that the data is old, and you will need to change the "scope" Splunk searchs to get any data. As the default in 24h

[Optional] Machine learning, which is currently not scheduled to be used in Prod

There are two ways to resolve the dependency issues faced for deploying this Splunk ML app

(Temporally) install python on the Spunk Machine (OS level)
- Run pip install suod pyod coupled-biased-random-walks -t "C:\Program Files\Splunk\Python-3.7\Lib\site-packages", to get the dependencies (3 packages, which have their own dependencies)
Copy, paste, and replace with my \Python-3.7\Lib\site-packages
- 100MB stored on Drive (lots of defaults thing updated)

Packages used:

Added coupled_biased_random_walks from https://github.com/dkaslovsky/Coupled-Biased-Random-Walks
Added PYOD from https://github.com/yzhao062/pyod
Added SUOD from https://github.com/yzhao062/SUOD

Code This is now working:

CBRW_based_dection.py
Ensemble_based_detection.py

Usage

MLTK Syntax

Resources

For the sake of example

How parameters work

fit LocalOutlierFactor <fields>// Columns,features,variables to pass in 
[n_neighbors=<int>]            // A parameter
[p=<int>]                      // A parameter, Minkowski distance. 1 is MAE (Manhattan distance), 2 is MSE (Euclidean distance)  
[contamination=<float>]        // A parameter, [0 .. 0.5], for the about of anomalies expected in our data

Verbose explanation of full search

// Select index and get logs with an `action`, then filter  
index=cii_pingfederate action=* requester!= "nascent" requester != "-1" action != "paneView"  action != "move" action != "change" action != "preview" action != "view" action != "rowView" action != "Next+time" NOT "@ci-qa.com" NOT "@telusinternal.com"

// Get only the variables explicitly ask for, this removes many internal (parsed) time/date Fields
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time

// Derive [Country, City, Region, lat, lon] from requesterIp  
| iplocation requesterIp

// Fit on all Fields (note we used `Table`), 2 parameters
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30

// (Ab)use `Table` to order
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon" 

// Sort by time, note that this is is debatable, the natural order might be better for analysis
| sort _time

Map

Bring up a Map

source="all_actions_all_notMissing.csv"
| table requesterIp
| iplocation requesterIp locallimit=20 | geostats count by Country

Rule_based_detection

use (hardcoded) rules to detect anomalous traffic

Best performing detector, and it's dependencies are including in MLTK and its prerequisite

Two approaches used:

One Email, being accessed from many IPs
- Get most common emails
- Flag as suspicious, if for a given email, the number of "unique locations" divided by "total number of requests" is >= 0.4.
One IP attempting to log into to a variety of emails.
- The IP has generated at least 4 events, with at least 3 requesters
- Get the most common IPs, mark as malicious, if, all status are NOTFOUND, OR no logins are successful
- If an IP successfully logs into many emails, mark emails as compromised.
- Compute Every IP that interacted with compromised email

source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Rule_based_detection *
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Every IP to login to a compromised email", "Compromised emails", "Country", "tid", "Region", "lat", "lon" | sort _time

CBRW_based_detection

Machine learning based detector. Second best, but Rule_based_detection is by far better

Coupled Biased Random Walks (CBRW) is for identifying outliers in categorical data with diversified frequency distributions and many noisy features.
- Features (Feilds) used are: action, status, adapter, serviceType, requester, requesterIp, City, Country

Parameters

contamination = 0.20, 0.15 if 1,000,000 over datapoints

Example

source="all_actions_all_notMissing.csv"
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit CBRW_based_detection * contamination=0.20
| table "index", "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort "index"

Ensemble_based_detection

Machine learning based detector. Third best, more expensive than CBRW_based_detection

A multithreaded Ensemble (group) of ~17 detectors

Parameters

contamination = 0.25
n_estimators = 30

[hardcoded] 3 workers (threads)

Example

source="all_actions_all_notMissing.csv" 
| table action, status, adapter, serviceType, requester, requesterIp, tid, _time | iplocation requesterIp
| fit Ensemble_based_detection * contamination=0.25, n_estimators=30
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "tid", "City", "Country", "Region", "lat", "lon" | sort _time

To Analysis and investigation

Output of Rule_based_detection will be (pseudo)sorted by offending IP/email. Till final fine tuning, and ultimately trust is built.

To verify correctness we need to search every IP and email. We can manually search every offending IP or email

source="all_actions_all_notMissing.csv" [email protected] | iplocation requesterIp
| table "_time", "action", "status", "adapter", "serviceType", "requester", "requesterIp", "City", "Country", "tid", "Region", "lat", "lon"

Or we can use Ali's Splunk SDK to programmatically query n number of emails or IPs, and output in a Excel sheet with n number of tabs. This would alleviate the need, to manually create browser tabs & to actively wait on Splunk to search

Raw

epic_mltk_addon_development.md

Notes on development on the Splunk app on a local Splunk install

Debugging

C:\Program Files\Splunk\var\log\splunk\mlspl.log for logging. Python's print() does not work in Splunk, you must use the logger. As I used in my code. Docs
- https://docs.splunk.com/Documentation/MLApp/5.2.0/API/UserFacingMessages
- https://docs.splunk.com/Documentation/MLApp/5.2.0/API/CustomLogging

Raw

Local_setup.md

Python & Machine Learning Hello world

Quickly get up to speed

Software setup

Use Python 3.8.*. No point in 3.9, however it should be fine.

Run these to test if install is successful, and install some important packages. Please follow the errors as they come, one will ask the user to install C++ build tools, if not already installed (see below). This will take several minutes, as the dependencies are a few GBs. You must be off VPN, or set-proxy, as shown below.

Here's what to do:
Downloaded Microsoft Visual C++ Build Tools from this link: https://visualstudio.microsoft.com/downloads/
Run the installer
Select: Workloads → Visual C++ build tools.
Install options: The only necessary component is "Windows 10 SDK"

// If you wish to be on VPN, set env variables 
http_proxy = http://142.174.134.33:8080
https_proxy = https://142.174.134.33:8080
pip install TwitterApi

// Or every time 
pip install --proxy=https://[email protected]:8080 numpy 

// run these 
python -v 
pip -v 
python -m pip install --upgrade pip 
pip install pandas numpy matplotlib requests scikit_learn scipy splunk_sdk xmltodict pyod suod hdbscan jupyterlab notebook urllib3  geoip2 coupled_biased_random_walks

Note that coupled_biased_random_walks will throw a complatiable error, it shouldn't matter, you can just ignore it, even in production.

Run jupyter-lab . to launch Jupyter-lab. The . can be your project folder.

[optional] It's easier to copy the path from windows explorer and in CMD cd to the directory, or you can navigate folders in-webapp. I use Git bash, which let's be click "Git bash here" in the explorer context menu to open a shell in any folder.

In a notebook, you can execute a shell command via a !, examples: !pip install foobar and !ping 8.8.8.8

Learning python and data science

If you dont know python, Then watch videos, else skip to code handouts.

I alway speed up youtube videos with this extension.

Day 1 @ mental prime - 60 + 20 mins:

Watch, you don't have to follow along:

Python programming 2020 - watch the first 60 mins. Review the code handout after a gap, Ideally before bed.
Python programming 2014 - optionally, watch the first 30 mins. Review the code handout.
- If you skip this, check out the code handout, after another gap, for the sake of 'spaced repetition' (learning strategy)

Day 2 @ anytime - 120 + 10 + 20 mins:

Pure Stats for Python - I recommend you code along with this. Handout

These two are videos are optional. handouts are recommended

Statistics for Data Science & Machine Learning - Till you get too bored Handout
Probability for Data Science & Machine Learning - First 30 mins. Handout

Linear algebra - necessary

Linear algebra for game devs - Skim all 3 parts

linear algebra - unneeded; overkill

Day 3 @ anytime

Now you can (almost) read python fluently!

If this has been is wayyy too fast This youtuber might be helpful

Do these quick exercises

Print the directory /Desktop with python
Print the directory /Desktop with CMD/Bash via python Review Python - Numpy
- Numpy is Arrays in python.. with a lot of useful operations
Notes
Notes - Skim

NOW it's time to start Machine learning

Introduction to Scikit-Learn - watch from 5 mins till you feel lost.
SFU CMPT 353 | site # 11 to #13 - I recommend skimming the lecture notes on the site

Locally running code

ALL this code is outdated, but expected to work

Code

The quality of code will decrease as you descend this list.

ML_based_detection.py - Script for Machine Learning based detection. Outputs a .xlsx. SEE Algorithms
rule_based_detection_driver.py - Script for rule based detection. Outputs a .xlsx
rule_based_detection_core.py - Has 3 core function, which call utility functions
Build action dataset.ipynb - Aggregate many CSVs, Preprocess, fill errors with -1, and save single large CSV file to actions_data\preprocessed\ all_action_all {DATA TIME}, then move inputs to .\_hide
attacks over different dates and geo plots.ipynb - for generating on the tables I used in presentation. Columns are a dataset. Rows are countries. Cells are evil/all authentication attempts. This is a partial mirror of the file Experiments - Plotting and other.ipynb. The folders "input_visualization" and "output_visualization" are used. Table generation is 100% automated, however ploting is manual, and may require reloading data into memory, due to the scope being limited to main loop
Experiments - Plotting and other.ipynb - Plots and Visualization. Computes the probability of a event being malicious given its Geolocation.
Experiments - Anomaly Detection.ipynb - Plots of coupled_biased_random_walks, individual PYOD detectors including the following neural networks: Single-Objective Generative Adversarial Active Learning and Variational Autoencoder
Experiments - Deep Learning.ipynb - A Categorical Variational Autoencoder using Gumbel-Softmax. and some other experiments
_All_audit.ipynb - Deprecated portion of project relating to audit logs. The output of this file would do into a file that was known as run AD.ipynb, which is not in the folder called garbage.

List of folders - Anomaly Detection

Action_data - folder to hold data before and after preprocessing
GeoLite2-City - folder to house GeoLite2-City.mmdb, the database used to map IP addresses to geolocation
input_visualization and output_visualization - The I/O for the program Experiments - Plotting and other.ipynb

List of folders

_garbage - Folder for manual storage of past files
Anomaly Detection - Where the core code is contained. Has nested readme.md
Docker_image - Docker image for a generic python environment with persistent storage
utils - Splunk search URL parser and other miscellaneous scripts
webhook - Webhook to listen for input. and ngrok.exe maps an dev-domain to localhost
fetch data
- Alert - Setup a Splunk email alert, and manually copy and paste a SID-thingy from a particullar link in the email, of the form scheduler__t954349__CII__RMD556aaa6af93d5a498_at_1603137600_75868_D38856AD-91C4-4C02-AA8B-2BDDEC646C9E
- Current Time - Every 5 minutes, run a query, ie. search index=cii_pingfederate action=* earliest=-5m
- Parse raw audit logs is my parser audit for logs.
Full App - driver - 5 minute loop will drive the object FetchFromSplunk, and analyze by calling child_process.py in a new process, which relives on rule_based_detection_core.py.

Algorithms

PYOD - Models are indirectly used
- PYOD | Docs |
Coupled_biased_random_walk
- Coupled-Biased-Random-Walks
Combo - ensemble (Simple Detector Aggregator) of PYOD models
- Combo | Docs
SUOD - ensemble ("Scalable Unsupervised Outlier Detection") of PYOD models
- SUOD | Docs | Paper.
- SUOD is an acceleration framework for large-scale unsupervised heterogeneous outlier detector training and prediction.
- PROs - Multithreading built in
- CONs - Weird compatibility issue.. "index error" with threads/n_jobs. bps_flag=False appears avoid the issue, at the cost of "balanced parallel scheduling" Notes: if there is a ram issue, decrease n_jobs in SUOD. Input CSV should be about 200mb, however I have tested with a input CSV of 775mb on a laptop [i7 vpro, 32gb ram]

If there is a need to reduce computational expense. I would run Coupled_biased_random_walk, and optionally to supplement, put the models of Combo into SUOD. Have both, the new SUOD, which is both multithreaded and running low-cost detectors, and the Coupled_biased_random_walk "Vote". and output 1 (or 3) sheets, depending on if the optional is used

Notes on data engineering and data usage

The following cases cause the row of data to be removed

The code that does this is marked for deprecation, as this is made redundant by the up-to Splunk query

data['requester'] == "nascent"
data['action] == any element in the set {"paneView", "move", "change", "preview", "view", "rowView", "Next+time"}

Splunk Query

    index=cii_pingfederate action=*
    requester!= "nascent" requester != "-1" action != "paneView"  action != "move" action != "change" action != "preview" action != "view" action != "rowView" action != "Next+time" NOT "@ci-qa.com" NOT "@telusinternal.com" 
    | fields action, status, adapter, serviceType, _time, requester, requesterIp, tid
    Note you must run, `Build actions dataset` to get Geolocation related fields, which were done by me in a different manner than in Splunk (using it's command)