# bn2vec
Boolean Networks' embedding techniques & ML based Boolean Networks' classification.
## 0. Introduction:
bn2vec is the result of an earlier research which has been conducted in 2021 (Mar-Sept) as part of a larger project named [BNediction](https://bnediction.github.io/), the goal of the research was fixated on developing new embedding techniques specifically built for dealing with Boolean Networks, the aim though was to use these techniques to help classifiy Boolean Networks and develop a solid set of features which would be able to explain the performance of a given BN. <br/>
the full master's thesis report which wraps the work done in this package could be found in [Master's Thesis](https://drive.google.com/file/d/1I8tlNt7-CV9RZhmOJ5rv5Hxi_padirUl/view?usp=sharing), any details regarding how the embedding or the classification work are discussed in the report. <br/>
for a walk through example please check [test.ipynb](./tests/test.ipynb).
## 1. Setting up:
step 1. creating a new virtual env.
```bash
python -m venv env
```
for a manual setup we should install the packages from the requirements.txt file and then install bn2vec using pip.
```bash
pip install -r requirements.txt
```
```bash
pip install -e .
```
## 3. Config:
when creating a ConfigParser object you will be asked to feed the path for your configuration file, the file should be a yaml type and it should conform to the validation rules for it to be used, in case of abscense of the config file a default file (allow-all) would be used, see [Default Config File](./bn2vec/config.yaml).<br/>
under section **Memory** 6 options are allowed:
- memorize_dnf_graphs (resp. _bn_graphs): if set to true it allows for remembering graphs' data generated from DNFs (resp. BNs).
- memorize_dnf_sequences (resp. _bn_sequences): if set to true it allows for remembering sequences' data generated from DNFs (resp. BNs).
- hard_memory: if set to true it allows for storing the generated data from an ensemble of BNs into the desk.
- hard_memory_loc: the path folder for hard_memory.
under section **Embeddings** we can specify any of the following options:
- rsf: stands for **Relaxed Structural Features**, if specified the system generates RSF features of the given ensemble of BNs.
- lsf: stands for **Lossy Structural Features**, if specified the system generates LSF features of the given ensemble of BNs.
- ptnrs: acronym of **Patterns**, if specified the system generates PTRNS features of the given ensemble of BNs.
- igf: stands for **Influence Graph Features**, if specified the system generates IGF features of the given ensemble of BNs.
for more details about the rest of the file please have a look at the [Default File](./bn2vec/config.yaml) and the [Full Report](https://drive.google.com/file/d/1I8tlNt7-CV9RZhmOJ5rv5Hxi_padirUl/view?usp=sharing).
## 4. Embeddings:
let us have a look at the different ways of using the feature engineering module.<br/>
necessary imports:
```python
from colomoto import minibn
from bn2vec.feature_engineering import Dnf2Vec, Bn2Vec, Ens2Mat
from bn2vec.utils import ConfigParser
```
in the case of using **Dnf2Vec** (embedding a single DNF) or **Bn2Vec** (embedding a single BN, ensemble of DNFs), we have to tell the system to parse the config file ourselves.
```python
ConfigParser.parse("path/to/configfile")
```
we use minibn.BooleanNetwork to parse Boolean Networks' files.
```python
bn = minibn.BooleanNetwork("path/to/boolean_network")
BN = list(bn.items())
```
then when using **Dnf2Vec** we can perform the embedding to one of the BN's DNFs this way.
```python
gen = Dnf2Vec(dnf=BN[0][1], comp_name=BN[0][0])
graphs, seqs, features = gen.generate_features()
```
the generate_features method returns three objects:
- graphs (resp. seqs): is a dictionary containing the graphs' (res. sequences') data of the given DNF.(if asked for).
- features: is a pandas Series object containing the final features extracted from the given DNF.
likewise we can embed the whole BN.
```python
gen = Bn2Vec(BN)
bn_graphs, bn_seqs, dnfs_data, bn_features = gen.generate_features()
```
this time we have more complicated semi-structed data to look at:
- bn_graphs (resp. bn_seqs): is a dictionary containing the graphs' (res. sequences') data of the given BN.(if asked for).
- dnfs_data: contains dnf graphs, sequences and features generated by Dnf2Vec for all dnfs in the given BN.
- bn_features: is a pandas Series object containing the final features extracted from the given BN.
if we want to embed an ensemble of BNs we simply use **Ens2Mat** (ensemble to matrix).
```python
gen = Ens2Mat(
config_path='path/to/config_file',
master_model_src = 'path/to/master_model'
)
X,Y = gen.vectorize_BNs(
'path/to/base_directory',
'', # bundle file name (under base_directory)
size = 'all' # or an integer (the number of BNs to embed)
)
```
## 5. Features Selector:
in order to use **BnFeaturesSelector** we should import one extra module:
```python
from bn2vec.feature_selection import BnFeaturesSelector
```
this module has 3 main methods:
- drop_zero_variance_features: literally removes features without any flactuations.
- cluster_collinear_features_leiden: uses the leiden algorithm to cluster features based on their collinearities, then the method selects the best representative feature from each cluster, this method is only useful in the case of LSF and RSF (mostly LSF where elminiating collinearities is important but also deciding which to remove is more important).
- correct_collinearity: takes a set of features and then returns another set of features (with high collinearity with the input features) which are better explainable than the originals.
```python
selector = BnFeaturesSelector(X, mode='lsf')
X = selector.drop_zero_variance_features()
X, clusters = selector.cluster_collinear_features_leiden(thresh = 0.8)
```
the argument thresh is the threshold (minimal value) to decide that two features are correlated, it is calulated as the absolute value of the correlation value between the two features.
## 6. Rules Extractor:
necessary imports for using the rules extraction module:
```python
from bn2vec.utils import BnDataset
from bn2vec.rules_extraction import DTC, RulesExtractor
```
creating a BnDataset object is necessary:
```python
base_dir='path/to/base_directory'
BN = BnDataset(
dataset_X = os.path.join(base_dir, 'path/to/X_file'),
dataset_Y= os.path.join(base_dir, 'path/to/Y_file'),
score_threshold = 1
)
```
then we can create our **DTC** (stands for Decision Tree Classifier) object:
```python
dtc = DTC(
dataset = BN,
save_dir = "path/to/saving_directory",
ensemble="ens1",
embedding="ptrns"
)
```
the arguments 'ensemble' and 'embedding' are there just for naming conventions, to train deep decision tree classifiers we use the train_deep_dtcs method:
```python
dtc.train_deep_dtcs(test_size=0.3)
```
this well train a balanced and an unbalanced version of the tree, it will save the trees and the metrics in the save_dir folder and it will print the metrics for visual inspection.<br/>
in order to extract useful rules from these trees we should use the **RulesExtractor** class:
```python
rule_extractor = RulesExtractor(
dataset = BN,
dtc = "path/to/dtc",
)
rules = rule_extractor.extract_rules(
thresh = 0,
tpr_weight = 0.5, # importance of the true positive rate
tnr_weight = 0.5 # importance of the true negative rate
)
```
for training singleton decision trees (trees with a single split) we use train_singleton_dtcs:
```python
rules = dtc.train_singleton_dtcs(
test_size=0.3,
balanced=False,
thresh=0.5,
tpr_weight=0.5,
tnr_weight=0.5
)
```