معرفی شرکت ها


evalAIRR-0.0.9


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Comparison of real and simulated AIRR datasets
ویژگی مقدار
سیستم عامل -
نام فایل evalAIRR-0.0.9
نام evalAIRR
نسخه کتابخانه 0.0.9
نگهدارنده []
ایمیل نگهدارنده []
نویسنده Lukas Sparnauskas
ایمیل نویسنده <lukas.11sp@gmail.com>
آدرس صفحه اصلی -
آدرس اینترنتی https://pypi.org/project/evalAIRR/
مجوز MIT
# evalAIRR A tool that allows comparison of real and simulated AIRR datasets by providing different statistical indicators and dataset visualizations in one report. ## Installation It is recommended to use a virtual python environment to run evalAIRR if another python environment is used. Here is a quick guide on how you can set up a virtual environment: `https://docs.python.org/3/tutorial/venv.html#creating-virtual-environments` ### Install using pip Run this command to install the evalAIRR package: `pip install evalairr` ## Quickstart evalAIRR uses a YAML file for configuration. If you are unfamiliar with how YAML files are structured, read this guide to the syntax: `https://docs.fileformat.com/programming/yaml/#syntax` This is the stucture of a sample report configuration file you can use to start off with (it is included in the repository location ./yaml_files/quickstart.yaml): ``` datasets: real: path: ./data/encoded_real_1000_200.csv sim: path: ./data/encoded_sim_1000_200.csv reports: feature_based: report1: features: - TGT - ANV report_types: - ks - distr_densityplot output: path: ./output/report.html ``` This report will process the two provided datasets (real and simulated) with encoded kmer data, and create an HTML report with feature-based report types - Kolmogorov–Smirnov test (indicated by `ks`) and a feature distribution density plot (indicated by `distr_densityplot`) for the features `TGT` and `ANV`. It will then export the report to the path `./output/report.html`. More details on what reports can be created can be found in the _YAML Configuration Guidelines_ section. The repository contains sample datafiles and a quickstart YAML configuration files. You can clone the repository and run evalAIRR within it to use sample data. Within the cloned repository run the command: `evalairr -i ./yaml_files/quickstart.yaml` The report will be generated in the specified output path in the configuration file or, if a specific path is not provided, in `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format. ## YAML Configuration Guidelines The configuration YAML file consists of 3 main sections: `datasets`, `reports` and `output`. ### Datasets In the `datasets` section, you have to provide paths to a real and a simulated datasets that you are comparing. CSV files with encoded kmer data are supported. This can be done by specifying the file path of each file in the `path` variable under the sections `real` and `sim` respectively. Here is an example of how a configured `datasets` section looks like: ``` datasets: real: path: ./data/encoded_real_1000_200.csv sim: path: ./data/encoded_sim_1000_200.csv ``` ### Reports In the `reports` section, you can provide the list of report types you want to create and their parameters. There are three types of report groups depending on the different parameters: `feature_based`, `observation_based` and `generic`. Here is the list of reports you can create that compare the features of the real dataset with the simulated dataset: #### Feature-based reports - <b>`ks`</b> - Kolmogorov–Smirnov statistic. Parameters: list of features you are creating the report for. - <b>`distr_histogram`</b> - feature distribution histogram. Parameters: list of features you are creating the report for. - <b>`distr_boxplot`</b> - feature distribution boxplot. Parameters: list of features you are creating the report for. - <b>`distr_violinplot`</b> - feature distribution violin plot. Parameters: list of features you are creating the report for. - <b>`distr_densityplot`</b> - feature distribution density plot. Parameters: list of features you are creating the report for. - <b>`distance`</b> - Euclidean distance between the real and simulated feature. Parameters: list of features you are creating the report for. - <b>`statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of a feature in both real and simulated datasets. Parameters: list of features you are creating the report for. #### Observation-based reports - <b>`ks`</b> - Kolmogorov–Smirnov statistic. Parameters: list of observations you are creating the report for. - <b>`observation_distr_histogram`</b> - observation distribution histogram. Parameters: list of observations you are creating the report for. - <b>`observation_distr_boxplot`</b> - observation distribution boxplot. Parameters: list of observations you are creating the report for. - <b>`observation_distr_violinplot`</b> - observation distribution violin plot. Parameters: list of observations you are creating the report for. - <b>`observation_distr_densityplot`</b> - observation distribution density plot. Parameters: list of observations you are creating the report for. The observation index `all` can be used to report on all observations in one plot. - <b>`observation_distance`</b> - Euclidean distance between the real and simulated observation. Parameters: list of observations you are creating the report for. - <b>`observation_statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of an observation in both real and simulated datasets. Parameters: list of observations you are creating the report for. Additional parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset (currently, only applies to the report type `observation_distr_densityplot` in reports with `all` observations). `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. #### General reports - <b>`ks_feat`</b> - Kolmogorov–Smirnov statistic for all features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_feat.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values. - <b>`ks_obs`</b> - Kolmogorov–Smirnov statistic for all observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/ks_obs.csv`). The csv file contains two rows, with the first row containing the ks-statistic and the second one - the p-values. - <b>`copula_2d`</b> - a 2D scatter plot that displays two features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified. - <b>`copula_3d`</b> - a 3D scatter plot that displays three features in a Gausian Multivariate copula space. Parameters: a report section of any name, under which the compared features are specified. - <b>`feature_mean_vs_variance`</b> - a scatter plot that displays the mean value of every feature on one axis and the variance of every feature on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`observation_mean_vs_variance`</b> - a scatter plot that displays the mean value of every observation on one axis and the variance of every observation on the other axis. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`corr`</b> - correlation matrix heatmaps of the real and simulated datasets. Parameters: `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`corr_hist`</b> - correlation matrix distribution histogram for the real and simulated datasets. Parameters: `n_bins` - an optional parameter that sets the number of bins in the histogram (default value is 30). `reduce_to_n_features` - an optional parameter for dimensionality reduction using PCA. The number of features to reduce the dataset to (must be reduce_to_n_features < min(n_observations, n_features)). `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`corr_csv`</b> - CSV file exporting of the difference between correlation matrices of the real and simulated datasets. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/corr.csv`). - <b>`pca_2d_feat`</b> - two feature-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`pca_2d_obs`</b> - two observation-level scatter plots with both datasets reduced to two dimensions using PCA. Parameters: `with_ml_sim` - optional parameter, which if True, instructs the report to include a comparison with a generated dataset using a GaussianProcessRegressor machine learning model trained on the real dataset. `ml_random_state` - optional integer parameter, relevant only if `with_ml_sim` is set to True, which sets a seed in the machine learning model random number generation. - <b>`distance`</b> - Euclidean distance between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/dist.csv`). - <b>`statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of all features in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_stat.csv` and `sim_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance. - <b>`observation_distance`</b> - Euclidean distance between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_dist.csv`). - <b>`observation_statistics`</b> - statistical indicators (mean, median, standard deviation and variance) of all observation in both real and simulated datasets. Parameters: `output_dir` - optional parameter, that specifies the directory for the csv files in which the csv result files `real_obs_stat.csv` and `sim_obs_stat.csv` will be exported to (default value is set to `./output/`). Each csv file contain four rows, each with a different statistic: 1 - mean, 2 - median, 3 - standard deviation, 4 - variance. - <b>`jensen_shannon`</b> - Jensen-Shannon divergence metric between the real and simulated features. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/jenshan.csv`). - <b>`observation_jensen_shannon`</b> - Jensen-Shannon divergence metric between the real and simulated observations. Parameters: `output` - optional parameter, that specifies the path of text/csv file the results will be exported to (default value is set to `./output/obs_jenshan.csv`). Here is a sample `reports` section of a configuration file containing all of the reports: ``` reports: feature_based: report1: features: - TGT - ANV report_types: - ks - distr_histogram - distr_boxplot - distr_violinplot - distr_densityplot - distance - statistics observation_based: report1: observations: - 0 report_types: - ks - observation_distr_histogram - observation_distr_boxplot - observation_distr_violinplot - observation_distr_densityplot - observation_distance - observation_statistics report2: observations: - all report_types: - observation_distr_densityplot with_ml_sim: True ml_random_state: 0 general: copula_2d: report1: - TGT - ANV copula_3d: report1: - TGT - ANV - CAS feature_mean_vs_variance: with_ml_sim: True ml_random_state: 0 observation_mean_vs_variance: with_ml_sim: True ml_random_state: 0 corr_hist: n_bins: 30 with_ml_sim: True ml_random_state: 0 reduce_to_n_features: 150 corr: reduce_to_n_features: 150 with_ml_sim: True ml_random_state: 0 pca_2d_feat: with_ml_sim: True ml_random_state: 0 pca_2d_obs: with_ml_sim: True ml_random_state: 0 corr_csv: output: ./output/corr.csv ks_feat: output: ./output/ks_feat.csv ks_obs: output: ./output/ks_obs.csv statistics: output_dir: ./output/ observation_statistics: output_dir: ./output/ distance: output: ./output/dist.csv observation_distance: output: ./output/obs_dist.csv jensen_shannon: output: ./output/jenshan.csv observation_jensen_shannon: output: ./output/obs_jenshan.csv ``` ### Output An optional section where you can specify the file path of the generated report. The default path of the generated report is `<CURRENT_DIRECTORY>/output/report.html`. The report is exported in the HTML format. If you declare the path as 'NONE', the report will not be created. An example output section: ``` output: path: ./output/report.html ``` For example, this output section would result in a report file not being created: ``` output: path: NONE ```


نحوه نصب


نصب پکیج whl evalAIRR-0.0.9:

    pip install evalAIRR-0.0.9.whl


نصب پکیج tar.gz evalAIRR-0.0.9:

    pip install evalAIRR-0.0.9.tar.gz