معرفی شرکت ها

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

Datategy Cohort Targeting : One-click discovery of diverse cohorts in your dataset with statistical guarantees

ویژگی	مقدار
سیستم عامل	-
نام فایل	dacota-0.1.2
نام	dacota
نسخه کتابخانه	0.1.2
نگهدارنده	[]
ایمیل نگهدارنده	[]
نویسنده	Dr. Eren Unlu
ایمیل نویسنده	eren.unlu@datategy.net
آدرس صفحه اصلی	https://github.com/datascientistunlu/dacota
آدرس اینترنتی	https://pypi.org/project/dacota/
مجوز	MIT

![--](docs/assets/dacoto_docs_01.png) ##### ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ```python pip install dacota ``` ##### ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ # **DACOTA** ##### ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ❏ ### **Overview** “Dacota”, Datategy Cohort Targeting, (pronounced /- dəˈkoʊtə/ ) is a python library developed by Datategy SAS’s data science team making automatic cohort discovery a breeze. We have identified the crucial need and urgency among the community to provide a tool which identifies cohorts related to a target variable (subgroups of the dataset of various sizes with similar characteristics) with solid statistical guarantees. Using Dacota, by choosing your ultimate task, whether regression or classification and your desired parameters related to cohortization and optimization process, with a single click you can identify the subgroups with statistical confidence. ### **Cohort Centric Data Science and ML** Ever since Andrew Ng coined the term “Data Centric AI”, the community has increased its efforts to focus more on data in terms of analysis, feature engineering and quality augmentation rather than ML models. Though this observation is totally correct especially in terms of tabular tasks, there exists an important gap, especially for certain types of data and context : The need for a cohort centric approach. When considering “imbalance” in data science and ML, we directly refer to the imbalance of target variable’s distribution, either for regression or classification. From a perspective, this stems from the fact that we generally want to optimize an “overall metric”, averaged over whole test dataset such as “mean squared error” or “mean classification accuracy”. However, most of the time, in real life use cases, overall metric is far from reflecting the actual objectives. This is due to the “imbalance in predictors” rather than on target variable. In other words, due to the cohortized nature of the dataset. These cohort specific concerns are prevalent especially in sensitive ML use cases, such as healthcare, security, finance etc. For instance, a dataset for developing a medical treatment may contain mostly young individuals, where the ML model optimized to maximize mean accuracy would discard the performance among elders, a cohort of minority. Of course, in this specific case the real life objective is to retain a certain degree of reliability for any important subgroup, rather than the overall metric. Though there exists many post hoc set of ML model performance assessment techniques, the cohort awareness should exists since the beginning of the development cycle, in exploratory data analysis. Therefore, as Datategy data science team, we wanted to declare a manifest of “Cohort Centric AI”. The library is at its extreme infancy, probably requiring tons of improvements in computation and also many functional extensions. We just wanted to spark an initiative around our cohort centric manifest and encourage community to improve the library. In our seed version, considering practically infinite set of options when it comes to automatic dataset cohortization, we aimed an optimal trade-off of usability, complexity, customizability, speed, statistical reliability and accuracy. ### **How it works ?** Dacota uses genetic algorithm to identify diverse set of cohorts inside a dataset for a particular task related to a target variable. In our devised framework, we assume each predictor is categorical. For continuous predictors, the algorithm automatically discretizes them in quantiles with number of bins set by the user. Given this constraint, a cohort can be encoded as a binary string like “10110…”, each bit representing a category of an input variable. Obviously, a cohort for a single feature cannot be all ‘0s’, whilst it can be all ‘1s’ (indicating this cohort for this particular feature has no statistical constraint) and any other combination. The idea is to identify cohorts which are divergent from the whole dataset with predetermined minimum statistical confidence. For this purpose, in case of regression, we perform permuted t-test with unequal variance. Fitness of a cohort in this case represents the absolute distance of cohort’s target variable mean from the whole dataset. For classification, we perform a permuted version of Jensen-Shannon distance measurement of categorical distribution of target variable. To incorporate the effect of cohort size, for each permutation, we sample a random control group of the same cohort size, record the JS distances and perform a t-test among sampled distances. Under this context, genetic algorithm is the natural choice of optimization as it provides maximum diversity in produced results and the notions of mutation and crossover fits well to the concept of cohortization. The user can choose the minimum and maximum size of his/her cohort definition and minimum statistical confidence among other variables related to genetic algorithm optimization. ### **Usage** ```python ## regression import warnings warnings.filterwarnings('ignore') from dacota import process_cohortization, get_cohort import pandas as pd import numpy as np from sklearn.datasets import load_boston dict_data = load_boston() df = pd.DataFrame(data = dict_data["data"], columns=dict_data["feature_names"]) df["target"] = dict_data["target"] del(dict_data) categorical_cols = [ "CHAS", 'target'] continous_cols = ['ZN', 'INDUS', 'TAX', 'CRIM', 'AGE', 'NOX', 'DIS', 'PTRATIO', 'LSTAT', ] target_col = "RAD" task = "classification" all_columns = list(set.union(set(categorical_cols),set(continous_cols),set([target_col]))) df = df[all_columns] df_cohorts, df_generations = process_cohortization(df, target_col, categorical_cols, continous_cols, task, minimum_confidence_for_a_cohort = 0.98, minimum_population_for_a_cohort = 5, maximum_population_for_a_cohort = 50, no_permutations = 100, no_initial_population_size = 2000, no_generations= 20, no_random_additions_per_generation = 1000, no_mutations_per_generation = 1000, no_crossovers_per_generation = 1000, top_quantile_to_survive_per_generation = 0.5, no_bins_discretization_continous = 5, treat_nans_as_category = True, target_nan_strategy = None, ) solution_form = df_cohorts.iloc[0].solution_form get_cohort(df, solution_form, continous_cols) ``` ```python ## classification import warnings warnings.filterwarnings('ignore') from dacota import process_cohortization, get_cohort import pandas as pd import numpy as np from sklearn.datasets import load_boston dict_data = load_boston() df = pd.DataFrame(data = dict_data["data"], columns=dict_data["feature_names"]) df["target"] = dict_data["target"] del(dict_data) categorical_cols = [ "CHAS", 'target'] continous_cols = ['ZN', 'INDUS', 'TAX', 'CRIM', 'AGE', 'NOX', 'DIS', 'PTRATIO', 'LSTAT', ] target_col = "RAD" task = "classification" all_columns = list(set.union(set(categorical_cols),set(continous_cols),set([target_col]))) df = df[all_columns] df_cohorts, df_generations = process_cohortization(df, target_col, categorical_cols, continous_cols, task, minimum_confidence_for_a_cohort = 0.98, minimum_population_for_a_cohort = 5, maximum_population_for_a_cohort = 50, no_permutations = 100, no_initial_population_size = 2000, no_generations= 20, no_random_additions_per_generation = 1000, no_mutations_per_generation = 1000, no_crossovers_per_generation = 1000, top_quantile_to_survive_per_generation = 0.5, no_bins_discretization_continous = 5, treat_nans_as_category = True, target_nan_strategy = None, ) solution_form = df_cohorts.iloc[0].solution_form get_cohort(df, solution_form, continous_cols) ```

زبان مورد نیاز

مقدار	نام
>=3	Python

نحوه نصب

نصب پکیج whl dacota-0.1.2:

pip install dacota-0.1.2.whl

نصب پکیج tar.gz dacota-0.1.2:

pip install dacota-0.1.2.tar.gz