معرفی شرکت ها


MultiProcessMStepRegression-0.3.2


Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر
Card image cap
تبلیغات ما

مشتریان به طور فزاینده ای آنلاین هستند. تبلیغات می تواند به آنها کمک کند تا کسب و کار شما را پیدا کنند.

مشاهده بیشتر

توضیحات

python多进程逐步回归。python step-wise regression with multi-processing.
ویژگی مقدار
سیستم عامل OS Independent
نام فایل MultiProcessMStepRegression-0.3.2
نام MultiProcessMStepRegression
نسخه کتابخانه 0.3.2
نگهدارنده []
ایمیل نگهدارنده []
نویسنده 王文皓(wangwenhao)
ایمیل نویسنده DATA-OG@139.com
آدرس صفحه اصلی https://github.com/wangwenhao-DATA-OG/MultiProcessMStepRegression
آدرس اینترنتی https://pypi.org/project/MultiProcessMStepRegression/
مجوز -
# Install pip install MultiProcessMStepRegression # Function Description A step-wise regression with python.It has step-wise logstic regression and step-wise linear regression. It uses multiprocessing when deciding to add or remove features. It works with multi-processing.Supporting Windows system multi-processing too. # All characteristics 1.Supporting forward-backward Step-Wise. 2.Supporting multi-processing.When adding or removing features,multi-processing is used to traversal all candidate features. 3.Supporting that user could point the index instead of AIC/BIC for measuring model performance when adding or removing feaures.There is a benefit when data is unbalanced. 4.Supporting that user could point p-value threshold.If max p-value is more than this threshold,the current features will not be added,although getting a lift on performance of model. 5.Supporting that user could point VIF threshold.If max VIF is more than this threshold,the current features will not be added,although getting a lift on performance of model. 6.Supporting that user could point coefficient of correlation threshold.If max coefficient of correlation is more than this threshold,the current features will not be added,although getting a lift on performance of model. 7.Supporting that user could point sign to coefficients of regression. A part of features have sense in some business like woe transfer which require that all coefficients of regression are postive or negtive.If the signs requirement is not met,the current features will not be added,although getting a lift on performance of model. 8.[4,5,6,7] above are completed in step-wise procedure.Picking features and verifing those thresholds and signs are simultaneous. 9.Users will get reasons of the features aren`t picked up,as performance is fall or p-value is more than threshold or signs is not in accord with expect of user and so on after adding this feature. 10.Supporting the Chinese and English log in whcih user can get record of every iteration. #### News in 0.2.1 11.When deciding that current feature should be in model or not,sorts probability output by current model descending and takes top(last) n% to evaluate model.The benefit is raising the ability of catching some one class sample.For example in credit risk,lifts proportion of overdue user in high risk sublevel(LIFT 1 or LIFT 5).Available in logistic. 12.When deciding that current feature should be in model or not, uses a new data set diffenent with trainning data set to evaluate model.(Try best to avoid using validation data set and test data set with confusing.When feature selecting ,it means data leak in the final model evaluating that using test data set,so that may over evaludate model performance.It should reserve a cleaning test data set to evaluate model). #### News in 0.3.1 13.User can set some features that must be imported by model. # Q&A WeChat: DATA_OG mark: github # Usage: ``` import MultiProcessMStepRegression as mpmr from sklearn.datasets import make_classification,make_regression import pandas as pd def get_X_y(data_type,n_samples=200,random_state=0): if data_type == 'logistic': #number of informative features = 4 #number of redundant features = 2.redundant feature is linear combinations of the informative features #number of useless features = 10-4-2=4 X, y = make_classification(n_samples=n_samples,n_features=10,n_informative=4,n_redundant=2,shuffle=False,random_state=random_state,class_sep=2) X = pd.DataFrame(X,columns=['informative_1','informative_2','informative_3','informative_4','redundant_1','redundant_2','useless_1','useless_2','useless_3','useless_4']).sample(frac=1) y=pd.Series(y).loc[X.index] if data_type == 'linear': # number of informative features = 6 # matrix rank = 4 (implying collinearity between six informative features) X, y = make_regression(n_samples=n_samples,n_features=10,n_informative=6,effective_rank=4,noise=5,shuffle=False,random_state=random_state) X = pd.DataFrame(X,columns=['informative_1','informative_2','informative_3','informative_4','informative_5','informative_6','useless_1','useless_2','useless_3','useless_4']).sample(frac=1) y=pd.Series(y).loc[X.index] return X, y def test_logit(X,y): # As can be seen: # 1.All informative features are picked up by this algorithm # 2.All linear combinations features are excluded and the reasons are over the max_vif_limit and over max_corr_limit and over max_pvalue_limit and no lift on the model perfermance. # 3.All useless features are excluded. The reasons of that are no lift on the perfermance of model or over max_pvalue_limit. # return # in_vars = ['informative_3', 'informative_4', 'informative_2', 'informative_1'] # # dr = {'redundant_1': (['模型性能=0.956100,小于等于最终模型的性能=0.956100', # '最大VIF=inf,大于设置的阈值=3.000000', # '最大相关系数=0.925277,大于设置的阈值=0.600000', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.956100,less or equals than the performance index of final model=0.956100', # 'the max VIF=inf,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.925277,more than the setting of max_corr_limit=0.600000', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'redundant_2': (['模型性能=0.956100,小于等于最终模型的性能=0.956100', # '最大VIF=inf,大于设置的阈值=3.000000', # '最大相关系数=0.676772,大于设置的阈值=0.600000', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.956100,less or equals than the performance index of final model=0.956100', # 'the max VIF=inf,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.676772,more than the setting of max_corr_limit=0.600000', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_1': (['模型性能=0.955200,小于等于最终模型的性能=0.956100', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.955200,less or equals than the performance index of final model=0.956100', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_2': (['有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_3': (['有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_4': (['模型性能=0.955800,小于等于最终模型的性能=0.956100', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.955800,less or equals than the performance index of final model=0.956100', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000'])} lr = mpmr.LogisticReg(X,y,measure='roc_auc',iter_num=20,logger_file_EN='c:/temp/mstep_en.log',logger_file_CH='c:/temp/mstep_ch.log') in_vars,clf_final,dr = lr.fit() return in_vars,clf_final,dr def test_linear(X,y): # As can be seen: # The picked features is from informative features # The number of picked features equals matrix rank # return # ['informative_2', 'informative_5', 'informative_3', 'informative_4'] # dr={'informative_1': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'informative_6': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_1': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_2': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_3': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_4': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000'])} lr = mpmr.LinearReg(X,y,max_pvalue_limit=0.01,logger_file_EN='c:/temp/mstep_en.log',logger_file_CH='c:/temp/mstep_ch.log') in_vars,rg_final,dr = lr.fit() return in_vars,rg_final,dr def test_measureXy(X,y): # As can be seen: # 1.The features are picked up by this algorithm are informative. # 2.It is due to its correlation is 0.65 and greater than 0.6 you set that excluding only one informative features. # 3.All linear combinations features are excluded and the reasons are over the max_vif_limit and max_corr_limit. # 4.All useless features are excluded and the reasons are no lift on the perfermance of model and over max_pvalue_limit. # return # in_vars = ['informative_4', 'informative_1', 'informative_3'] # # dr = {'informative_2': (['最大相关系数=0.656879,大于设置的阈值=0.600000'], # ['the max correlation coefficient=0.656879,more than the setting of max_corr_limit=0.600000']), # 'redundant_1': (['最大VIF=7.957338,大于设置的阈值=3.000000', # '最大相关系数=0.886775,大于设置的阈值=0.600000'], # ['the max VIF=7.957338,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.886775,more than the setting of max_corr_limit=0.600000']), # 'redundant_2': (['最大VIF=532.953471,大于设置的阈值=3.000000', # '最大相关系数=0.883488,大于设置的阈值=0.600000'], # ['the max VIF=532.953471,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.883488,more than the setting of max_corr_limit=0.600000']), # 'useless_1': (['模型性能=0.640169,小于等于最终模型的性能=0.648233', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.640169,less or equals than the performance index of final model=0.648233', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_2': (['模型性能=0.640297,小于等于最终模型的性能=0.648233', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.640297,less or equals than the performance index of final model=0.648233', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_3': (['模型性能=0.641577,小于等于最终模型的性能=0.648233', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.641577,less or equals than the performance index of final model=0.648233', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_4': (['模型性能=0.640553,小于等于最终模型的性能=0.648233', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.640553,less or equals than the performance index of final model=0.648233', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000'])} from sklearn.model_selection import train_test_split X_train,X_measure,y_train,y_measure = train_test_split(X,y, test_size=0.5,random_state=10) lr = mpmr.LogisticReg(X_train,y_train,measure_X=X_measure,measure_y=y_measure,iter_num=20) in_vars,clf_final,dr = lr.fit() return in_vars,clf_final,dr def test_givenX_logistic(X,y): # As can be seen: # 1.The features picked up by this algorithm are informative features except for features imported by force # 2.The reason the informative features are excluded is the VIF and correlation is greater than the value from user set.That is due to import a redundant combination feature forcedly # 3.All linear combination features are excluded except for combination features imported by force and the reasons of that are over the max_vif_limit and max_corr_limit # 4.All useless features are excluded except for useless features imported by force and the reasons of that are no lift on the perfermance of model and over max_pvalue_limit # 5.The pvalue of useless features imported forcedly is 0.32 # return # in_vars = ['redundant_1', 'useless_4', 'informative_2', 'informative_1'] # pvalues: # const 2.041083e-07 # redundant_1 2.010675e-07 # useless_4 3.264707e-01 # informative_2 4.459015e-11 # informative_1 1.700265e-07 # # dr = {'informative_3': (['最大VIF=3.265828,大于设置的阈值=3.000000'], # ['the max VIF=3.265828,more than the setting of max_vif_limit=3.000000']), # 'informative_4': (['最大VIF=313226.371738,大于设置的阈值=3.000000', # '最大相关系数=0.925277,大于设置的阈值=0.600000'], # ['the max VIF=313226.371738,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.925277,more than the setting of max_corr_limit=0.600000']), # 'redundant_2': (['最大VIF=7.226070,大于设置的阈值=3.000000', # '最大相关系数=0.814790,大于设置的阈值=0.600000'], # ['the max VIF=7.226070,more than the setting of max_vif_limit=3.000000', # 'the max correlation coefficient=0.814790,more than the setting of max_corr_limit=0.600000']), # 'useless_1': (['有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_2': (['模型性能=0.892900,小于等于最终模型的性能=0.893000', # '有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['the performance index of model=0.892900,less or equals than the performance index of final model=0.893000', # 'some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000']), # 'useless_3': (['有些系数不显著,P_VALUE大于设置的阈值=0.050000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.050000'])} lr = mpmr.LogisticReg(X,y,given_cols=['redundant_1','useless_4'],measure='roc_auc') in_vars,clf_final,dr = lr.fit() return in_vars,clf_final,dr def test_givenX_linear(X,y): # As can be seen: # 1.Compare with test_linear, 'informative_1' and 'useless_1' is chosen more.They are imported by user forcedly. # 2.The pvalue of 'informative_1' and 'useless_1' are 0.31 and 0.095 # return # ['informative_1','useless_1','informative_2','informative_5','informative_3','informative_4'] # const 4.096327e-01 # informative_1 3.087456e-01 # useless_1 9.528599e-02 # informative_2 1.897154e-12 # informative_5 8.468702e-08 # informative_3 3.985832e-04 # informative_4 4.351842e-03 # dr = {'informative_6': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_2': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_3': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000']), # 'useless_4': (['有些系数不显著,P_VALUE大于设置的阈值=0.010000'], # ['some coefficients are not significant,P_VALUE is more than the setting of max_pvalue_limit=0.010000'])} lr = mpmr.LinearReg(X,y,given_cols=['informative_1','useless_1'],iter_num=10,max_pvalue_limit=0.01) in_vars,clf_final,dr = lr.fit() return in_vars,clf_final,dr if __name__ == '__main__': X_logit, y_logit = get_X_y('logistic') in_vars_logit,clf_final_logit,dr_logit = test_logit(X_logit,y_logit) X_linear, y_linear = get_X_y('linear',500) in_vars_linear,rg_final_linear,dr_linear = test_linear(X_linear,y_linear) X_measure,y_measure = get_X_y('logistic',500) in_vars_measure,clf_final_measure,dr_measure = test_measureXy(X_measure,y_measure) X_given, y_given = get_X_y('logistic') in_vars_given,clf_final_given,dr_given = test_givenX_logistic(X_given,y_given) X_given, y_given = get_X_y('linear',500) in_vars_given,rg_final_given,dr_given = test_givenX_linear(X_given,y_given) ``` # Document and API ## class ## class MultiProcessMStepRegression.LinearReg(MultiProcessMStepRegression.Reg_Sup_Step_Wise_MP.Regression) A Step-Wise Linear Regression handling with multi-processing.It bases on statsmodels.api.OLS or statsmodels.api.WLS supplying a linear regression algorithm.Which algorithm should be used depends on the setting of train sample weight. In adding feature process,multi-processing is used to traversal several features concurrently.The feature which meets the conditions which the user set and get a max lift on measure index is added in the model.If any feature can\`t improve the performance of model undering the conditions set by user ,no feature is added in current iteration. The removing feature process has same policy with adding feature process to decide which feature should be removed. When adding process, if there is improving on performance of model but some conditions user set are missed,a additional removing process will start to run.If the feature to remove is same with the feature to add,the feature will not be added and the adding process is over.If They are not same,the feature to add is added in and the feature to remove is excluded from current list in which the picked features stay.The additional removing process has same procedure with removing process. When modeling is compeleted,the features not picked up will respectively be added in picked features list. And then by rebuilding model with those features,a exact deletion reasons will return. Note:As X and y is a property in a instance of MultiProcessMStepRegression.LinearReg class,so that instance will be very large.Saving that instance is not recommended instead of saving the returned model and remove reasons. ### __init__ function ### LinearReg(self,X,y,given_cols=[],fit_weight=None,measure='r2',measure_weight=None,measure_X=None,measure_y=None,kw_measure_args=None,max_pvalue_limit=0.05,max_vif_limit=3,max_corr_limit=0.6,coef_sign=None,iter_num=20,kw_algorithm_class_args=None,n_core=None,logger_file_CH=None,logger_file_EN=None) #### Parameters X:DataFrame features y:Series target given_cols:list | [](Default) New in 0.3.1 The features appointed by user to be imoprted by model. fit_weight:Series The length of fit_weight is same with length of y.The fit_weight is for trainning data.If None(default),every sample has a same trainning weight and statsmodels.api.OLS is used as base linear algorithm.If not None,statsmodels.api.WLS is used as base linear algorithm.In linear regression,the goal of setting weight is for getting a stable model with Heteroscedasticity. measure:str r2(default) | explained_variance_score | max_error Performance evaluate function.The y_true,y_hat and measure_weight will be put into measure function automatically and the other parameters will be put into measure function with kw_measure_args measure_weight:Series The length of measure_weight is same with length of y.The measure_weight is for measuring function.If None(default),every sample has a same measuring weight. See also fit_weight measure_X:DataFrame | None(Default) New in 0.2.1 When selecting feature,use a data set different with trainning X to evaluate model. Default is None that means trainning X is used for evaluating model. measure_y:Series | None(Default) New in 0.2.1 When selecting feature,use a data set different with trainning y to evaluate model. Default is None that means trainning y is used for evaluating model. Note:Try best to avoid using validation data set and test data set with confusing.When selecting feature,it means data leak that using test data set,so that may over evaludate model performance.It shoule reserve a cleaning test data set to evaluate model finally. kw_measure_args:dict | None(default) Except y_true,y_hat and measure_weight,the other parameters need be put in kw_measure_args to deliver into measure function. None means that no other parameters delivers into measure function. max_pvalue_limit:float The max P-VALUE limit.0.05(default) max_vif_limit:float The max VIF limit.3(default) max_corr_limit:float The max coefficient of correlation limit.0.6(default) coef_sign:'+','-',dict,None(default) If the user have a priori knowledge on relation between X and y,like positive correlation or negtive correlation,user can make a constraint restriction on sign of resression coefficient by this parameter. '+':all signs of resression coefficients(not in given_cols) are positive '-':all signs of resression coefficients(not in given_cols) are negtive dict:the format is as {'x_name1':'+','x_name2':'-'}.Put coefficient and coefficient\`s sign on which you have a priori knowledge into a dict and then constraint these signs that are in this dict. The coefficients not included in this dict will not be constrainted. None:all coefficients are not constrainted. iter_num:int The iteration num for picking features.Default is 20.When np.inf,no limit to iteration num,if features are many,then the running time is long.If all features are already picked in model or no imporve on perfermance by adding/removing any feature,the actual iteration num should be samller than iter_num.The steps inclueed in every iteration is:1.Try adding feature which is not added in current model yet and then pick up one feature that makes most promotion for performance of model with satisfying user\`s setting. 2.Try removing feature and then remove out one feature that makes most promotion for performance of model with satisfying user\`s setting.It is means finshing one time iteration that step 1 and step 2 is completed.If all step 1 and step 2 can\`t pick up any feature then iteration is pre-terminated,no matter whether iter_num is reached. kw_algorithm_class_args:dict Except X,y,fit_weight,the other parameters that are delivered into linear regression algorithm is in kw_algorithm_class_args Note:y,X is called endog and exog in statsmodels.genmod.generalized_linear_model.GLM n_core:int | float | None Count of CPU processing.If int,user point the count.If float,the count is as percentage of all count transfered to int(ceil).If None(default),all count of CPU processing -1. logger_file_CH:str A log file name where log for step-wise procedure is recorded in Chinese.If None(default),not recording Chinese log. logger_file_EN:str A log file name where log for step-wise procedure is recorded in English.If None(default),not recording English log. ### method ### LinearReg.fit(self) Fitting a model #### Returns in_vars : list All variables to be picked up by model.The order in list is same with the order of to be added clf_final :statsmodels.regression.linear_model.RegressionResultsWrapper A final step-wise model dr : dict deletion reason.It\`s format is {'var_name':([...],[...])} Every value in dr contains a tuple including two elements.The first element is reason in Chinese and the second in English.Every element is a list and record all deletion reason of variable(matching key).Some features should be added into model manually,if a list corresponding these features has no any element. ## class ## class MultiProcessMStepRegression.LogisticReg(MultiProcessMStepRegression.Reg_Sup_Step_Wise_MP.Regression) MultiProcessMStepRegression.LogisticReg:A Step-Wise Logistic Regression handling with multi-processing.It bases on statsmodels.genmod.generalized_linear_model.GLM supplying a logistic regression algorithm In adding feature process,multi-processing is used to traversal several features concurrently.The feature which meets the conditions which the user set and get a max lift on measure index is added in the model.If any feature can\`t improve the performance of model undering the conditions set by user ,no feature is added in current iteration. The removing feature process has same policy with adding feature process to decide which feature should be removed. When adding process, if there is improving on performance of model but some conditions user set are missed,a additional removing process will start to run.If the feature to remove is same with the feature to add,the feature will not be added and the adding process is over.If They are not same,the feature to add is added in and the feature to remove is excluded from current list in which the picked features stay.The additional removing process has same procedure with removing process. When modeling is compeleted,the features not picked up will respectively be added in picked features list. And then by rebuilding model with those features,a exact deletion reasons will return. Note:As X and y is a property in a instance of MultiProcessMStepRegression.LogisticReg class,so that instance will be very large.Saving that instance is not recommended instead of saving the returned model and remove reasons. ### __init__ function ### LogisticReg(self,X,y,given_cols=[],fit_weight=None,measure='ks',measure_weight=None,measure_frac=None,measure_X=None,measure_y=None,kw_measure_args=None,max_pvalue_limit=0.05,max_vif_limit=3,max_corr_limit=0.6,coef_sign=None,iter_num=20,kw_algorithm_class_args=None,n_core=None,logger_file_CH=None,logger_file_EN=None) #### Parameters X:DataFrame features y:Series target given_cols:list | [](Default) New in 0.3.1 The features appointed by user to be imoprted by model. fit_weight:Series The length of fit_weight is same with length of y.The fit_weight is for trainning data.If None(default),every sample has a same trainning weight.Don\`t confuse fit_weight with measure_weight(mentioned below) that is for measuring model.It depends on user\`s design on sample whether fit_weight is same with measure_weight or not.For example,for reducing effect from large class sample,it\`s a good way to improve weights of small class sample when trainning model but the weight between large class sample and small class sample returns back to original weight value when measuring with some index like KS or ROC_AUC.Why doing like this is that the lost function of regression is large class sensitive.So the user need adjust sample weights.Some index like KS or ROC_AUC,their calculate way is non-sensitive in unbalanced sample situation,so the user need not adjust sample weights unless the user thinks that the loss penalty between samples is different. note:Although the user set measure='KS' or 'ROC_AUC' to measure performance and pick features,but the MultiProcessMStepRegression.LogisticReg is still large class sensitive,due to the base algorithm is standard logistic regression yet. measure:str ks(default) | accuracy | roc_auc | balanced_accuracy | average_precision Performance evaluate function.The y_true,y_hat and measure_weight will be put into measure function automatically and the other parameters will be put into measure function with kw_measure_args measure_weight:Series The length of measure_weight is same with length of y.The measure_weight is for measuring function.If None(default),every sample has a same measuring weight. See also fit_weight measure_frac:float | None(Default) New in 0.2.1 The percent of sample used to evaluate model. Sorts the probability output by model as descending and then takes top(or last) measure_frac sample to evaluate model. If measure_frac>0,top measure_frac else last abs(measure_frac). It can be used as promoting ability catching some one class sample.For example in credit risk,raises proportion of overdue user in high risk sublevel(LIFT 1 or LIFT 5). Default is None that means all samples are taken. measure_X:DataFrame | None(Default) New in 0.2.1 When selecting feature,use a data set different with trainning X to evaluate model. Default is None that means trainning X is used for evaluating model. measure_y:Series | None(Default) New in 0.2.1 When selecting feature,use a data set different with trainning y to evaluate model. Default is None that means trainning y is used for evaluating model. Note:Try best to avoid using validation data set and test data set with confusing.When selecting feature,it means data leak that using test data set,so that may over evaludate model performance.It shoule reserve a cleaning test data set to evaluate model finally. kw_measure_args:dict | None(default) Except y_true,y_hat and measure_weight,the other parameters need be put in kw_measure_args to deliver into measure function. None means that no other parameters delivers into measure function. max_pvalue_limit:float The max P-VALUE limit.0.05(default) max_vif_limit:float The max VIF limit.3(default) max_corr_limit:float The max coefficient of correlation limit. 0.6(default) coef_sign:'+','-',dict,None(default) If the user have a priori knowledge on relation between X and y,like positive correlation or negtive correlation,user can make a constraint restriction on sign of resression coefficient by this parameter. '+':all signs of resression coefficients(not in given_cols) are positive '-':all signs of resression coefficients(not in given_cols) are negtive dict:the format is as {'x_name1':'+','x_name2':'-'}.Put coefficient and coefficient\`s sign on which you have a priori knowledge into a dict and then constraint these signs that are in this dict. The coefficients not included in this dict will not be constrainted. None:all coefficients are not constrainted. iter_num:int The iteration num for picking features.Default is 20.When np.inf,no limit to iteration num,if features are many,then the running time is long.If all features are already picked in model or no imporve on perfermance by adding/removing any feature,the actual iteration num should be samller than iter_num.The steps inclueed in every iteration is:1.Try adding feature which is not added in current model yet and then pick up one feature that makes most promotion for performance of model with satisfying user\`s setting. 2.Try removing feature and then remove out one feature that makes most promotion for performance of model with satisfying user\`s setting.It is means finshing one time iteration that step 1 and step 2 is completed.If all step 1 and step 2 can\`t pick up any feature then iteration is pre-terminated,no matter whether iter_num is reached. kw_algorithm_class_args:dict Except X,y,fit_weight,the other parameters that are delivered into logistic regression algorithm is in kw_algorithm_class_args Note:y,X is called endog and exog in statsmodels.genmod.generalized_linear_model.GLM n_core:int | float | None Count of CPU processing.If int,user point the count.If float,the count is as percentage of all count transfered to int(ceil).If None(default),all count of CPU processing -1. logger_file_CH:str A log file name where log for step-wise procedure is recorded in Chinese.If None(default),not recording Chinese log. logger_file_EN:str A log file name where log for step-wise procedure is recorded in English.If None(default),not recording English log. ### method ### LogisticReg.fit(self) Fitting a model #### Returns in_vars : list All variables to be picked up by model.The order in list is same with the order of to be added clf_final : statsmodels.genmod.generalized_linear_model.GLMResultsWrapper A final step-wise model dr : dict deletion reason.It\`s format is {'var_name':([...],[...])} Every value in dr contains a tuple including two elements.The first element is reason in Chinese and the second in English.Every element is a list and record all deletion reason of variable(matching key).Some features should be added into model manually,if a list corresponding these features has no any element.


نیازمندی

مقدار نام
>=0.20.4 scikit-learn
>=0.10.0 statsmodels


زبان مورد نیاز

مقدار نام
>=3.4 Python


نحوه نصب


نصب پکیج whl MultiProcessMStepRegression-0.3.2:

    pip install MultiProcessMStepRegression-0.3.2.whl


نصب پکیج tar.gz MultiProcessMStepRegression-0.3.2:

    pip install MultiProcessMStepRegression-0.3.2.tar.gz