Project 1: Custom Machine Learning Framework

May 10, 2020

During the process of machine learning modelos development, there are several steps we usually follow to find the best fit model, ensuring that this model were generalizable to new unknown datasets. The common steps followed are data sampling, data partition, feature engineering, feature selection, correlation analysis, feature discretization and model adjustment. Each of these steps can demand a lot of work for each machine learning project.

In this project, I built a set of classes to automate several parts of these usual steps, making use of sklearn base classes BaseEstimator and TransformerMixin, in order to allow its use in sklearn pipelines, with the use of fit/transform methods.

The following classes was implemented in this first version of the framework:

ImputMissingValues: fill null values with predefined criterions like mean, median, min and max (for numerical features) and mode (for categorical features). This class also removes features with higher missing fraction than a parametrized threshold of class constructor.
DummyTransformer: dummy mapping for each categorical feature, learns (fit) a map of each class and replicates (transform) the maps for new detesets. This class also discard (the medium, first or last according to a constructor parameter) one dummy of each feature, in order to avoid multicoliniarity during the model fitting.
FestureImpoertanceVariableSelection: a selection features method based on feature importance criterion. This class fits a Gradient Boosting (LGBM) model to calculate feature importances, then the features are selected from the higher to the lower importances, until it reaches the cummulative importance threshold.
RegularizationFeatureSelection: a selection features based on a Lasso Regression coefficients. This class has a parameter to define the regularization strength. All features with non-zero coefficient are selected. For categorical features, we create dummies and if at least one of the dummies of a feature has coefficient different from zero, the feature is also selected.
VariableClusteringSelection: a selection features method based on correlation analysis. This method works in a few steps: (a) clustering of variables using correlation (1 - abs(correlation)) as distance metric; (b) calculation of the R2 of each feature of a cluster with all other features of the same cluster, through Linear Regressions; (c) calculation of the R2 of each feature of a cluster with all features of the nearest cluster, also through Linear Regressions; (d) calculation of the individual metrics 1 - R2 Ratio = (1 - R2 own cluster) / (1 - R2 nearest cluster). The feature of lowest 1 - R2 Ratio in each cluster will be selected to be mantained.
OptimalIntervalBinning: this class is used to numerical feature discretization, it computes optimal ranges with the following steps: (a) percentiles are computed to create ordered groups of the numerical feature; (b) in a hierarchical way, hypothesis tests of difference of proportions are employed in neighbor ranges, and the pair of ranges with higher p-value is merged, this is repeated until all differences were significant or a threshold is reached. This class also has two methods to allow to permit interactively a split or merging of a sequence of ranges.
OptimalNominalBinning: this class is used to group similar categorical features values. This is done with the following steps: (a) the categories are sorted based com response rate; (b) in a hierarchical way, hypothesis tests of difference of proportions are employed in neighbor categories, and the pair with higher p-value is merged, this is repeated until all differences were significant or a threshold is reached. This class also has two methods to allow to customize interactively the group assignments configuration.

The Colab Notebook of this framework is available here, you can see some examples of uses here, and the data used in examples are available here.