Skip to content

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini#1520

Merged
j143 merged 4 commits intoapache:mainfrom
morf1us:impurity-measures-builtin
Feb 12, 2022
Merged

[SYSTEMDS-3184] Builtin for computing information gain using entropy and gini#1520
j143 merged 4 commits intoapache:mainfrom
morf1us:impurity-measures-builtin

Conversation

@morf1us
Copy link
Contributor

@morf1us morf1us commented Jan 21, 2022

This builtin computes the measure of impurity for the given dataset based on the passed method. The current version expects the target vector to contain only 0 or 1 values and categorical data to be positive integers. Additionally, the builtin expects a row vector R to denote which features are continuous and which are categorical. In case of continuous features, the current implementation applies equal width binning.
It returns a row vector with gini gain or information gain for each feature. In both cases, the higher the gain, the better the split.

@j143
Copy link
Member

j143 commented Jan 22, 2022

Hi @morf1us - thanks a lot for the contribution. 😸

  1. How about adding usage instructions in builtins-reference.md. Also you can add latex formulas generously if you want to, in the description.
  2. testing seems fine.

Keep working on finalizing bins related changes.

@morf1us
Copy link
Contributor Author

morf1us commented Feb 8, 2022

Hi @j143 Thanks for reviewing!

I added usage instructions and also added some more tests.

@j143 j143 self-requested a review February 9, 2022 06:35
@j143
Copy link
Member

j143 commented Feb 10, 2022

Hi @morf1us - Just curious. Did you notice the impurity measures logic in decisionTree.dml? The implementation in this PR and that one is same?

Docs are here: https://apache.github.io/systemds/site/algorithms-classification.html#decision-trees

calcGiniImpurity = function(Double num_true, Double num_false) return (Double impurity) {
prop_true = num_true / (num_true + num_false)
prop_false = num_false / (num_true + num_false)
impurity = 1 - (prop_true ^ 2) - (prop_false ^ 2)
}
calcImpurity = function(
Matrix[Double] X,
Matrix[Double] Y,
Matrix[Double] use_rows_vector,
Double col,
Double type,
int bins) return (Double impurity, Matrix[Double] threshold) {
is_scalar_type = typeIsScalar(type)
if (is_scalar_type) {
possible_thresholds = calcPossibleThresholdsScalar(X, use_rows_vector, col, bins)
} else {
possible_thresholds = calcPossibleThresholdsCategory(type)
}
len_thresholds = ncol(possible_thresholds)
impurity = 1
threshold = matrix(0, rows=1, cols=1)
for (index in 1:len_thresholds) {
[false_rows, true_rows] = splitRowsVector(X, use_rows_vector, col, possible_thresholds[, index], type)
num_true_positive = 0; num_false_positive = 0; num_true_negative = 0; num_false_negative = 0
len = dataVectorLength(use_rows_vector)
for (c_row in 1:len) {
true_row_data = dataVectorGet(true_rows, c_row)
false_row_data = dataVectorGet(false_rows, c_row)
if (true_row_data != 0 & false_row_data == 0) { # IT'S POSITIVE!
if (as.scalar(Y[c_row, 1]) != 0) {
num_true_positive = num_true_positive + 1
} else {
num_false_positive = num_false_positive + 1
}
} else if (true_row_data == 0 & false_row_data != 0) { # IT'S NEGATIVE
if (as.scalar(Y[c_row, 1]) != 0.0) {
num_false_negative = num_false_negative + 1
} else {
num_true_negative = num_true_negative + 1
}
}
}
impurity_positive_branch = calcGiniImpurity(num_true_positive, num_false_positive)
impurity_negative_branch = calcGiniImpurity(num_true_negative, num_false_negative)
num_samples = num_true_positive + num_false_positive + num_true_negative + num_false_negative
num_negative = num_true_negative + num_false_negative
num_positive = num_true_positive + num_false_positive
c_impurity = num_positive / num_samples * impurity_positive_branch + num_negative / num_samples * impurity_negative_branch
if (c_impurity <= impurity) {
impurity = c_impurity
threshold = possible_thresholds[, index]
}
}
}
calcBestSplittingCriteria = function(
Matrix[Double] X,
Matrix[Double] Y,
Matrix[Double] R,
Matrix[Double] use_rows_vector,
Matrix[Double] use_cols_vector,
int bins) return (Double impurity, Double used_col, Matrix[Double] threshold, Double type) {
impurity = 1
used_col = 1
threshold = matrix(0, 1, 1)
type = 1
# -- user-defined function calls not supported for iterable predicates
len = dataVectorLength(use_cols_vector)
for (c_col in 1:len) {
use_feature = dataVectorGet(use_cols_vector, c_col)
if (use_feature != 0) {
c_type = getTypeOfCol(R, c_col)
[c_impurity, c_threshold] = calcImpurity(X, Y, use_rows_vector, c_col, c_type, bins)
if(c_impurity <= impurity) {
impurity = c_impurity
used_col = c_col
threshold = c_threshold
type = c_type
}
}
}
}

Copy link
Member

@j143 j143 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

The code looks good. But, need to check whether the implementation is efficient (with edge cases considered) compared to the one implemented in the scripts/builtin/decisionTrees.dml.

I will have a look at the other code, shortly.

@morf1us
Copy link
Contributor Author

morf1us commented Feb 11, 2022

Hi @j143 Thanks for taking the time. Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

@j143
Copy link
Member

j143 commented Feb 11, 2022

Yes, it is quite similar. I was aware of both the decisionTree and randomForest implementations before starting.

Yes, eventually we need to use Impurity measures inside the the scripts.

@j143
Copy link
Member

j143 commented Feb 12, 2022

Thank you, @morf1us - LGTM. 👍
🎉 🚀

We can work on using these measures functions inside the decisionTree scripts, later. Perhaps would you like to take that?

@j143 j143 merged commit 7c3cc82 into apache:main Feb 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants