Decision tree based classification uses a decision tree (as a predictive model) to go from observations about an item (represented in the branches) to conclusions about the item’s target value (represented in the leaves).
Classification problem specific to decision trees include
- Classifying male/female based on the first name of user
- Classifying rainy or cloudy based on the time of day
- Classifying Spam/ham based on the message received.
I will consider the first example of classifying male/female based on the first name of user.
If we are to do the above task manually, we would have a decision based tree as follows where patterns such as vowel-endings are used for classification of either male or female.
Decision tree models works similarly.The decision tree model finds patterns from the training data set and builds its own splitting rules which can be used for classification.
The Gini Index is used as the cost function used when constructing a decision tree.The Gini Index is used to form the rules in the training dataset.
A Gini score gives an idea of how good a split is by how mixed the classes are in the two labeled groups created by the split. A perfect separation results in a Gini score of 0, whereas the worst case split that results in 50/50 classes. We calculate it for every row and split the data accordingly in our binary tree. We repeat this process recursively.
Example use of Gini to split dataset
One problem that might occur with Decision Trees is that it can overfit/underfit. That is the Decision Tree can “memorize” the training set.
Ensemble learning techniques could be used to reduce the effects of this. These include using Random forest, Gradient boosted trees.
What random forest does is construct multiple Decision Trees and get a majority vote when classifying.