Supervised machine learning offers a variety of regression, classification, and clustering algorithms. In my previous articles, I focused on describing the most popular classification algorithms such as logistic regression and support vector machines. I went over each algorithm’s advantages, common use cases, and walked through the steps of the model implementation. In this article I want to introduce another popular classification commonly used in supervised machine learning: the decision tree.
Basics of the decision tree classifier
The decision tree model can be used for predicting categorical and continuous variables. Like SVM, it can be used for regression or ranking as well. Therefore, there are two types of trees: classification decision trees and regression decision trees. Here, I’m focusing only on the classification and will walk you through a binary classification problem..
The structure of decision tree is simple – it hierarchically splits data down into subsets which are then split again into the smaller partitions or branches until they become “pure”, meaning the features inside the branch all belong to the same class. Such classes are called “leaves”. You can read more about decision trees here. Basically, a tree classification has the following flow:
- The top node represents the root
- The internal node represents features
- The leaves represent the outcome
The tree’s branches contain the logic for a decision rule, meaning your data is continually split given the input features.
The decision tree classifier is commonly used for image classification, decision analysis, strategy analysis, in medicine for diagnosis, in psychology for behavioral thinking analysis, and more.
Advantages of decision trees
The biggest advantage of decision trees is that they make it very easy to interpret and visualize nonlinear data patterns. They also work very fast, especially for exploratory data analysis. If your dataset is small, decision trees deliver the high accuracy score. Also, if your data is messy and not normalized (outliers), decision trees help you ignore or exclude non-essential features. Another advantage of classification decision trees is the possibility to improve their accuracy by setting the logic for the branches split.
Predictive modeling with a decision tree
Let’s return to the Titanic challenge I’ve used previously where I run different classification models to predict the passenger survival rate for a Titanic dataset. Here in my analysis I introduce the problem statement, run exploratory data analysis, do some feature engineering and walk through the steps of predictive modeling using Python Pandas, Seaborn statistical graph, and Scikit-Learn ML package for analysis and modeling. This prediction was done to showcase the steps for implementing common classification models.
After the data is cleaned and prepared for modeling (removed null values, outliers, the input cleaned to numerical and/or categorical values), I use a test split feature to run the forecast for 20% our sample.
Now let’s set the decision tree algorithm for our prediction.
Not bad! We got a 73% accuracy score (compared to a 50% random guess). This is much higher than the score we received from the Support Vector Machines algorithm (58%), but still lower than the Logistic Regression prediction (78.7%).
Conclusion
To conclude, for the given Titanic challenge, Logistic Regression performed as the most accurate algorithm. That being said, you have to try different models and choose the best approach for your problem. Given the advantages of the decision tree classifier, it is one of the must-try approaches for such types of data challenges.
Like what you see?
Olga Berezovsky is a Senior Data Analyst. She has extensive experience in the Big Data industry—specifically in data acquisition, transformation, and analysis—and deep expertise in building quantitative and qualitative user profile analysis that reveals user insights and behavior.