An introduction to support vector machines
In my last post, I walked you through Logistic Regression as a binary classification model. In this blog, I’ll share an overview of Support Vector Machines (SVM). I’ll cover common use cases for applying this model, its advantages, and the steps of applying this classification algorithm for data forecasting. Let’s continue the overview of supervised machine learning models!
Support vector machines: The basics
SVM is one of the most popular models to use for classification. It can be used for regression or ranking as well, but it’s the most common use case is classification. SVM is often used for image or text classification, face or speech recognition, document categorization. You can read more about support vector machines here.
Imagine we have 100 pictures of dogs and cats, and our task is to train our machine learning model to identify an image of a cat or a dog for the next 50 unseen pictures. In this scenario, the first given 100 pictures are our training set where we “teach” the model to recognize where a cat or a dog is, and the next set is a testing data for which the model will run a prediction (classification).
In implementation, SVM looks for a line (or lines) that correctly classifies the number of input data points (dogs and cats features from our training set). From those lines, it chooses the one which has the longest distance to the given closest points which are called support vectors:
Original Source: Rohith Gandhi, Medium
Advantages of support vector machines
Choosing the right classification model depends on many factors: memory size, over-fitting tendency, parameterizations, number of features, etc. SVM is popular for its high accuracy and low computation power. The model prevents form over-fitting and has good generalizations. Finally, it works very well for both linear and nonlinear data, with unstructured and structured data with the help of Kernels.
Predictive modeling with SVM
1. Problem definition and tools overview
I’ll use the Titanic challenge again (as I did in my previous article here) to walk through the steps of predictive modeling. As a quick refresher: there are two datasets that include real Titanic passenger information like name, age, gender, social class, fare, etc. One dataset, “Training,” has binary (yes or no) data for 891 passengers. Another one is “Testing” for which we have to predict which passengers will survive. We can use the training set to teach our SVM model with the given data patterns, and then use our testing set to evaluate its performance. SVM doesn’t return the probability, but rather directly gives us the binary value of Survived or Didn’t Survived (1 or 0).
I’m using Python Pandas, Seaborn statistical graph, and Scikit-Learn ML package for analysis and modeling.
2. Exploratory data analysis
We can conduct an exploratory data analysis to get the feeling of a dataset, its patterns, and features.
To start with, below is a chart which illustrates the survival rate for a training dataset:
We can see that the not survived rate is higher. Our goal is to predict this rate for the rest of passengers in our testing set using data we have about them.
This graph below us the social class break down based of survived rate:
As we see, the first class has the highest survival rate, and the third class - the lowest.
We have to run more analysis for other input features like age, social class, family size. The full exploratory analysis is available here.
3. Feature engineering
There is one more step before our modeling: feature engineering. We have to clean and prepare our data for prediction. To understand which data features have to be to transformed, we can build a correlation plot to see the connections between given features:
We don't have high correlations which might affect our prediction model. Besides age, we also might need to look into parents/children (parch) and fare values.
Some of the things we can do with the missing age data values:
Or missing fare values:
4. Prediction and model evaluation
The same way as the last time, we use a test split feature, and will run the forecast for 20% our sample:
And, running our prediction using SVM:
We got a 58% accuracy score which is much lower than the one we had last time using Logistic Regression (76%). That might be due to some noise in our data. We can increase the accuracy score by doing more feature engineering to extract the most value from the input features.
There is no one perfect algorithm which would work the best for every problem. Given the size and structure of the dataset, you should try many different algorithms, such as Logistic Regression, Support Vector Machines, KNN, Naive Bayes Classifier, Decision Tree Classifier, Random Forest for your problem, maybe tune them and then select "a winner".
Olga Berezovsky is a Senior Data Analyst. She has extensive experience in the Big Data industry—specifically in data acquisition, transformation, and analysis—and deep expertise in building quantitative and qualitative user profile analysis that reveals user insights and behavior.
Follow Olga on LinkedIn