Decision trees with R codes

Decision tree can be used for both classification and regression problems.

Terminologies Related To Decision Tree

•Root Node

Splitting

Decision Node

Leaf/Terminal Node

Pruning

Branch/ Sub Tree

Parent/Child Node

•Algorithms used in decision tree –

Ginni Index – •Higher the value of Gini higher the homogeneity. •CART (Classification and Regression Tree) uses Gini method to create binary splits

Chi-Square – •Higher the value of Chi-Square higher the statistical significance of differences between sub-node and Parent node.

Information Gain

Reduction In Variance – Reduction in variance is an algorithm used for continuous target variables (regression problems).

If we can use logistic regression for classification problems and linear regression for regression problems, why is there a need to use trees?

Algorithm to be used depends on the type of problem we are solving.

•If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.

•If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.

•If you need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model.

 

CART decision tree algorithm
library(rpart)

install.packages('rattle')
install.packages('rpart.plot')
install.packages('RColorBrewer')
library(rattle)
library(rpart.plot)

fit <- rpart(Survived ~ Pclass + Sex + Age + SibSp + Embarked, data=train, method="class")
# fit is our decision tree model 
# let us visualize the tree

plot(fit)
text(fit)

fancyRpartPlot(fit)

# Let us make the prediction using decision tree model

Prediction <- predict(fit, test, type = "class")
submit <- data.frame(PassengerId = test$PassengerId, Survived = Prediction)
write.csv(submit, file = "myfirstdtree.csv", row.names = FALSE)

Leave a Reply

Your email address will not be published. Required fields are marked *