Classification With High Interpretability
Modeling categorical predictions that is easy explainโusing classification trees in R, Python, and Julia.
Classification and Regression Trees (CART) are powerful decision tree algorithms used for both classification and regression tasks. CART is widely used in various domains for tasks like customer segmentation, fraud detection, disease diagnosis, and sentiment analysis. Their intuitive representation and ability to handle complex interactions make them valuable for understanding data patterns and making accurate predictions in classification problems, leading to data-driven insights and informed decision-making.
Like Regression Trees, the significance of Classification Trees lies in their ability to perform non-linear and interpretable classification. They divide the data into distinct classes by recursively splitting based on the most informative features.
Letโs examine this technique using the Cleveland Clinic heart disease dataset.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinsonโs approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyoneโincluding me.
cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 ร Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
Importing and Examining Dataset
heart_disease_r <- read.csv("../../dataset/cleveland-clinic-heart-disease.csv", stringsAsFactors=TRUE)
utils::str(object=heart_disease_r)
'data.frame': 303 obs. of 14 variables:
$ age : int 63 67 67 37 41 56 62 57 63 53 ...
$ sex : int 1 1 1 1 0 1 0 0 1 1 ...
$ cp : int 1 4 4 3 2 2 4 4 4 4 ...
$ trestbps: int 145 160 120 130 130 120 140 120 130 140 ...
$ chol : int 233 286 229 250 204 236 268 354 254 203 ...
$ fbs : int 1 0 0 0 0 0 0 0 0 1 ...
$ restecg : int 2 2 2 0 2 0 2 0 2 2 ...
$ thalach : int 150 108 129 187 172 178 160 163 147 155 ...
$ exang : int 0 1 1 0 0 0 0 1 0 1 ...
$ oldpeak : num 2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
$ slope : int 3 2 2 3 1 1 3 1 2 3 ...
$ ca : int 0 3 2 0 0 0 2 0 1 0 ...
$ thal : int 6 3 7 3 3 3 3 3 7 7 ...
$ class : int 0 2 1 0 0 0 3 0 2 1 ...
utils::head(x=heart_disease_r, n=8)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
1 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
2 67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
3 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
4 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
5 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
6 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
7 62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
8 57 0 4 120 354 0 0 163 1 0.6 1 0 3 0
which(is.na(heart_disease_r))
[1] 3500 3526 3621 3636 3724 3903
heart_disease_py = pandas.read_csv("../../dataset/cleveland-clinic-heart-disease.csv")
heart_disease_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null int64
1 sex 303 non-null int64
2 cp 303 non-null int64
3 trestbps 303 non-null int64
4 chol 303 non-null int64
5 fbs 303 non-null int64
6 restecg 303 non-null int64
7 thalach 303 non-null int64
8 exang 303 non-null int64
9 oldpeak 303 non-null float64
10 slope 303 non-null int64
11 ca 299 non-null float64
12 thal 301 non-null float64
13 class 303 non-null int64
dtypes: float64(3), int64(11)
memory usage: 33.3 KB
heart_disease_py.head(n=8)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
0 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 0
1 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 2
2 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 1
3 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 0
4 41 0 2 130 204 0 2 172 0 1.4 1 0.0 3.0 0
5 56 1 2 120 236 0 0 178 0 0.8 1 0.0 3.0 0
6 62 0 4 140 268 0 2 160 0 3.6 3 2.0 3.0 3
7 57 0 4 120 354 0 0 163 1 0.6 1 0.0 3.0 0
heart_disease_py.tail(n=8)
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
295 41 1 2 120 157 0 0 182 0 0.0 1 0.0 3.0 0
296 59 1 4 164 176 1 2 90 0 1.0 2 2.0 6.0 3
297 57 0 4 140 241 0 0 123 1 0.2 2 0.0 7.0 1
298 45 1 1 110 264 0 0 132 0 1.2 2 0.0 7.0 1
299 68 1 4 144 193 1 0 141 0 3.4 2 2.0 7.0 2
300 57 1 4 130 131 0 0 115 1 1.2 2 1.0 7.0 3
301 57 0 2 130 236 0 2 174 0 0.0 2 1.0 3.0 1
302 38 1 3 138 175 0 0 173 0 0.0 1 NaN 3.0 0
heart_disease_jl = CSV.File("../../dataset/cleveland-clinic-heart-disease.csv") |> DataFrames.DataFrame
303ร14 DataFrame
Row โ age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
โ Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Int64 Float64 Int64 String3 String3 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 63 1 1 145 233 1 2 150 0 2.3 3 0 6 0
2 โ 67 1 4 160 286 0 2 108 1 1.5 2 3 3 2
3 โ 67 1 4 120 229 0 2 129 1 2.6 2 2 7 1
4 โ 37 1 3 130 250 0 0 187 0 3.5 3 0 3 0
5 โ 41 0 2 130 204 0 2 172 0 1.4 1 0 3 0
6 โ 56 1 2 120 236 0 0 178 0 0.8 1 0 3 0
7 โ 62 0 4 140 268 0 2 160 0 3.6 3 2 3 3
8 โ 57 0 4 120 354 0 0 163 1 0.6 1 0 3 0
9 โ 63 1 4 130 254 0 2 147 0 1.4 2 1 7 2
10 โ 53 1 4 140 203 1 2 155 1 3.1 3 0 7 1
11 โ 57 1 4 140 192 0 0 148 0 0.4 2 0 6 0
12 โ 56 0 2 140 294 0 2 153 0 1.3 2 0 3 0
13 โ 56 1 3 130 256 1 2 142 1 0.6 2 1 6 2
14 โ 44 1 2 120 263 0 0 173 0 0.0 1 0 7 0
15 โ 52 1 3 172 199 1 0 162 0 0.5 1 0 7 0
16 โ 57 1 3 150 168 0 0 174 0 1.6 1 0 3 0
17 โ 48 1 2 110 229 0 0 168 0 1.0 3 0 7 1
18 โ 54 1 4 140 239 0 0 160 0 1.2 1 0 3 0
19 โ 48 0 3 130 275 0 0 139 0 0.2 1 0 3 0
20 โ 49 1 2 130 266 0 0 171 0 0.6 1 0 3 0
21 โ 64 1 1 110 211 0 2 144 1 1.8 2 0 3 0
22 โ 58 0 1 150 283 1 2 162 0 1.0 1 0 3 0
23 โ 58 1 2 120 284 0 2 160 0 1.8 2 0 3 1
24 โ 58 1 3 132 224 0 2 173 0 3.2 1 2 7 3
25 โ 60 1 4 130 206 0 2 132 1 2.4 2 2 7 4
26 โ 50 0 3 120 219 0 0 158 0 1.6 2 0 3 0
27 โ 58 0 3 120 340 0 0 172 0 0.0 1 0 3 0
28 โ 66 0 1 150 226 0 0 114 0 2.6 3 0 3 0
29 โ 43 1 4 150 247 0 0 171 0 1.5 1 0 3 0
30 โ 40 1 4 110 167 0 2 114 1 2.0 2 0 7 3
31 โ 69 0 1 140 239 0 0 151 0 1.8 1 2 3 0
32 โ 60 1 4 117 230 1 0 160 1 1.4 1 2 7 2
33 โ 64 1 3 140 335 0 0 158 0 0.0 1 0 3 1
34 โ 59 1 4 135 234 0 0 161 0 0.5 2 0 7 0
35 โ 44 1 3 130 233 0 0 179 1 0.4 1 0 3 0
36 โ 42 1 4 140 226 0 0 178 0 0.0 1 0 3 0
37 โ 43 1 4 120 177 0 2 120 1 2.5 2 0 7 3
38 โ 57 1 4 150 276 0 2 112 1 0.6 2 1 6 1
39 โ 55 1 4 132 353 0 0 132 1 1.2 2 1 7 3
40 โ 61 1 3 150 243 1 0 137 1 1.0 2 0 3 0
41 โ 65 0 4 150 225 0 2 114 0 1.0 2 3 7 4
42 โ 40 1 1 140 199 0 0 178 1 1.4 1 0 7 0
43 โ 71 0 2 160 302 0 0 162 0 0.4 1 2 3 0
44 โ 59 1 3 150 212 1 0 157 0 1.6 1 0 3 0
45 โ 61 0 4 130 330 0 2 169 0 0.0 1 0 3 1
46 โ 58 1 3 112 230 0 2 165 0 2.5 2 1 7 4
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ
259 โ 70 1 2 156 245 0 2 143 0 0.0 1 0 3 0
260 โ 57 1 2 124 261 0 0 141 0 0.3 1 0 7 1
261 โ 44 0 3 118 242 0 0 149 0 0.3 2 1 3 0
262 โ 58 0 2 136 319 1 2 152 0 0.0 1 2 3 3
263 โ 60 0 1 150 240 0 0 171 0 0.9 1 0 3 0
264 โ 44 1 3 120 226 0 0 169 0 0.0 1 0 3 0
265 โ 61 1 4 138 166 0 2 125 1 3.6 2 1 3 4
266 โ 42 1 4 136 315 0 0 125 1 1.8 2 0 6 2
267 โ 52 1 4 128 204 1 0 156 1 1.0 2 0 NA 2
268 โ 59 1 3 126 218 1 0 134 0 2.2 2 1 6 2
269 โ 40 1 4 152 223 0 0 181 0 0.0 1 0 7 1
270 โ 42 1 3 130 180 0 0 150 0 0.0 1 0 3 0
271 โ 61 1 4 140 207 0 2 138 1 1.9 1 1 7 1
272 โ 66 1 4 160 228 0 2 138 0 2.3 1 0 6 0
273 โ 46 1 4 140 311 0 0 120 1 1.8 2 2 7 2
274 โ 71 0 4 112 149 0 0 125 0 1.6 2 0 3 0
275 โ 59 1 1 134 204 0 0 162 0 0.8 1 2 3 1
276 โ 64 1 1 170 227 0 2 155 0 0.6 2 0 7 0
277 โ 66 0 3 146 278 0 2 152 0 0.0 2 1 3 0
278 โ 39 0 3 138 220 0 0 152 0 0.0 2 0 3 0
279 โ 57 1 2 154 232 0 2 164 0 0.0 1 1 3 1
280 โ 58 0 4 130 197 0 0 131 0 0.6 2 0 3 0
281 โ 57 1 4 110 335 0 0 143 1 3.0 2 1 7 2
282 โ 47 1 3 130 253 0 0 179 0 0.0 1 0 3 0
283 โ 55 0 4 128 205 0 1 130 1 2.0 2 1 7 3
284 โ 35 1 2 122 192 0 0 174 0 0.0 1 0 3 0
285 โ 61 1 4 148 203 0 0 161 0 0.0 1 1 7 2
286 โ 58 1 4 114 318 0 1 140 0 4.4 3 3 6 4
287 โ 58 0 4 170 225 1 2 146 1 2.8 2 2 6 2
288 โ 58 1 2 125 220 0 0 144 0 0.4 2 NA 7 0
289 โ 56 1 2 130 221 0 2 163 0 0.0 1 0 7 0
290 โ 56 1 2 120 240 0 0 169 0 0.0 3 0 3 0
291 โ 67 1 3 152 212 0 2 150 0 0.8 2 0 7 1
292 โ 55 0 2 132 342 0 0 166 0 1.2 1 0 3 0
293 โ 44 1 4 120 169 0 0 144 1 2.8 3 0 6 2
294 โ 63 1 4 140 187 0 2 144 1 4.0 1 2 7 2
295 โ 63 0 4 124 197 0 0 136 1 0.0 2 0 3 1
296 โ 41 1 2 120 157 0 0 182 0 0.0 1 0 3 0
297 โ 59 1 4 164 176 1 2 90 0 1.0 2 2 6 3
298 โ 57 0 4 140 241 0 0 123 1 0.2 2 0 7 1
299 โ 45 1 1 110 264 0 0 132 0 1.2 2 0 7 1
300 โ 68 1 4 144 193 1 0 141 0 3.4 2 2 7 2
301 โ 57 1 4 130 131 0 0 115 1 1.2 2 1 7 3
302 โ 57 0 2 130 236 0 2 174 0 0.0 2 1 3 1
303 โ 38 1 3 138 175 0 0 173 0 0.0 1 NA 3 0
212 rows omitted
Business Understanding
Data Understanding
Data Preparation
Data Modeling
Model Evaluation
Appendix A: Environment, Language & Package Versions, and Coding Style
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyoneโincluding me.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("Metrics", version="0.1.4", repos="http://cran.us.r-project.org")
library(package=dplyr)
library(package=ggplot2)
library(package=Metrics)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 ร Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
Further Readings
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.