Data-Informed Thinking + Doing

Classification With High Interpretability

Modeling categorical predictions that is easy explain—using classification trees in R, Python, and Julia.

Classification and Regression Trees (CART) are powerful decision tree algorithms used for both classification and regression tasks. CART is widely used in various domains for tasks like customer segmentation, fraud detection, disease diagnosis, and sentiment analysis. Their intuitive representation and ability to handle complex interactions make them valuable for understanding data patterns and making accurate predictions in classification problems, leading to data-driven insights and informed decision-making.

Like Regression Trees, the significance of Classification Trees lies in their ability to perform non-linear and interpretable classification. They divide the data into distinct classes by recursively splitting based on the most informative features.

Let’s examine this technique using the Cleveland Clinic heart disease dataset.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly

Importing and Examining Dataset

heart_disease_r <- read.csv("../../dataset/cleveland-clinic-heart-disease.csv", stringsAsFactors=TRUE)
utils::str(object=heart_disease_r)
'data.frame':	303 obs. of  14 variables:
 $ age     : int  63 67 67 37 41 56 62 57 63 53 ...
 $ sex     : int  1 1 1 1 0 1 0 0 1 1 ...
 $ cp      : int  1 4 4 3 2 2 4 4 4 4 ...
 $ trestbps: int  145 160 120 130 130 120 140 120 130 140 ...
 $ chol    : int  233 286 229 250 204 236 268 354 254 203 ...
 $ fbs     : int  1 0 0 0 0 0 0 0 0 1 ...
 $ restecg : int  2 2 2 0 2 0 2 0 2 2 ...
 $ thalach : int  150 108 129 187 172 178 160 163 147 155 ...
 $ exang   : int  0 1 1 0 0 0 0 1 0 1 ...
 $ oldpeak : num  2.3 1.5 2.6 3.5 1.4 0.8 3.6 0.6 1.4 3.1 ...
 $ slope   : int  3 2 2 3 1 1 3 1 2 3 ...
 $ ca      : int  0 3 2 0 0 0 2 0 1 0 ...
 $ thal    : int  6 3 7 3 3 3 3 3 7 7 ...
 $ class   : int  0 2 1 0 0 0 3 0 2 1 ...
utils::head(x=heart_disease_r, n=8)
  age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal class
1  63   1  1      145  233   1       2     150     0     2.3     3  0    6     0
2  67   1  4      160  286   0       2     108     1     1.5     2  3    3     2
3  67   1  4      120  229   0       2     129     1     2.6     2  2    7     1
4  37   1  3      130  250   0       0     187     0     3.5     3  0    3     0
5  41   0  2      130  204   0       2     172     0     1.4     1  0    3     0
6  56   1  2      120  236   0       0     178     0     0.8     1  0    3     0
7  62   0  4      140  268   0       2     160     0     3.6     3  2    3     3
8  57   0  4      120  354   0       0     163     1     0.6     1  0    3     0
which(is.na(heart_disease_r))
[1] 3500 3526 3621 3636 3724 3903
heart_disease_py = pandas.read_csv("../../dataset/cleveland-clinic-heart-disease.csv")
heart_disease_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        299 non-null    float64
 12  thal      301 non-null    float64
 13  class     303 non-null    int64  
dtypes: float64(3), int64(11)
memory usage: 33.3 KB
heart_disease_py.head(n=8)
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope   ca  thal  class
0   63    1   1       145   233    1        2      150      0      2.3      3  0.0   6.0      0
1   67    1   4       160   286    0        2      108      1      1.5      2  3.0   3.0      2
2   67    1   4       120   229    0        2      129      1      2.6      2  2.0   7.0      1
3   37    1   3       130   250    0        0      187      0      3.5      3  0.0   3.0      0
4   41    0   2       130   204    0        2      172      0      1.4      1  0.0   3.0      0
5   56    1   2       120   236    0        0      178      0      0.8      1  0.0   3.0      0
6   62    0   4       140   268    0        2      160      0      3.6      3  2.0   3.0      3
7   57    0   4       120   354    0        0      163      1      0.6      1  0.0   3.0      0
heart_disease_py.tail(n=8)
     age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope   ca  thal  class
295   41    1   2       120   157    0        0      182      0      0.0      1  0.0   3.0      0
296   59    1   4       164   176    1        2       90      0      1.0      2  2.0   6.0      3
297   57    0   4       140   241    0        0      123      1      0.2      2  0.0   7.0      1
298   45    1   1       110   264    0        0      132      0      1.2      2  0.0   7.0      1
299   68    1   4       144   193    1        0      141      0      3.4      2  2.0   7.0      2
300   57    1   4       130   131    0        0      115      1      1.2      2  1.0   7.0      3
301   57    0   2       130   236    0        2      174      0      0.0      2  1.0   3.0      1
302   38    1   3       138   175    0        0      173      0      0.0      1  NaN   3.0      0
heart_disease_jl = CSV.File("../../dataset/cleveland-clinic-heart-disease.csv") |> DataFrames.DataFrame
303×14 DataFrame
 Row │ age    sex    cp     trestbps  chol   fbs    restecg  thalach  exang  oldpeak  slope  ca       thal     class
     │ Int64  Int64  Int64  Int64     Int64  Int64  Int64    Int64    Int64  Float64  Int64  String3  String3  Int64
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │    63      1      1       145    233      1        2      150      0      2.3      3  0        6            0
   2 │    67      1      4       160    286      0        2      108      1      1.5      2  3        3            2
   3 │    67      1      4       120    229      0        2      129      1      2.6      2  2        7            1
   4 │    37      1      3       130    250      0        0      187      0      3.5      3  0        3            0
   5 │    41      0      2       130    204      0        2      172      0      1.4      1  0        3            0
   6 │    56      1      2       120    236      0        0      178      0      0.8      1  0        3            0
   7 │    62      0      4       140    268      0        2      160      0      3.6      3  2        3            3
   8 │    57      0      4       120    354      0        0      163      1      0.6      1  0        3            0
   9 │    63      1      4       130    254      0        2      147      0      1.4      2  1        7            2
  10 │    53      1      4       140    203      1        2      155      1      3.1      3  0        7            1
  11 │    57      1      4       140    192      0        0      148      0      0.4      2  0        6            0
  12 │    56      0      2       140    294      0        2      153      0      1.3      2  0        3            0
  13 │    56      1      3       130    256      1        2      142      1      0.6      2  1        6            2
  14 │    44      1      2       120    263      0        0      173      0      0.0      1  0        7            0
  15 │    52      1      3       172    199      1        0      162      0      0.5      1  0        7            0
  16 │    57      1      3       150    168      0        0      174      0      1.6      1  0        3            0
  17 │    48      1      2       110    229      0        0      168      0      1.0      3  0        7            1
  18 │    54      1      4       140    239      0        0      160      0      1.2      1  0        3            0
  19 │    48      0      3       130    275      0        0      139      0      0.2      1  0        3            0
  20 │    49      1      2       130    266      0        0      171      0      0.6      1  0        3            0
  21 │    64      1      1       110    211      0        2      144      1      1.8      2  0        3            0
  22 │    58      0      1       150    283      1        2      162      0      1.0      1  0        3            0
  23 │    58      1      2       120    284      0        2      160      0      1.8      2  0        3            1
  24 │    58      1      3       132    224      0        2      173      0      3.2      1  2        7            3
  25 │    60      1      4       130    206      0        2      132      1      2.4      2  2        7            4
  26 │    50      0      3       120    219      0        0      158      0      1.6      2  0        3            0
  27 │    58      0      3       120    340      0        0      172      0      0.0      1  0        3            0
  28 │    66      0      1       150    226      0        0      114      0      2.6      3  0        3            0
  29 │    43      1      4       150    247      0        0      171      0      1.5      1  0        3            0
  30 │    40      1      4       110    167      0        2      114      1      2.0      2  0        7            3
  31 │    69      0      1       140    239      0        0      151      0      1.8      1  2        3            0
  32 │    60      1      4       117    230      1        0      160      1      1.4      1  2        7            2
  33 │    64      1      3       140    335      0        0      158      0      0.0      1  0        3            1
  34 │    59      1      4       135    234      0        0      161      0      0.5      2  0        7            0
  35 │    44      1      3       130    233      0        0      179      1      0.4      1  0        3            0
  36 │    42      1      4       140    226      0        0      178      0      0.0      1  0        3            0
  37 │    43      1      4       120    177      0        2      120      1      2.5      2  0        7            3
  38 │    57      1      4       150    276      0        2      112      1      0.6      2  1        6            1
  39 │    55      1      4       132    353      0        0      132      1      1.2      2  1        7            3
  40 │    61      1      3       150    243      1        0      137      1      1.0      2  0        3            0
  41 │    65      0      4       150    225      0        2      114      0      1.0      2  3        7            4
  42 │    40      1      1       140    199      0        0      178      1      1.4      1  0        7            0
  43 │    71      0      2       160    302      0        0      162      0      0.4      1  2        3            0
  44 │    59      1      3       150    212      1        0      157      0      1.6      1  0        3            0
  45 │    61      0      4       130    330      0        2      169      0      0.0      1  0        3            1
  46 │    58      1      3       112    230      0        2      165      0      2.5      2  1        7            4
  ⋮  │   ⋮      ⋮      ⋮       ⋮        ⋮      ⋮       ⋮        ⋮       ⋮       ⋮       ⋮       ⋮        ⋮       ⋮
 259 │    70      1      2       156    245      0        2      143      0      0.0      1  0        3            0
 260 │    57      1      2       124    261      0        0      141      0      0.3      1  0        7            1
 261 │    44      0      3       118    242      0        0      149      0      0.3      2  1        3            0
 262 │    58      0      2       136    319      1        2      152      0      0.0      1  2        3            3
 263 │    60      0      1       150    240      0        0      171      0      0.9      1  0        3            0
 264 │    44      1      3       120    226      0        0      169      0      0.0      1  0        3            0
 265 │    61      1      4       138    166      0        2      125      1      3.6      2  1        3            4
 266 │    42      1      4       136    315      0        0      125      1      1.8      2  0        6            2
 267 │    52      1      4       128    204      1        0      156      1      1.0      2  0        NA           2
 268 │    59      1      3       126    218      1        0      134      0      2.2      2  1        6            2
 269 │    40      1      4       152    223      0        0      181      0      0.0      1  0        7            1
 270 │    42      1      3       130    180      0        0      150      0      0.0      1  0        3            0
 271 │    61      1      4       140    207      0        2      138      1      1.9      1  1        7            1
 272 │    66      1      4       160    228      0        2      138      0      2.3      1  0        6            0
 273 │    46      1      4       140    311      0        0      120      1      1.8      2  2        7            2
 274 │    71      0      4       112    149      0        0      125      0      1.6      2  0        3            0
 275 │    59      1      1       134    204      0        0      162      0      0.8      1  2        3            1
 276 │    64      1      1       170    227      0        2      155      0      0.6      2  0        7            0
 277 │    66      0      3       146    278      0        2      152      0      0.0      2  1        3            0
 278 │    39      0      3       138    220      0        0      152      0      0.0      2  0        3            0
 279 │    57      1      2       154    232      0        2      164      0      0.0      1  1        3            1
 280 │    58      0      4       130    197      0        0      131      0      0.6      2  0        3            0
 281 │    57      1      4       110    335      0        0      143      1      3.0      2  1        7            2
 282 │    47      1      3       130    253      0        0      179      0      0.0      1  0        3            0
 283 │    55      0      4       128    205      0        1      130      1      2.0      2  1        7            3
 284 │    35      1      2       122    192      0        0      174      0      0.0      1  0        3            0
 285 │    61      1      4       148    203      0        0      161      0      0.0      1  1        7            2
 286 │    58      1      4       114    318      0        1      140      0      4.4      3  3        6            4
 287 │    58      0      4       170    225      1        2      146      1      2.8      2  2        6            2
 288 │    58      1      2       125    220      0        0      144      0      0.4      2  NA       7            0
 289 │    56      1      2       130    221      0        2      163      0      0.0      1  0        7            0
 290 │    56      1      2       120    240      0        0      169      0      0.0      3  0        3            0
 291 │    67      1      3       152    212      0        2      150      0      0.8      2  0        7            1
 292 │    55      0      2       132    342      0        0      166      0      1.2      1  0        3            0
 293 │    44      1      4       120    169      0        0      144      1      2.8      3  0        6            2
 294 │    63      1      4       140    187      0        2      144      1      4.0      1  2        7            2
 295 │    63      0      4       124    197      0        0      136      1      0.0      2  0        3            1
 296 │    41      1      2       120    157      0        0      182      0      0.0      1  0        3            0
 297 │    59      1      4       164    176      1        2       90      0      1.0      2  2        6            3
 298 │    57      0      4       140    241      0        0      123      1      0.2      2  0        7            1
 299 │    45      1      1       110    264      0        0      132      0      1.2      2  0        7            1
 300 │    68      1      4       144    193      1        0      141      0      3.4      2  2        7            2
 301 │    57      1      4       130    131      0        0      115      1      1.2      2  1        7            3
 302 │    57      0      2       130    236      0        2      174      0      0.0      2  1        3            1
 303 │    38      1      3       138    175      0        0      173      0      0.0      1  NA       3            0
                                                                                                     212 rows omitted

Business Understanding


Data Understanding


Data Preparation


Data Modeling


Model Evaluation


Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("Metrics", version="0.1.4", repos="http://cran.us.r-project.org")

library(package=dplyr)
library(package=ggplot2)
library(package=Metrics)
import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1

import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")

using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase

Further Readings

  • Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
  • Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.
Applied Advanced Analytics & AI in Sports