Binary Classification Based on Similarities

Saturday, October 25, 2008

The original post in Excel/XLMiner has been ported over to R, Python, and Julia; migrated from http://glue.umd.edu/~mmallari/ (sunsetted); and refreshed with newer references. Image Credit: https://unsplash.com/photos/city-buildings-during-night-time-41J9-JTIP-c

Data Understanding

To understand the data, I’ve imported a CSV file containing 100 records and 5 columns. The columns represent gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This preliminary analysis is crucial for recognizing the dataset’s structure and preparing for further data exploration and modeling.

str(loan_r)

'data.frame':	5000 obs. of  15 variables:
 $ id                : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age               : int  25 45 39 35 35 37 53 50 35 34 ...
 $ experience        : int  1 19 15 9 8 13 27 24 10 9 ...
 $ income            : int  49 34 11 100 45 29 72 22 81 180 ...
 $ zip_code          : int  91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
 $ family            : int  4 3 1 1 4 4 2 1 3 1 ...
 $ cc_avg            : num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
 $ education         : int  1 1 1 2 2 2 2 3 2 3 ...
 $ mortgage          : int  0 0 0 0 0 155 0 0 104 0 ...
 $ personal_loan     : int  0 0 0 0 0 0 0 0 0 1 ...
 $ securities_account: int  1 1 0 0 0 0 0 0 0 0 ...
 $ cd_account        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ online            : int  0 0 0 0 0 1 1 0 1 0 ...
 $ credit_card       : int  0 0 0 0 1 0 0 1 0 0 ...
 $ X                 : num  30.8 NA NA NA NA ...

Python

Julia

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is crucial in offering an all-encompassing perspective of individual variables (univariate), duos (bivariate), and multiple (multivariate) interactions. It reveals trends, patterns, and irregularities, steering subsequent analysis and the development of hypotheses.

Univariate Analysis

Univariate analysis, which inspects single variables, reveals patterns and provides a summary of the data. It forms the basis for all further analyses by shedding light on distribution, central tendency, and dispersion. By focusing on one variable at a time, it facilitates the detection of outliers, comprehension of data distribution, and evaluation of data quality. This analysis is vital for preparing for more sophisticated bivariate and multivariate analyses.

Numerical Variables

# Income
# AnalyzeUnivariate(c(beer_r$income))

Python

Julia

Categorical Variables

# Gender
# AnalyzeUnivariate(beer_r$gender, categorical=TRUE)

Python

Julia

Bivariate Analysis

Bivariate analysis is essential for discerning statistical associations between two variables, particularly between predictors and the outcome variable. It provides insights on trends and feature selection, while improving the accuracy of predictive models by elucidating how variables interrelate with each other.

Python

Julia

Multivariate Analysis

Multivariate analysis is essential for understanding intricate relationships in data. It facilitates the identification of patterns, outliers, and the inherent structure of data sets. A correlation matrix, a fundamental tool in multivariate analysis, uncovers the magnitude and direction of associations between variables. Positive values denote a direct relationship, whereas negative values imply an inverse one. Knowledge gained from a correlation matrix can steer feature selection, aid in formulating hypotheses, and improve model precision by pinpointing multicollinearity or potential predictors for additional analysis.

Python

Julia

Data Modeling

Data Preparation: Split Data for Training & Testing

Splitting the dataset into training and testing sets is crucial to evaluate the model’s ability to generalize to new data. For a small dataset with 100 observations, an ideal split ratio is often 70:30 or 80:20 for training and testing, respectively. This ensures sufficient data for learning while retaining enough unique instances to assess performance accurately without overfitting.

Python

Julia

Model Fitting

Python

Julia

Business Understanding: Leveling-Up

Appendix A: Environment, Language & Package Versions, and Coding Style

If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.

cat(
    R.version$version.string, "-", R.version$nickname,
    "\nOS:", Sys.info()["sysname"], R.version$platform,
    "\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)

R version 4.2.3 (2023-03-15) - Shortstop Beagle 
OS: Darwin x86_64-apple-darwin17.0 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4.1", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")
devtools::install_version("class", version="7.3.21", repos="http://cran.us.r-project.org")

library(package=dplyr)
library(package=ggplot2)
library(package=ggcorrplot)
library(package=caret)
library(package=class)

Python

import sys
import platform
import os
import cpuinfo
print(
    "Python", sys.version,
    "\nOS:", platform.system(), platform.platform(),
    "\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)

Python 3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] 
OS: Darwin macOS-10.16-x86_64-i386-64bit 
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1

import numpy
import pandas
from scipy import stats

Julia

using InteractiveUtils
InteractiveUtils.versioninfo()

Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server

using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")

using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase

Appendix B: A Case for KNN

Advantages

Simple and Easy to Understand: The KNN algorithm is straightforward, making it a popular choice for beginners in machine learning.
Non-parametric: KNN does not make any assumptions about the underlying distribution of the data, making it a flexible algorithm that can be used in a wide range of applications.
No Training Required: KNN does not require any training process, which means that it can be used in real-time applications where data is continuously being generated.
Can Handle Large Datasets: KNN can handle large datasets without suffering from the curse of dimensionality, a common problem in other machine learning algorithms.
Accurate and Effective: KNN is known for its accuracy and effectiveness, particularly when used with small to medium-sized datasets.

Disadvantages

Sensitive to Outliers: KNN can be sensitive to outliers in the data, which can significantly affect its performance.
Computationally Expensive: KNN can be computationally expensive, particularly for large datasets. This is because the algorithm needs to compute the distance between each test data point and every training data point, which can be time-consuming.
Requires Good Choice of K: The KNN algorithm requires a good choice of the K parameter, which determines the number of nearest neighbors used for classification. If K is too small, the algorithm may be too sensitive to noise in the data, while if K is too large, the algorithm may miss important patterns in the data.
Limited to Euclidean Distance: KNN is often limited to using Euclidean distance, which may not always be the best measure of similarity between data points.
Imbalanced Data: KNN can struggle with imbalanced data, where one class has significantly more instances than the other.

Binary Classification Based on Similarities

Data Understanding

Exploratory Data Analysis

Univariate Analysis

Numerical Variables

Categorical Variables

Bivariate Analysis

Multivariate Analysis

Data Modeling

Data Preparation: Split Data for Training & Testing

Model Fitting

Model Evaluation

Model 1: All Features

Model 2:

Model Selection

Business Understanding: Leveling-Up

Appendix A: Environment, Language & Package Versions, and Coding Style

Appendix B: A Case for KNN

Advantages

Disadvantages

Further Readings