Centroid-Based and Hierarchical-Based Clustering
Identifying similar groups of U.S. colleges & universities—using K-Means clustering and HClust in R, Python, and Julia.
Data Understanding
To understand the colleges & universities data (after importing the CSV file), I first identified that it contains 1,302 observations across 23 variables. The variables include both categorical (e.g., name
, state
, private
) and numerical types (e.g., sat_math
, act
, applications_received
). Some variables have missing values (e.g., sat_math
, act
, pct_new_students_top10pct
), which will need to be addressed in the Data Preparation phase. The dataset could be useful for performing cluster analysis. In the next stage, I will perform exploratory data analysis (EDA) to understand the distributions, relationships, and potential outliers in the data.
str(colleges_universities_r)
'data.frame': 1302 obs. of 23 variables:
$ name : chr "Alaska Pacific University" "University of Alaska at Fairbanks" "University of Alaska Southeast" "University of Alaska at Anchorage" ...
$ state : Factor w/ 51 levels "AK","AL","AR",..: 1 1 1 1 2 2 2 2 2 2 ...
$ private : logi TRUE FALSE FALSE FALSE FALSE TRUE ...
$ sat_math : int 490 499 NA 459 NA NA NA NA 575 575 ...
$ sat_verbal : int 482 462 NA 422 NA NA NA NA 501 525 ...
$ act : int 20 22 NA 20 17 20 21 NA 24 26 ...
$ applications_received : int 193 1852 146 2065 2817 345 1351 4639 7548 805 ...
$ applications_accepted : int 146 1427 117 1598 1920 320 892 3272 6791 588 ...
$ new_students_enrolled : int 55 928 89 1162 984 179 570 1278 3070 287 ...
$ pct_new_students_top10pct : num 0.16 NA 0.04 NA NA NA 0.18 NA 0.25 0.67 ...
$ pct_new_students_top25pct : num 0.44 NA 0.24 NA NA 0.27 0.78 NA 0.57 0.88 ...
$ undergraduate_ft : int 249 3885 492 6209 3958 1367 2385 4051 16262 1376 ...
$ undergraduate_pt : int 869 4519 1849 10537 305 578 331 405 1716 207 ...
$ tuition_instate : int 7560 1742 1742 1742 1700 5600 2220 1500 2100 11660 ...
$ tuition_outofstate : int 7560 5226 5226 5226 3400 5600 4440 3000 6300 11660 ...
$ room : int 1620 1800 2514 2600 1108 1550 NA 1960 NA 2050 ...
$ board : int 2500 1790 2250 2520 1442 1700 NA NA NA 2430 ...
$ additional_fees : int 130 155 34 114 155 300 124 84 NA 120 ...
$ estimated_expense_books : int 800 650 500 580 500 350 300 500 600 400 ...
$ estimated_expense_personal: int 1500 2304 1162 1260 850 NA 600 NA 1908 900 ...
$ pct_faculty_phd : num 0.76 0.67 0.39 0.48 0.53 0.52 0.72 0.48 0.85 0.74 ...
$ student_faculty_ratio : num 11.9 10 9.5 13.7 14.3 32.8 18.9 18.7 16.7 14 ...
$ graduation_rate : num 0.15 NA 0.39 NA 0.4 0.55 0.51 0.15 0.69 0.72 ...
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is essential in delivering a thorough understanding of single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It discloses trends, patterns, and anomalies, directing further analysis and the creation of hypotheses.
Univariate Analysis
Univariate analysis, which focuses on single variables, uncovers patterns and summarizes data. It lays the groundwork for all future analyses by offering insights into distribution, central tendency, and dispersion. By scrutinizing one variable at a time, it aids in the detection of outliers, comprehension of data distribution, and evaluation of data quality. This analysis is crucial for paving the way for more intricate bivariate and multivariate analyses.
Numerical Variables
# Income
# AnalyzeUnivariate(c(beer_r$income))
Categorical Variables
# Gender
# AnalyzeUnivariate(beer_r$gender, categorical=TRUE)
Bivariate Analysis
Bivariate analysis is vital for uncovering statistical relationships between two variables, notably between predictors and the target variable. It sheds light on trends and feature selection, while augmenting the accuracy of predictive models by elucidating how variables interact with one another.
Multivariate Analysis
Multivariate analysis is key for deciphering intricate relationships in data. It allows the discovery of patterns, outliers, and the inherent structure of data sets. A correlation matrix, a central tool in multivariate analysis, exposes the intensity and direction of links between variables. Positive values represent a direct relationship, while negative values indicate an inverse one. Insights derived from a correlation matrix can direct feature selection, assist in hypothesis testing, and boost model accuracy by detecting multicollinearity or potential predictors for further examination.
Data Modeling
Data Preparation: Split Data for Training & Testing
Splitting the dataset into training and testing sets is crucial to evaluate the model’s ability to generalize to new data. For a small dataset with 100 observations, an ideal split ratio is often 70:30 or 80:20 for training and testing, respectively. This ensures sufficient data for learning while retaining enough unique instances to assess performance accurately without overfitting.
Model Fitting
Model Evaluation
Model 1: All Features
Model 2:
Model Selection
Business Understanding: Leveling-Up
Appendix A: Environment, Language & Package Versions, and Coding Style
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4.1", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")
devtools::install_version("class", version="7.3.21", repos="http://cran.us.r-project.org")
library(package=dplyr)
library(package=ggplot2)
library(package=ggcorrplot)
library(package=caret)
library(package=class)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
Appendix B: A Case for K-Means Clustering
Advantages
- Simplicity: K-Means is relatively simple to implement1.
- Scalability: It scales well to large data sets1.
- Convergence: The algorithm guarantees convergence1.
- Adaptability: It can easily adapt to new examples1.
- Generalization: K-Means can generalize to clusters of different shapes and sizes, such as elliptical clusters1.
Disadvantages
- Choosing K: The number of clusters (K) needs to be set manually, which can be a tedious job12.
- Initial Values Dependency: K-Means is dependent on initial values. For a low K, this dependence can be mitigated by running K-Means several times with different initial values and picking the best result1.
- Varying Sizes and Density: K-Means has trouble clustering data where clusters are of varying sizes and density1.
- Outliers: Centroids can be dragged by outliers, or outliers might get their own cluster instead of being ignored. Consider removing or clipping outliers before clustering1.
- Scaling with Number of Dimensions: As the number of dimensions increases, a distance-based similarity measure converges to a constant value between any given examples1.
Appendix C: A Case for HClust
Advantages
- Robustness: HClust is more robust than other methods since it does not require a predetermined number of clusters to be specified.
- Complex Shapes: With HClust, you can create more complex shaped clusters that weren’t possible with other methods.
- No Assumptions: You need not make any assumptions of how the resulting shape of your cluster should look like.
- Identifying Patterns: It helps to identify obscure patterns and relationships within a data set.
- Exploratory Data Analysis: It helps to carry out exploratory data analysis.
Disadvantages
- Interpretation Difficulty: It can be difficult to interpret the results of an ambiguous or ill-defined cluster.
Further Readings
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.