Categorical Predictions Based on Probability
Inferring consumer demographics with preference for light beer—using logistic regression in R, Python, and Julia.
Suppose the Blue Moon Brewing Company has noticed a trend that suggests a shift in consumer preferences to light beer, which may be associated with specific demographic segments. To validate this, the business has initiated a customer segmentation study, collecting relevant data through customer surveys. The research has been designed to uncover patterns and correlations that can inform new product development and targeted marketing campaigns.
The objective of this data analysis is to infer whether demographic factors such as gender, age, marital status, and income significantly influence the preference for light beer compared to regular beer. This understanding will guide Blue Moon in optimizing its segmentation, target market, and positioning (STP) strategies to cater to the evolving tastes of their customer base.
Data Understanding
To understand the data, I’ve imported a CSV file containing 100 records and 5 columns. The columns represent gender (0 for female, 1 for male), marital status (0 for unmarried, 1 for married), income, age, and beer preference (0 for regular, 1 for light). This preliminary analysis is crucial for recognizing the dataset’s structure and preparing for further data exploration and modeling.
str(beer_r)
'data.frame': 100 obs. of 5 variables:
$ gender : int 0 1 1 1 1 1 1 1 0 1 ...
$ married : int 0 1 1 1 1 0 0 1 1 0 ...
$ income : int 31779 32739 24302 64709 41882 38990 22408 25440 30784 31916 ...
$ age : int 46 50 46 70 54 36 40 51 52 43 ...
$ prefer_light: int 0 0 0 0 0 0 0 0 0 0 ...
beer_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 gender 100 non-null int64
1 married 100 non-null int64
2 income 100 non-null int64
3 age 100 non-null int64
4 prefer_light 100 non-null int64
dtypes: int64(5)
memory usage: 4.0 KB
beer_py.head(n=8)
gender married income age prefer_light
0 0 0 31779 46 0
1 1 1 32739 50 0
2 1 1 24302 46 0
3 1 1 64709 70 0
4 1 1 41882 54 0
5 1 0 38990 36 0
6 1 0 22408 40 0
7 1 1 25440 51 0
beer_py.tail(n=8)
gender married income age prefer_light
92 0 1 46273 56 1
93 0 0 33088 22 1
94 0 0 41391 54 1
95 0 1 45133 51 1
96 0 1 39942 21 1
97 1 1 47495 35 1
98 1 1 70981 45 1
99 0 1 48259 32 1
beer_jl
100×5 DataFrame
Row │ gender married income age prefer_light
│ Int64 Int64 Int64 Int64 Int64
─────┼──────────────────────────────────────────────
1 │ 0 0 31779 46 0
2 │ 1 1 32739 50 0
3 │ 1 1 24302 46 0
4 │ 1 1 64709 70 0
5 │ 1 1 41882 54 0
6 │ 1 0 38990 36 0
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮
96 │ 0 1 45133 51 1
97 │ 0 1 39942 21 1
98 │ 1 1 47495 35 1
99 │ 1 1 70981 45 1
100 │ 0 1 48259 32 1
89 rows omitted
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is vital in providing a comprehensive view of single variables (univariate), pairs (bivariate), and multiple (multivariate) interactions. It uncovers trends, patterns, and anomalies, guiding further analysis and hypothesis formulation.
Univariate Analysis
Univariate analysis on individual variables summarizes and finds patterns in the data. It’s the foundation for all subsequent analysis, providing insights into distribution, central tendency, and variability. By examining one variable at a time, it helps identify outliers, understand data distribution, and assess data quality. This analysis is essential for setting the stage for more complex bivariate and multivariate analyses.
Numerical Variables
The summary statistics of income
(a discrete number) suggest a right-skewed trend, as the mean is slightly higher than the median. Most incomes cluster around the lower to mid-range, with fewer high earners, indicated by the maximum value being significantly higher than the mean and median.
The summary statistics of age
(a discrete number) indicate a moderately right-skewed distribution, with a mean slightly higher than the median. The data points are more concentrated in the middle age range, with a gradual decrease towards older ages, as reflected by the maximum age of 87.
# Income
AnalyzeUnivariate(vector=beer_r$income, categorical=FALSE)
[1] "Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
20945 33073 39646 40064 45776 70981
# Age
AnalyzeUnivariate(vector=beer_r$age, categorical=FALSE)
[1] "Summary Statistics"
Min. 1st Qu. Median Mean 3rd Qu. Max.
21 36 43 44 51 87
Categorical Variables
# Gender
AnalyzeUnivariate(vector=beer_r$gender, categorical=TRUE)
[1] "Distribution"
vector
0 1
43 57
# Married
AnalyzeUnivariate(vector=beer_r$married, categorical=TRUE)
[1] "Distribution"
vector
0 1
45 55
# Prefer Light
AnalyzeUnivariate(beer_r$prefer_light, categorical=TRUE)
[1] "Distribution"
vector
0 1
50 50
Bivariate Analysis
Bivariate analysis is pivotal for identifying statistical relationships between two variables, especially between predictors and the target variable. It informs on tendencies and feature selection, while enhancing predictive model accuracy by clarifying how variables interact with each other.
# Gender vs. Prefer Light
# Age vs. Prefer Light
# Income vs. Prefer Light
# Age vs. Prefer Light
Multivariate Analysis
Multivariate analysis is crucial for grasping complex relationships within data. It enables the detection of patterns, outliers, and the intrinsic structure of data sets. A correlation matrix, a key multivariate analysis tool, reveals the strength and direction of relationships between variables. Positive values indicate a direct relationship, while negative values suggest an inverse one. Insights from a correlation matrix can guide feature selection, inform hypothesis testing, and enhance model accuracy by identifying multicollinearity or potential predictors for further analysis.
correlation_matrix_r <- round(cor(beer_r), 2)
head(correlation_matrix_r[5:1, 1:5])
gender married income age prefer_light
prefer_light -0.14 0.10 0.51 -0.51 1.00
age 0.22 0.26 0.09 1.00 -0.51
income 0.05 0.40 1.00 0.09 0.51
married -0.01 1.00 0.40 0.26 0.10
gender 1.00 -0.01 0.05 0.22 -0.14
ggcorrplot::ggcorrplot(correlation_matrix_r, outline.col=palette_michaelmallari_r[1], colors = c(palette_michaelmallari_r[3], palette_michaelmallari_r[1], palette_michaelmallari_r[2]), lab=TRUE)
ggplot2::ggplot(beer_r, aes(x=age, y=income, color=factor(prefer_light))) +
ggplot2::geom_point() +
ggplot2::stat_ellipse(type="t") +
ggplot2::scale_y_continuous(breaks = c(15000, 30000, 45000, 60000, 75000), limits=c(15000, 75000), expand=c(0, 0), position="right") +
ggplot2::scale_color_manual(values=c(palette_michaelmallari_r[3], palette_michaelmallari_r[2]), labels=c("Regular Beer", "Light Beer")) +
ggplot2::labs(
title="Responsible Drinking With Light Beer",
subtitle="Age and income as proxies for level of responsibilities",
x="Age",
y="Income (USD)",
color=""
) +
theme_michaelmallari_r()
Data Modeling
Data Preparation: Split Data for Training & Testing
Splitting the dataset into training and testing sets is crucial to evaluate the model’s ability to generalize to new data. For a small dataset with 100 observations, an ideal split ratio is often 70:30 or 80:20 for training and testing, respectively. This ensures sufficient data for learning while retaining enough unique instances to assess performance accurately without overfitting.
set.seed(1856) #FearTheTurtle 🐢
index_train <- createDataPartition(beer_r$prefer_light, p=0.7, list=FALSE)
beer_train_r <- beer_r[index_train, ]
beer_test_r <- beer_r[-index_train, ]
Model Training
# All features
model_1 <- glm(formula=prefer_light ~ ., family="binomial", data=beer_train_r)
summary(model_1)
Call:
glm(formula = prefer_light ~ ., family = "binomial", data = beer_train_r)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4408 -0.4498 0.0009 0.3410 2.0903
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0555189 2.1395814 0.03 0.97930
gender -0.6796768 0.8336018 -0.82 0.41487
married 0.1062429 0.9507667 0.11 0.91103
income 0.0002499 0.0000677 3.69 0.00022 ***
age -0.2174801 0.0612084 -3.55 0.00038 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.041 on 69 degrees of freedom
Residual deviance: 41.988 on 65 degrees of freedom
AIC: 51.99
Number of Fisher Scoring iterations: 6
# Significance features only: income and age
model_2 <- glm(formula=prefer_light ~ income + age, family="binomial", data=beer_train_r)
summary(model_2)
Call:
glm(formula = prefer_light ~ income + age, family = "binomial",
data = beer_train_r)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2352 -0.4191 -0.0043 0.3391 2.2329
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.1370452 1.9659827 -0.07 0.94443
income 0.0002387 0.0000611 3.91 0.000093 ***
age -0.2097528 0.0556144 -3.77 0.00016 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.041 on 69 degrees of freedom
Residual deviance: 42.753 on 67 degrees of freedom
AIC: 48.75
Number of Fisher Scoring iterations: 6
# Marital status, income, and age
model_3 <- glm(formula=prefer_light ~ married + income + age, family="binomial", data=beer_train_r)
summary(model_3)
Call:
glm(formula = prefer_light ~ married + income + age, family = "binomial",
data = beer_train_r)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2747 -0.4176 -0.0044 0.3166 2.2517
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0954511 2.1404852 0.04 0.96443
married 0.2586641 0.9342008 0.28 0.78187
income 0.0002360 0.0000619 3.81 0.00014 ***
age -0.2152624 0.0597939 -3.60 0.00032 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 97.041 on 69 degrees of freedom
Residual deviance: 42.676 on 66 degrees of freedom
AIC: 50.68
Number of Fisher Scoring iterations: 6
Model Evaluation
Accuracy is the preferred metric for evaluating our model’s performance in predicting beer preference because it reflects the overall correctness of predictions. Precision and recall are less relevant as they focus on positive predictions, which isn’t as critical when distinguishing between two equally important classes—regular and light beer. The F1 score, which balances precision and recall, is unnecessary for similar reasons. Specificity, which measures true negatives, is also less pertinent because the cost of false positives is not higher than false negatives in this context. Thus, accuracy provides a clear and balanced evaluation of the model’s predictive power.
Model 1: All Features
GetAccuracyCutoff(model=model_1, df=beer_test_r, target="prefer_light") +
ggplot2::labs(
title="Model 1: Maximum Predictive Accuracy at 93.33%",
alt="Model 1: Maximum Predictive Accuracy at 93.33%",
subtitle="When 58%+ probability = light beer preference",
x="Predicted Probability Cut-off for Light Beer Preference",
y="Model Accuracy",
fill=NULL,
caption=NULL
)
GetEvalMetrics(model=model_1, df=beer_test_r, target="prefer_light", type="response", cutoff=0.58)
Confusion Matrix
Prediction
Actual Regular Light
Regular 14 1
Light 1 14
TN FP
FN TP
------------------
Evaluation Metrics
Accuracy = 93.33 %
Precision = 93.33 %
Recall/Sensitivity/True Positive Rate = 93.33 %
F1 Score = 93.33 %
Specificity/True Negative Rate = 93.33 %
Model 2: Significant Features Only (Income and Age)
GetAccuracyCutoff(model=model_2, df=beer_test_r, target="prefer_light") +
ggplot2::labs(
title="Model 2: Maximum Predictive Accuracy at 90.00%",
alt="Model 2: Maximum Predictive Accuracy at 90.00%",
subtitle="When 57%+ probability = light beer preference",
x="Predicted Probability Cut-off for Light Beer Preference",
y="Model Accuracy",
fill=NULL,
caption=NULL
)
GetEvalMetrics(model=model_2, df=beer_test_r, target="prefer_light", type="response", cutoff=0.57)
Confusion Matrix
Prediction
Actual Regular Light
Regular 14 1
Light 2 13
TN FP
FN TP
------------------
Evaluation Metrics
Accuracy = 90.00 %
Precision = 92.86 %
Recall/Sensitivity/True Positive Rate = 86.67 %
F1 Score = 89.66 %
Specificity/True Negative Rate = 93.33 %
Model 3: Marital Status, Income, and Age
GetAccuracyCutoff(model=model_3, df=beer_test_r, target="prefer_light") +
ggplot2::labs(
title="Model 3: Maximum Predictive Accuracy at 90.00%",
alt="Model 3: Maximum Predictive Accuracy at 90.00%",
subtitle="When 50%+ probability = light beer preference",
x="Predicted Probability Cut-off for Light Beer Preference",
y="Model Accuracy",
fill=NULL,
caption=NULL
)
GetEvalMetrics(model=model_3, df=beer_test_r, target="prefer_light", type="response", cutoff=0.50)
Confusion Matrix
Prediction
Actual Regular Light
Regular 13 2
Light 1 14
TN FP
FN TP
------------------
Evaluation Metrics
Accuracy = 90.00 %
Precision = 87.50 %
Recall/Sensitivity/True Positive Rate = 93.33 %
F1 Score = 90.32 %
Specificity/True Negative Rate = 86.67 %
Model Selection
Although I was primarily interested in evaluating the predictive models based on accuracy, model_1
outperformed model_2
and model_3
(on all selected performance metrics) across all evaluation metrics of interest—indicating more consistent and reliable predictions across all aspects of the binary classification performance.
While model_2
outperformed model_3
in predicting a preference for regular beer, model_3
was better in predicting a preference for light beer.
Business Understanding: Leveling-Up
In conclusion, this analysis has illuminated a clear preference for light beer among women and higher-income earners—the target market attributes for introducing light beer under the Blue Moon brand. By integrating this insight with a cluster analysis (for data-driven customer segmentation), we’ve empowered the business to strategically explore customer segments with these desired attributes. Subsequently, the go-to-market (GTM) team will then be equipped to position a light beer that resonates with the selected target market’s preferences.
With a sound, data-informed STP marketing strategy, the Blue Moon Brewing Company is poised to successfully introduce and market a new light beer to the most receptive audiences—ensuring product-market fit.
Appendix A: Environment, Language & Package Versions, and Coding Style
If you are interested in reproducing this work, here are the versions of R, Python, and Julia that I used (as well as the respective packages for each). Additionally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me. Finally, the data visualizations are mostly (if not entirely) implemented using the Grammar of Graphics framework.
cat(
R.version$version.string, "-", R.version$nickname,
"\nOS:", Sys.info()["sysname"], R.version$platform,
"\nCPU:", benchmarkme::get_cpu()$no_of_cores, "x", benchmarkme::get_cpu()$model_name
)
R version 4.2.3 (2023-03-15) - Shortstop Beagle
OS: Darwin x86_64-apple-darwin17.0
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
require(devtools)
devtools::install_version("dplyr", version="1.1.4", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.5.0", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4.1", repos="http://cran.us.r-project.org")
devtools::install_version("caret", version="6.0.94", repos="http://cran.us.r-project.org")
library(package=dplyr)
library(package=ggplot2)
library(package=ggcorrplot)
library(package=caret)
import sys
import platform
import os
import cpuinfo
print(
"Python", sys.version,
"\nOS:", platform.system(), platform.platform(),
"\nCPU:", os.cpu_count(), "x", cpuinfo.get_cpu_info()["brand_raw"]
)
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
OS: Darwin macOS-10.16-x86_64-i386-64bit
CPU: 8 x Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
!pip install numpy==1.25.1
!pip install pandas==2.0.3
!pip install scipy==1.11.1
import numpy
import pandas
from scipy import stats
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk-21.jdk/Contents/Home/lib/server
using Pkg
Pkg.add(name="HTTP", version="1.10.2")
Pkg.add(name="CSV", version="0.10.13")
Pkg.add(name="DataFrames", version="1.6.1")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="StatsBase", version="0.34.2")
using HTTP
using CSV
using DataFrames
using CategoricalArrays
using StatsBase
Appendix B: A Case for Logistic Regression
Advantages
- Simplicity: Logistic regression is a simple model that is easy to understand and interpret. The output can be interpreted as the probability of the positive class.
- Efficiency: It requires less computational resources as it’s not a complex algorithm. It’s relatively quick and easy to train.
- Robust: It’s less prone to overfitting and can handle noise in the data.
- Outputs Probabilities: Unlike other classification algorithms, logistic regression provides actual probabilities, which can be useful in cases where we want to rank predictions.
- Feature Importance: Logistic regression allows for understanding the influence of several independent variables on a single outcome variable.
Disadvantages
- Linearity Assumption: Logistic regression assumes a linear relationship between the independent variables and the logit of the dependent variable. If this assumption is violated, the model’s predictive performance can suffer.
- Outliers: Logistic regression is sensitive to outliers in the data. Outliers can have a large effect on the decision boundary.
- Multicollinearity: Logistic regression can overfit on highly correlated features. It’s important to check for multicollinearity in the data before applying logistic regression.
- Limited to Binary Classification: Logistic regression is typically used for binary classification. For multi-class problems, one would have to resort to techniques like “one-vs-all” or “one-vs-one”, which involve training multiple models.
- Performance: Logistic regression may not perform as well as other more complex models like Random Forests or Gradient Boosting on complex datasets that have many non-linear relationships.
Further Readings
- Albright, S. C., Winston, W. L., & Zappe, C. (2003). Data Analysis for Managers with Microsoft Excel (2nd ed.). South-Western College Publishing.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1
- Shmueli, G., Patel, N. R., & Bruce, P. C. (2007). Data Mining for Business Intelligence. Wiley.