Univariate, Bivariate, and Multivariate Analyses on Numerical Data
Exploratory data analysis of the PGA golf datasetโusing R, Python, and Julia.
Univariate, bivariate, and multivariate analyses are essential in exploring and understanding numerical data comprehensively. Univariate analysis examines individual variables, revealing their distributions, central tendencies, and variability. Bivariate analysis assesses relationships between two variables, uncovering correlations, associations, or dependencies. Multivariate analysis considers interactions among multiple variables simultaneously, providing a deeper understanding of complex relationships, patterns, and dependencies.
These analyses (commonly known as Exploratory Data Analysis or EDA) enable data-driven decision-making, hypothesis testing, predictive modeling, and insights into various domains like finance, healthcare, and social sciences. By employing these techniques, researchers and analysts can derive valuable insights, uncover hidden patterns, and make informed decisions based on a thorough understanding of the data (not only numerical data, but also categorical data and other data types).
Letโs have some fun and look at this golf dataset.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinsonโs approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyoneโincluding me.
cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("tibble", version="3.2.1", repos="http://cran.us.r-project.org")
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
devtools::install_version("cowplot", version="1.1.1", repos="http://cran.us.r-project.org")
devtools::install_version("ggcorrplot", version="0.1.4", repos="http://cran.us.r-project.org")
library(tibble)
library(dplyr)
library(ggplot2)
library(cowplot)
library(ggcorrplot)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 ร Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.4.0")
using DataFrames
using CSV
using Colors
using Cairo
using Gadfly
Importing and Examining Dataset
# https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018
pga_r <- read.csv("../../dataset/pga-tour.csv")
str(pga_r)
'data.frame': 2312 obs. of 18 variables:
$ Player.Name : chr "Henrik Stenson" "Ryan Armour" "Chez Reavie" "Ryan Moore" ...
$ Rounds : num 60 109 93 78 103 103 93 94 77 50 ...
$ Fairway.Percentage: num 75.2 73.6 72.2 71.9 71.4 ...
$ Year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ Avg.Distance : num 292 284 286 289 279 ...
$ gir : num 73.5 68.2 68.7 68.8 67.1 ...
$ Average.Putts : num 29.9 29.3 29.1 29.2 29.1 ...
$ Average.Scrambling: num 60.7 60.1 62.3 64.2 59.2 ...
$ Average.Score : num 69.6 70.8 70.4 70 71 ...
$ Points : chr "868" "1,006" "1,020" "795" ...
$ Wins : num NA 1 NA NA NA NA NA NA NA NA ...
$ Top.10 : num 5 3 3 5 3 6 5 5 3 2 ...
$ Average.SG.Putts : num -0.207 -0.058 0.192 -0.271 0.164 0.442 0.037 0.546 0.167 0.389 ...
$ Average.SG.Total : num 1.153 0.337 0.674 0.941 0.062 ...
$ SG.OTT : num 0.427 -0.012 0.183 0.406 -0.227 -0.166 0.378 0.364 0.093 -0.392 ...
$ SG.APR : num 0.96 0.213 0.437 0.532 0.099 0.036 0.298 0.345 0.467 0.179 ...
$ SG.ARG : num -0.027 0.194 -0.137 0.273 0.026 0.253 -0.027 -0.122 -0.186 0.235 ...
$ Money : chr "$2,680,487" "$2,485,203" "$2,700,018" "$1,986,608" ...
# https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018
pga_python = pandas.read_csv("../../dataset/pga-tour.csv")
# https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018
pga_julia = CSV.read("../../dataset/pga-tour.csv", DataFrame);
show(pga_julia, allcols = true)
2312ร18 DataFrame
Row โ Player Name Rounds Fairway Percentage Year Avg Distance gir Average Putts Average Scrambling Average Score Points Wins Top 10 Average SG Putts Average SG Total SG:OTT SG:APR SG:ARG Money
โ String31 Float64? Float64? Int64 Float64? Float64? Float64? Float64? Float64? String7? Float64? Float64? Float64? Float64? Float64? Float64? Float64? String15?
โโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ Henrik Stenson 60.0 75.19 2018 291.5 73.51 29.93 60.67 69.617 868 missing 5.0 -0.207 1.153 0.427 0.96 -0.027 $2,680,487
2 โ Ryan Armour 109.0 73.58 2018 283.5 68.22 29.31 60.13 70.758 1,006 1.0 3.0 -0.058 0.337 -0.012 0.213 0.194 $2,485,203
3 โ Chez Reavie 93.0 72.24 2018 286.5 68.67 29.12 62.27 70.432 1,020 missing 3.0 0.192 0.674 0.183 0.437 -0.137 $2,700,018
4 โ Ryan Moore 78.0 71.94 2018 289.2 68.8 29.17 64.16 70.015 795 missing 5.0 -0.271 0.941 0.406 0.532 0.273 $1,986,608
5 โ Brian Stuard 103.0 71.44 2018 278.9 67.12 29.11 59.23 71.038 421 missing 3.0 0.164 0.062 -0.227 0.099 0.026 $1,089,763
6 โ Brian Gay 103.0 71.37 2018 282.9 64.52 28.25 63.26 70.28 880 missing 6.0 0.442 0.565 -0.166 0.036 0.253 $2,152,501
7 โ Kyle Stanley 93.0 71.29 2018 295.7 71.09 29.89 54.8 70.404 1,198 missing 5.0 0.037 0.686 0.378 0.298 -0.027 $3,916,001
8 โ Emiliano Grillo 94.0 70.16 2018 295.2 68.84 29.04 61.05 70.152 901 missing 5.0 0.546 1.133 0.364 0.345 -0.122 $2,493,163
9 โ Russell Henley 77.0 70.03 2018 293.0 68.77 29.8 54.33 70.489 569 missing 3.0 0.167 0.541 0.093 0.467 -0.186 $1,516,438
10 โ Jim Furyk 50.0 69.91 2018 280.5 63.19 28.73 62.58 70.342 291 missing 2.0 0.389 0.412 -0.392 0.179 0.235 $660,010
11 โ Steve Wheatcroft 60.0 69.79 2018 288.9 66.57 29.29 61.03 71.631 138 missing 1.0 -0.128 -0.339 0.112 -0.065 -0.258 $309,656
12 โ Kevin Streelman 94.0 69.11 2018 295.1 71.56 29.67 60.93 70.436 673 missing 5.0 -0.25 0.619 0.439 0.415 0.014 $1,523,642
13 โ C.T. Pan 104.0 68.98 2018 292.7 71.2 29.66 56.89 70.457 693 missing 1.0 -0.067 0.478 0.215 0.267 0.063 $1,881,787
14 โ David Lingmerth 82.0 68.93 2018 285.4 63.03 28.5 58.57 71.043 274 missing missing 0.229 -0.007 0.006 -0.16 -0.081 $616,758
15 โ Keegan Bradley 98.0 67.9 2018 299.6 69.18 29.68 56.78 70.303 872 missing 4.0 -0.358 0.793 0.237 0.888 0.026 $4,069,464
16 โ Rafa Cabrera Bello 75.0 67.85 2018 295.1 70.16 29.47 57.98 69.887 784 missing 4.0 0.273 1.112 0.256 0.487 0.096 $2,449,869
17 โ Billy Horschel 86.0 67.8 2018 295.4 71.75 29.46 58.03 70.154 960 1.0 3.0 0.392 1.112 0.538 0.352 -0.169 $4,315,200
18 โ Russell Knox 94.0 67.7 2018 291.7 69.57 29.7 59.43 70.568 585 missing 3.0 -0.088 0.383 0.059 0.263 0.149 $1,424,030
19 โ Ben Crane 65.0 67.52 2018 281.1 64.88 28.69 63.0 71.097 267 missing 1.0 0.332 0.176 -0.302 -0.038 0.184 $620,646
20 โ Vaughn Taylor 83.0 67.51 2018 286.1 67.02 29.15 59.91 70.692 445 missing 3.0 -0.08 0.219 -0.005 0.305 -0.002 $965,691
21 โ Brian Harman 94.0 67.14 2018 291.9 67.59 29.29 56.95 70.536 1,056 missing 8.0 0.273 0.29 0.137 -0.024 -0.096 $2,733,463
22 โ Sam Ryder 82.0 66.91 2018 297.3 72.08 29.88 56.47 70.914 442 missing 3.0 -0.349 0.154 0.203 0.399 -0.099 $1,046,166
23 โ Ted Potter, Jr. 87.0 66.83 2018 286.0 63.03 28.45 57.51 71.024 744 1.0 1.0 0.074 -0.094 -0.074 -0.2 0.105 $1,976,198
24 โ Austin Cook 107.0 66.76 2018 292.3 66.51 28.72 62.02 70.469 1,060 1.0 3.0 0.315 0.569 0.12 -0.045 0.179 $2,448,920
25 โ Tyler Duncan 97.0 66.74 2018 294.4 69.65 30.19 52.76 71.04 457 missing 2.0 -0.566 0.017 0.273 0.476 -0.166 $944,021
26 โ David Hearn 66.0 66.63 2018 285.1 68.89 29.58 55.65 71.325 315 missing 2.0 -0.127 -0.031 -0.17 0.379 -0.113 $622,383
27 โ Alex Cejka 77.0 66.49 2018 286.7 63.77 28.52 64.0 70.675 502 missing 2.0 0.009 0.312 -0.024 -0.169 0.495 $1,198,541
28 โ Ian Poulter 73.0 66.41 2018 293.6 67.01 28.97 57.11 70.593 1,030 1.0 4.0 0.223 0.85 0.141 0.435 0.051 $2,714,450
29 โ Joel Dahmen 93.0 66.36 2018 295.8 68.82 29.26 63.31 70.578 676 missing 3.0 -0.277 0.4 0.3 0.381 -0.004 $1,476,838
30 โ Kevin Kisner 89.0 66.33 2018 290.8 65.38 28.91 60.29 70.729 971 missing 4.0 0.513 0.037 -0.043 -0.316 -0.118 $2,972,285
31 โ J.J. Henry 81.0 66.22 2018 291.8 70.52 30.37 54.77 71.372 239 missing 1.0 -0.569 -0.193 0.174 0.34 -0.138 $482,052
32 โ J.J. Spaun 85.0 66.18 2018 298.2 69.48 29.63 55.76 70.525 849 missing 4.0 -0.149 0.292 0.286 0.41 -0.255 $1,978,906
33 โ Justin Rose 70.0 66.02 2018 303.5 69.95 28.67 63.03 68.993 1,991 2.0 8.0 0.424 1.952 0.551 0.526 0.45 $8,130,678
34 โ Kelly Kraft 94.0 65.84 2018 288.6 63.83 28.92 58.93 71.333 627 missing 3.0 0.075 -0.32 -0.286 0.017 -0.126 $1,496,253
35 โ Conrad Shindler 60.0 65.77 2018 291.3 66.27 29.14 60.0 71.465 92 missing missing -0.166 -0.313 0.114 -0.008 -0.252 $187,399
36 โ Scott Piercy 84.0 65.68 2018 296.6 69.72 29.86 55.96 70.736 802 1.0 2.0 -0.569 0.29 0.209 0.553 0.097 $1,882,337
37 โ Adam Hadwin 92.0 65.6 2018 289.4 68.04 29.26 58.59 70.75 638 missing 3.0 0.106 0.596 0.057 0.166 0.267 $1,932,488
38 โ Ben Silverman 90.0 65.56 2018 290.2 64.95 28.74 57.36 71.281 323 missing 2.0 0.238 -0.206 0.069 -0.385 -0.128 $793,140
39 โ Satoshi Kodaira 51.0 65.53 2018 293.6 60.3 29.88 51.02 72.182 600 1.0 1.0 -0.645 -1.832 0.027 -0.284 -0.93 $1,471,462
40 โ Rickie Fowler 74.0 65.33 2018 299.8 69.52 28.99 63.05 69.435 1,302 missing 4.0 0.296 1.275 0.244 0.494 0.242 $4,235,237
41 โ Xinjun Zhang 81.0 65.33 2018 298.7 67.04 29.67 58.2 71.486 195 missing 1.0 -0.437 -0.734 0.139 -0.254 -0.182 $420,377
42 โ Kiradech Aphibarnrat 51.0 65.12 2018 294.7 62.2 28.93 55.59 70.629 missing missing missing 0.138 0.511 0.485 -0.247 0.135 missing
43 โ Pat Perez 82.0 65.09 2018 290.9 67.78 29.29 55.63 70.594 1,116 1.0 4.0 0.115 -0.048 -0.063 -0.138 0.037 $2,962,641
44 โ Lucas Glover 65.0 65.06 2018 297.7 67.12 29.64 59.83 71.066 324 missing 1.0 -0.186 0.276 0.514 -0.043 -0.008 $789,382
45 โ Richy Werenski 98.0 65.06 2018 291.8 66.97 29.36 56.49 71.185 498 missing 2.0 -0.109 -0.124 0.077 -0.169 0.077 $1,081,283
46 โ Hunter Mahan 67.0 65.04 2018 296.9 68.96 29.02 58.81 71.135 234 missing 1.0 0.346 0.209 0.465 -0.149 -0.453 $457,337
โฎ โ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ โฎ
2268 โ Carlos Franco missing missing 2010 missing missing missing missing missing 92 missing missing missing missing missing missing missing 123,232
2269 โ Tom Watson missing missing 2010 missing missing missing missing missing 92 missing missing missing missing missing missing missing 149,371
2270 โ Graeme McDowell missing missing 2010 missing missing missing missing missing 91 1.0 2.0 missing missing missing missing missing 1,589,337
2271 โ Matt Weibring missing missing 2010 missing missing missing missing missing 83 missing missing missing missing missing missing missing 128,328
2272 โ Marco Dawson missing missing 2010 missing missing missing missing missing 82 missing missing missing missing missing missing missing 108,160
2273 โ Jason Gore missing missing 2010 missing missing missing missing missing 81 missing missing missing missing missing missing missing 77,213
2274 โ Rich Beem missing missing 2010 missing missing missing missing missing 80 missing missing missing missing missing missing missing 128,877
2275 โ Frank Lickliter II missing missing 2010 missing missing missing missing missing 67 missing missing missing missing missing missing missing 74,721
2276 โ Todd Hamilton missing missing 2010 missing missing missing missing missing 64 missing missing missing missing missing missing missing 77,608
2277 โ Shane Bertsch missing missing 2010 missing missing missing missing missing 53 missing missing missing missing missing missing missing 57,108
2278 โ Guy Boros missing missing 2010 missing missing missing missing missing 48 missing missing missing missing missing missing missing 54,833
2279 โ Robert Gamez missing missing 2010 missing missing missing missing missing 47 missing 1.0 missing missing missing missing missing 101,700
2280 โ Fred Funk missing missing 2010 missing missing missing missing missing 44 missing missing missing missing missing missing missing 77,803
2281 โ John Huston missing missing 2010 missing missing missing missing missing 40 missing missing missing missing missing missing missing 69,249
2282 โ Dicky Pride missing missing 2010 missing missing missing missing missing 39 missing missing missing missing missing missing missing 40,120
2283 โ Jonathan Kaye missing missing 2010 missing missing missing missing missing 34 missing missing missing missing missing missing missing 38,989
2284 โ Parker McLachlin missing missing 2010 missing missing missing missing missing 32 missing missing missing missing missing missing missing 53,291
2285 โ Mark Brooks missing missing 2010 missing missing missing missing missing 29 missing missing missing missing missing missing missing 46,360
2286 โ Chris Smith missing missing 2010 missing missing missing missing missing 27 missing missing missing missing missing missing missing 23,400
2287 โ Fran Quinn missing missing 2010 missing missing missing missing missing 26 missing missing missing missing missing missing missing 45,096
2288 โ Robert Damron missing missing 2010 missing missing missing missing missing 24 missing missing missing missing missing missing missing 17,446
2289 โ Len Mattiace missing missing 2010 missing missing missing missing missing 22 missing missing missing missing missing missing missing 22,200
2290 โ Shigeki Maruyama missing missing 2010 missing missing missing missing missing 20 missing missing missing missing missing missing missing 22,440
2291 โ Michael Clark II missing missing 2010 missing missing missing missing missing 19 missing missing missing missing missing missing missing 21,045
2292 โ John Morse missing missing 2010 missing missing missing missing missing 19 missing missing missing missing missing missing missing 28,998
2293 โ Jim Carter missing missing 2010 missing missing missing missing missing 18 missing missing missing missing missing missing missing 29,285
2294 โ J.L. Lewis missing missing 2010 missing missing missing missing missing 14 missing missing missing missing missing missing missing 12,960
2295 โ Phil Tataurangi missing missing 2010 missing missing missing missing missing 12 missing missing missing missing missing missing missing 19,449
2296 โ Tom Byrum missing missing 2010 missing missing missing missing missing 11 missing missing missing missing missing missing missing 13,420
2297 โ Chris Baryla missing missing 2010 missing missing missing missing missing 9 missing missing missing missing missing missing missing 24,254
2298 โ Tommy Armour III missing missing 2010 missing missing missing missing missing 6 missing missing missing missing missing missing missing 11,130
2299 โ Eric Axley missing missing 2010 missing missing missing missing missing 6 missing missing missing missing missing missing missing 24,124
2300 โ Willie Wood missing missing 2010 missing missing missing missing missing 5 missing missing missing missing missing missing missing 6,540
2301 โ Robin Freeman missing missing 2010 missing missing missing missing missing 2 missing missing missing missing missing missing missing 13,062
2302 โ Brad Adamonis missing missing 2010 missing missing missing missing missing 2 missing missing missing missing missing missing missing missing
2303 โ Paul Azinger missing missing 2010 missing missing missing missing missing 1 missing missing missing missing missing missing missing 9,486
2304 โ Spike McRoy missing missing 2010 missing missing missing missing missing missing missing missing missing missing missing missing missing 6,840
2305 โ Jon Rahm missing missing 2016 missing missing missing missing missing missing missing missing missing missing missing missing missing $1,004,035
2306 โ Byeong Hun An missing missing 2016 missing missing missing missing missing missing missing missing missing missing missing missing missing $926,797
2307 โ Joey Snyder III missing missing 2012 missing missing missing missing missing missing missing missing missing missing missing missing missing $112,800
2308 โ Carl Paulson missing missing 2012 missing missing missing missing missing missing missing missing missing missing missing missing missing $16,943
2309 โ Peter Tomasulo missing missing 2012 missing missing missing missing missing missing missing missing missing missing missing missing missing $12,827
2310 โ Marc Turnesa missing missing 2010 missing missing missing missing missing missing missing missing missing missing missing missing missing 10,159
2311 โ Jesper Parnevik missing missing 2010 missing missing missing missing missing missing missing missing missing missing missing missing missing 9,165
2312 โ Jim Gallagher, Jr. missing missing 2010 missing missing missing missing missing missing missing missing missing missing missing missing missing 6,552
2221 rows omitted
Wrangling Data
str(pga_clean_r)
'data.frame': 2312 obs. of 18 variables:
$ player_name : Factor w/ 526 levels "Aaron Baddeley",..: 196 413 108 416 75 73 293 159 411 234 ...
$ rounds : int 60 109 93 78 103 103 93 94 77 50 ...
$ fairway_pct : num 75.2 73.6 72.2 71.9 71.4 ...
$ year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ avg_distance : int 291 283 286 289 278 282 295 295 293 280 ...
$ gir : num 73.5 68.2 68.7 68.8 67.1 ...
$ avg_putts : num 29.9 29.3 29.1 29.2 29.1 ...
$ avg_scrambling: num 60.7 60.1 62.3 64.2 59.2 ...
$ avg_score : num 69.6 70.8 70.4 70 71 ...
$ points : num 868 1006 1020 795 421 ...
$ wins : num 0 1 0 0 0 0 0 0 0 0 ...
$ top_10 : num 5 3 3 5 3 6 5 5 3 2 ...
$ avg_sg_putts : num -0.207 -0.058 0.192 -0.271 0.164 0.442 0.037 0.546 0.167 0.389 ...
$ avg_sg_total : num 1.153 0.337 0.674 0.941 0.062 ...
$ sg_ott : num 0.427 -0.012 0.183 0.406 -0.227 -0.166 0.378 0.364 0.093 -0.392 ...
$ sg_apr : num 0.96 0.213 0.437 0.532 0.099 0.036 0.298 0.345 0.467 0.179 ...
$ sg_arg : num -0.027 0.194 -0.137 0.273 0.026 0.253 -0.027 -0.122 -0.186 0.235 ...
$ money : num 2680487 2485203 2700018 1986608 1089763 ...
Univariate Analysis
univariate_box_and_whisker_plot_rounds_r <- ggplot2::ggplot(pga_clean_r, aes(x=rounds)) +
geom_boxplot() +
xlim(40, 120) +
theme_michaelmallari_r() +
theme(
panel.grid.major.y=element_blank(),
axis.line.x.bottom=element_blank(),
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.title.x=element_blank(),
axis.line.y=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
axis.title.y=element_blank()
)
univariate_histogram_rounds_r <- ggplot2::ggplot(pga_clean_r, aes(x=rounds)) +
geom_histogram() +
xlim(40, 120) +
scale_y_continuous(expand=c(0, 0), position="right") + # Scale
labs(
x="Rounds",
y=NULL,
caption="Data Source: https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018"
) +
theme_michaelmallari_r()
cowplot::plot_grid(
univariate_box_and_whisker_plot_rounds_r,
univariate_histogram_rounds_r,
ncol=1,
rel_heights=c(0.2, 1),
align="v",
axis="lr"
)
Bivariate Analysis
bivariate_scatterplot_points_money_r <- ggplot2::ggplot(pga_clean_r, aes(x=points, y=money)) + # Data, aesthetics
geom_point(color=palette_michaelmallari_r[19], alpha=0.3) + # Geometric object
geom_smooth(method=lm, colour=palette_michaelmallari_r[2]) + # Geometric object
scale_y_continuous(expand=c(0, 0), position="right") + # Scale
labs(
title="TikTok Not a One-Trick Pony",
alt="TikTok Not a One-Trick Pony",
subtitle="Prize money ($) based on points scored, n = 2,312",
x="Points Scored",
y=NULL,
caption="Data Source: https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018"
) +
theme_michaelmallari_r()
bivariate_scatterplot_points_money_r
Multivariate Analysis
Correlation Matrix
correlation_pearson_r <- cor(
subset(pga_clean_r, select=-c(player_name, year)),
method="pearson"
)
correlation_pearson_r
rounds fairway_pct avg_distance gir avg_putts avg_scrambling avg_score points wins top_10 avg_sg_putts avg_sg_total sg_ott sg_apr sg_arg money
rounds 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
fairway_pct NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA
avg_distance NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA NA
gir NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA NA
avg_putts NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA NA
avg_scrambling NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA NA
avg_score NA NA NA NA NA NA 1 NA NA NA NA NA NA NA NA NA
points NA NA NA NA NA NA NA 1.00 0.72 NA NA NA NA NA NA 0.96
wins NA NA NA NA NA NA NA 0.72 1.00 NA NA NA NA NA NA 0.72
top_10 NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
avg_sg_putts NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
avg_sg_total NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA NA
sg_ott NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA NA
sg_apr NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA NA
sg_arg NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 NA
money NA NA NA NA NA NA NA 0.96 0.72 NA NA NA NA NA NA 1.00
multivariate_correlation_matrix_r <- ggcorrplot::ggcorrplot(
corr=correlation_pearson_r,
method="square",
type="full",
show.legend=TRUE,
legend.title="Pearson Correlation (r)",
colors=c(palette_michaelmallari_r[3], palette_michaelmallari_r[1], palette_michaelmallari_r[2]),
lab=TRUE,
lab_size=2,
digits=2
)
multivariate_correlation_matrix_r
Bubble Plot
relationship_bubble_plot_win_money_points_r <- ggplot2::ggplot(pga_clean_r, aes(x=points, y=money)) + # Data, aesthetics
geom_point(aes(size=wins), color=palette_michaelmallari_r[19], alpha=0.3) + # Geometric object, aesthetic
geom_smooth(method=lm, colour=palette_michaelmallari_r[2]) + # Geometric object
scale_y_continuous(expand=c(0, 0), position="right") + # Scale
labs(
title="Separation From the Pack With 1750+ Points",
alt="Separation From the Pack With 1750+ Points",
subtitle="PGA prize money ($) based on points scored and wins, n = 2,312",
x="Points Scored",
y=NULL,
size="Wins",
caption="Source: https://www.kaggle.com/jmpark746/pga-tour-data-2010-2018"
) +
theme_michaelmallari_r()
relationship_bubble_plot_win_money_points_r
References
- Schwabish, J. (2021). Better Data Visualizations: A Guide for Scholars, Researchers, and Wonks. Columbia University Press.
- Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis (2nd ed.). Springer. https://doi.org/10.1007/978-3-319-24277-4
- Wickham, H. (2010). A Layered Grammar of Graphics. Journal of Computational and Graphical Statistics, 19(1), 3โ28. https://www.jstor.org/stable/25651297
- Wilkinson, L. (2005). The Grammar of Graphics (2nd ed.). Springer.
- Michael Mallari (michaelmallari) - Profile | Pinterest. (n.d.). Pinterest. https://www.pinterest.com/michaelmallari/