Random Sampling
Exploring the simple, stratified, systematic, and cluster sampling methods on coffee rating data—using R, Python, and Julia.
Th objective of random sampling is to make inferences about a large population (from information contained in a small sample from that population), while optimizing the sampling for cost, logistics, and statistical significance (Scheaffer et al., 2006). Various sampling methods should be considered, in order to ensure that every member of the population has an equal chance of being included in the sample, biased samples are avoided, and errors are minimized.
Common sampling methods include simple random sampling, where each member has an equal probability of selection, and stratified random sampling, where the population is divided into distinct groups or strata and samples are drawn from each stratum. Other methods include systematic sampling and cluster sampling, each suited for specific study designs and population characteristics.
Let’s dive into these techniques using a coffee rating dataset.
Getting Started
If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
R.version.string
[1] "R version 4.2.3 (2023-03-15)"
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.2.1", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
devtools::install_version("infer", version = "1.0.4", repos = "http://cran.us.r-project.org")
devtools::install_version("epiDisplay", version="3.5.0.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(tibble)
library(ggplot2)
library(infer)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin22.4.0)
CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
Threads: 1 on 8 virtual cores
Environment:
DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.8")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.4.0")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
Importing and Examining Dataset
coffee_population_r <- read.csv("../../dataset/coffee-ratings.csv", stringsAsFactors=TRUE)
utils::str(coffee_population_r)
'data.frame': 1339 obs. of 8 variables:
$ total_cup_points : num 90.6 89.9 89.8 89 88.8 ...
$ variety : Factor w/ 29 levels "Arusha","Blue Mountain",..: NA 16 3 NA 16 NA 16 NA NA 16 ...
$ country_of_origin: Factor w/ 36 levels "Brazil","Burundi",..: 9 9 10 9 9 1 25 9 9 9 ...
$ aroma : num 8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
$ flavor : num 8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
$ aftertaste : num 8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
$ body : num 8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
$ balance : num 8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
utils::head(x=coffee_population_r, n=8)
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1 91 <NA> Ethiopia 8.7 8.8 8.7 8.5 8.4
2 90 Other Ethiopia 8.8 8.7 8.5 8.4 8.4
3 90 Bourbon Guatemala 8.4 8.5 8.4 8.3 8.4
4 89 <NA> Ethiopia 8.2 8.6 8.4 8.5 8.2
5 89 Other Ethiopia 8.2 8.5 8.2 8.4 8.3
6 89 <NA> Brazil 8.6 8.4 8.4 8.2 8.3
7 89 Other Peru 8.4 8.5 8.3 8.2 8.2
8 89 <NA> Ethiopia 8.2 8.3 8.5 8.3 8.5
utils::tail(x=coffee_population_r, n=8)
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1332 80 <NA> India 7.7 7.7 7.5 7.6 7.5
1333 80 <NA> India 7.6 7.4 7.4 7.4 7.5
1334 79 Arusha United States 7.9 7.5 7.4 7.4 7.4
1335 79 <NA> Ecuador 7.8 7.6 7.3 5.1 7.8
1336 78 <NA> Ecuador 7.5 7.7 7.8 5.2 5.2
1337 77 <NA> United States 7.3 7.3 7.2 7.5 7.2
1338 75 <NA> India 7.4 6.8 6.8 7.2 7.0
1339 74 <NA> Vietnam 6.8 6.7 6.5 6.9 6.8
coffee_population_py = pandas.read_csv("../../dataset/coffee-ratings.csv")
coffee_population_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1339 entries, 0 to 1338
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 total_cup_points 1339 non-null float64
1 variety 1113 non-null object
2 country_of_origin 1338 non-null object
3 aroma 1339 non-null float64
4 flavor 1339 non-null float64
5 aftertaste 1339 non-null float64
6 body 1339 non-null float64
7 balance 1339 non-null float64
dtypes: float64(6), object(2)
memory usage: 83.8+ KB
coffee_population_py.head(n=8)
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
0 90.58 NaN Ethiopia 8.67 8.83 8.67 8.50 8.42
1 89.92 Other Ethiopia 8.75 8.67 8.50 8.42 8.42
2 89.75 Bourbon Guatemala 8.42 8.50 8.42 8.33 8.42
3 89.00 NaN Ethiopia 8.17 8.58 8.42 8.50 8.25
4 88.83 Other Ethiopia 8.25 8.50 8.25 8.42 8.33
5 88.83 NaN Brazil 8.58 8.42 8.42 8.25 8.33
6 88.75 Other Peru 8.42 8.50 8.33 8.25 8.25
7 88.67 NaN Ethiopia 8.25 8.33 8.50 8.33 8.50
coffee_population_py.tail(n=8)
total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1331 80.17 NaN India 7.67 7.67 7.50 7.58 7.50
1332 80.17 NaN India 7.58 7.42 7.42 7.42 7.50
1333 79.33 Arusha United States 7.92 7.50 7.42 7.42 7.42
1334 78.75 NaN Ecuador 7.75 7.58 7.33 5.08 7.83
1335 78.08 NaN Ecuador 7.50 7.67 7.75 5.17 5.25
1336 77.17 NaN United States 7.33 7.33 7.17 7.50 7.17
1337 75.08 NaN India 7.42 6.83 6.75 7.25 7.00
1338 73.75 NaN Vietnam 6.75 6.67 6.50 6.92 6.83
coffee_population_jl = CSV.File("../../dataset/coffee-ratings.csv") |> DataFrames.DataFrame
1339×8 DataFrame
Row │ total_cup_points variety country_of_origin aroma flavor aftertaste body balance
│ Float64 String31 String31 Float64 Float64 Float64 Float64 Float64
──────┼───────────────────────────────────────────────────────────────────────────────────────────────
1 │ 90.58 NA Ethiopia 8.67 8.83 8.67 8.5 8.42
2 │ 89.92 Other Ethiopia 8.75 8.67 8.5 8.42 8.42
3 │ 89.75 Bourbon Guatemala 8.42 8.5 8.42 8.33 8.42
4 │ 89.0 NA Ethiopia 8.17 8.58 8.42 8.5 8.25
5 │ 88.83 Other Ethiopia 8.25 8.5 8.25 8.42 8.33
6 │ 88.83 NA Brazil 8.58 8.42 8.42 8.25 8.33
7 │ 88.75 Other Peru 8.42 8.5 8.33 8.25 8.25
8 │ 88.67 NA Ethiopia 8.25 8.33 8.5 8.33 8.5
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1333 │ 80.17 NA India 7.58 7.42 7.42 7.42 7.5
1334 │ 79.33 Arusha United States 7.92 7.5 7.42 7.42 7.42
1335 │ 78.75 NA Ecuador 7.75 7.58 7.33 5.08 7.83
1336 │ 78.08 NA Ecuador 7.5 7.67 7.75 5.17 5.25
1337 │ 77.17 NA United States 7.33 7.33 7.17 7.5 7.17
1338 │ 75.08 NA India 7.42 6.83 6.75 7.25 7.0
1339 │ 73.75 NA Vietnam 6.75 6.67 6.5 6.92 6.83
1324 rows omitted
Wrangling Data
Population vs. Sample
Simple Random Sampling
Stratified Random Sampling
Systematic Sampling
Cluster Sampling
References
- Scheaffer, R. L., Mendenhall, W., & Ott, R. L. (2006). Elementary Survey Sampling (6th ed.). Thompson Brooks/Cole.
- Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.