Data-Informed Thinking + Doing

Random Sampling

Exploring the simple, stratified, systematic, and cluster sampling methods on coffee rating data—using R, Python, and Julia.

Th objective of random sampling is to make inferences about a large population (from information contained in a small sample from that population), while optimizing the sampling for cost, logistics, and statistical significance (Scheaffer et al., 2006). Various sampling methods should be considered, in order to ensure that every member of the population has an equal chance of being included in the sample, biased samples are avoided, and errors are minimized.

Common sampling methods include simple random sampling, where each member has an equal probability of selection, and stratified random sampling, where the population is divided into distinct groups or strata and samples are drawn from each stratum. Other methods include systematic sampling and cluster sampling, each suited for specific study designs and population characteristics.

Let’s dive into these techniques using a coffee rating dataset.

Getting Started

If you are interested in reproducing this work, here are the versions of R, Python, and Julia used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted for this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.

R.version.string
[1] "R version 4.2.3 (2023-03-15)"
require(devtools)
devtools::install_version("dplyr", version="1.1.2", repos="http://cran.us.r-project.org")
devtools::install_version("tibble", version="3.2.1", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
devtools::install_version("infer", version = "1.0.4", repos = "http://cran.us.r-project.org")
devtools::install_version("epiDisplay", version="3.5.0.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(tibble)
library(ggplot2)
library(infer)
import sys
print(sys.version)
3.11.4 (v3.11.4:d2340ef257, Jun  6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)]
!pip install pandas==2.0.3
!pip install plotnine==0.12.1
import pandas
import plotnine
using InteractiveUtils
InteractiveUtils.versioninfo()
Julia Version 1.9.2
Commit e4ee485e909 (2023-07-05 09:39 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin22.4.0)
  CPU: 8 × Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores
Environment:
  DYLD_FALLBACK_LIBRARY_PATH = /Library/Frameworks/R.framework/Resources/lib:/Library/Java/JavaVirtualMachines/jdk1.8.0_241.jdk/Contents/Home/jre/lib/server
using Pkg
Pkg.add(name="CSV", version="0.10.11")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.8")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.4.0")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly

Importing and Examining Dataset

coffee_population_r <- read.csv("../../dataset/coffee-ratings.csv", stringsAsFactors=TRUE)
utils::str(coffee_population_r)
'data.frame':	1339 obs. of  8 variables:
 $ total_cup_points : num  90.6 89.9 89.8 89 88.8 ...
 $ variety          : Factor w/ 29 levels "Arusha","Blue Mountain",..: NA 16 3 NA 16 NA 16 NA NA 16 ...
 $ country_of_origin: Factor w/ 36 levels "Brazil","Burundi",..: 9 9 10 9 9 1 25 9 9 9 ...
 $ aroma            : num  8.67 8.75 8.42 8.17 8.25 8.58 8.42 8.25 8.67 8.08 ...
 $ flavor           : num  8.83 8.67 8.5 8.58 8.5 8.42 8.5 8.33 8.67 8.58 ...
 $ aftertaste       : num  8.67 8.5 8.42 8.42 8.25 8.42 8.33 8.5 8.58 8.5 ...
 $ body             : num  8.5 8.42 8.33 8.5 8.42 8.25 8.25 8.33 8.33 7.67 ...
 $ balance          : num  8.42 8.42 8.42 8.25 8.33 8.33 8.25 8.5 8.42 8.42 ...
utils::head(x=coffee_population_r, n=8)
  total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1               91    <NA>          Ethiopia   8.7    8.8        8.7  8.5     8.4
2               90   Other          Ethiopia   8.8    8.7        8.5  8.4     8.4
3               90 Bourbon         Guatemala   8.4    8.5        8.4  8.3     8.4
4               89    <NA>          Ethiopia   8.2    8.6        8.4  8.5     8.2
5               89   Other          Ethiopia   8.2    8.5        8.2  8.4     8.3
6               89    <NA>            Brazil   8.6    8.4        8.4  8.2     8.3
7               89   Other              Peru   8.4    8.5        8.3  8.2     8.2
8               89    <NA>          Ethiopia   8.2    8.3        8.5  8.3     8.5
utils::tail(x=coffee_population_r, n=8)
     total_cup_points variety country_of_origin aroma flavor aftertaste body balance
1332               80    <NA>             India   7.7    7.7        7.5  7.6     7.5
1333               80    <NA>             India   7.6    7.4        7.4  7.4     7.5
1334               79  Arusha     United States   7.9    7.5        7.4  7.4     7.4
1335               79    <NA>           Ecuador   7.8    7.6        7.3  5.1     7.8
1336               78    <NA>           Ecuador   7.5    7.7        7.8  5.2     5.2
1337               77    <NA>     United States   7.3    7.3        7.2  7.5     7.2
1338               75    <NA>             India   7.4    6.8        6.8  7.2     7.0
1339               74    <NA>           Vietnam   6.8    6.7        6.5  6.9     6.8
coffee_population_py = pandas.read_csv("../../dataset/coffee-ratings.csv")
coffee_population_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1339 entries, 0 to 1338
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   total_cup_points   1339 non-null   float64
 1   variety            1113 non-null   object 
 2   country_of_origin  1338 non-null   object 
 3   aroma              1339 non-null   float64
 4   flavor             1339 non-null   float64
 5   aftertaste         1339 non-null   float64
 6   body               1339 non-null   float64
 7   balance            1339 non-null   float64
dtypes: float64(6), object(2)
memory usage: 83.8+ KB
coffee_population_py.head(n=8)
   total_cup_points  variety country_of_origin  aroma  flavor  aftertaste  body  balance
0             90.58      NaN          Ethiopia   8.67    8.83        8.67  8.50     8.42
1             89.92    Other          Ethiopia   8.75    8.67        8.50  8.42     8.42
2             89.75  Bourbon         Guatemala   8.42    8.50        8.42  8.33     8.42
3             89.00      NaN          Ethiopia   8.17    8.58        8.42  8.50     8.25
4             88.83    Other          Ethiopia   8.25    8.50        8.25  8.42     8.33
5             88.83      NaN            Brazil   8.58    8.42        8.42  8.25     8.33
6             88.75    Other              Peru   8.42    8.50        8.33  8.25     8.25
7             88.67      NaN          Ethiopia   8.25    8.33        8.50  8.33     8.50
coffee_population_py.tail(n=8)
      total_cup_points variety country_of_origin  aroma  flavor  aftertaste  body  balance
1331             80.17     NaN             India   7.67    7.67        7.50  7.58     7.50
1332             80.17     NaN             India   7.58    7.42        7.42  7.42     7.50
1333             79.33  Arusha     United States   7.92    7.50        7.42  7.42     7.42
1334             78.75     NaN           Ecuador   7.75    7.58        7.33  5.08     7.83
1335             78.08     NaN           Ecuador   7.50    7.67        7.75  5.17     5.25
1336             77.17     NaN     United States   7.33    7.33        7.17  7.50     7.17
1337             75.08     NaN             India   7.42    6.83        6.75  7.25     7.00
1338             73.75     NaN           Vietnam   6.75    6.67        6.50  6.92     6.83
coffee_population_jl = CSV.File("../../dataset/coffee-ratings.csv") |> DataFrames.DataFrame
1339×8 DataFrame
  Row │ total_cup_points  variety   country_of_origin  aroma    flavor   aftertaste  body     balance
      │ Float64           String31  String31           Float64  Float64  Float64     Float64  Float64
──────┼───────────────────────────────────────────────────────────────────────────────────────────────
    1 │            90.58  NA        Ethiopia              8.67     8.83        8.67     8.5      8.42
    2 │            89.92  Other     Ethiopia              8.75     8.67        8.5      8.42     8.42
    3 │            89.75  Bourbon   Guatemala             8.42     8.5         8.42     8.33     8.42
    4 │            89.0   NA        Ethiopia              8.17     8.58        8.42     8.5      8.25
    5 │            88.83  Other     Ethiopia              8.25     8.5         8.25     8.42     8.33
    6 │            88.83  NA        Brazil                8.58     8.42        8.42     8.25     8.33
    7 │            88.75  Other     Peru                  8.42     8.5         8.33     8.25     8.25
    8 │            88.67  NA        Ethiopia              8.25     8.33        8.5      8.33     8.5
  ⋮   │        ⋮             ⋮              ⋮             ⋮        ⋮         ⋮          ⋮        ⋮
 1333 │            80.17  NA        India                 7.58     7.42        7.42     7.42     7.5
 1334 │            79.33  Arusha    United States         7.92     7.5         7.42     7.42     7.42
 1335 │            78.75  NA        Ecuador               7.75     7.58        7.33     5.08     7.83
 1336 │            78.08  NA        Ecuador               7.5      7.67        7.75     5.17     5.25
 1337 │            77.17  NA        United States         7.33     7.33        7.17     7.5      7.17
 1338 │            75.08  NA        India                 7.42     6.83        6.75     7.25     7.0
 1339 │            73.75  NA        Vietnam               6.75     6.67        6.5      6.92     6.83
                                                                                     1324 rows omitted

Wrangling Data

Population vs. Sample

Simple Random Sampling

Stratified Random Sampling

Systematic Sampling

Cluster Sampling


References

  • Scheaffer, R. L., Mendenhall, W., & Ott, R. L. (2006). Elementary Survey Sampling (6th ed.). Thompson Brooks/Cole.
  • Hildebrand, D. K., Ott, R. L., & Gray, J. B. (2005). Basic Statistical Ideas for Managers (2nd ed.). Thompson Brooks/Cole.
Applied Advanced Analytics & AI in Sports