Data Wrangling Categorical Data
Cleaning the DC Comics dataset from common messiness issues—using Julia, Python, and R.
“Data wrangling, which is also commonly referred to as data munging, transformation, manipulation, janitor work, etc., can be a painstakingly laborious process…up to 80% of data analysis is spent on the process of cleaning and preparing data.” (Boehmke, 2016).
This is an area where data analysts and data engineers can always improve upon, in order to reduce the time spent on data wrangling. Let’s look at common data wrangling techniques applied to categorical data.
Getting Started
If you are interested in reproducing this work, here are the versions of Julia, Python, and R used (as well as the respective packages for each). Additionally, Leland Wilkinson’s approach to data visualization (Grammar of Graphics) has been adopted in this work. Finally, my coding style here is verbose, in order to trace back where functions/methods and variables are originating from, and make this a learning experience for everyone—including me.
VERSION
v"1.9.0"
import Pkg
Pkg.add(name="CSV", version="0.10.10")
Pkg.add(name="DataFrames", version="1.5.0")
Pkg.add(name="CategoricalArrays", version="0.10.8")
Pkg.add(name="Colors", version="0.12.10")
Pkg.add(name="Cairo", version="1.0.5")
Pkg.add(name="Gadfly", version="1.3.4")
using CSV
using DataFrames
using CategoricalArrays
using Colors
using Cairo
using Gadfly
import platform
platform.python_implementation() + " " + platform.python_version()
'CPython 3.9.6'
!pip install numpy==1.24.3
!pip install pandas==2.0.1
!pip install plotnine==0.12.1
import numpy
import pandas
import plotnine
cat(R.version$version.string, R.version$nickname)
R version 4.2.3 (2023-03-15) Shortstop Beagle
require(devtools)
devtools::install_version("dplyr", version="1.1.1", repos="http://cran.us.r-project.org")
devtools::install_version("ggplot2", version="3.4.2", repos="http://cran.us.r-project.org")
library(dplyr)
library(ggplot2)
Importing and Examining Dataset
dc_jl = CSV.File("../../dataset/comic-characters/dc-wikia-data.csv") |> DataFrames.DataFrame
6896×13 DataFrame
Row │ page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST APPEARANCE YEAR
│ Int64 String String String31? String31? String31? String31? String31? String31? String31? Int64? String15? Int64?
──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 1422 Batman (Bruce Wayne) \\/wiki\\/Batman_(Bruce_Wayne) Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 3093 1939, May 1939
2 │ 23387 Superman (Clark Kent) \\/wiki\\/Superman_(Clark_Kent) Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 2496 1986, October 1986
3 │ 1458 Green Lantern (Hal Jordan) \\/wiki\\/Green_Lantern_(Hal_Jor… Secret Identity Good Characters Brown Eyes Brown Hair Male Characters missing Living Characters 1565 1959, October 1959
4 │ 1659 James Gordon (New Earth) \\/wiki\\/James_Gordon_(New_Eart… Public Identity Good Characters Brown Eyes White Hair Male Characters missing Living Characters 1316 1987, February 1987
5 │ 1576 Richard Grayson (New Earth) \\/wiki\\/Richard_Grayson_(New_E… Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 1237 1940, April 1940
6 │ 1448 Wonder Woman (Diana Prince) \\/wiki\\/Wonder_Woman_(Diana_Pr… Public Identity Good Characters Blue Eyes Black Hair Female Characters missing Living Characters 1231 1941, December 1941
7 │ 1486 Aquaman (Arthur Curry) \\/wiki\\/Aquaman_(Arthur_Curry) Public Identity Good Characters Blue Eyes Blond Hair Male Characters missing Living Characters 1121 1941, November 1941
8 │ 1451 Timothy Drake (New Earth) \\/wiki\\/Timothy_Drake_(New_Ear… Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 1095 1989, August 1989
9 │ 71760 Dinah Laurel Lance (New Earth) \\/wiki\\/Dinah_Laurel_Lance_(Ne… Public Identity Good Characters Blue Eyes Blond Hair Female Characters missing Living Characters 1075 1969, November 1969
10 │ 1380 Flash (Barry Allen) \\/wiki\\/Flash_(Barry_Allen) Secret Identity Good Characters Blue Eyes Blond Hair Male Characters missing Living Characters 1028 1956, October 1956
11 │ 403631 GenderTest \\/wiki\\/GenderTest Secret Identity Good Characters Blue Eyes Blond Hair Female Characters missing Living Characters 1028 1956, October 1956
12 │ 1459 Alan Scott (New Earth) \\/wiki\\/Alan_Scott_(New_Earth) Secret Identity Good Characters Blue Eyes Blond Hair Male Characters missing Deceased Characters 969 1940, July 1940
13 │ 1905 Barbara Gordon (New Earth) \\/wiki\\/Barbara_Gordon_(New_Ea… Secret Identity Good Characters Blue Eyes Red Hair Female Characters missing Living Characters 951 1967, January 1967
14 │ 1386 Jason Garrick (New Earth) \\/wiki\\/Jason_Garrick_(New_Ear… Public Identity Good Characters Blue Eyes Brown Hair Male Characters missing Living Characters 951 1940, January 1940
15 │ 23383 Lois Lane (New Earth) \\/wiki\\/Lois_Lane_(New_Earth) Public Identity Good Characters Blue Eyes Black Hair Female Characters missing Living Characters 934 1938, June 1938
16 │ 1456 Alfred Pennyworth (New Earth) \\/wiki\\/Alfred_Pennyworth_(New… Public Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 930 1943, April 1943
17 │ 1849 Carter Hall (New Earth) \\/wiki\\/Carter_Hall_(New_Earth) Secret Identity Good Characters Blue Eyes Brown Hair Male Characters missing Living Characters 803 1940, January 1940
18 │ 4320 Kyle Rayner (New Earth) \\/wiki\\/Kyle_Rayner_(New_Earth) Secret Identity Good Characters Green Eyes Black Hair Male Characters missing Living Characters 716 1994, January 1994
19 │ 1706 Raymond Palmer (New Earth) \\/wiki\\/Raymond_Palmer_(New_Ea… Public Identity Good Characters Brown Eyes missing Male Characters missing Living Characters 706 1961, October 1961
20 │ 1480 Alexander Luthor (New Earth) \\/wiki\\/Alexander_Luthor_(New_… Public Identity Bad Characters Green Eyes missing Male Characters missing Living Characters 677 1986, October 1986
21 │ 1556 Roy Harper (New Earth) \\/wiki\\/Roy_Harper_(New_Earth) Secret Identity Neutral Characters Green Eyes Red Hair Male Characters missing Living Characters 654 1941, November 1941
22 │ 1580 Kara Zor-L (Earth-Two) \\/wiki\\/Kara_Zor-L_(Earth-Two) Secret Identity Good Characters Blue Eyes Blond Hair Female Characters missing Living Characters 635 1976, February 1976
23 │ 4849 Ted Grant (New Earth) \\/wiki\\/Ted_Grant_(New_Earth) Secret Identity missing Blue Eyes Black Hair Male Characters missing Living Characters 605 1942, January 1942
24 │ 1611 Garfield Logan (New Earth) \\/wiki\\/Garfield_Logan_(New_Ea… Public Identity Good Characters Green Eyes Green Hair Male Characters missing Living Characters 595 1965, November 1965
25 │ 1479 Guy Gardner (New Earth) \\/wiki\\/Guy_Gardner_(New_Earth) Public Identity Good Characters Blue Eyes Red Hair Male Characters missing Living Characters 593 1968, March 1968
26 │ 1582 Victor Stone (New Earth) \\/wiki\\/Victor_Stone_(New_Eart… Public Identity Good Characters Brown Eyes Black Hair Male Characters missing Living Characters 584 1980, October 1980
27 │ 14006 Kon-El (New Earth) \\/wiki\\/Kon-El_(New_Earth) Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters 560 1993, June 1993
28 │ 1484 Ralph Dibny (New Earth) \\/wiki\\/Ralph_Dibny_(New_Earth) Public Identity missing Blue Eyes Red Hair Male Characters missing Deceased Characters 558 1960, May 1960
29 │ 23391 James Olsen (New Earth) \\/wiki\\/James_Olsen_(New_Earth) Public Identity Good Characters Green Eyes Red Hair Male Characters missing Living Characters 557 1986, October 1986
30 │ 1478 John Stewart (New Earth) \\/wiki\\/John_Stewart_(New_Eart… Public Identity Good Characters Brown Eyes Black Hair Male Characters missing Living Characters 549 1971, December 1971
31 │ 1455 Joker (New Earth) \\/wiki\\/Joker_(New_Earth) Secret Identity Bad Characters Green Eyes Green Hair Male Characters missing Living Characters 517 1940, June 1940
32 │ 10163 Franklin Rock (New Earth) \\/wiki\\/Franklin_Rock_(New_Ear… Public Identity Good Characters Blue Eyes missing Male Characters missing Living Characters 492 1959, April 1959
33 │ 1872 Garth (New Earth) \\/wiki\\/Garth_(New_Earth) Public Identity Good Characters Purple Eyes Black Hair Male Characters missing Deceased Characters 487 1960, February 1960
34 │ 1616 Rex Mason (New Earth) \\/wiki\\/Rex_Mason_(New_Earth) Secret Identity missing Black Eyes missing Male Characters missing Living Characters 470 1965, January 1965
35 │ 1577 Zatanna Zatara (New Earth) \\/wiki\\/Zatanna_Zatara_(New_Ea… Public Identity Good Characters Blue Eyes Black Hair Female Characters missing Living Characters 439 1964, November 1964
36 │ 6791 Aztar (New Earth) \\/wiki\\/Aztar_(New_Earth) Secret Identity missing White Eyes missing Male Characters missing Living Characters 436 1940, February 1940
37 │ 1481 Theodore Kord (New Earth) \\/wiki\\/Theodore_Kord_(New_Ear… Secret Identity missing Blue Eyes Brown Hair Male Characters missing Deceased Characters 429 1986, February 1986
38 │ 1559 Michael Jon Carter (New Earth) \\/wiki\\/Michael_Jon_Carter_(Ne… Public Identity Good Characters Blue Eyes Blond Hair Male Characters missing Living Characters 427 1986, February 1986
39 │ 1607 Cassandra Sandsmark (New Earth) \\/wiki\\/Cassandra_Sandsmark_(N… Public Identity Good Characters Blue Eyes Blond Hair Female Characters missing Living Characters 423 1996, January 1996
40 │ 1680 Kent Nelson (New Earth) \\/wiki\\/Kent_Nelson_(New_Earth) Secret Identity Good Characters Blue Eyes Blond Hair Male Characters missing Deceased Characters 422 1940, May 1940
41 │ 1662 Harvey Bullock (New Earth) \\/wiki\\/Harvey_Bullock_(New_Ea… Public Identity Good Characters Brown Eyes Black Hair Male Characters missing Living Characters 413 1974, July 1974
42 │ 1864 Rachel Roth (New Earth) \\/wiki\\/Rachel_Roth_(New_Earth) Secret Identity Good Characters Purple Eyes Black Hair Female Characters missing Living Characters 399 1980, October 1980
43 │ 37696 Helena Bertinelli (New Earth) \\/wiki\\/Helena_Bertinelli_(New… Secret Identity Good Characters Blue Eyes Black Hair Female Characters missing Living Characters 393 1989, April 1989
44 │ 1981 Wesley Dodds (New Earth) \\/wiki\\/Wesley_Dodds_(New_Eart… Secret Identity Good Characters Brown Eyes Brown Hair Male Characters missing Deceased Characters 391 1939, April 1939
45 │ 1514 Uxas (New Earth) \\/wiki\\/Uxas_(New_Earth) Public Identity Bad Characters Red Eyes missing Male Characters missing Deceased Characters 388 1970, December 1970
46 │ 1560 Nathaniel Adam (New Earth) \\/wiki\\/Nathaniel_Adam_(New_Ea… Secret Identity Good Characters Blue Eyes White Hair Male Characters missing Living Characters 386 1987, March 1987
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
6852 │ 14894 Shu (New Earth) \\/wiki\\/Shu_(New_Earth) missing missing missing missing Male Characters missing Living Characters missing 1977, April 1977
6853 │ 2614 Ch'al Andar (New Earth) \\/wiki\\/Ch%27al_Andar_(New_Ear… Secret Identity Good Characters Blue Eyes Blond Hair Male Characters missing Living Characters missing 1975, March 1975
6854 │ 54248 M'nagalah (New Earth) \\/wiki\\/M%27nagalah_(New_Earth) Secret Identity Bad Characters missing missing Genderless Characters missing Living Characters missing 1974, February 1974
6855 │ 285581 Mrblonde267\\/Buddy Blank (New E… \\/wiki\\/User:Mrblonde267\\/Bud… missing missing missing missing missing missing missing missing 1974, October 1974
6856 │ 263173 Dragorin (New Earth) \\/wiki\\/Dragorin_(New_Earth) Public Identity Bad Characters Red Eyes Black Hair Male Characters missing Living Characters missing 1971, October 1971
6857 │ 1466 Ra's al Ghul (New Earth) \\/wiki\\/Ra%27s_al_Ghul_(New_Ea… Secret Identity Bad Characters Green Eyes Grey Hair Male Characters missing Deceased Characters missing 1971, June 1971
6858 │ 7196 Bartholomew Lash (New Earth) \\/wiki\\/Bartholomew_Lash_(New_… Public Identity Good Characters Blue Eyes Red Hair Male Characters missing Living Characters missing 1968, August 1968
6859 │ 10536 Angel Beatrix O'Day (New Earth) \\/wiki\\/Angel_Beatrix_O%27Day_… Public Identity Good Characters Blue Eyes Blond Hair Female Characters missing Living Characters missing 1968, September 1968
6860 │ 289360 Cobalt (New Earth) \\/wiki\\/Cobalt_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Male Characters missing Living Characters missing 1968, May 1968
6861 │ 289362 Gallium (New Earth) \\/wiki\\/Gallium_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Male Characters missing Living Characters missing 1968, May 1968
6862 │ 289357 Iridia (New Earth) \\/wiki\\/Iridia_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Female Characters missing Living Characters missing 1968, May 1968
6863 │ 289358 Osmium (New Earth) \\/wiki\\/Osmium_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Male Characters missing Living Characters missing 1968, May 1968
6864 │ 289361 Silver (New Earth) \\/wiki\\/Silver_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Male Characters missing Living Characters missing 1968, May 1968
6865 │ 18688 T'Charr (New Earth) \\/wiki\\/T%27Charr_(New_Earth) Secret Identity missing missing missing Male Characters missing Deceased Characters missing 1968, June 1968
6866 │ 289359 Zinc (New Earth) \\/wiki\\/Zinc_(New_Earth) Secret Identity Good Characters Photocellular Eyes missing Male Characters missing Living Characters missing 1968, May 1968
6867 │ 162822 Baron Tyrano (New Earth) \\/wiki\\/Baron_Tyrano_(New_Eart… Secret Identity Bad Characters Blue Eyes missing Male Characters missing Living Characters missing 1967, July 1967
6868 │ 10025 Brains (New Earth) \\/wiki\\/Brains_(New_Earth) missing Good Characters missing missing Male Characters missing Living Characters missing 1967, April 1967
6869 │ 10030 Cracker (New Earth) \\/wiki\\/Cracker_(New_Earth) missing Good Characters missing missing Male Characters missing Living Characters missing 1967, June 1967
6870 │ 10031 Hard Head (New Earth) \\/wiki\\/Hard_Head_(New_Earth) missing Good Characters missing Black Hair Male Characters missing Living Characters missing 1967, April 1967
6871 │ 10032 Zig-Zag (New Earth) \\/wiki\\/Zig-Zag_(New_Earth) missing Good Characters missing missing Male Characters missing Living Characters missing 1967, June 1967
6872 │ 228659 Dragonfly (New Earth) \\/wiki\\/Dragonfly_(New_Earth) Secret Identity Bad Characters missing Black Hair Female Characters missing Living Characters missing 1966, June 1966
6873 │ 129755 Carl Bradford (New Earth) \\/wiki\\/Carl_Bradford_(New_Ear… missing Bad Characters missing missing Male Characters missing Living Characters missing 1966, January 1966
6874 │ 1449 Donna Troy (New Earth) \\/wiki\\/Donna_Troy_(New_Earth) Public Identity Good Characters Blue Eyes Black Hair Female Characters missing Living Characters missing 1965, July 1965
6875 │ 128098 Bartholomew Magan (New Earth) \\/wiki\\/Bartholomew_Magan_(New… missing Bad Characters missing missing Male Characters missing Living Characters missing 1963, September 1963
6876 │ 22325 James Moon (New Earth) \\/wiki\\/James_Moon_(New_Earth) missing missing missing missing Male Characters missing Living Characters missing 1962, March 1962
6877 │ 1383 Flash (Wally West) \\/wiki\\/Flash_(Wally_West) Secret Identity Good Characters Green Eyes Red Hair Male Characters missing Living Characters missing 1960, January 1960
6878 │ 1485 J'onn J'onzz (New Earth) \\/wiki\\/J%27onn_J%27onzz_(New_… Public Identity Good Characters Red Eyes missing Male Characters missing Living Characters missing 1955, November 1955
6879 │ 34617 Dorothea Tane (New Earth) \\/wiki\\/Dorothea_Tane_(New_Ear… missing missing missing Blond Hair Female Characters missing Living Characters missing 1948, August 1948
6880 │ 238641 Dmane (Earth-Two) \\/wiki\\/Dmane_(Earth-Two) missing Bad Characters Blue Eyes missing Male Characters missing Living Characters missing 1946, April 1946
6881 │ 258830 Maximillian O'Leary (New Earth) \\/wiki\\/Maximillian_O%27Leary_… Public Identity Good Characters missing Black Hair Male Characters missing Living Characters missing 1946, January 1946
6882 │ 1624 Doris Zuel (New Earth) \\/wiki\\/Doris_Zuel_(New_Earth) Secret Identity Bad Characters Green Eyes Red Hair Female Characters missing Living Characters missing 1944, June 1944
6883 │ 22701 Doris Lee (New Earth) \\/wiki\\/Doris_Lee_(New_Earth) Public Identity missing Brown Eyes Brown Hair Female Characters missing Deceased Characters missing 1941, April 1941
6884 │ 1581 Patrick O'Brian (New Earth) \\/wiki\\/Patrick_O%27Brian_(New… Secret Identity Good Characters Blue Eyes Black Hair Male Characters missing Living Characters missing 1941, August 1941
6885 │ 1473 Basil Karlo (New Earth) \\/wiki\\/Basil_Karlo_(New_Earth) Secret Identity Bad Characters Black Eyes Black Hair Male Characters missing Living Characters missing 1940, June 1940
6886 │ 1460 Catwoman (Selina Kyle) \\/wiki\\/Catwoman_(Selina_Kyle) Secret Identity Neutral Characters Green Eyes Black Hair Female Characters missing Living Characters missing 1940, June 1940
6887 │ 289378 Bedivere (New Earth) \\/wiki\\/Bedivere_(New_Earth) missing missing missing missing Male Characters missing Living Characters missing 1936, February 1936
6888 │ 283661 Herbert Hoover (New Earth) \\/wiki\\/Herbert_Hoover_(New_Ea… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6889 │ 283657 William Howard Taft (New Earth) \\/wiki\\/William_Howard_Taft_(N… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6890 │ 21655 Frank Fitzsimmons (New Earth) \\/wiki\\/Frank_Fitzsimmons_(New… Public Identity Good Characters missing Grey Hair Male Characters missing Living Characters missing missing missing
6891 │ 283482 James Garfield (New Earth) \\/wiki\\/James_Garfield_(New_Ea… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6892 │ 66302 Nadine West (New Earth) \\/wiki\\/Nadine_West_(New_Earth) Public Identity Good Characters missing missing Female Characters missing Living Characters missing missing missing
6893 │ 283475 Warren Harding (New Earth) \\/wiki\\/Warren_Harding_(New_Ea… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6894 │ 283478 William Harrison (New Earth) \\/wiki\\/William_Harrison_(New_… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6895 │ 283471 William McKinley (New Earth) \\/wiki\\/William_McKinley_(New_… Public Identity Good Characters missing missing Male Characters missing Living Characters missing missing missing
6896 │ 150660 Mookie (New Earth) \\/wiki\\/Mookie_(New_Earth) Public Identity Bad Characters Blue Eyes Blond Hair Male Characters missing Living Characters missing missing missing
6805 rows omitted
levels(dc_jl.ID)
3-element Vector{String31}:
"Identity Unknown"
"Public Identity"
"Secret Identity"
levels(dc_jl.ALIGN)
4-element Vector{String31}:
"Bad Characters"
"Good Characters"
"Neutral Characters"
"Reformed Criminals"
levels(dc_jl.EYE)
17-element Vector{String31}:
"Amber Eyes"
"Auburn Hair"
"Black Eyes"
"Blue Eyes"
"Brown Eyes"
"Gold Eyes"
"Green Eyes"
"Grey Eyes"
"Hazel Eyes"
"Orange Eyes"
"Photocellular Eyes"
"Pink Eyes"
"Purple Eyes"
"Red Eyes"
"Violet Eyes"
"White Eyes"
"Yellow Eyes"
levels(dc_jl.HAIR)
17-element Vector{String31}:
"Black Hair"
"Blond Hair"
"Blue Hair"
"Brown Hair"
"Gold Hair"
"Green Hair"
"Grey Hair"
"Orange Hair"
"Pink Hair"
"Platinum Blond Hair"
"Purple Hair"
"Red Hair"
"Reddish Brown Hair"
"Silver Hair"
"Strawberry Blond Hair"
"Violet Hair"
"White Hair"
levels(dc_jl.SEX)
4-element Vector{String31}:
"Female Characters"
"Genderless Characters"
"Male Characters"
"Transgender Characters"
levels(dc_jl.GSM)
2-element Vector{String31}:
"Bisexual Characters"
"Homosexual Characters"
levels(dc_jl.ALIVE)
2-element Vector{String31}:
"Deceased Characters"
"Living Characters"
dc_py = pandas.read_csv("../../dataset/comic-characters/dc-wikia-data.csv")
dc_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6896 entries, 0 to 6895
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 page_id 6896 non-null int64
1 name 6896 non-null object
2 urlslug 6896 non-null object
3 ID 4883 non-null object
4 ALIGN 6295 non-null object
5 EYE 3268 non-null object
6 HAIR 4622 non-null object
7 SEX 6771 non-null object
8 GSM 64 non-null object
9 ALIVE 6893 non-null object
10 APPEARANCES 6541 non-null float64
11 FIRST APPEARANCE 6827 non-null object
12 YEAR 6827 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 700.5+ KB
dc_py.head(n=8)
page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST APPEARANCE YEAR
0 1422 Batman (Bruce Wayne) \/wiki\/Batman_(Bruce_Wayne) Secret Identity Good Characters Blue Eyes Black Hair Male Characters NaN Living Characters 3093.0 1939, May 1939.0
1 23387 Superman (Clark Kent) \/wiki\/Superman_(Clark_Kent) Secret Identity Good Characters Blue Eyes Black Hair Male Characters NaN Living Characters 2496.0 1986, October 1986.0
2 1458 Green Lantern (Hal Jordan) \/wiki\/Green_Lantern_(Hal_Jordan) Secret Identity Good Characters Brown Eyes Brown Hair Male Characters NaN Living Characters 1565.0 1959, October 1959.0
3 1659 James Gordon (New Earth) \/wiki\/James_Gordon_(New_Earth) Public Identity Good Characters Brown Eyes White Hair Male Characters NaN Living Characters 1316.0 1987, February 1987.0
4 1576 Richard Grayson (New Earth) \/wiki\/Richard_Grayson_(New_Earth) Secret Identity Good Characters Blue Eyes Black Hair Male Characters NaN Living Characters 1237.0 1940, April 1940.0
5 1448 Wonder Woman (Diana Prince) \/wiki\/Wonder_Woman_(Diana_Prince) Public Identity Good Characters Blue Eyes Black Hair Female Characters NaN Living Characters 1231.0 1941, December 1941.0
6 1486 Aquaman (Arthur Curry) \/wiki\/Aquaman_(Arthur_Curry) Public Identity Good Characters Blue Eyes Blond Hair Male Characters NaN Living Characters 1121.0 1941, November 1941.0
7 1451 Timothy Drake (New Earth) \/wiki\/Timothy_Drake_(New_Earth) Secret Identity Good Characters Blue Eyes Black Hair Male Characters NaN Living Characters 1095.0 1989, August 1989.0
dc_py.tail(n=8)
page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST APPEARANCE YEAR
6888 283657 William Howard Taft (New Earth) \/wiki\/William_Howard_Taft_(New_Earth) Public Identity Good Characters NaN NaN Male Characters NaN Living Characters NaN NaN NaN
6889 21655 Frank Fitzsimmons (New Earth) \/wiki\/Frank_Fitzsimmons_(New_Earth) Public Identity Good Characters NaN Grey Hair Male Characters NaN Living Characters NaN NaN NaN
6890 283482 James Garfield (New Earth) \/wiki\/James_Garfield_(New_Earth) Public Identity Good Characters NaN NaN Male Characters NaN Living Characters NaN NaN NaN
6891 66302 Nadine West (New Earth) \/wiki\/Nadine_West_(New_Earth) Public Identity Good Characters NaN NaN Female Characters NaN Living Characters NaN NaN NaN
6892 283475 Warren Harding (New Earth) \/wiki\/Warren_Harding_(New_Earth) Public Identity Good Characters NaN NaN Male Characters NaN Living Characters NaN NaN NaN
6893 283478 William Harrison (New Earth) \/wiki\/William_Harrison_(New_Earth) Public Identity Good Characters NaN NaN Male Characters NaN Living Characters NaN NaN NaN
6894 283471 William McKinley (New Earth) \/wiki\/William_McKinley_(New_Earth) Public Identity Good Characters NaN NaN Male Characters NaN Living Characters NaN NaN NaN
6895 150660 Mookie (New Earth) \/wiki\/Mookie_(New_Earth) Public Identity Bad Characters Blue Eyes Blond Hair Male Characters NaN Living Characters NaN NaN NaN
dc_r <- read.csv("../../dataset/comic-characters/dc-wikia-data.csv", stringsAsFactors=TRUE)
str(object=dc_r)
'data.frame': 6896 obs. of 13 variables:
$ page_id : int 1422 23387 1458 1659 1576 1448 1486 1451 71760 1380 ...
$ name : Factor w/ 6896 levels "3g4 (New Earth)",..: 598 6007 2488 3002 5280 6771 378 6289 1695 2185 ...
$ urlslug : Factor w/ 6896 levels "\\/wiki\\/3g4_(New_Earth)",..: 598 6006 2488 3002 5279 6771 378 6288 1695 2185 ...
$ ID : Factor w/ 4 levels "","Identity Unknown",..: 4 4 4 3 4 3 3 4 3 4 ...
$ ALIGN : Factor w/ 5 levels "","Bad Characters",..: 3 3 3 3 3 3 3 3 3 3 ...
$ EYE : Factor w/ 18 levels "","Amber Eyes",..: 5 5 6 6 5 5 5 5 5 5 ...
$ HAIR : Factor w/ 18 levels "","Black Hair",..: 2 2 5 18 2 2 3 2 3 3 ...
$ SEX : Factor w/ 5 levels "","Female Characters",..: 4 4 4 4 4 2 4 4 2 4 ...
$ GSM : Factor w/ 3 levels "","Bisexual Characters",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ALIVE : Factor w/ 3 levels "","Deceased Characters",..: 3 3 3 3 3 3 3 3 3 3 ...
$ APPEARANCES : int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ FIRST.APPEARANCE: Factor w/ 775 levels "","1935, October",..: 16 456 157 462 21 34 40 487 262 130 ...
$ YEAR : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
head(x=dc_r, n=8)
page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST.APPEARANCE YEAR
1 1422 Batman (Bruce Wayne) \\/wiki\\/Batman_(Bruce_Wayne) Secret Identity Good Characters Blue Eyes Black Hair Male Characters Living Characters 3093 1939, May 1939
2 23387 Superman (Clark Kent) \\/wiki\\/Superman_(Clark_Kent) Secret Identity Good Characters Blue Eyes Black Hair Male Characters Living Characters 2496 1986, October 1986
3 1458 Green Lantern (Hal Jordan) \\/wiki\\/Green_Lantern_(Hal_Jordan) Secret Identity Good Characters Brown Eyes Brown Hair Male Characters Living Characters 1565 1959, October 1959
4 1659 James Gordon (New Earth) \\/wiki\\/James_Gordon_(New_Earth) Public Identity Good Characters Brown Eyes White Hair Male Characters Living Characters 1316 1987, February 1987
5 1576 Richard Grayson (New Earth) \\/wiki\\/Richard_Grayson_(New_Earth) Secret Identity Good Characters Blue Eyes Black Hair Male Characters Living Characters 1237 1940, April 1940
6 1448 Wonder Woman (Diana Prince) \\/wiki\\/Wonder_Woman_(Diana_Prince) Public Identity Good Characters Blue Eyes Black Hair Female Characters Living Characters 1231 1941, December 1941
7 1486 Aquaman (Arthur Curry) \\/wiki\\/Aquaman_(Arthur_Curry) Public Identity Good Characters Blue Eyes Blond Hair Male Characters Living Characters 1121 1941, November 1941
8 1451 Timothy Drake (New Earth) \\/wiki\\/Timothy_Drake_(New_Earth) Secret Identity Good Characters Blue Eyes Black Hair Male Characters Living Characters 1095 1989, August 1989
tail(x=dc_r, n=8)
page_id name urlslug ID ALIGN EYE HAIR SEX GSM ALIVE APPEARANCES FIRST.APPEARANCE YEAR
6889 283657 William Howard Taft (New Earth) \\/wiki\\/William_Howard_Taft_(New_Earth) Public Identity Good Characters Male Characters Living Characters NA NA
6890 21655 Frank Fitzsimmons (New Earth) \\/wiki\\/Frank_Fitzsimmons_(New_Earth) Public Identity Good Characters Grey Hair Male Characters Living Characters NA NA
6891 283482 James Garfield (New Earth) \\/wiki\\/James_Garfield_(New_Earth) Public Identity Good Characters Male Characters Living Characters NA NA
6892 66302 Nadine West (New Earth) \\/wiki\\/Nadine_West_(New_Earth) Public Identity Good Characters Female Characters Living Characters NA NA
6893 283475 Warren Harding (New Earth) \\/wiki\\/Warren_Harding_(New_Earth) Public Identity Good Characters Male Characters Living Characters NA NA
6894 283478 William Harrison (New Earth) \\/wiki\\/William_Harrison_(New_Earth) Public Identity Good Characters Male Characters Living Characters NA NA
6895 283471 William McKinley (New Earth) \\/wiki\\/William_McKinley_(New_Earth) Public Identity Good Characters Male Characters Living Characters NA NA
6896 150660 Mookie (New Earth) \\/wiki\\/Mookie_(New_Earth) Public Identity Bad Characters Blue Eyes Blond Hair Male Characters Living Characters NA NA
str(dc_r)
'data.frame': 6896 obs. of 13 variables:
$ page_id : int 1422 23387 1458 1659 1576 1448 1486 1451 71760 1380 ...
$ name : Factor w/ 6896 levels "3g4 (New Earth)",..: 598 6007 2488 3002 5280 6771 378 6289 1695 2185 ...
$ urlslug : Factor w/ 6896 levels "\\/wiki\\/3g4_(New_Earth)",..: 598 6006 2488 3002 5279 6771 378 6288 1695 2185 ...
$ ID : Factor w/ 4 levels "","Identity Unknown",..: 4 4 4 3 4 3 3 4 3 4 ...
$ ALIGN : Factor w/ 5 levels "","Bad Characters",..: 3 3 3 3 3 3 3 3 3 3 ...
$ EYE : Factor w/ 18 levels "","Amber Eyes",..: 5 5 6 6 5 5 5 5 5 5 ...
$ HAIR : Factor w/ 18 levels "","Black Hair",..: 2 2 5 18 2 2 3 2 3 3 ...
$ SEX : Factor w/ 5 levels "","Female Characters",..: 4 4 4 4 4 2 4 4 2 4 ...
$ GSM : Factor w/ 3 levels "","Bisexual Characters",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ALIVE : Factor w/ 3 levels "","Deceased Characters",..: 3 3 3 3 3 3 3 3 3 3 ...
$ APPEARANCES : int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ FIRST.APPEARANCE: Factor w/ 775 levels "","1935, October",..: 16 456 157 462 21 34 40 487 262 130 ...
$ YEAR : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
levels(dc_r$ID)
[1] "" "Identity Unknown" "Public Identity" "Secret Identity"
levels(dc_r$ALIGN)
[1] "" "Bad Characters" "Good Characters" "Neutral Characters" "Reformed Criminals"
levels(dc_r$EYE)
[1] "" "Amber Eyes" "Auburn Hair" "Black Eyes" "Blue Eyes" "Brown Eyes" "Gold Eyes" "Green Eyes" "Grey Eyes" "Hazel Eyes" "Orange Eyes" "Photocellular Eyes" "Pink Eyes" "Purple Eyes" "Red Eyes" "Violet Eyes" "White Eyes" "Yellow Eyes"
levels(dc_r$HAIR)
[1] "" "Black Hair" "Blond Hair" "Blue Hair" "Brown Hair" "Gold Hair" "Green Hair" "Grey Hair" "Orange Hair" "Pink Hair" "Platinum Blond Hair" "Purple Hair" "Red Hair" "Reddish Brown Hair" "Silver Hair" "Strawberry Blond Hair" "Violet Hair" "White Hair"
levels(dc_r$SEX)
[1] "" "Female Characters" "Genderless Characters" "Male Characters" "Transgender Characters"
levels(dc_r$GSM)
[1] "" "Bisexual Characters" "Homosexual Characters"
levels(dc_r$ALIVE)
[1] "" "Deceased Characters" "Living Characters"
Renaming Columns According to Preferred Naming Convension
Clean data starts with making the data easy-to-use and interpretable by the consumers of that data. Following common style guides—such as the Julia, Python, and R style guides—will make us good data citizens.
# Convert column names to lowercase
dc_clean_jl = DataFrames.rename!(dc_jl, lowercase.(names(dc_jl)));
# Rename column names
DataFrames.rename!(dc_clean_jl, [
:page_id, :name, :url_slug, :identity,
:align, :eye, :hair, :gender,
:gender_sexual_minority, :living_status, :appearances, :first_appearance_month,
:first_appearance_year
]);
# Convert column names to lowercase
dc_clean_py= dc_py.rename(columns = str.lower)
# Convert column names to lowercase
dc_clean_r <- dplyr::rename_with(dc_r, tolower)
# Rename column names
colnames(dc_clean_r)[which(names(dc_clean_r) == "urlslug")] <- "url_slug"
colnames(dc_clean_r)[which(names(dc_clean_r) == "id")] <- "identity"
colnames(dc_clean_r)[which(names(dc_clean_r) == "sex")] <- "gender"
colnames(dc_clean_r)[which(names(dc_clean_r) == "gsm")] <- "gender_sexual_minority"
colnames(dc_clean_r)[which(names(dc_clean_r) == "alive")] <- "living_status"
colnames(dc_clean_r)[which(names(dc_clean_r) == "first.appearance")] <- "first_appearance_month"
colnames(dc_clean_r)[which(names(dc_clean_r) == "year")] <- "first_appearance_year"
Converting Categorical to String (and Vice Versa)
# Convert string to categorical
dc_clean_jl.identity = CategoricalArray(dc_clean_jl.identity);
dc_clean_jl.align = CategoricalArray(dc_clean_jl.align);
dc_clean_jl.eye = CategoricalArray(dc_clean_jl.eye);
dc_clean_jl.hair = CategoricalArray(dc_clean_jl.hair);
dc_clean_jl.gender = CategoricalArray(dc_clean_jl.gender);
dc_clean_jl.gender_sexual_minority = CategoricalArray(dc_clean_jl.gender_sexual_minority);
dc_clean_jl.living_status = CategoricalArray(dc_clean_jl.living_status);
# Convert categorical to string
dc_clean_r$name <- as.character(dc_clean_r$name)
Removing First N Characters of String
If we remove the first six characters of each value in the first_appearance_month
column (and grab the characters starting from the 7th position), we will be left with the month (spelled out) in string format.
dc_clean_r$first_appearance_month <- as.character(dc_clean_r$first_appearance_month)
dc_clean_r$first_appearance_month <- substring(dc_clean_r$first_appearance_month, 7)
dc_clean_r$first_appearance_month <- as.factor(dc_clean_r$first_appearance_month)
Renaming Levels
DataFrames.replace!(dc_clean_jl.identity, "Identity Unknown" => "Unknown");
DataFrames.replace!(dc_clean_jl.identity, "Known to Authorities Identity" => "Known to Authorities");
DataFrames.replace!(dc_clean_jl.identity, "No Dual Identity" => "No Dual");
DataFrames.replace!(dc_clean_jl.identity, "Public Identity" => "Public");
DataFrames.replace!(dc_clean_jl.identity, "Secret Identity" => "Secret");
DataFrames.replace!(dc_clean_jl.align, "Bad Characters" => "Bad");
DataFrames.replace!(dc_clean_jl.align, "Good Characters" => "Good");
DataFrames.replace!(dc_clean_jl.align, "Neutral Characters" => "Neutral");
DataFrames.replace!(dc_clean_jl.align, "Reformed Criminals" => "Reformed Criminal");
DataFrames.replace!(dc_clean_jl.gender, "Agender Characters" => "Agender");
DataFrames.replace!(dc_clean_jl.gender, "Female Characters" => "Female");
DataFrames.replace!(dc_clean_jl.gender, "Genderfluid Characters" => "Gender Fluid");
DataFrames.replace!(dc_clean_jl.gender, "Genderless Characters" => "Genderless");
DataFrames.replace!(dc_clean_jl.gender, "Male Characters" => "Male");
DataFrames.replace!(dc_clean_jl.gender, "Transgender Characters" => "Transgender");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Bisexual Characters" => "Bisexual");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Genderfluid Characters" => "Gender Fluid");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Homosexual Characters" => "Homosexual");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Pansexual Characters" => "Pansexual");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Transgender Characters" => "Transgender");
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, "Transvestites" => "Transvestite");
DataFrames.replace!(dc_clean_jl.living_status, "Deceased Characters" => "Deceased");
DataFrames.replace!(dc_clean_jl.living_status, "Living Characters" => "Living");
levels(dc_clean_r$identity)[levels(dc_clean_r$identity) == "Identity Unknown"] <- "Unknown"
levels(dc_clean_r$identity)[levels(dc_clean_r$identity) == "Public Identity"] <- "Public"
levels(dc_clean_r$identity)[levels(dc_clean_r$identity) == "Secret Identity"] <- "Secret"
levels(dc_clean_r$align)[levels(dc_clean_r$align) == "Bad Characters"] <- "Bad"
levels(dc_clean_r$align)[levels(dc_clean_r$align) == "Good Characters"] <- "Good"
levels(dc_clean_r$align)[levels(dc_clean_r$align) == "Neutral Characters"] <- "Neutral"
levels(dc_clean_r$align)[levels(dc_clean_r$align) == "Reformed Criminals"] <- "Reformed Criminal"
levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Amber Eyes"] <- "Amber"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Auburn Hair"] <- "Auburn"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Black Eyes"] <- "Black"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Blue Eyes"] <- "Blue"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Brown Eyes"] <- "Brown"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Gold Eyes"] <- "Gold"
levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Green Eyes"] <- "Green"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Grey Eyes"] <- "Grey"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Hazel Eyes"] <- "Hazel"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Orange Eyes"] <- "Orange"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Photocellular Eyes"] <- "Photocellular"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Pink Eyes"] <- "Pink"
levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Purple Eyes"] <- "Purple"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Red Eyes"] <- "Red"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Violet Eyes"] <- "Violet"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "White Eyes"] <- "White"; levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == "Yellow Eyes"] <- "Yellow"
levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Black Hair"] <- "Black"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Blond Hair"] <- "Blond"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Blue Hair"] <- "Blue"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Brown Hair"] <- "Brown"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Gold Hair"] <- "Gold"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Green Hair"] <- "Green"
levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Grey Hair"] <- "Grey"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Orange Hair"] <- "Orange"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Pink Hair"] <- "Pink"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Platinum Blond Hair"] <- "Platinum Blond"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Purple Hair"] <- "Purple"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Red Hair"] <- "Red"
levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Reddish Brown Hair"] <- "Reddish Brown"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Silver Hair"] <- "Silver"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Strawberry Blond Hair"] <- "Strawberry Blond"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "Violet Hair"] <- "Violet"; levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == "White Hair"] <- "White"
levels(dc_clean_r$gender)[levels(dc_clean_r$gender) == "Female Characters"] <- "Female"
levels(dc_clean_r$gender)[levels(dc_clean_r$gender) == "Genderless Characters"] <- "Genderless"
levels(dc_clean_r$gender)[levels(dc_clean_r$gender) == "Male Characters"] <- "Male"
levels(dc_clean_r$gender)[levels(dc_clean_r$gender) == "Transgender Characters"] <- "Transgender"
levels(dc_clean_r$gender_sexual_minority)[levels(dc_clean_r$gender_sexual_minority) == "Bisexual Characters"] <- "Bisexual"
levels(dc_clean_r$gender_sexual_minority)[levels(dc_clean_r$gender_sexual_minority) == "Homosexual Characters"] <- "Homosexual"
levels(dc_clean_r$living_status)[levels(dc_clean_r$living_status) == "Deceased Characters"] <- "Deceased"
levels(dc_clean_r$living_status)[levels(dc_clean_r$living_status) == "Living Characters"] <- "Alive"
Handling Missing Data
# Assign as Unknown
DataFrames.replace!(dc_clean_jl.identity, missing => "Unknown");
DataFrames.replace!(dc_clean_jl.align, missing => "Unknown");
DataFrames.replace!(dc_clean_jl.eye, missing => "Unknown");
DataFrames.replace!(dc_clean_jl.hair, missing => "Unknown");
DataFrames.replace!(dc_clean_jl.gender, missing => "Unknown");
DataFrames.replace!(dc_clean_jl.living_status, missing => "Unknown");
# Assign a specific value
DataFrames.replace!(dc_clean_jl.gender_sexual_minority, missing => "Non-GSM");
# Assign as Unknown
levels(dc_clean_r$identity)[levels(dc_clean_r$identity) == ""] <- "Unknown"
levels(dc_clean_r$align)[levels(dc_clean_r$align) == ""] <- "Unknown"
levels(dc_clean_r$eye)[levels(dc_clean_r$eye) == ""] <- "Unknown"
levels(dc_clean_r$hair)[levels(dc_clean_r$hair) == ""] <- "Unknown"
levels(dc_clean_r$gender)[levels(dc_clean_r$gender) == ""] <- "Unknown"
levels(dc_clean_r$living_status)[levels(dc_clean_r$living_status) == ""] <- "Unknown"
# Assign a specific value
levels(dc_clean_r$gender_sexual_minority)[levels(dc_clean_r$gender_sexual_minority) == ""] <- "Non-GSM"
Assigning Order to Levels
Categorical data can be further defined as ordinal (with an order) and nominal (without an order) data. In this case, I’ve made the assumption of a natural progression (or order) in a character’s alignment.
dc_clean_jl.align = CategoricalArrays.levels!(dc_clean_jl.align, ["Bad", "Reformed Criminal", "Unknown", "Neutral", "Good"]);
dc_clean_r$align <- ordered(dc_clean_r$align, levels = c("Bad", "Reformed Criminal", "Unknown", "Neutral", "Good"))
Removing Columns
dc_clean_jl = select!(dc_clean_jl, Not([:page_id, :url_slug]));
show(dc_clean_jl, allcols = true)
6896×11 DataFrame
Row │ name identity align eye hair gender gender_sexual_minority living_status appearances first_appearance_month first_appearance_year
│ String Cat…? Cat…? Cat…? Cat…? Cat…? Cat…? Cat…? Int64? String15? Int64?
──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Batman (Bruce Wayne) Secret Good Blue Eyes Black Hair Male Non-GSM Living 3093 1939, May 1939
2 │ Superman (Clark Kent) Secret Good Blue Eyes Black Hair Male Non-GSM Living 2496 1986, October 1986
3 │ Green Lantern (Hal Jordan) Secret Good Brown Eyes Brown Hair Male Non-GSM Living 1565 1959, October 1959
4 │ James Gordon (New Earth) Public Good Brown Eyes White Hair Male Non-GSM Living 1316 1987, February 1987
5 │ Richard Grayson (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 1237 1940, April 1940
6 │ Wonder Woman (Diana Prince) Public Good Blue Eyes Black Hair Female Non-GSM Living 1231 1941, December 1941
7 │ Aquaman (Arthur Curry) Public Good Blue Eyes Blond Hair Male Non-GSM Living 1121 1941, November 1941
8 │ Timothy Drake (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 1095 1989, August 1989
9 │ Dinah Laurel Lance (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living 1075 1969, November 1969
10 │ Flash (Barry Allen) Secret Good Blue Eyes Blond Hair Male Non-GSM Living 1028 1956, October 1956
11 │ GenderTest Secret Good Blue Eyes Blond Hair Female Non-GSM Living 1028 1956, October 1956
12 │ Alan Scott (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Deceased 969 1940, July 1940
13 │ Barbara Gordon (New Earth) Secret Good Blue Eyes Red Hair Female Non-GSM Living 951 1967, January 1967
14 │ Jason Garrick (New Earth) Public Good Blue Eyes Brown Hair Male Non-GSM Living 951 1940, January 1940
15 │ Lois Lane (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living 934 1938, June 1938
16 │ Alfred Pennyworth (New Earth) Public Good Blue Eyes Black Hair Male Non-GSM Living 930 1943, April 1943
17 │ Carter Hall (New Earth) Secret Good Blue Eyes Brown Hair Male Non-GSM Living 803 1940, January 1940
18 │ Kyle Rayner (New Earth) Secret Good Green Eyes Black Hair Male Non-GSM Living 716 1994, January 1994
19 │ Raymond Palmer (New Earth) Public Good Brown Eyes Unknown Male Non-GSM Living 706 1961, October 1961
20 │ Alexander Luthor (New Earth) Public Bad Green Eyes Unknown Male Non-GSM Living 677 1986, October 1986
21 │ Roy Harper (New Earth) Secret Neutral Green Eyes Red Hair Male Non-GSM Living 654 1941, November 1941
22 │ Kara Zor-L (Earth-Two) Secret Good Blue Eyes Blond Hair Female Non-GSM Living 635 1976, February 1976
23 │ Ted Grant (New Earth) Secret Unknown Blue Eyes Black Hair Male Non-GSM Living 605 1942, January 1942
24 │ Garfield Logan (New Earth) Public Good Green Eyes Green Hair Male Non-GSM Living 595 1965, November 1965
25 │ Guy Gardner (New Earth) Public Good Blue Eyes Red Hair Male Non-GSM Living 593 1968, March 1968
26 │ Victor Stone (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 584 1980, October 1980
27 │ Kon-El (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 560 1993, June 1993
28 │ Ralph Dibny (New Earth) Public Unknown Blue Eyes Red Hair Male Non-GSM Deceased 558 1960, May 1960
29 │ James Olsen (New Earth) Public Good Green Eyes Red Hair Male Non-GSM Living 557 1986, October 1986
30 │ John Stewart (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 549 1971, December 1971
31 │ Joker (New Earth) Secret Bad Green Eyes Green Hair Male Non-GSM Living 517 1940, June 1940
32 │ Franklin Rock (New Earth) Public Good Blue Eyes Unknown Male Non-GSM Living 492 1959, April 1959
33 │ Garth (New Earth) Public Good Purple Eyes Black Hair Male Non-GSM Deceased 487 1960, February 1960
34 │ Rex Mason (New Earth) Secret Unknown Black Eyes Unknown Male Non-GSM Living 470 1965, January 1965
35 │ Zatanna Zatara (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living 439 1964, November 1964
36 │ Aztar (New Earth) Secret Unknown White Eyes Unknown Male Non-GSM Living 436 1940, February 1940
37 │ Theodore Kord (New Earth) Secret Unknown Blue Eyes Brown Hair Male Non-GSM Deceased 429 1986, February 1986
38 │ Michael Jon Carter (New Earth) Public Good Blue Eyes Blond Hair Male Non-GSM Living 427 1986, February 1986
39 │ Cassandra Sandsmark (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living 423 1996, January 1996
40 │ Kent Nelson (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Deceased 422 1940, May 1940
41 │ Harvey Bullock (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 413 1974, July 1974
42 │ Rachel Roth (New Earth) Secret Good Purple Eyes Black Hair Female Non-GSM Living 399 1980, October 1980
43 │ Helena Bertinelli (New Earth) Secret Good Blue Eyes Black Hair Female Non-GSM Living 393 1989, April 1989
44 │ Wesley Dodds (New Earth) Secret Good Brown Eyes Brown Hair Male Non-GSM Deceased 391 1939, April 1939
45 │ Uxas (New Earth) Public Bad Red Eyes Unknown Male Non-GSM Deceased 388 1970, December 1970
46 │ Nathaniel Adam (New Earth) Secret Good Blue Eyes White Hair Male Non-GSM Living 386 1987, March 1987
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
6852 │ Shu (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1977, April 1977
6853 │ Ch'al Andar (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Living missing 1975, March 1975
6854 │ M'nagalah (New Earth) Secret Bad Unknown Unknown Genderless Non-GSM Living missing 1974, February 1974
6855 │ Mrblonde267\\/Buddy Blank (New E… Unknown Unknown Unknown Unknown Unknown Non-GSM Unknown missing 1974, October 1974
6856 │ Dragorin (New Earth) Public Bad Red Eyes Black Hair Male Non-GSM Living missing 1971, October 1971
6857 │ Ra's al Ghul (New Earth) Secret Bad Green Eyes Grey Hair Male Non-GSM Deceased missing 1971, June 1971
6858 │ Bartholomew Lash (New Earth) Public Good Blue Eyes Red Hair Male Non-GSM Living missing 1968, August 1968
6859 │ Angel Beatrix O'Day (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living missing 1968, September 1968
6860 │ Cobalt (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6861 │ Gallium (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6862 │ Iridia (New Earth) Secret Good Photocellular Eyes Unknown Female Non-GSM Living missing 1968, May 1968
6863 │ Osmium (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6864 │ Silver (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6865 │ T'Charr (New Earth) Secret Unknown Unknown Unknown Male Non-GSM Deceased missing 1968, June 1968
6866 │ Zinc (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6867 │ Baron Tyrano (New Earth) Secret Bad Blue Eyes Unknown Male Non-GSM Living missing 1967, July 1967
6868 │ Brains (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, April 1967
6869 │ Cracker (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, June 1967
6870 │ Hard Head (New Earth) Unknown Good Unknown Black Hair Male Non-GSM Living missing 1967, April 1967
6871 │ Zig-Zag (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, June 1967
6872 │ Dragonfly (New Earth) Secret Bad Unknown Black Hair Female Non-GSM Living missing 1966, June 1966
6873 │ Carl Bradford (New Earth) Unknown Bad Unknown Unknown Male Non-GSM Living missing 1966, January 1966
6874 │ Donna Troy (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living missing 1965, July 1965
6875 │ Bartholomew Magan (New Earth) Unknown Bad Unknown Unknown Male Non-GSM Living missing 1963, September 1963
6876 │ James Moon (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1962, March 1962
6877 │ Flash (Wally West) Secret Good Green Eyes Red Hair Male Non-GSM Living missing 1960, January 1960
6878 │ J'onn J'onzz (New Earth) Public Good Red Eyes Unknown Male Non-GSM Living missing 1955, November 1955
6879 │ Dorothea Tane (New Earth) Unknown Unknown Unknown Blond Hair Female Non-GSM Living missing 1948, August 1948
6880 │ Dmane (Earth-Two) Unknown Bad Blue Eyes Unknown Male Non-GSM Living missing 1946, April 1946
6881 │ Maximillian O'Leary (New Earth) Public Good Unknown Black Hair Male Non-GSM Living missing 1946, January 1946
6882 │ Doris Zuel (New Earth) Secret Bad Green Eyes Red Hair Female Non-GSM Living missing 1944, June 1944
6883 │ Doris Lee (New Earth) Public Unknown Brown Eyes Brown Hair Female Non-GSM Deceased missing 1941, April 1941
6884 │ Patrick O'Brian (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living missing 1941, August 1941
6885 │ Basil Karlo (New Earth) Secret Bad Black Eyes Black Hair Male Non-GSM Living missing 1940, June 1940
6886 │ Catwoman (Selina Kyle) Secret Neutral Green Eyes Black Hair Female Non-GSM Living missing 1940, June 1940
6887 │ Bedivere (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1936, February 1936
6888 │ Herbert Hoover (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6889 │ William Howard Taft (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6890 │ Frank Fitzsimmons (New Earth) Public Good Unknown Grey Hair Male Non-GSM Living missing missing missing
6891 │ James Garfield (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6892 │ Nadine West (New Earth) Public Good Unknown Unknown Female Non-GSM Living missing missing missing
6893 │ Warren Harding (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6894 │ William Harrison (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6895 │ William McKinley (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6896 │ Mookie (New Earth) Public Bad Blue Eyes Blond Hair Male Non-GSM Living missing missing missing
6805 rows omitted
dc_clean_r <- subset(dc_clean_r, select = -c(page_id, url_slug))
Validating Clean Data
After data wrangling, here is what a clean data looks like:
show(dc_clean_jl, allcols = true)
6896×11 DataFrame
Row │ name identity align eye hair gender gender_sexual_minority living_status appearances first_appearance_month first_appearance_year
│ String Cat…? Cat…? Cat…? Cat…? Cat…? Cat…? Cat…? Int64? String15? Int64?
──────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Batman (Bruce Wayne) Secret Good Blue Eyes Black Hair Male Non-GSM Living 3093 1939, May 1939
2 │ Superman (Clark Kent) Secret Good Blue Eyes Black Hair Male Non-GSM Living 2496 1986, October 1986
3 │ Green Lantern (Hal Jordan) Secret Good Brown Eyes Brown Hair Male Non-GSM Living 1565 1959, October 1959
4 │ James Gordon (New Earth) Public Good Brown Eyes White Hair Male Non-GSM Living 1316 1987, February 1987
5 │ Richard Grayson (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 1237 1940, April 1940
6 │ Wonder Woman (Diana Prince) Public Good Blue Eyes Black Hair Female Non-GSM Living 1231 1941, December 1941
7 │ Aquaman (Arthur Curry) Public Good Blue Eyes Blond Hair Male Non-GSM Living 1121 1941, November 1941
8 │ Timothy Drake (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 1095 1989, August 1989
9 │ Dinah Laurel Lance (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living 1075 1969, November 1969
10 │ Flash (Barry Allen) Secret Good Blue Eyes Blond Hair Male Non-GSM Living 1028 1956, October 1956
11 │ GenderTest Secret Good Blue Eyes Blond Hair Female Non-GSM Living 1028 1956, October 1956
12 │ Alan Scott (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Deceased 969 1940, July 1940
13 │ Barbara Gordon (New Earth) Secret Good Blue Eyes Red Hair Female Non-GSM Living 951 1967, January 1967
14 │ Jason Garrick (New Earth) Public Good Blue Eyes Brown Hair Male Non-GSM Living 951 1940, January 1940
15 │ Lois Lane (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living 934 1938, June 1938
16 │ Alfred Pennyworth (New Earth) Public Good Blue Eyes Black Hair Male Non-GSM Living 930 1943, April 1943
17 │ Carter Hall (New Earth) Secret Good Blue Eyes Brown Hair Male Non-GSM Living 803 1940, January 1940
18 │ Kyle Rayner (New Earth) Secret Good Green Eyes Black Hair Male Non-GSM Living 716 1994, January 1994
19 │ Raymond Palmer (New Earth) Public Good Brown Eyes Unknown Male Non-GSM Living 706 1961, October 1961
20 │ Alexander Luthor (New Earth) Public Bad Green Eyes Unknown Male Non-GSM Living 677 1986, October 1986
21 │ Roy Harper (New Earth) Secret Neutral Green Eyes Red Hair Male Non-GSM Living 654 1941, November 1941
22 │ Kara Zor-L (Earth-Two) Secret Good Blue Eyes Blond Hair Female Non-GSM Living 635 1976, February 1976
23 │ Ted Grant (New Earth) Secret Unknown Blue Eyes Black Hair Male Non-GSM Living 605 1942, January 1942
24 │ Garfield Logan (New Earth) Public Good Green Eyes Green Hair Male Non-GSM Living 595 1965, November 1965
25 │ Guy Gardner (New Earth) Public Good Blue Eyes Red Hair Male Non-GSM Living 593 1968, March 1968
26 │ Victor Stone (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 584 1980, October 1980
27 │ Kon-El (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living 560 1993, June 1993
28 │ Ralph Dibny (New Earth) Public Unknown Blue Eyes Red Hair Male Non-GSM Deceased 558 1960, May 1960
29 │ James Olsen (New Earth) Public Good Green Eyes Red Hair Male Non-GSM Living 557 1986, October 1986
30 │ John Stewart (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 549 1971, December 1971
31 │ Joker (New Earth) Secret Bad Green Eyes Green Hair Male Non-GSM Living 517 1940, June 1940
32 │ Franklin Rock (New Earth) Public Good Blue Eyes Unknown Male Non-GSM Living 492 1959, April 1959
33 │ Garth (New Earth) Public Good Purple Eyes Black Hair Male Non-GSM Deceased 487 1960, February 1960
34 │ Rex Mason (New Earth) Secret Unknown Black Eyes Unknown Male Non-GSM Living 470 1965, January 1965
35 │ Zatanna Zatara (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living 439 1964, November 1964
36 │ Aztar (New Earth) Secret Unknown White Eyes Unknown Male Non-GSM Living 436 1940, February 1940
37 │ Theodore Kord (New Earth) Secret Unknown Blue Eyes Brown Hair Male Non-GSM Deceased 429 1986, February 1986
38 │ Michael Jon Carter (New Earth) Public Good Blue Eyes Blond Hair Male Non-GSM Living 427 1986, February 1986
39 │ Cassandra Sandsmark (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living 423 1996, January 1996
40 │ Kent Nelson (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Deceased 422 1940, May 1940
41 │ Harvey Bullock (New Earth) Public Good Brown Eyes Black Hair Male Non-GSM Living 413 1974, July 1974
42 │ Rachel Roth (New Earth) Secret Good Purple Eyes Black Hair Female Non-GSM Living 399 1980, October 1980
43 │ Helena Bertinelli (New Earth) Secret Good Blue Eyes Black Hair Female Non-GSM Living 393 1989, April 1989
44 │ Wesley Dodds (New Earth) Secret Good Brown Eyes Brown Hair Male Non-GSM Deceased 391 1939, April 1939
45 │ Uxas (New Earth) Public Bad Red Eyes Unknown Male Non-GSM Deceased 388 1970, December 1970
46 │ Nathaniel Adam (New Earth) Secret Good Blue Eyes White Hair Male Non-GSM Living 386 1987, March 1987
⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
6852 │ Shu (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1977, April 1977
6853 │ Ch'al Andar (New Earth) Secret Good Blue Eyes Blond Hair Male Non-GSM Living missing 1975, March 1975
6854 │ M'nagalah (New Earth) Secret Bad Unknown Unknown Genderless Non-GSM Living missing 1974, February 1974
6855 │ Mrblonde267\\/Buddy Blank (New E… Unknown Unknown Unknown Unknown Unknown Non-GSM Unknown missing 1974, October 1974
6856 │ Dragorin (New Earth) Public Bad Red Eyes Black Hair Male Non-GSM Living missing 1971, October 1971
6857 │ Ra's al Ghul (New Earth) Secret Bad Green Eyes Grey Hair Male Non-GSM Deceased missing 1971, June 1971
6858 │ Bartholomew Lash (New Earth) Public Good Blue Eyes Red Hair Male Non-GSM Living missing 1968, August 1968
6859 │ Angel Beatrix O'Day (New Earth) Public Good Blue Eyes Blond Hair Female Non-GSM Living missing 1968, September 1968
6860 │ Cobalt (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6861 │ Gallium (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6862 │ Iridia (New Earth) Secret Good Photocellular Eyes Unknown Female Non-GSM Living missing 1968, May 1968
6863 │ Osmium (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6864 │ Silver (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6865 │ T'Charr (New Earth) Secret Unknown Unknown Unknown Male Non-GSM Deceased missing 1968, June 1968
6866 │ Zinc (New Earth) Secret Good Photocellular Eyes Unknown Male Non-GSM Living missing 1968, May 1968
6867 │ Baron Tyrano (New Earth) Secret Bad Blue Eyes Unknown Male Non-GSM Living missing 1967, July 1967
6868 │ Brains (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, April 1967
6869 │ Cracker (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, June 1967
6870 │ Hard Head (New Earth) Unknown Good Unknown Black Hair Male Non-GSM Living missing 1967, April 1967
6871 │ Zig-Zag (New Earth) Unknown Good Unknown Unknown Male Non-GSM Living missing 1967, June 1967
6872 │ Dragonfly (New Earth) Secret Bad Unknown Black Hair Female Non-GSM Living missing 1966, June 1966
6873 │ Carl Bradford (New Earth) Unknown Bad Unknown Unknown Male Non-GSM Living missing 1966, January 1966
6874 │ Donna Troy (New Earth) Public Good Blue Eyes Black Hair Female Non-GSM Living missing 1965, July 1965
6875 │ Bartholomew Magan (New Earth) Unknown Bad Unknown Unknown Male Non-GSM Living missing 1963, September 1963
6876 │ James Moon (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1962, March 1962
6877 │ Flash (Wally West) Secret Good Green Eyes Red Hair Male Non-GSM Living missing 1960, January 1960
6878 │ J'onn J'onzz (New Earth) Public Good Red Eyes Unknown Male Non-GSM Living missing 1955, November 1955
6879 │ Dorothea Tane (New Earth) Unknown Unknown Unknown Blond Hair Female Non-GSM Living missing 1948, August 1948
6880 │ Dmane (Earth-Two) Unknown Bad Blue Eyes Unknown Male Non-GSM Living missing 1946, April 1946
6881 │ Maximillian O'Leary (New Earth) Public Good Unknown Black Hair Male Non-GSM Living missing 1946, January 1946
6882 │ Doris Zuel (New Earth) Secret Bad Green Eyes Red Hair Female Non-GSM Living missing 1944, June 1944
6883 │ Doris Lee (New Earth) Public Unknown Brown Eyes Brown Hair Female Non-GSM Deceased missing 1941, April 1941
6884 │ Patrick O'Brian (New Earth) Secret Good Blue Eyes Black Hair Male Non-GSM Living missing 1941, August 1941
6885 │ Basil Karlo (New Earth) Secret Bad Black Eyes Black Hair Male Non-GSM Living missing 1940, June 1940
6886 │ Catwoman (Selina Kyle) Secret Neutral Green Eyes Black Hair Female Non-GSM Living missing 1940, June 1940
6887 │ Bedivere (New Earth) Unknown Unknown Unknown Unknown Male Non-GSM Living missing 1936, February 1936
6888 │ Herbert Hoover (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6889 │ William Howard Taft (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6890 │ Frank Fitzsimmons (New Earth) Public Good Unknown Grey Hair Male Non-GSM Living missing missing missing
6891 │ James Garfield (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6892 │ Nadine West (New Earth) Public Good Unknown Unknown Female Non-GSM Living missing missing missing
6893 │ Warren Harding (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6894 │ William Harrison (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6895 │ William McKinley (New Earth) Public Good Unknown Unknown Male Non-GSM Living missing missing missing
6896 │ Mookie (New Earth) Public Bad Blue Eyes Blond Hair Male Non-GSM Living missing missing missing
6805 rows omitted
dc_clean_py.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6896 entries, 0 to 6895
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 page_id 6896 non-null int64
1 name 6896 non-null object
2 urlslug 6896 non-null object
3 id 4883 non-null object
4 align 6295 non-null object
5 eye 3268 non-null object
6 hair 4622 non-null object
7 sex 6771 non-null object
8 gsm 64 non-null object
9 alive 6893 non-null object
10 appearances 6541 non-null float64
11 first appearance 6827 non-null object
12 year 6827 non-null float64
dtypes: float64(2), int64(1), object(10)
memory usage: 700.5+ KB
str(dc_clean_r)
'data.frame': 6896 obs. of 11 variables:
$ name : chr "Batman (Bruce Wayne)" "Superman (Clark Kent)" "Green Lantern (Hal Jordan)" "James Gordon (New Earth)" ...
$ identity : Factor w/ 3 levels "Unknown","Public",..: 3 3 3 2 3 2 2 3 2 3 ...
$ align : Ord.factor w/ 5 levels "Bad"<"Reformed Criminal"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ eye : Factor w/ 18 levels "Unknown","Amber",..: 5 5 6 6 5 5 5 5 5 5 ...
$ hair : Factor w/ 18 levels "Unknown","Black",..: 2 2 5 18 2 2 3 2 3 3 ...
$ gender : Factor w/ 5 levels "Unknown","Female",..: 4 4 4 4 4 2 4 4 2 4 ...
$ gender_sexual_minority: Factor w/ 3 levels "Non-GSM","Bisexual",..: 1 1 1 1 1 1 1 1 1 1 ...
$ living_status : Factor w/ 3 levels "Unknown","Deceased",..: 3 3 3 3 3 3 3 3 3 3 ...
$ appearances : int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ first_appearance_month: Factor w/ 14 levels "","April","August",..: 11 13 13 5 2 4 12 3 12 13 ...
$ first_appearance_year : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
Summary Statistics
Here are the summary statistics of what the data looks like.
DataFrames.describe(dc_clean_jl)
11×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Union… Any Union… Any Int64 Type
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ name 3g4 (New Earth) Zzlrrrzzzm (New Earth) 0 String
2 │ identity Unknown Secret 0 Union{Missing, CategoricalValue{…
3 │ align Bad Good 0 Union{Missing, CategoricalValue{…
4 │ eye Amber Eyes Unknown 0 Union{Missing, CategoricalValue{…
5 │ hair Black Hair Unknown 0 Union{Missing, CategoricalValue{…
6 │ gender Female Unknown 0 Union{Missing, CategoricalValue{…
7 │ gender_sexual_minority Bisexual Non-GSM 0 Union{Missing, CategoricalValue{…
8 │ living_status Deceased Unknown 0 Union{Missing, CategoricalValue{…
9 │ appearances 23.6251 1 6.0 3093 355 Union{Missing, Int64}
10 │ first_appearance_month 1935, October 2013, October 69 Union{Missing, String15}
11 │ first_appearance_year 1989.77 1935 1992.0 2013 69 Union{Missing, Int64}
dc_clean_py.describe()
page_id appearances year
count 6896.000000 6541.000000 6827.000000
mean 147441.209252 23.625134 1989.766662
std 108388.631149 87.378509 16.824194
min 1380.000000 1.000000 1935.000000
25% 44105.500000 2.000000 1983.000000
50% 141267.000000 6.000000 1992.000000
75% 213203.000000 15.000000 2003.000000
max 404010.000000 3093.000000 2013.000000
summary(dc_clean_r, maxsum = 50)
name identity align eye hair gender gender_sexual_minority living_status appearances first_appearance_month first_appearance_year
Length:6896 Unknown:2022 Bad :2895 Unknown :3628 Unknown :2274 Unknown : 125 Non-GSM :6832 Unknown : 3 Min. : 1 :213 Min. :1935
Class :character Public :2466 Reformed Criminal: 3 Amber : 5 Black :1574 Female :1967 Bisexual : 10 Deceased:1693 1st Qu.: 2 April :485 1st Qu.:1983
Mode :character Secret :2408 Unknown : 601 Auburn : 7 Blond : 744 Genderless : 20 Homosexual: 54 Alive :5200 Median : 6 August :634 Median :1992
Neutral : 565 Black : 412 Blue : 41 Male :4783 Mean : 24 December :632 Mean :1990
Good :2832 Blue :1102 Brown :1148 Transgender: 1 3rd Qu.: 15 February :544 3rd Qu.:2003
Brown : 879 Gold : 5 Max. :3093 Holiday : 2 Max. :2013
Gold : 9 Green : 42 NA's :355 January :537 NA's :69
Green : 291 Grey : 157 July :554
Grey : 40 Orange : 21 June :613
Hazel : 23 Pink : 11 March :539
Orange : 10 Platinum Blond : 2 May :524
Photocellular: 48 Purple : 32 November :452
Pink : 6 Red : 461 October :604
Purple : 14 Reddish Brown : 3 September:563
Red : 208 Silver : 3
Violet : 12 Strawberry Blond: 28
White : 116 Violet : 4
Yellow : 86 White : 346
If you want to take this further with univariate & bivariate analyses, an example can found applied to the Marvel Comics dataset.
References
- Boehmke, B.C. (2016). Data Wrangling in R. Springer. https://doi.org/10.1007/978-3-319-45599-0
- Fivethirtyeight. (2014). Comic Characters. GitHub. https://github.com/fivethirtyeight/data/tree/master/comic-characters. (2019).