Read in my_data (.csv)

Base R can handle .csv and .txt files “out of the box”, but these can have limitations. Note: With Rmd, comments/text do not require the “#” comment symbol. Instead we specify where the code will go using “code chunks”, as below.

my_data <- read.csv("example_data.csv")

colnames(my_data)
## [1] "SEX"      "EDATTAIN" "EMPSTAT"  "AGE"
table(my_data$SEX)
## 
##    1    2 
## 2568 2432
table(my_data$EDATTAIN)
## 
##    1    2    3    4    5 
## 2488 1973  130  289  120
table(my_data$EMPSTAT)
## 
##    1    2    3    4    5 
## 1096 1873  121 1900   10

To read in different file types, we will need to download R packages.

Install Packages

R packages are add-ons that expand the functionality of base R. Its helpful to think of R as a series of books. “Base R” is the starter set of “books” you get when you first install R. install.packages() downloads a new “book” and adds it to your collection. This step needs to be done only once, or periodically to update.

Note: quotation marks (““) ARE REQUIRED to specify a package name.

if(!require("openxlsx")) {
  install.packages("openxlsx")
}
## Loading required package: openxlsx
if (!(require("haven"))) {
  install.packages("haven")
}
## Loading required package: haven
if(!require("ggplot2")) {
  install.packages("ggplot2")
}
## Loading required package: ggplot2
if (!(require("gtsummary"))) {
  install.packages("gtsummary")
}
## Loading required package: gtsummary

openxlsx provides functions to read and write Microsoft Excel files.

haven provides functions to read from various stats packages, including SPSS, Stata, and SAS. Notably, haven is able to read in values and value labels.

ggplot2 streamlines the creation of plots to visualize your data.

gtsummary streamlines the creation of tables to summarize your data.

Load Packages

In order to load the package, we use library(). This is like taking the book off your shelf and opening it to read/use. This step needs to be done each time you open R/RStudio, so it’s helpful to have a section at the top of the script that loads all packages needed.

Note: quotation marks (““) are NOT required for the library function.

library(haven)
library(openxlsx)
library(ggplot2)
library(gtsummary)

Read in my_data (.xlsx)

Excel can be convenient for data management, but like .csv files, it does not allow you to store a numeric code and a text label in a single cell. Understanding coded values will require referring to a separate metadata file.

my_data_xlsx <- read.xlsx("example_data.xlsx")

head(my_data_xlsx)
##   SEX EDATTAIN EMPSTAT AGE
## 1   1        2       2  29
## 2   1        1       2  17
## 3   1        2       2  32
## 4   1        1       4  69
## 5   2        1       1   8
## 6   2        2       4  27
#### Summarize (Frequency Tables)
table(my_data_xlsx$SEX)
## 
##    1    2 
## 2568 2432
table(my_data_xlsx$EDATTAIN)
## 
##    1    2    3    4    5 
## 2488 1973  130  289  120
table(my_data_xlsx$EMPSTAT)
## 
##    1    2    3    4    5 
## 1096 1873  121 1900   10

Read in my_data (with labels)

The haven package provides functions to read and write the formatted data files used by Stata, SPSS, and SAS. This package makes it easier to get your data into R, and also provides support for LABELED VALUES.

haven labels are complex; converting to the factor format, which is native to R, will preserve the labels and work more smoothly with other R functions.

We use as_factor from the haven package to simplify the labeled values.

my_data <- read_dta("example_data.dta")


head(my_data) # It's still numeric! and confusing
## # A tibble: 6 × 4
##          SEX                        EDATTAIN                   EMPSTAT   AGE
##    <dbl+lbl>                       <dbl+lbl>                 <dbl+lbl> <dbl>
## 1 1 [Male]   2 [Primary completed]           2 [Employed]                 29
## 2 1 [Male]   1 [Less than primary completed] 2 [Employed]                 17
## 3 1 [Male]   2 [Primary completed]           2 [Employed]                 32
## 4 1 [Male]   1 [Less than primary completed] 4 [Inactive]                 69
## 5 2 [Female] 1 [Less than primary completed] 1 [NIU (not in universe)]     8
## 6 2 [Female] 2 [Primary completed]           4 [Inactive]                 27
## as_factor in the haven package to convert to factors for further analysis

my_data$SEX <- as_factor(my_data$SEX)
my_data$EDATTAIN <- as_factor(my_data$EDATTAIN)
my_data$EMPSTAT <- as_factor(my_data$EMPSTAT)

head(my_data) # Labels present!
## # A tibble: 6 × 4
##   SEX    EDATTAIN                    EMPSTAT                 AGE
##   <fct>  <fct>                       <fct>                 <dbl>
## 1 Male   Primary completed           Employed                 29
## 2 Male   Less than primary completed Employed                 17
## 3 Male   Primary completed           Employed                 32
## 4 Male   Less than primary completed Inactive                 69
## 5 Female Less than primary completed NIU (not in universe)     8
## 6 Female Primary completed           Inactive                 27

Summarize your data

Frequency Tables

table(my_data$SEX)
## 
##   Male Female 
##   2568   2432
table(my_data$EDATTAIN)
## 
## Less than primary completed           Primary completed 
##                        2488                        1973 
##         Secondary completed        University completed 
##                         130                         289 
##                     Unknown 
##                         120
table(my_data$EMPSTAT)
## 
## NIU (not in universe)              Employed            Unemployed 
##                  1096                  1873                   121 
##              Inactive       Unknown/missing 
##                  1900                    10

Distribution of Numeric Variables

summary(my_data$AGE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   11.00   23.00   26.36   39.00   99.00

Two-Way Tables

table(my_data$EDATTAIN, my_data$SEX)
##                              
##                               Male Female
##   Less than primary completed 1270   1218
##   Primary completed           1019    954
##   Secondary completed           57     73
##   University completed         163    126
##   Unknown                       59     61

Subset observations

Use the subset() function specifying a DATA.FRAME followed by an expression that will return a LOGICAL (TRUE or FALSE) value.

sub_data <- subset(my_data, EDATTAIN != "Unknown")

table(sub_data$EDATTAIN) ## empty levels present
## 
## Less than primary completed           Primary completed 
##                        2488                        1973 
##         Secondary completed        University completed 
##                         130                         289 
##                     Unknown 
##                           0

R factors will keep empty levels, unless we explicitly tell R to drop them.

sub_data <- droplevels(sub_data) ## drop removed labels

# This time we save the resulting table to an object for further analysis
crosstab <- table(sub_data$EDATTAIN, sub_data$SEX)

Printing R objects

Calling an R object will print the contents of that object. Small objects may be printed in the console/Rmd report, but larger onces might get truncated:

## A reasonable table
crosstab
##                              
##                               Male Female
##   Less than primary completed 1270   1218
##   Primary completed           1019    954
##   Secondary completed           57     73
##   University completed         163    126
## Large objects may get truncated/summarized
my_data
## # A tibble: 5,000 × 4
##    SEX    EDATTAIN                    EMPSTAT                 AGE
##    <fct>  <fct>                       <fct>                 <dbl>
##  1 Male   Primary completed           Employed                 29
##  2 Male   Less than primary completed Employed                 17
##  3 Male   Primary completed           Employed                 32
##  4 Male   Less than primary completed Inactive                 69
##  5 Female Less than primary completed NIU (not in universe)     8
##  6 Female Primary completed           Inactive                 27
##  7 Male   Less than primary completed NIU (not in universe)     8
##  8 Female Primary completed           Employed                 20
##  9 Male   Less than primary completed Inactive                 12
## 10 Female Secondary completed         Employed                 30
## # … with 4,990 more rows

Writing out R objects

Options to write R objects out to various file types. Note the similarities and differences in syntax. In most file-writing functions, the first argument is the R object to be written out.

write.csv(crosstab, file =  "xtab_emp_by_sex.csv")

write.xlsx(crosstab, file = "xtab_emp_by_sex.xlsx")

write_dta(as.data.frame(crosstab), path = "xtab_emp_by_sex.dta") 
## note different syntax (argument name is `path` instead of `file`)

Base R vs Packages - Tables

Creating tables in base R can be a bit clunky. If you want a table of proportions or percentages, it may take a few steps.

crosstab ## print frequencies
##                              
##                               Male Female
##   Less than primary completed 1270   1218
##   Primary completed           1019    954
##   Secondary completed           57     73
##   University completed         163    126
ppn_crosstab <- prop.table(crosstab) ## get proportions
ppn_crosstab
##                              
##                                     Male     Female
##   Less than primary completed 0.26024590 0.24959016
##   Primary completed           0.20881148 0.19549180
##   Secondary completed         0.01168033 0.01495902
##   University completed        0.03340164 0.02581967
ppn_crosstab <- round(ppn_crosstab, 2) # round to 2 digits
ppn_crosstab
##                              
##                               Male Female
##   Less than primary completed 0.26   0.25
##   Primary completed           0.21   0.20
##   Secondary completed         0.01   0.01
##   University completed        0.03   0.03

Fancier Tables (gtsummary)

gtsummary produces nice all-in-one tables that integrate nicely into HTML reports created with R Markdown.

tbl_summary(sub_data, by = "SEX")
Characteristic Male, N = 2,5091 Female, N = 2,3711
EDATTAIN
Less than primary completed 1,270 (51%) 1,218 (51%)
Primary completed 1,019 (41%) 954 (40%)
Secondary completed 57 (2.3%) 73 (3.1%)
University completed 163 (6.5%) 126 (5.3%)
EMPSTAT
NIU (not in universe) 585 (23%) 511 (22%)
Employed 1,183 (47%) 583 (25%)
Unemployed 72 (2.9%) 49 (2.1%)
Inactive 667 (27%) 1,222 (52%)
Unknown/missing 2 (<0.1%) 6 (0.3%)
AGE 22 (10, 39) 23 (11, 39)
1 n (%); Median (IQR)

Base R vs Packages - Graphs

Basic Visualization (Base R)

barplot(crosstab)

Base R provides MANY built in functions, each with many options to customize your plots. The flexibility/customization comes at the trade-off of user input. Adding a legend takes two extra steps in Base R.

First, we use layout() to tell R to treat the plot window as a matrix(), consisting of 1 row and 4 columns: 1,1,2,2. The first plot will occupy the first two columns (as a merged column), while the second plot will occupy the last set of columns.

Make an empty plot (of a single point) to use as the base for the legend. “n” indicates “null” for many arguments to the plot function.

## plot matrix
layout(matrix(c(1,1,2,2), ncol = 4))

## barplot as normal
barplot(crosstab, col = rainbow(4), main = "Education by Sex")

## Make an empty plot 
plot(
  1,
  type = "n", ## do not plot the points
  bty = "n", ## do not plot a bounding box
  xaxt = "n", ## do not plot an X axis
  xlab = "", ## label the x axis as "", in other words, as blank
  yaxt = "n", ## do not plot a Y axis
  ylab ="", ## label the Y axis as "", in other words, as blank
)

legend(
  "center",
  legend = levels(sub_data$EDATTAIN),
  pch = 22,
  pt.bg = rainbow(4),
  inset = c(-0.2,0),
  title = "Ed Attainment"
)

Basic Visualization (ggplot2)

ggplot2 is EXTREMELY popular for quickly generating polished looking plots. It does use a slightly different syntax than Base R, but there is ample user-support.

qplot(x = SEX, fill = EDATTAIN, data = sub_data, geom = "bar")