Getting Started with socR
Daniel Russ
2024-12-08
GettingStarted.Rmd
I use socR to handle almost everyday to handle common tasks that involve occupational or industrial codes. The most common task I have involves dealing with coding systems. This vignette is designed to show you how I do common tasks with socR.
Create a coding system
This is often the first thing you have to do. I save my coding system data on github pages. It is public data, feel free to use it. If you want to add a coding system to my github repository, let me know. As long as there are no licensing issues, I’ll be happy to add it.
As an example, I will load the soc2000 system from https://danielruss.github.io/codingsystems/soc2000_all.csv. I actually use soc2010 in my work, but that comes with socR, as does a few other I that use often.
library(socR)
soc2000_all <- codingsystem("https://danielruss.github.io/codingsystems/soc2000_all.csv",name="soc2000")
soc2000_all
#> # Coding System: soc2000
#> code title Level Hierarchical_structure parent soc2d soc3d soc5d soc6d
#> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11-0000 Manageme… 2 Major Group NA 11-0… NA NA NA
#> 2 11-1000 Top Exec… 3 Minor Group 11-00… 11-0… 11-1… NA NA
#> 3 11-1010 Chief Ex… 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 4 11-1011 Chief Ex… 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 5 11-1020 General … 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 6 11-1021 General … 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 7 11-1030 Legislat… 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 8 11-1031 Legislat… 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 9 11-2000 Advertis… 3 Minor Group 11-00… 11-0… 11-2… NA NA
#> 10 11-2010 Advertis… 5 Broad Occupation 11-20… 11-0… 11-2… 11-2… NA
#> # ℹ 1,379 more rows
A coding system is an S3 class that wraps a tibble. The coding system
is required to have a column name code
and
a column named title
. The other columns are optional,
however, if you want to move up the code hierarchy having the additional
columns are useful. In this example, soc2000 has Level
which corresponds to the number of digits in the code (not counting
trailing zeros, e.g. 11-0000 is a 2-digit code (Level=2) and 11-1010 is
a 5-digit code Level=5). The parent
column is the immediate
parent in the heirarchy of a coding system. The columns
soc2d
through soc6d
are the codes at the
various levels. My codingsystem use NA
to mark cases that
don’t exist (e.g. the soc6d for 11-0000). The codingsystem also has a
name that is printed out for your use.
Here is the soc2010 coding system that comes with socR. There is also a soc2010_6d, which is deprecated and will be removed soon since you can create it from by filtering soc2010_all.
soc2010_all
#> # Coding System: soc2010
#> code title Level Hierarchical_structure parent soc2d soc3d soc5d soc6d
#> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11-0000 Manageme… 2 Major Group NA 11-0… NA NA NA
#> 2 11-1000 Top Exec… 3 Minor Group 11-00… 11-0… 11-1… NA NA
#> 3 11-1010 Chief Ex… 5 Broad Group 11-10… 11-0… 11-1… 11-1… NA
#> 4 11-1011 Chief Ex… 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 5 11-1020 General … 5 Broad Group 11-10… 11-0… 11-1… 11-1… NA
#> 6 11-1021 General … 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 7 11-1030 Legislat… 5 Broad Group 11-10… 11-0… 11-1… 11-1… NA
#> 8 11-1031 Legislat… 6 Detailed Occupation 11-10… 11-0… 11-1… 11-1… 11-1…
#> 9 11-2000 Advertis… 3 Minor Group 11-00… 11-0… 11-2… NA NA
#> 10 11-2010 Advertis… 5 Broad Group 11-20… 11-0… 11-2… 11-2… NA
#> # ℹ 1,415 more rows
Changing to higher level codes
Given a vector of soc codes, you may want to convert them to 2-digit socs. In order to do this we use a function factory method to create the appropriate function.
## create a function to convert a vector of codes to a the 2-digit level
## notice we are uses the column name that contains the 2-digit socs for
## each code
to_2d <- to_level(soc2000_all,soc2d)
to_2d(c("11-1021","11-1031"))
#> [1] "11-0000" "11-0000"
## lets do it for a tibble...
my_data <- tibble::tibble(resp_id=c("A13254","A33122"),soc2000=c("11-1021","11-1031")) |>
dplyr::mutate(soc2000_2d=to_2d(soc2000))
my_data
#> # A tibble: 2 × 3
#> resp_id soc2000 soc2000_2d
#> <chr> <chr> <chr>
#> 1 A13254 11-1021 11-0000
#> 2 A33122 11-1031 11-0000
Checking for invalid codes
Sometimes you want to check if your data has invalid codes. socR has
a few ways of checking codes. If you have a coding system, you can
create a function using a provided factory method
valid_code
which takes either a coding system or a vector
of codes. This is why the data had to have a column named
code
, the codingsystem knows which column is the code
column and can create a list of all the valid codes for you. If you
want, you could replace the codingsystem object with a vector of valid
codes
is_valid_soc2000 <- valid_code(soc2000_all)
is_valid_soc2000( c("11-0000","11","11-1021","11-1030") )
#> [1] TRUE FALSE TRUE TRUE
Filtering a coding system
Sometime you are not interested in the entire coding system, but only
the codes at a particular level. Since a codingsystem is a thin wrapper
around a tibble, you can use some of the dplyr
verbs
(select and filter – I can add others if needed). Now you see why I
named the variable soc2000_all
. If you get odd errors
when you filter, you may be using the wrong filter function. The
stats package, which is loaded by default, has a filter method.
soc2000_5d <- soc2000_all |> dplyr::filter(Level == 5,name="soc2000_5d")
soc2000_5d
#> # Coding System: soc2000_5d
#> code title Level Hierarchical_structure parent soc2d soc3d soc5d soc6d
#> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11-1010 Chief Ex… 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 2 11-1020 General … 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 3 11-1030 Legislat… 5 Broad Occupation 11-10… 11-0… 11-1… 11-1… NA
#> 4 11-2010 Advertis… 5 Broad Occupation 11-20… 11-0… 11-2… 11-2… NA
#> 5 11-2020 Marketin… 5 Broad Occupation 11-20… 11-0… 11-2… 11-2… NA
#> 6 11-2030 Public R… 5 Broad Occupation 11-20… 11-0… 11-2… 11-2… NA
#> 7 11-3010 Administ… 5 Broad Occupation 11-30… 11-0… 11-3… 11-3… NA
#> 8 11-3020 Computer… 5 Broad Occupation 11-30… 11-0… 11-3… 11-3… NA
#> 9 11-3030 Financia… 5 Broad Occupation 11-30… 11-0… 11-3… 11-3… NA
#> 10 11-3040 Human Re… 5 Broad Occupation 11-30… 11-0… 11-3… 11-3… NA
#> # ℹ 439 more rows
## you can check for valid 5-digit soc codes
is_valid_5digit_soc2010 <- valid_code(soc2000_5d)
is_valid_5digit_soc2010( c("11-0000","11","11-1021","11-1030") )
#> [1] FALSE FALSE FALSE TRUE
If you need a dplyr verb that I don’t support, if you ask I might be
able to add it.
Otherwise, the work around is to get the tibble from the codingsystem
which is the table
entry of the S3 codingsystem object.
Since you now have a tibble, you can continue working with it as any
other tibble, or convert it back to a codingsystem using the
as_codingsystem
function. You will need to give the
codingsystem a name, or it will default to something useless like coding
system.
soc2000_3d <- soc2000_all$table |> dplyr::filter(Level == 3) |>
as_codingsystem(name="soc2000_3d")
soc2000_3d
#> # Coding System: soc2000_3d
#> code title Level Hierarchical_structure parent soc2d soc3d soc5d soc6d
#> <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 11-1000 Top Exec… 3 Minor Group 11-00… 11-0… 11-1… NA NA
#> 2 11-2000 Advertis… 3 Minor Group 11-00… 11-0… 11-2… NA NA
#> 3 11-3000 Operatio… 3 Minor Group 11-00… 11-0… 11-3… NA NA
#> 4 11-9000 Other Ma… 3 Minor Group 11-00… 11-0… 11-9… NA NA
#> 5 13-1000 Business… 3 Minor Group 13-00… 13-0… 13-1… NA NA
#> 6 13-2000 Financia… 3 Minor Group 13-00… 13-0… 13-2… NA NA
#> 7 15-1000 Computer… 3 Minor Group 15-00… 15-0… 15-1… NA NA
#> 8 15-2000 Mathemat… 3 Minor Group 15-00… 15-0… 15-2… NA NA
#> 9 17-1000 Architec… 3 Minor Group 17-00… 17-0… 17-1… NA NA
#> 10 17-2000 Engineers 3 Minor Group 17-00… 17-0… 17-2… NA NA
#> # ℹ 86 more rows