8 min read

Rectangling IRS 990s-Part-06


irs 990s xml
Working with xml list formats using IRS 990s

View raw source for this post

Summary

This posts converts an IRS 990 in xml format to tabular data.

Table of Contents

Overview

As a way to summarize the previous lessons, this post will take an IRS 990 and convert it from an xml list to a tabular form. The “990” is the IRS form that not-for-profit corporations file annually. The list was downloaded from Pro Publica’s “Nonprofit Explorer” and the organization is “Delivering Good, Inc” an organization that distributes goods and merchandise to those in need.

Getting Data

From the Pro Publica Nonprofit Explorer page, the link for the xml source can be obtained.

delivering_good_990 <- glue::glue(
    "https://s3.amazonaws.com/irs-form-990/202033219349317233_public.xml",
    "?_ga=2.16730738.538037686.1650110873-1696376179.1650110873"
)
# 'dg' = delivering good
xml_dg <- delivering_good_990 |> read_xml() |> as_list()

Inspect List

We’ll inspect the list using the native graphical user interface in Rstudio and the str() command.

Rstudio

The return data is within two branches: the “ReturnHeader” and the “ReturnData.” The information within the header is comprised of metadata regarding the return, which will prove useful if multiple returns are retrieved.

The other branch is the return data. Note that a number of the items of the list contain the word “schedule”. In tax practice, schedules are often interim worksheets and the values are then carried forward to the face sheet. Here, the face sheet is the “IRS990” and the schedules are of lesser interest.

The Rstudio window that allows for the inspection of a list.

Figure 1: The Rstudio window that allows for the inspection of a list.

str()

When working with large lists, the user should target parts of the list or limit its display. Otherwise, the console will be garbled. First, let’s look for the Employer Identification Number (“EIN”) a unique identifier for taxpayers by targeting a single item.

List of 1
 $ : chr "133300271"

Next, we’ll target part of the list and limit console output by level and length.

List of 11
 $ ReturnTs                 :List of 1
 $ TaxPeriodEndDt           :List of 1
 $ PreparerFirmGrp          :List of 3
 $ ReturnTypeCd             :List of 1
 $ TaxPeriodBeginDt         :List of 1
  [list output truncated]
 - attr(*, "binaryAttachmentCnt")= chr "0"

Rectangle preparer address

This effort took me an hour or better. There was a lot of trial and error but my efficiency improved as I became more familiar with the functions. I tended to rely on the unnest_auto() function to start and then switch that function for the one it recommended. The goal was to get the tibble to no longer have list-columns. Don’t get discouraged!

# insert into tibble
dt <- tibble(xml_dg = xml_dg)
# wrangle!
    dt |> 
    unnest_wider(xml_dg) |> 
    select(ReturnHeader) |> 
    unnest_wider(ReturnHeader) |> 
    unnest_wider(PreparerFirmGrp) |> 
    select(1:5) |> 
    unnest_wider(PreparerFirmName) |> 
    unnest_wider(PreparerUSAddress) |> 
    unnest(cols = everything()) |> 
    unnest(cols = everything())
# A tibble: 1 × 8
  ReturnTs       TaxPeriodEndDt PreparerFirmEIN BusinessNameLin… AddressLine1Txt
  <chr>          <chr>          <chr>           <chr>            <chr>          
1 2020-11-16T15… 2019-12-31     271728945       PKF O'CONNOR DA… 665 FIFTH AVEN…
# … with 3 more variables: CityNm <chr>, StateAbbreviationCd <chr>, ZIPCd <chr>

Build tibble

The IRS 990 return contains 297+ pieces of data for just one section of the list. Just winnowing it down to some key pieces of information is helpful. Here, a simple tibble was built using list references from the base package. Again, the Rstudio interface was very helpful and I copied and pasted the table together using the green arrow.

#990 get some info
tibble(
    principal_officer = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["PrincipalOfficerNm"]][[1]],
    address_1 = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["USAddress"]][["AddressLine1Txt"]][[1]],
    city = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["USAddress"]][["CityNm"]][[1]],
    state = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["USAddress"]][["StateAbbreviationCd"]][[1]],
    zip = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["USAddress"]][["ZIPCd"]][[1]],
    website = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["WebsiteAddressTxt"]][[1]],
    total_employees = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["TotalEmployeeCnt"]][[1]],
    gross_receipts = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["GrossReceiptsAmt"]][[1]],
    contributions = xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["CYContributionsGrantsAmt"]][[1]]
)
# A tibble: 1 × 9
  principal_officer address_1          city  state zip   website total_employees
  <chr>             <chr>              <chr> <chr> <chr> <chr>   <chr>          
1 LISA GURWITCH     266 W 37TH STREET… NEW … NY    10018 WWW.DE… 23             
# … with 2 more variables: gross_receipts <chr>, contributions <chr>

Chairperson

The 990 includes a listing of the top 5 highest employees, board officers and board members. This was accomplished and then filtered to just the chair to save vertical space.

# Executives, Officers & Board members 
xml_return <- xml_dg[["Return"]][["ReturnData"]][["IRS990"]]
xml_return |> 
    enframe() |> 
    dplyr::filter(name == "Form990PartVIISectionAGrp") |> 
    unnest_wider(value) |> 
    unnest(cols = everything()) |> 
    select(-name) |> 
    unnest(cols = everything())|> 
    filter(TitleTxt == "CHAIR")
# A tibble: 1 × 9
  PersonNm     TitleTxt AverageHoursPerWeekRt IndividualTrusteeOrDir… OfficerInd
  <chr>        <chr>    <chr>                 <chr>                   <chr>     
1 ANDREA WEISS CHAIR    0.10                  X                       X         
# … with 4 more variables: ReportableCompFromOrgAmt <chr>,
#   ReportableCompFromRltdOrgAmt <chr>, OtherCompensationAmt <chr>,
#   HighestCompensatedEmployeeInd <chr>

990 Data

return <- tibble(dg = xml_dg)
return |> 
    unnest_wider(dg) |> 
    select(ReturnData) |> 
    unnest_wider(ReturnData, names_sep = "_") |> 
    select(ReturnData_IRS990) |> 
    unnest_wider(ReturnData_IRS990, names_sep = "_") |> 
    select_if(~map(.x, length) == 1) |> 
    unnest(cols = everything()) |> 
    unnest(cols = everything())
# A tibble: 1 × 204
  ReturnData_IRS990_Principa… ReturnData_IRS9… ReturnData_IRS9… ReturnData_IRS9…
  <chr>                       <chr>            <chr>            <chr>           
1 LISA GURWITCH               140903914        0                X               
# … with 200 more variables: ReturnData_IRS990_WebsiteAddressTxt <chr>,
#   ReturnData_IRS990_TypeOfOrganizationCorpInd <chr>,
#   ReturnData_IRS990_FormationYr <chr>,
#   ReturnData_IRS990_LegalDomicileStateCd <chr>,
#   ReturnData_IRS990_ActivityOrMissionDesc <chr>,
#   ReturnData_IRS990_VotingMembersGoverningBodyCnt <chr>,
#   ReturnData_IRS990_VotingMembersIndependentCnt <chr>, …

Taxpayer Address

#address
xml_return <- xml_dg[["Return"]][["ReturnData"]][["IRS990"]][["USAddress"]]
xml_return |> 
    enframe() |> 
    unnest(value) |> 
    pivot_wider(names_from = name, values_from = value) |> 
    mutate(across(everything(), ~unlist(.))) |> 
    setNames(c("address", "city", "state", "zip")) |> 
    mutate(name = "Doing Good, Inc.") |> 
    select(name, everything())
# A tibble: 1 × 5
  name             address                      city     state zip  
  <chr>            <chr>                        <chr>    <chr> <chr>
1 Doing Good, Inc. 266 W 37TH STREET 22ND FLOOR NEW YORK NY    10018

Conclusion

Having devoted some effort to learn more about lists, I hope that I can tackle the data format with greater creativity and speed. Truthfully though, I’m simultaneously hoping there will be a “.csv” format offered too. Thanks for reading and thanks to all of those that share their knowledge about lists in R!

Acknowledgements

This blog post was made possible thanks to:

References

[1]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2022 [Online]. Available: https://www.R-project.org/
[2]
Y. Xie, C. Dervieux, and A. Presmanes Hill, Blogdown: Create blogs and websites with r markdown. 2022 [Online]. Available: https://CRAN.R-project.org/package=blogdown
[3]
L. Henry and H. Wickham, Purrr: Functional programming tools. 2020 [Online]. Available: https://CRAN.R-project.org/package=purrr
[4]
K. Müller and H. Wickham, Tibble: Simple data frames. 2021 [Online]. Available: https://CRAN.R-project.org/package=tibble
[5]
H. Wickham and M. Girlich, Tidyr: Tidy messy data. 2022 [Online]. Available: https://CRAN.R-project.org/package=tidyr
[6]
H. Wickham, J. Hester, and J. Ooms, xml2: Parse XML. 2021 [Online]. Available: https://CRAN.R-project.org/package=xml2

Disclaimer

The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimed as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.

Reproducibility

─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       macOS Big Sur/Monterey 10.16
 system   x86_64, darwin17.0
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2022-04-25
 pandoc   2.14.1 @ /usr/local/bin/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
 package     * version    date (UTC) lib source
 assertthat    0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
 blogdown    * 1.9        2022-03-28 [1] CRAN (R 4.1.2)
 bookdown      0.25       2022-03-16 [1] CRAN (R 4.1.2)
 brio          1.1.3      2021-11-30 [1] CRAN (R 4.1.0)
 bslib         0.3.1.9000 2022-03-04 [1] Github (rstudio/bslib@888fbe0)
 cachem        1.0.6      2021-08-19 [1] CRAN (R 4.1.0)
 callr         3.7.0      2021-04-20 [1] CRAN (R 4.1.0)
 cli           3.2.0      2022-02-14 [1] CRAN (R 4.1.2)
 codetools     0.2-18     2020-11-04 [1] CRAN (R 4.1.3)
 colorspace    2.0-3      2022-02-21 [1] CRAN (R 4.1.2)
 crayon        1.5.1      2022-03-26 [1] CRAN (R 4.1.0)
 DBI           1.1.2      2021-12-20 [1] CRAN (R 4.1.0)
 desc          1.4.1      2022-03-06 [1] CRAN (R 4.1.2)
 devtools    * 2.4.3      2021-11-30 [1] CRAN (R 4.1.0)
 digest        0.6.29     2021-12-01 [1] CRAN (R 4.1.0)
 dplyr         1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
 ellipsis      0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
 evaluate      0.15       2022-02-18 [1] CRAN (R 4.1.2)
 fansi         1.0.3      2022-03-24 [1] CRAN (R 4.1.2)
 fastmap       1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
 fs            1.5.2      2021-12-08 [1] CRAN (R 4.1.0)
 generics      0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
 ggplot2     * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)
 ggthemes    * 4.2.4      2021-01-20 [1] CRAN (R 4.1.0)
 glue          1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
 gtable        0.3.0      2019-03-25 [1] CRAN (R 4.1.0)
 htmltools     0.5.2      2021-08-25 [1] CRAN (R 4.1.0)
 jquerylib     0.1.4      2021-04-26 [1] CRAN (R 4.1.0)
 jsonlite      1.8.0      2022-02-22 [1] CRAN (R 4.1.2)
 knitr         1.38       2022-03-25 [1] CRAN (R 4.1.0)
 lifecycle     1.0.1      2021-09-24 [1] CRAN (R 4.1.0)
 magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.1.2)
 memoise       2.0.1      2021-11-26 [1] CRAN (R 4.1.0)
 munsell       0.5.0.9000 2021-10-19 [1] Github (cwickham/munsell@e539541)
 pillar        1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
 pkgbuild      1.3.1      2021-12-20 [1] CRAN (R 4.1.0)
 pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
 pkgload       1.2.4      2021-11-30 [1] CRAN (R 4.1.0)
 prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.0)
 processx      3.5.3      2022-03-25 [1] CRAN (R 4.1.0)
 ps            1.6.0      2021-02-28 [1] CRAN (R 4.1.0)
 purrr       * 0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
 R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.0)
 remotes       2.4.2      2021-11-30 [1] CRAN (R 4.1.0)
 rlang         1.0.2      2022-03-04 [1] CRAN (R 4.1.2)
 rmarkdown     2.13       2022-03-10 [1] CRAN (R 4.1.2)
 rprojroot     2.0.3      2022-04-02 [1] CRAN (R 4.1.0)
 rstudioapi    0.13       2020-11-12 [1] CRAN (R 4.1.0)
 sass          0.4.1      2022-03-23 [1] CRAN (R 4.1.2)
 scales        1.1.1      2020-05-11 [1] CRAN (R 4.1.0)
 sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.1.0)
 stringi       1.7.6      2021-11-29 [1] CRAN (R 4.1.0)
 stringr       1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
 testthat      3.1.3      2022-03-29 [1] CRAN (R 4.1.2)
 tibble      * 3.1.6      2021-11-07 [1] CRAN (R 4.1.0)
 tidyr       * 1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
 tidyselect    1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
 usethis     * 2.1.5      2021-12-09 [1] CRAN (R 4.1.0)
 utf8          1.2.2      2021-07-24 [1] CRAN (R 4.1.0)
 vctrs         0.4.0      2022-03-30 [1] CRAN (R 4.1.2)
 withr         2.5.0      2022-03-03 [1] CRAN (R 4.1.0)
 xfun          0.30       2022-03-02 [1] CRAN (R 4.1.2)
 xml2        * 1.3.3      2021-11-30 [1] CRAN (R 4.1.0)
 yaml          2.3.5      2022-02-21 [1] CRAN (R 4.1.2)

 [1] /Library/Frameworks/R.framework/Versions/4.1/Resources/library

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────