Summary
Everytime I run into a file with an.xml
extension, I cringe. Though, admittedly, it’s a file format that you have to be familiar with when it comes to sending and receiving data over the web. R
has a package xlm2
to assist in the conversion of nested data to tabular data.
Table of Contents
Overview
The task here is to take an easy and short xml
file and convert it to a dataframe. The xml2
package was specifically designed for the task and is the successor to the xml
package.
Background
Sanity Checks
Let’s face it: you wouldn’t be on this page if you could’ve figured it out using W3schools or the R
help menu. Let’s start with some super basic stuff. Have you opened the .xml
file using a browser? Do that first and see if you can get a sense of how the data is organized. Second, open the xml file in a linter and see if it’s formatted correctly. (Here’s an example) I’ve lost a ton of time double checking code only to figure out that some .json
was missing a “}”. Finally, open it in a good text editor and beautify the code. (I use Atom.) My last project, I “peeled” the outer layers of the onion to get at the data. It was hacky and lacked reproducibility, but it worked.
Definitions
First, it’s important to get some defintions first. (W3schools has a really helpful set of tutorials on the topic.) A “node” in general and speaking in a broad way is an HTML element.
DOM (Document Object Model) - a tree structure that represents the HTML of the website, and every HTML element is a “node”.
Element Nodes - model the actual HTML elements in the document.
Attribute Nodes - model the various attributes in the different HTML elements. Attributes include id, class, title and style.
Text Nodes - model the text content inside the different HTML elements.
Root Node - the node on the very top of the document tree, usually called the document node.
Parent Node - a node that has children; represents an element that has at least one other element or text nested inside it.
Child Node - a node that has a parent; represents an element or text that is nested inside another element. For example, a <p>
tag is often the child of a <div>
tag.
Example
Structure
XML structured data looks like this:
<?xml version="1.0" encoding="UTF-8"?> #xml prologue
<root>
<child>
<subchild>.....</subchild>
</child>
</root>
Xpath
XPath is a syntax for defining and navigating parts of an XML document. They can be used across many languages. Xpath has over 200 functions according to W3schools. See the W3 page for xpath syntax.
XML Data
All of the functions below are from the xlm2
package and the help page.
# get W3 data
url <- "https://www.w3schools.com/xml/simple.xml"
# read breakfast menu
bkfst <- read_xml(x = url)
str(bkfst)
List of 2
$ node:<externalptr>
$ doc :<externalptr>
- attr(*, "class")= chr [1:2] "xml_document" "xml_node"
xml_name(bkfst)
[1] "breakfast_menu"
See XML Data Structure
xml_structure(bkfst)
<breakfast_menu>
<food>
<name>
{text}
<price>
{text}
<description>
{text}
<calories>
{text}
<food>
<name>
{text}
<price>
{text}
<description>
{text}
<calories>
{text}
<food>
<name>
{text}
<price>
{text}
<description>
{text}
<calories>
{text}
<food>
<name>
{text}
<price>
{text}
<description>
{text}
<calories>
{text}
<food>
<name>
{text}
<price>
{text}
<description>
{text}
<calories>
{text}
# see text
xml_text(bkfst)
[1] "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900French Toast$4.50Thick slices made from our homemade sourdough bread600Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"
Extract XML
The real key is understanding how to define or set the “xpath” argument.
xml_find_all(bkfst, xpath = "//name")
{xml_nodeset (5)}
[1] <name>Belgian Waffles</name>
[2] <name>Strawberry Belgian Waffles</name>
[3] <name>Berry-Berry Belgian Waffles</name>
[4] <name>French Toast</name>
[5] <name>Homestyle Breakfast</name>
xml_text(xml_find_all(bkfst, xpath = "//name"))
[1] "Belgian Waffles" "Strawberry Belgian Waffles"
[3] "Berry-Berry Belgian Waffles" "French Toast"
[5] "Homestyle Breakfast"
Build table
library(tibble)
name <- xml_text(xml_find_all(bkfst, xpath = "//name"))
price <- xml_text(xml_find_all(bkfst, xpath = "//price"))
df <- tibble(name = name, price = price)
df
# A tibble: 5 x 2
name price
<chr> <chr>
1 Belgian Waffles $5.95
2 Strawberry Belgian Waffles $7.95
3 Berry-Berry Belgian Waffles $8.95
4 French Toast $4.50
5 Homestyle Breakfast $6.95
Lists and XML
Lists and xml are similar and easily converted within the xml2
package.
document <- list(root = list(parent = list(child = list(1))))
xmlDoc <- xml2::as_xml_document(document)
xml2::xml_structure(xmlDoc)
<root>
<parent>
<child>
{text}
{xml_nodeset (1)}
[1] <child>1</child>
Conclusion
If you can get a basic understanding of html web pages and xpath syntax, then you should be able to parse xml files efficiently with xlm2
.
Acknowledgements
(Get bibliographic stuff from “archetype hill”.)
References
Disclaimer
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimd as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
Reproducibility
─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
setting value
version R version 3.6.3 (2020-02-29)
os macOS Catalina 10.15.7
system x86_64, darwin15.6.0
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2021-03-05
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0)
blogdown 1.1.14 2021-02-27 [1] Github (rstudio/blogdown@b0fa6ed)
bookdown 0.21 2020-10-13 [1] CRAN (R 3.6.3)
bslib 0.2.4 2021-01-25 [1] CRAN (R 3.6.2)
cachem 1.0.3 2021-02-04 [1] CRAN (R 3.6.2)
callr 3.5.1 2020-10-13 [1] CRAN (R 3.6.2)
cli 2.3.0 2021-01-31 [1] CRAN (R 3.6.2)
codetools 0.2-18 2020-11-04 [1] CRAN (R 3.6.2)
colorspace 2.0-0 2020-11-11 [1] CRAN (R 3.6.2)
crayon 1.4.1 2021-02-08 [1] CRAN (R 3.6.2)
DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.2)
desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0)
devtools * 2.3.2 2020-09-18 [1] CRAN (R 3.6.2)
digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.2)
dplyr 1.0.4 2021-02-02 [1] CRAN (R 3.6.2)
ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 3.6.2)
fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 3.6.2)
ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 3.6.2)
ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 3.6.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0)
htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 3.6.2)
jquerylib 0.1.3 2020-12-17 [1] CRAN (R 3.6.2)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 3.6.2)
knitr 1.31 2021-01-27 [1] CRAN (R 3.6.2)
lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0)
magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.2)
memoise 2.0.0 2021-01-26 [1] CRAN (R 3.6.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0)
pillar 1.4.7 2020-11-20 [1] CRAN (R 3.6.2)
pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 3.6.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0)
pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0)
processx 3.4.5 2020-11-30 [1] CRAN (R 3.6.2)
ps 1.5.0 2020-12-05 [1] CRAN (R 3.6.2)
purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.2)
R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.2)
remotes 2.2.0 2020-07-21 [1] CRAN (R 3.6.2)
rlang 0.4.10 2020-12-30 [1] CRAN (R 3.6.2)
rmarkdown 2.7 2021-02-19 [1] CRAN (R 3.6.3)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.2)
sass 0.3.1 2021-01-24 [1] CRAN (R 3.6.2)
scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.2)
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0)
stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.2)
stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0)
testthat 3.0.1 2020-12-17 [1] CRAN (R 3.6.2)
tibble 3.0.6 2021-01-29 [1] CRAN (R 3.6.2)
tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.2)
usethis * 2.0.1 2021-02-10 [1] CRAN (R 3.6.2)
vctrs 0.3.6 2020-12-17 [1] CRAN (R 3.6.2)
withr 2.4.1 2021-01-26 [1] CRAN (R 3.6.2)
xfun 0.21 2021-02-10 [1] CRAN (R 3.6.2)
xml2 * 1.3.2 2020-04-23 [1] CRAN (R 3.6.2)
yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0)
[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library