DC R Conference Recap

A summary of the 2019 DC R Conference at Georgetown University.
rstats
conference
Author

Amanda Peterson

Published

November 12, 2019

The overview: DC R 2019, hosted by Lander Analytics, was held Nov 7-9 at Georgetown University. The short action-packed conference consisted of one-day workshops and two days of talks. Find the videos on the DC R website (available soon) and associated tweets via the #rstatsdc hashtag on Twitter. Also, see Jared Lander’s blog post for a higher level overview of the conference with lots of fun pics.

Highlights for me personally included Malorie Hughes’ presentation of flexdashboard. Somehow in the hype of R shiny, I had missed this dashboard tool. I was also excited to learn about Dr Dasgupta’s coursedown package, which I’m looking forward to testing out for the class that I teach. Another highlight was learning more about the implementation of Google’s BERT model in R by Jon Harmon & Jonathan Bratt.

Read more: Below are my notes from selected talks. I’ve divided it into the following sections: Communication, Modeling, Spatial Analysis, General R Programming, and A Few Remaining Random Bits.

Communication (Dashboards, Websites, and Data Viz)

Coursedown: Managing Course Materials Using R Techniques, Dr Abhijit Dasgupta, @webbedfeet

Dashboarding Like a Boss, Malorie Hughes, @data_all_day

  • Don’t email your results. Create a dashboard!
  • Talk featured a demo of flexdashboard. See more info about flexdashboard on RStudio’s website.
  • Create using R Markdown. Just requires a different output type in the YAML header.
  • See her GitHub repo for templates
  • Tips for making your output look pretty on the dashboard:
    • Use printr to make printing R output look better. Less effort than knitr::kable().
    • Use summarytools. It’s not great YET for R Markdown, but keep tabs on it.
    • For model output, use stargazer
    • DT::datatable() for interactive table (see advanced options in this function for fancier options for the table)
    • For interactive plots: highcharter. Assign colors with colorize().
  • She included an export/download data button on her dashboard. See her code to learn how to do that.
  • See Malorie’s official DC R dashboard:

Using networkD3 and R to Visualize and Explore Relationships in Data, Dr Ami Gates, @DrGates309

  • Talk focused on networkD3, which is nice for interactive networks.
  • Other options for visualizing networks: igraph, ggnet2() (part of GGally), and visNetwork
  • arules package for association mining in R. She used it to find associated words considering both how often words appear together and also how often they appear in same dataset (in her example, how often in the same tweet)
  • arulesViz - for visualization of association rules
  • She used the igraph package to massage the data before giving it to networkD3

Better DataViz in ggplot2: Tips, Tricks, and Examples, Alex Engler, @AlexCEngler

  1. ggplot2 + sf for bivariate plots: ggplot2::geom_sf(). See the Drawing beautiful maps with R, sf, and ggplot2 sf tutorial on r-spatial.org.
  2. ggtext to add text to plots using markdown. Includes text formatting. Showed an example of how to highlight parts of title to match highlighted part of graph.
  3. gghighlight to highlight parts of graph with color (and grey out rest)
  4. ggplot2::annotation_custom() to add a plot on top of another. Create a “grobe” and drop it somewhere. Use to zoom in to a map with a bounding box. (My comment: seems similar to cowplot).
  5. geofacet::facet_geo() to create a plot per state, positioned in the location of the states on the map.
  6. ggplot2 extensions: See www.ggplot2-exts.org/gallery and the rayshader package.

Modeling

RBERT: Cutting Edge NLP in R, Jon Harmon, @JonTheGeek

  • BERT: Bidirectional Encoder Representations from Transformers. Transfer learning for natural language processing. Created by Google. RBERT is the R implementation.
  • Older embeddings have one embedding per word. Can’t help this situation: “I saw the branch on the bank.” vs “I saw the branch of the bank.” BERT addresses this issue.
  • RBERTviz: Visualize attention and PCA. Let’s you see how BERT is thinking. Showed an interactive demo— go watch the video!
  • BERT has 12 layers.
  • RBERT usable now (available on GitHub), but the Jon and Jonathan are making it more user-friendly. Goal: On CRAN by end of 2019
  • Works with tensorflow 2.0 (in testing)
  • Example shown: Used RBERT to pull out features for xgboost model.
  • Slides available here

Optimizing Topic Models for Classification Tasks, Tommy Jones, @thos_jones

  • Performing PhD research to optimize LDA topic models
  • Cousin to LDA: Text embeddings (considered state of the art). In his words, LDA stopped being cared about after deep learning came along. However, he thinks the LDA work was never completed to determine a good LDA model. He thinks LDA might be competitive with deep neural nets if tuned appropriately.
  • Walked through an example of classifying 20newsgroups
  • Tech stack: textmineR (maintained by Tommy), randomForest (simple so he could focus more on LDA), SigOptR, cloudml, magrittr, stringr, parallel (part of R-core)
  • SigOpt: ensemble of Bayesian optimizers for picking hyperparmeters (more efficient than grid search). SigOptR is an R API for SigOpt. -Used unigrams, removed stopwords, and words that were seen in less than 5 docs
  • Optimized for accuracy and coherence. LDA was way more accurate than LSA. Showed a accuracy vs topic coherence.

Topic Modeling Consumer Complaints at the Federal Reserve, BJ Bloom, @bj_bloom

  • Goal: Use consumer complaints to inform risk analysis for Federal Reserve Board
  • Used LDA topic modeling
  • Main LDA packages in R: lda (old, fine), topicmodels (special snowflake, have to re-index to 0, tricky to use), stm (recommend, structural topic modeling, can use to incorporate metadata into topic model), textmineR (recommend, in his opinion includes diagnostic measures that make sense)
  • In his words, “Don’t use tm. It had its place back in the day.” Instead, he’s a fan of tidytext.
  • Other R packages used: textrank (for pagerank)

Spatial Analysis

The Care and Feeding of Spatial Data, Angela Li, @CivicAngela

  • Slides available here
  • She described the steps before the fancy plots and regression, focusing on the data.
  • Two types of spatial data:
    • Vector data - points lines polygons (she focused more on this)
    • Raster data - grids, pixels, cells
  • Geo-coding: tmaptools::geocode_OSM(“Georgetown University”) gives coordinates (to translate human spatial data to computer understandable spatial data). Note that OSM = “open street map.” Talk to your GIS librarian (or if you have ARCGis) to get coords.
  • In her words: “Spatial file formats are confusing.”
  • Spatial data formats: .shp (shapefile made up of 4 files), .geojson (lots of web-based mapping), .gpkg (new and good, but not common yet), .csv, .tiff (for raster data)
  • Library she uses: sf package. st_read() or read_sf() to create a tidy sf data frame.
  • Her tech stack: geoDa, QGIS, ArcGIS, PostGIS, CART, R
  • R packages she uses the most (packages in parens are the “comparable” non-spatial versions):
    • sf (readr, tidyr, dplyr)
    • tmap (ggplot2)
    • spdep (glmnet)
    • sp (data.table)
  • rayshader for 3D plotting.
  • Use mapview for GIS-like experience
  • Think beyond dots on a map. Are there clusters? Look at spatial auto-correlation and spatially constrained clustering.
  • Pre release of rgeoda on GitHub
  • See #rspatial on Twitter
  • Most important slide she’s ever made in her life:

R is Not Just Basic Stats: Think Spatially!, Tatyana Tsvetovat, @t_tsvet

General R Programming

Ten Tremendous Tricks in the Tidyverse, David Robinson, @drob

Be sue to catch his #tidytuesday screen casts on YouTube. Ten tricks that people have liked from his screen casts:

  1. count(): One of the most used EDA functions
  2. count(column, sort, wt, name): Create a new variable within count. For example: count(decade = 10 * (year %/% 10) to produce counts for each decade.
  3. add_count() adds a count column (really useful to combine with filter to get x with a minimum count)
  4. Use summarize() to create a list column. Looks like data %>% groupby() %>% summarize. Example: summarize(test = list(t.test(avg_score))) Then use broom to visualize. See chapter 25 of R for Data Science.
  5. Use the combination of: count(), fct_reorder() (to turn column into ordered factor), geom_col (use instead of geom_bar(stat = ‘idendtity’)), coord_flip(),
  6. fct_lump() to combine least/most common levels together
  7. scale_x or y_log10() to use log scale. In his words: “I suspect much more data is log-normal than normal.”
  8. tidyr::crossing() to find all combinations of multiple vectors. Example: crossing(a = 1:3, b = A:Z, c = 10:20). Can use for simulations in place of for-loops.
  9. separate() to split a column into multiple columns based on a regex
  10. extract() to extract information from a column based on a regex

Raising baby with R, Jared Lander, @jaredlander

  • Jared and his wife recently had a baby. He described his use of R to choose the baby’s name and analyze its eating, sleeping, and poo patterns.
  • See the babynames package for data about popularity of baby names
  • Analyzing time series data: tsibble, fable, feast. Use model() function in this package to fit ts models. Use autoplot() to plot ts data. index_by() is the ts equiv of group_by().
  • fable.prophet package to use prophet in a fable framework. Used this to perform change-point detection.
  • Slides available here.

R and Python coexisting in the same Development environment, Dan Chen, @chendaniely, (Author of “Pandas for Everyone”)

  • What he believes Python does better: environments, web development, and hardware
  • What he likes about R: communication (shiny, rmarkdown, ggplot2)
  • Python environments in R:
conda_envs <- reticulate::cond_list()
conda_envs$name[[1]] #have to restart R if you want to change env again)
  • Most data scientists using Python use anaconda.
  • Don’t mix anaconda R package installation with install.packages()
  • New model pipeline in R: rsample, recipes, parsnip, yardstick
  • Described workflow: Python joblib for “pickling” a model in python. Read it into R. Then use R shiny to display
  • [blogdown] supports jupyter notebooks
  • knitpy: knitr implementation in python
  • Use Apache arrow to pass datasets between R and python
  • Information on installing python here
  • Slides available here

Managing Your Cloud: Working with APIs, Marck Vaisman, @wahalulu

  • Discussed how to use R to interact with APIs (mostly for the consumption of data)
  • plumber for building and serving up APIs
  • three package to be aware of: httr (Part of the tidyverse. More abstract/easier than curl package), curl (new package), jsonlite

A Few Remaining Random Bits