52 min read

Finding Datasets in R

Saghir Bashir

Do you need to find an R dataset that you can use for teaching, presentations or reprex? WhatData is there to help you. You can try it out here.

A Dataset of R Datasets

We start by creating a dataset of the available datasets in R on your system (i.e. from both base R and from the packages installed on your system). The following code creates a tidy dataset (all_ds) with package name, dataset name, title and the object class.

library(tidyverse)
library(stringr)
library(DT)

# Function to catch the error for data that is not exported.
unexportedData <- function (x) {
  out <- tryCatch(class(eval(parse(text = x))), error = function(e) "NOT EXPORTED")
  return(out)
}

all_ds <- data(package = .packages(all.available = TRUE)) %>% 
  .$results %>%
  tibble::as_tibble() %>%
  dplyr::mutate(DataOrig = stringr::word(Item, 1)) %>%
  dplyr::mutate(pkgData = paste(Package, DataOrig, sep="::")) %>% 
  dplyr::arrange(pkgData) %>% 
  dplyr::mutate(Class = purrr::invoke_map(unexportedData, pkgData)) %>% 
  tidyr::unnest(Class) %>% 
  dplyr::filter(!str_detect(Class, "NOT EXPORTED")) %>% 
  dplyr::select(pkgData, Package, DataOrig, Title, Class) %>%
  dplyr::arrange(pkgData, Class) %>%
  dplyr::mutate(Val = Class) %>%
  tidyr::spread(key = Class, value=Val, fill = "") %>%
  tidyr::unite(Classes, c(-pkgData, -Package, -DataOrig, -Title), sep= " ")

Some key points for the code above:

  • data(package = .packages(all.available = TRUE)) identifies all the data objects in all the packages installed on your system including the those that are not exported.
  • Function unexportedData() catches the errors from using class() and labels them as “NOT EXPORTED” (although the errors could be due to other reasons but we ignore the reason).
  • Observations are dropped when Class is “NOT EXPORTED”.
  • The remaining code is to create a tidy dataset.

Find R Datasets

Interactive Data Table

DT::datatable() can be used to search for datasets and searches can be refined by including a search box on top of each column (using the option filter = "top").

all_ds %>% 
  select(-pkgData) %>% 
  DT::datatable(filter = "top")

Using dplyr::filter()

dplyr::filter() can also be used to search for datasets. Searches can be refined using regular expressions (regex).

# Find all tibbles.
all_ds %>% 
  filter(str_detect(Classes, "tbl_df"))

# Filtering out the rows with class "ts" and not things like "datasets".
all_ds %>% 
  filter(str_detect(Classes, regex("\\b(ts)\\b")))

# Find datasets related with any form of "sleep" in dataset names or descriptions.
all_ds %>% 
  filter(str_detect(Title, regex("sleep", ignore_case = TRUE)))

# Time series data for econ(omic), stock or share data.
all_ds %>% 
  filter(str_detect(Title, regex("econ|stock|share", ignore_case = TRUE)) &
                  str_detect(Classes, regex("\\b(ts)\\b")))

Summary

We have presented one approach to finding R datasets that you can use for teaching, presentations or reprex. It is a useful way to discover (or be reminded about) datasets in R.