Saghir Bashir
Do you need to find an R dataset that you can use for teaching, presentations or reprex
? WhatData is there to help you. You can try it out here.
A Dataset of R Datasets
We start by creating a dataset of the available datasets in R on your system (i.e. from both base R and from the packages installed on your system). The following code creates a tidy dataset (all_ds
) with package name, dataset name, title and the object class.
library(tidyverse)
library(stringr)
library(DT)
# Function to catch the error for data that is not exported.
unexportedData <- function (x) {
out <- tryCatch(class(eval(parse(text = x))), error = function(e) "NOT EXPORTED")
return(out)
}
all_ds <- data(package = .packages(all.available = TRUE)) %>%
.$results %>%
tibble::as_tibble() %>%
dplyr::mutate(DataOrig = stringr::word(Item, 1)) %>%
dplyr::mutate(pkgData = paste(Package, DataOrig, sep="::")) %>%
dplyr::arrange(pkgData) %>%
dplyr::mutate(Class = purrr::invoke_map(unexportedData, pkgData)) %>%
tidyr::unnest(Class) %>%
dplyr::filter(!str_detect(Class, "NOT EXPORTED")) %>%
dplyr::select(pkgData, Package, DataOrig, Title, Class) %>%
dplyr::arrange(pkgData, Class) %>%
dplyr::mutate(Val = Class) %>%
tidyr::spread(key = Class, value=Val, fill = "") %>%
tidyr::unite(Classes, c(-pkgData, -Package, -DataOrig, -Title), sep= " ")
Some key points for the code above:
data(package = .packages(all.available = TRUE))
identifies all the data objects in all the packages installed on your system including the those that are not exported.- Function
unexportedData()
catches the errors from usingclass()
and labels them as “NOT EXPORTED” (although the errors could be due to other reasons but we ignore the reason). - Observations are dropped when
Class
is “NOT EXPORTED”. - The remaining code is to create a tidy dataset.
Find R Datasets
Interactive Data Table
DT::datatable()
can be used to search for datasets and searches can be refined by including a search box on top of each column (using the option filter = "top"
).
all_ds %>%
select(-pkgData) %>%
DT::datatable(filter = "top")
Using dplyr::filter()
dplyr::filter()
can also be used to search for datasets. Searches can be refined using regular expressions (regex
).
# Find all tibbles.
all_ds %>%
filter(str_detect(Classes, "tbl_df"))
# Filtering out the rows with class "ts" and not things like "datasets".
all_ds %>%
filter(str_detect(Classes, regex("\\b(ts)\\b")))
# Find datasets related with any form of "sleep" in dataset names or descriptions.
all_ds %>%
filter(str_detect(Title, regex("sleep", ignore_case = TRUE)))
# Time series data for econ(omic), stock or share data.
all_ds %>%
filter(str_detect(Title, regex("econ|stock|share", ignore_case = TRUE)) &
str_detect(Classes, regex("\\b(ts)\\b")))
Summary
We have presented one approach to finding R datasets that you can use for teaching, presentations or reprex. It is a useful way to discover (or be reminded about) datasets in R.