We’ve received a number of questions from our users about dealing with the finer details of data sources on the web. Whether you’re reading data from local storage such as a csv file, a .Rdata store, or possibly a proprietary file format, you’ve most likely run into some issues in the past. Common problems include passing incorrect paths, files being too big for memory, or requiring several packages to read files in incompatible formats. Reading data from the web entails a whole other set of challenges. Although there are many ways to obtain data from the web, this post primarily deals with retrieving data from Application Programming Interfaces also known as APIs....
The iNaturalist project is a really cool way to both engage people in citizen science and collect species occurrence data. The premise is pretty simple, users download an app for their smartphone, and then can easily geo reference any specimen they see, uploading it to the iNaturalist website. It let’s users turn casual observations into meaningful crowdsourced species occurrence data. They also provide a nice robust API to access almost all of their data. We’ve developed a package rinat that can easily access all of that data in R. Our package spocc uses iNaturalist data as one of it’s sources, rinat provides an interface for all the features available in the API....
The rOpenSci projects aims to provide programmatic access to scientific data repositories on the web. A vast majority of the packages in our current suite retrieve some form of biodiversity or taxonomic data. Since several of these datasets have been georeferenced, it provides numerous opportunities for visualizing species distributions, building species distribution maps, and for using it analyses such as species distribution models. In an effort to streamline access to these data, we have developed a package called Spocc, which provides a unified API to all the biodiversity sources that we provide. The obvious advantage is that a user can interact with a common API and not worry about the nuances in syntax that differ between packages. As more data sources come online, users can access even more data without significant changes to their code. However, it is important to note that spocc will never replicate the full functionality that exists within specific packages. Therefore users with a strong interest in one of the specific data sources listed below would benefit from familiarising themselves with the inner working of the appropriate packages.
...
We recently pushed the first version of rnoaa to CRAN - version 0.1. NOAA has a lot of data, some of which is provided via the National Climatic Data Center, or NCDC. NOAA has provided access to NCDC climate data via a RESTful API - which is great because people like us can create clients for different programming languages to access their data programatically. If you are so inclined to write a bit of R code, this means you can get to NCDC data in the R environment where your workflow is reproducible, and you can connect data acquisition to a suite of tools for data manipulation (e.g., plyr), visualization (e.g., ggplot2), and statistics (e.g., lme4, etc.)....
Reproducible research involves the careful, annotated preservation of data, analysis code, and associated files, such that statistical procedures, output, and published results can be directly and fully replicated. As the push for reproducible research has grown, the R community has responded with an increasingly large set of tools for engaging in reproducible research practices (see, for example, the ReproducibleResearch Task View on CRAN). Most of these tools focus on improving one’s own workflow through closer integration of data analysis and report generation. But reproducible research also requires the persistent - and perhaps indefinite - storage of research files so that they can be used to recreate or modify future analyses and reports....