October 5, 2023

Monitoring technologies (e.g., satellites, good telephones, acoustic recorders) are creating huge amounts of knowledge about our earth daily. These information hold promise to offer international insights on every thing from biodiversity patterns to vegetation change at more and more nice spatial and temporal decision. Leveraging this info often requires us to work with information that’s too huge to match in our computer’s “working reminiscence” (RAM) and even to download to our pc’s onerous drive. Some information – like excessive decision remote sensing imagery – could be too giant to download even when we have entry to huge compute infrastructure.

In this publish, I will stroll by way of some tools, phrases, and examples to get began with cloud native workflows. Cloud native workflows allow us to remotely access and query giant knowledge from online sources or web providers (i.e., the “cloud”), all while skipping the necessity to download massive files. We’ll contact on a couple frequent data storage types (e.g., Parquet files), cataloging structures (e.g., S3) and tools (e.g., Apache Arrow and Vsicurl) to question these online information. We’ll focus on an instance of accessing big tabular knowledge however may even touch on spatial information. I’ll be sharing resources along the way for deeper dives on each of the topics and tools.

Example code might be supplied in R and case studies focused on environmental data, but these concepts aren’t specific to a programming language or analysis domain. I hope that this post supplies a place to begin for anybody thinking about working with massive information using cloud native approaches!

Tabular data: Querying information from cloud storage with Apache Arrow
Example: Plotting billions of biodiversity observations through time
The Global Biodiversity Information Facility (GBIF) synthesizes biodiversity knowledge derived from a variety of sources, ranging from museum specimens to smartphone pictures recorded on participatory science platforms. The database contains over a billion species incidence information – most likely not too massive to download onto your machine’s onerous drive, however likely too huge to read into your working reminiscence. Downloading the entire database after which determining the way to work with it’s really a hassle (you can trust me on that, or you possibly can attempt it yourself!), so we are going to stroll via a cloud native approach to collect and obtain only the information we’d like from this database.

Step 1. Finding and accessing the data
There are snapshots (i.e., versions) of the entire GBIF database on Amazon Web Services (AWS) S3. You examine them out and choose a version right here. A lot of knowledge is saved on AWS or different S3 compatible platforms (not just observation data of vegetation and animals), so the next workflow isn’t specific to GBIF.

What is S3? Simple Storage Service (S3) is only a structure of object storage by way of a web service interface. S3 compatibility is a requirement for cloud-native workflows (what we are working in the direction of here). Cloud-native applications use the S3 API to speak with object storage. The GBIF snapshot we leverage is saved on Amazon Web Services (AWS) S3, however there are other S3 suitable object storage platforms (e.g., MinIO) that you could question with the identical methods that we are going to discover under.

First, we’ll seize the S3 Uniform Resource Identifier (URI) for a current GBIF snapshot (note the s3:// positioned before the URI name)

You might discover that it is a .parquet file.

What is a Parquet file? Parquet is an open supply, column-oriented data file format designed for efficient data storage and retrieval. You can learn all about it right here. Put merely, it’s an environment friendly method to store tabular information – what you may otherwise store in a .csv

We’ll use the R package arrow which supports reading knowledge units from cloud storage with out having to download them (allowing for large queries to run domestically by solely downloading the parts of the data-set necessary). You can learn extra about Apache Arrow if you’re thinking about digging deeper into other applications (or excited about interfacing with the arrow package in other programming languages like Python). Note that arrow doesn’t require that knowledge is a parquet file – it also works with “csv”, “textual content” and “feather” file codecs.



The open_dataset perform in the arrow package deal connects us to the online parquet file (in this case, the GBIF snapshot from above).

Notice that this doesn’t load the whole dataset into your computer’s memory, but, conveniently, we will still check out the variable names (and even take a glimpse at a small subset of the data!)


db # take a look at the variables we will query

glimpse(db) # see what a subset of the data seems like

Step 2. Querying the info
The arrow package deal provides a dplyr interface to arrow datasets. We’ll use this interface to carry out the question. Note that this strategy is a bit restricted, since only a subset of dplyr verbs can be found to question arrow dataset objects (there is a nice useful resource here). Verbs like filter, group_by, summarise, select are all supported.

So, let’s filter GBIF to all observations within the United States and get a count of observations of each class of species per yr. We use the collect() perform to drag the outcomes of our query into our R session.


filter(countrycode == “US”) |>

group_by(kingdom, year) |>

count() |>


Within minutes we will plot the variety of observations by kingdom through time!

gbif_US |>

drop_na() |>

filter(year > 1990) |>

ggplot(aes(x = yr,

y = log(n),

color = kingdom)) +

geom_line() + theme_bw()

There are many different purposes of arrow that we can’t get into here (e.g., zero-copy R and Python data sharing). Check out this cheat sheet in case you are interested in exploring more!

Raster information: STAC and vsicurl
Cloud native workflows are not restricted to querying tabular data (e.g., Parquet, CSV files) as explored above however may also be helpful for working with spatiotemporal knowledge. We’ll concentrate on studying and querying information utilizing the SpatioTemporal Asset Catalog (STAC) (STAC is only a widespread language to describe geospatial information, so it can more easily be labored with, listed, and discovered. More info here).

A lot of spatiotemporal data is indexed by STAC — for instance, Microsoft’s planetary laptop data catalog includes petabytes of environmental monitoring data utilizing STAC format.

We’ll use the rstac library to make requests to the planetary laptop’s STAC API (and like our tabular instance, similar workflows can be found in Python)





Now, we search for knowledge obtainable on the planetary computer!

s_obj /api/stac/v1/”)

We can set a bounding field for our knowledge search – lets say, the San Francisco Bay Area!

SF /panorama/redlining/static/downloads/geojson/CAS…”)

bbox = st_bbox(SF)

And we are in a position to check out a given “assortment” of information (e.g., Landsat data):

it_obj %

stac_search(collections = “landsat-c2-l2”,

bbox = bbox) %>%

get_request() |>

items_sign(sign_fn = sign_planetary_computer())

Instead of downloading the raster, we use the prefix /vsicurl/ and the obtain URL from above – passing that directly to the rast() function (reads in a raster; from the terra package)

We can do computation over the community too – for instance calculating vegetation indices (NDVI) using a subset of bands in Landsat imagery. This permits us to only download the results into memory.

Cloud native workflows may be incredibly handy and highly effective when working with massive knowledge. Here, we walked via cloud native workflows for both tabular data and spatial knowledge, using Arrow and Vsicurl. This is a really transient introduction, however I hope the links provided assist jumpstart your ability to use these tools in your individual work!