Download data from a STAC API using R, rstac, and GDAL
This tutorial walks through querying a STAC API using the rstac R package, and downloading data from the API using rstac or GDAL (via sf). This tutorial will assume that you’re already familiar with R and working with spatial data.
As all of the packages we’ll be using are available from CRAN, you can
install them (if necessary) using install.packages()
:
install.packages("sf")
install.packages("rstac")
install.packages("terra")
STAC APIs are servers that provide access to a set of data which users can query and retrieve. You can find a partial list of STAC APIs at STAC Index, https://stacindex.org/. For this tutorial, we’ll be downloading data from Microsoft’s Planetary Computer, which currently provides more than 100 data sets for free via a central STAC API.
To start downloading data, we’ll first need to let rstac know what STAC
API we want to query and download from. To do so, we’ll pass the URL of
the Planetary Computer STAC API to the rstac::stac()
function:
stac_source <- rstac::stac(
"https://planetarycomputer.microsoft.com/api/stac/v1"
)
stac_source
###RSTACQuery
- url: https://planetarycomputer.microsoft.com/api/stac/v1
- params:
- field(s): version, base_url, endpoint, params, verb, encode
As you can see, the output of stac()
is an RSTACQuery
object, which
contains information about an HTTP query that we might want to run in
the future. Under the hood, these objects are normal lists containing
information about the query:
str(stac_source)
List of 6
$ version : NULL
$ base_url: chr "https://planetarycomputer.microsoft.com/api/stac/v1"
$ endpoint: NULL
$ params : list()
$ verb : chr "GET"
$ encode : NULL
- attr(*, "class")= chr [1:2] "stac" "RSTACQuery"
But most of the time, you won’t need to worry about this internal representation; rstac provides many helper functions to access the elements of this list if needed.
It’s worth highlighting that this object is a representation of a
future HTTP query, not the results of a query we’ve already run! In
order to actually run these queries, we need to use
rstac::get_request()
(or rstac::post_request()
, depending on what
HTTP verb your STAC API is expecting). If we use get_request()
to
query the Planetary Computer STAC API, we get a brief description of
what this API provides:
rstac::get_request(stac_source)
###STACCatalog
- id: microsoft-pc
- description:
Searchable spatiotemporal metadata describing Earth science datasets hosted by the Microsoft Planetary Computer
- field(s): type, id, title, description, stac_version, conformsTo, links
Because the RSTACQuery
object is a representation of a future query,
we can use other functions in rstac to change our query parameters and
fields before we actually make a request. For instance, we can use
rstac::collections()
to update our request to query the /collections
endpoint of the Planetary Computer API, which lists all the available
collections
(which, to quote the STAC Collection specification, “describe a group of
Items
that share properties and metadata”):
collections_query <- stac_source |>
rstac::collections()
collections_query
###RSTACQuery
- url: https://planetarycomputer.microsoft.com/api/stac/v1
- params:
- field(s): version, base_url, endpoint, params, verb, encode
While it might not look like much has changed, under the hood our
collections_query
object has a new collections
class to indicate
that we’re querying the collections endpoint, not the top-level STAC
endpoint:
class(stac_source)
[1] "stac" "RSTACQuery"
class(collections_query)
[1] "collections" "RSTACQuery"
And as a result, when we use get_request()
to turn this query
specification into a query result, we get a list of the collections
available from this API:
available_collections <- rstac::get_request(collections_query)
available_collections
###STACCollectionList
- collections (122 item(s)):
- daymet-annual-pr
- daymet-daily-hi
- 3dep-seamless
- 3dep-lidar-dsm
- fia
- sentinel-1-rtc
- gridmet
- daymet-annual-na
- daymet-monthly-na
- daymet-annual-hi
- ... with 112 more collection(s).
- field(s): collections, links
These collections are a subset of the data sets available in the Planetary Computer data catalog; the Planetary Computer is organized so that each collection corresponds to a distinct data set in the catalog.
For our purposes today, we’re going to be querying the USGS Land Change
Monitoring, Assessment, and Projection
(LCMAP)
collection, which provides (among other things) annual land cover
classifications for the continental United States. We can query what
items are available for this collection using the rstac::stac_search()
function. We'll limit our search to 2021 using the datetime
argument,
ask for up to 999 items (the most Planetary Computer will return in a
single request) using the limit
argument, and only search within the
LCMAP collection by using the collection’s ID of usgs-lcmap-conus-v13
:
rstac::stac_search(
q = stac_source,
collections = "usgs-lcmap-conus-v13",
datetime = "2021-01-01/2021-12-31",
limit = 999
)
###STACItemCollection
- features (422 item(s)):
- LCMAP_CU_032003_2021_V13_CCDC
- LCMAP_CU_031006_2021_V13_CCDC
- LCMAP_CU_031004_2021_V13_CCDC
- LCMAP_CU_031003_2021_V13_CCDC
- LCMAP_CU_031002_2021_V13_CCDC
- LCMAP_CU_030007_2021_V13_CCDC
- LCMAP_CU_030006_2021_V13_CCDC
- LCMAP_CU_030005_2021_V13_CCDC
- LCMAP_CU_030004_2021_V13_CCDC
- LCMAP_CU_030003_2021_V13_CCDC
- ... with 412 more feature(s).
- assets:
browse, dates, lcachg, lcachg_metadata, lcpconf, lcpconf_metadata, lcpri, lcpri_metadata, lcsconf, lcsconf_metadata, lcsec, lcsec_metadata, rendered_preview, sclast, sclast_metadata, scmag, scmag_metadata, scmqa, scmqa_metadata, scstab, scstab_metadata, sctime, sctime_metadata, tilejson
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
We can see that there are 422 items inside this catalog for 2021. Collectively, these items contain all LCMAP data for the continental United States for 2021, with each item containing a number of assets covering a relatively small chunk of the nation. These assets are “object[s] that [contain] a URI to data associated with the Item that can be downloaded or streamed”, to quote the spec; here, assets are things like “primary land cover classification” or “a metadata object”.
To keep things simple, we’ll start off downloading data for a relatively small region, namely North Carolina’s Ashe County. We’ll use data included in the sf package to get the county’s geometry:
ashe <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))[1, ]
sf::st_geometry(ashe) |> plot()
To filter our query down to just tiles intersecting this region, we’ll
need to provide the bounding box of the county in
WGS84 as a query
parameter. We can use sf::st_transform()
to reproject the county and
sf::st_bbox()
to find our bounding box:
ashe_bbox <- ashe |>
sf::st_transform(4326) |>
sf::st_bbox()
ashe_bbox
xmin ymin xmax ymax
-81.74091 36.23448 -81.23970 36.58977
We can then pass this bounding box directly to the bbox
argument of
stac_search
to filter our results down further:
stac_query <- rstac::stac_search(
q = stac_source,
collections = "usgs-lcmap-conus-v13",
bbox = ashe_bbox,
datetime = "2021-01-01/2021-12-31"
)
executed_stac_query <- rstac::get_request(stac_query)
executed_stac_query
###STACItemCollection
- features (1 item(s)):
- LCMAP_CU_025011_2021_V13_CCDC
- assets:
browse, dates, lcachg, lcachg_metadata, lcpconf, lcpconf_metadata, lcpri, lcpri_metadata, lcsconf, lcsconf_metadata, lcsec, lcsec_metadata, rendered_preview, sclast, sclast_metadata, scmag, scmag_metadata, scmqa, scmqa_metadata, scstab, scstab_metadata, sctime, sctime_metadata, tilejson
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
As luck would have it, Ashe County is entirely contained in a single LCMAP tile.
There are a few different ways to download the assets associated with
this item. First, we could use the rstac::assets_download()
function.
Before we can download assets, however, we’re going to need to
authenticate ourselves with Planetary
Computer.
Note that this doesn’t require an account or registration, but rather is
a way for Microsoft to uniquely identify your Planetary Computer session
and impose rate limits if needed.
This authentication process is built-in to rstac, using the
items_sign()
and sign_planetary_computer()
functions:
signed_stac_query <- rstac::items_sign(
executed_stac_query,
rstac::sign_planetary_computer()
)
signed_stac_query
###STACItemCollection
- features (1 item(s)):
- LCMAP_CU_025011_2021_V13_CCDC
- assets:
browse, dates, lcachg, lcachg_metadata, lcpconf, lcpconf_metadata, lcpri, lcpri_metadata, lcsconf, lcsconf_metadata, lcsec, lcsec_metadata, rendered_preview, sclast, sclast_metadata, scmag, scmag_metadata, scmqa, scmqa_metadata, scstab, scstab_metadata, sctime, sctime_metadata, tilejson
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
We didn’t need to log in or register anywhere, but now have authenticated our requests and will be able to download assets from this STAC API.
We’ll then use this signed query object with assets_download()
to
actually retrieve our assets. We’ll provide the name of what assets we
want to download – let’s start off by downloading lcpri
, which
contains the primary land cover classifications for this area:
rstac::assets_download(signed_stac_query, "lcpri", output_dir = tempdir())
We’re downloading these into a temporary directory (using tempdir()
)
so that these example files will be cleaned up after we finish running
this tutorial. In practice, you’ll likely want to set output_dir
somewhere else, where the data will persist for as long as you need.
The assets_download()
function will save your data in a sub-directory
of output_dir
that corresponds to the URL that the asset is being
downloaded from. In this case, that winds up being quite a long folder
path – but once we type the folder path out, we can load our downloaded
raster into R and use it with terra and other raster packages as normal:
output_file <- file.path(
tempdir(),
"lcmap",
"CU",
"V13",
"025011",
"2021",
"LCMAP_CU_025011_2021_20220721_V13_CCDC",
"LCMAP_CU_025011_2021_20220629_V13_LCPRI.tif"
) |>
terra::rast()
terra::plot(output_file)
The assets_download()
function is an easy and straightforward way to
download assets associated with a STAC Item, and by passing multiple
asset names (or leaving the asset_names
argument NULL
) we could use
this function to download more than one asset at a time. This function
can also handle any format an asset is stored in, making it a flexible
way to download rasters, metadata, vectors, or whatever other data is
provided by a STAC API.
The main downside of this approach, however, is that we can’t use it to download only specific parts of an asset, which means we can wind up downloading much more data than we need. For instance, look at how far our land cover raster extends beyond Ashe County’s borders:
terra::plot(output_file)
ashe |>
sf::st_transform(sf::st_crs(output_file)) |>
sf::st_geometry() |>
plot(add = TRUE, lwd = 3)
An alternative approach for downloading raster assets is to use GDAL’s virtual file system interface to download rasters. If the STAC server stores the asset in a cloud-friendly format, such as COG, then GDAL is able to download only the necessary pieces of the raster, potentially speeding up download times and reducing bandwidth usage.
To download assets using GDAL, we need to extract the URL for our assets
from our unsigned STAC query results. We can get the URL for the
lcpri
asset using rstac::assets_url()
:
lcpri_url <- rstac::assets_url(executed_stac_query, "lcpri")
lcpri_url
[1] "https://landcoverdata.blob.core.windows.net/lcmap/CU/V13/025011/2021/LCMAP_CU_025011_2021_20220721_V13_CCDC/LCMAP_CU_025011_2021_20220629_V13_LCPRI.tif"
In order for this URL to work with GDAL’s virtual filesystem interface, we’ll need to append a few things in front of the URL. Namely, we’ll need to add:
/vsicurl
, to specify that we want to use the http-based virtual filesytem interface,pc_url_signing=yes
, to use GDAL’s built-in method for signing requests to Microsoft’s Planetary Computer,- and
pc_collection=usgs-lcmap-conus-v13
, to let GDAL know what collection we need authentication for.
All of these configuration parameters are documented in the official GDAL documentation.
We can write a small function that will paste these parameters on to any URL we want:
make_vsicurl_url <- function(base_url) {
paste0(
"/vsicurl",
"?pc_url_signing=yes",
"&pc_collection=usgs-lcmap-conus-v13",
"&url=",
base_url
)
}
We might have chosen to make this function a bit more flexible by adding
a collection
argument, so that we can sign requests for data in any
collection we’d want. For our purposes here, however, this simpler
function should suffice.
We can then use this function to modify our LCMAP URL:
lcpri_url <- make_vsicurl_url(lcpri_url)
Next, we’ll download the lcpri
asset from this URL using
sf::gdal_utils()
, which provides access to GDAL’s C++ utilities. We’ll
use the included interface to
gdalwarp to both download and
reproject our data, using the t_srs
and te
arguments to gdalwarp to
control the spatial reference system and the extent of our downloaded
raster:
out_file <- tempfile(fileext = ".tif")
sf::gdal_utils(
"warp",
source = lcpri_url,
destination = out_file,
options = c(
"-t_srs", sf::st_crs(ashe)$wkt,
"-te", sf::st_bbox(ashe)
)
)
This download runs a good bit faster than using assets_download()
, and
results in a smaller raster covering only our area of interest,
reprojected into the same CRS as our original data:
terra::rast(out_file) |>
terra::plot()
ashe |>
sf::st_geometry() |>
plot(lwd = 3, add = TRUE)
At this point, we’ve walked through how to use rstac to query a STAC API, and how to use either rstac or GDAL to download assets found in these queries. It’s worth highlighting that so far we’ve only needed to write about 40 lines of code; here’s everything we’ve walked through so far, presented as a single script instead of in a bunch of chunks:
ashe <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))[1, ]
ashe_bbox <- ashe |>
sf::st_transform(4326) |>
sf::st_bbox()
stac_query <- rstac::stac(
"https://planetarycomputer.microsoft.com/api/stac/v1"
) |>
rstac::stac_search(
collections = "usgs-lcmap-conus-v13",
bbox = ashe_bbox,
datetime = "2021-01-01/2021-12-31"
) |>
rstac::get_request()
make_vsicurl_url <- function(base_url) {
paste0(
"/vsicurl",
"?pc_url_signing=yes",
"&pc_collection=usgs-lcmap-conus-v13",
"&url=",
base_url
)
}
lcpri_url <- make_vsicurl_url(rstac::assets_url(stac_query, "lcpri"))
out_file <- tempfile(fileext = ".tif")
sf::gdal_utils(
"warp",
source = lcpri_url,
destination = out_file,
options = c(
"-t_srs", sf::st_crs(ashe)$wkt,
"-te", sf::st_bbox(ashe)
)
)
terra::rast(out_file) |>
terra::plot()
ashe |>
sf::st_geometry() |>
plot(lwd = 3, add = TRUE)
What if we wanted to download more than one raster asset? We could
choose to pass multiple asset names to assets_download()
. For
instance, if we wanted to also download the secondary land cover
classification for this area (called lcsec
), we could write:
rstac::assets_download(
signed_stac_query,
c("lcpri", "lcsec"),
output_dir = tempdir(),
overwrite = TRUE
)
Note that we needed to set overwrite = TRUE
, in order to re-download
the lcpri
asset that we had already retrieved.
These rasters are now somewhere in our output_dir
directory. We can
find their file paths using list.files()
in order to load and
visualize them:
list.files(
file.path(tempdir(), "lcmap"),
recursive = TRUE,
full.names = TRUE
) |>
terra::rast() |>
terra::plot()
If we wanted to download multiple assets using sf and GDAL, we’d need to
run our gdal_utils()
call for each asset we wanted to download. For
instance, to download both lcpri
and lcsec
, we could run:
vapply(
make_vsicurl_url(rstac::assets_url(stac_query, c("lcpri", "lcsec"))),
function(asset_url) {
out_file <- tempfile(fileext = ".tif")
sf::gdal_utils(
"warp",
source = asset_url,
destination = out_file,
options = c(
"-t_srs", sf::st_crs(ashe)$wkt,
"-te", sf::st_bbox(ashe)
)
)
out_file
},
character(1)
) |>
terra::rast() |>
terra::plot()
What if we wanted to download the same asset from more than one item?
For instance, what if we wanted to download lcpri
for all of North
Carolina?
We could construct our STAC query in almost exactly the same way as
before, except using the entire nc
data frame instead of just its
first row:
nc <- sf::read_sf(system.file("shape/nc.shp", package = "sf"))
nc_bbox <- nc |>
sf::st_transform(4326) |>
sf::st_bbox()
stac_query <- rstac::stac(
"https://planetarycomputer.microsoft.com/api/stac/v1"
) |>
rstac::stac_search(
collections = "usgs-lcmap-conus-v13",
bbox = nc_bbox,
datetime = "2021-01-01/2021-12-31"
) |>
rstac::get_request()
signed_query <- stac_query |>
rstac::items_sign(
rstac::sign_planetary_computer()
)
signed_query
###STACItemCollection
- features (20 item(s)):
- LCMAP_CU_029012_2021_V13_CCDC
- LCMAP_CU_029011_2021_V13_CCDC
- LCMAP_CU_028013_2021_V13_CCDC
- LCMAP_CU_028012_2021_V13_CCDC
- LCMAP_CU_028011_2021_V13_CCDC
- LCMAP_CU_028010_2021_V13_CCDC
- LCMAP_CU_027013_2021_V13_CCDC
- LCMAP_CU_027012_2021_V13_CCDC
- LCMAP_CU_027011_2021_V13_CCDC
- LCMAP_CU_027010_2021_V13_CCDC
- ... with 10 more feature(s).
- assets:
browse, dates, lcachg, lcachg_metadata, lcpconf, lcpconf_metadata, lcpri, lcpri_metadata, lcsconf, lcsconf_metadata, lcsec, lcsec_metadata, rendered_preview, sclast, sclast_metadata, scmag, scmag_metadata, scmqa, scmqa_metadata, scstab, scstab_metadata, sctime, sctime_metadata, tilejson
- item's fields:
assets, bbox, collection, geometry, id, links, properties, stac_extensions, stac_version, type
As you can see, we now have 20 items to download, rather than just 1!
We barely need to change our code at all to download all 20 lcpri
assets from these items using assets_download()
. We’re simply going to
set progress = FALSE
, in order to keep the output from this function
clean. Other than that, we don’t need to change a thing:
rstac::assets_download(
signed_query,
"lcpri",
output_dir = tempdir(),
overwrite = TRUE,
progress = FALSE
)
These files are each saved into their own folder inside of output_dir
.
We can use list.files()
to get the paths to each of them, and then
terra::sprc()
and terra::mosaic()
to combine them into a single
raster:
lcpri <- list.files(
file.path(tempdir(), "lcmap"),
pattern = "LCPRI.tif",
recursive = TRUE,
full.names = TRUE
) |>
lapply(terra::rast) |>
terra::sprc() |>
terra::mosaic()
terra::plot(lcpri)
nc |>
sf::st_transform(sf::st_crs(lcpri)) |>
sf::st_geometry() |>
plot(lwd = 3, add = TRUE)
Note that we’ve lost our color palette due to mosaic()
, but this is
the same data as we’ve been working with throughout the entire tutorial.
Or alternatively, we can use our gdal_utils()
call – without any edits
– to download and merge these files together. Remember that we’re using
our unsigned query results here!
out_file <- tempfile(fileext = ".tif")
sf::gdal_utils(
"warp",
source = make_vsicurl_url(rstac::assets_url(stac_query, "lcpri")),
destination = out_file,
options = c(
"-t_srs", sf::st_crs(nc)$wkt,
"-te", sf::st_bbox(nc)
)
)
terra::rast(out_file) |>
terra::plot()
nc |>
sf::st_geometry() |>
plot(lwd = 3, add = TRUE)
Note that gdalwarp does not interpolate between overlapping pixels, and instead simply uses the value from whichever pixel it processed last. That’s fine here, where our tiles shouldn’t overlap (and pixel values should be identical if they do), but means you shouldn’t use gdalwarp to combine images from multiple time periods into a single composite.
It’s worth highlighting that these are just two methods among many for downloading this data – you may also find using packages such as terra or gdalcubes useful for getting data from STAC APIs!