A new way to query mass amounts of geospatial raster data

Nov 5, 2013

Éric St-JeanEcometrica’s Our Ecosystem (OE) online geospatial analysis platform will apply complex models to multiple layers of geospatial data and give you results on arbitrary polygons containing many millions of cells within seconds. And that’s without using any of those tricks:

  1. having to pre-calculate the results of the models to create a cache layer of computed results
  2. pre-calculating in advanced intermediate results at coarser zoom levels
  3. pre-calculating results for “likely query polygons”

OE does actually do #3 for pre-defined polygons so that certain queries will come back in a fraction of a second instead of seconds, but it can still re-calculate results from scratch within a few seconds. We might implement those tricks later on, and others do use those tricks, but using them translates to a less flexible system – changing any or all of the models or data or queries or polygons to be queried means a round of expensive pre-calculations, whereas OE needs none. How does this work?

Background

There are many ways to edit, combine, and query geospatial raster data. There are even quite a few for displaying such data online, including some really gorgeous ones. Some of those online mapping offerings are done in Flash or Silverlight, requiring a plugin, while others are done with pure web standards, and will work on most browsers on most devices. They all vary in functionality, ease of use and user experience, and will allow you to let your clients visualize any number of points, vectors, polygons, or raster data on top of beautiful base maps.

OE NBM screenshot

Ecometrica’s Normative Biodiversity Metric, visible *and* queryable

That’s all well and good for looking at pre-aggregated data in a geographical context. For example, population by age group per city or state. And that does cover most of the use cases for using maps online (apart from, well, wanting to look at actual maps online). However, many organisations need to do actual analysis on geographical data. For example, figuring out whether a given area’s forests are growing, or detecting land use for an area (land use being farming (and which crop?), or wooded area (and what type of trees?), bogs, etc.). Doing this requires one to:

  • download raw satellite data for the area
  • find or develop a model that gives you the data you need, from the data you have
  • purchase desktop Geographical Information System (GIS) software (such as ArcGIS)
  • apply this model to the data, for the areas you need
  • use the software to implement the model, define your areas, and apply the model
  • wait for the calculations to end
  • integrate the results into some kind of report
  • rinse and repeat for each area, date range, or model tweaks

Some organisations will hire people in-house for these tasks. Others will outsource the work. But, most of the time, the work will be done with these steps. In some cases, you might be lucky, and the metric you’re after is something others also want, so there’s either free data out there already, or data available for purchase for cheaper than if you had had to hire someone to do it. The data still won’t be for the area you need, however, so you’ll likely have to download the data, open it in some kind of GIS, import your areas, and run some type of aggregation. Or you’ll be using the value from an area, such as a state, that is close to your area of interest in the original data.

Let’s give a simple example. Many companies buy carbon credits now, to offset the emissions caused by their activities (which Ecometrica Sustainability can measure accurately!). Ever wondered where those credits come from? Some will come from areas being forested, in Brazil for example, by a company. That company, let’s call them NewForest, bought rights to a large unforested area (and old farm for example). They plant trees, and sell the carbon credits, to another organization, let’s call those CleanCarbon. CleanCarbon buys credits from many organisations, not just NewForest. And they then resell those credits to companies as carbon offsets. But how does CleanCarbon know for sure that NewForest is actually planting those trees? Or that the trees are actually growing and storing carbon?

They can send personnel on-site, but it would be near impossible to visit the entire site, for all the sites. They can manually look at satellite pictures, but how do they know that green area is a tall forest, and not just bushes? So they might hire a GIS specialist, or outsource the work to a GIS company, who will then use models to determine from satellite imagery in different bands the above-ground carbon stored in the vegetation, using different models. They’ll have to keep paying, to do this regularly, say on a yearly basis, for all areas they buy from.

OE Can Help You

Our Ecosystem is not simply an online mapping product. Frankly, others have made better products in that regard, and we acknowledge that as we, for example, are switching our base maps to those provided by MapBox (if you want an easy way to create beautiful custom maps without ever touching any GIS software, that’s a great place to go, by the way).

However, we don’t think anybody has made something like OE. OE lets you solve CleanCarbon’s problem, and many others. It lets you run complicated data models on geographical data for any arbitrary polygon in seconds. In a case like our fictional CleanCarbon, they could pick one of the metrics we have already developed and get their own branded OE site, and get results for any area they’re interested in, including historical results, and receive automated reports on the growth of the areas they’ve invested in. They could also define an alert, something we are now implementing, so that any deforestation in the area (caused by tree cutting or forest fires) would trigger an alert to them. OE would allow them to calculate the results for any area they’d like, not just the pre-defined areas of interest, within seconds, and download the results as a spreadsheet or PDF report.

Of course, they also have the option of getting our science team to implement custom metrics that suit them better. How is that different than the current state of them hiring a specialist or outsource the work? In this case, once the metric or model has been defined and put into OE, the sources setup, OE will just carry on the work for them, and there’s no need to pay humans to re-do the actual work and run the reports. Furthermore, these new metrics will be fully queryable for any area, and CleanCarbon will get their alerts and reports automatically when new data comes in. Think of it as a one-time customisation fee to develop bespoke metrics, indicators, alerts and reports.

The Technology

So, how are we actually pulling this off, when even desktop software will make you wait for your results?

OE includes proprietary technology under the hood to do this, algorithms and storage formats that have evolved over years of research and development. In the very beginning, we started using off-the-shelf technology – such as PostGIS – that exist to store and query geographical information. Although these technologies are great to make proof-of-concept products, they start breaking down when you have millions of points, and become completely unusable when you must query from multiple layers of hundreds of millions of geolocated data points and/or complex polygons.

We then moved to geohashing, which is a way to encode geographical coordinates into a single string, whose length encodes the resolution of the coordinates. If you think about this well enough, you’re encoding cells – boxes. Your cells must fall on pre-determined locations for a given resolution, and the resolutions are also fixed, but this means that a single string refers to a cell of a given size at a given location. This, databases can actually index very efficiently and query rapidly. This was our second incarnation of the storage and querying engine, and it gave us over an order of magnitude in query speed and queryable number of cells.

Of course, I can’t at this point give the keys to the castle, so i can’t really discuss the specifics of the technology that we developed after that. We went through some more iterations, every time improving at least an order of magnitude in:

  • how much data we could store and select from in a single query
  • how many cells could end up in a given polygon for querying
  • how complex the polygon could be

for a given maximum time to return results. This time was always on the order of seconds or tens of seconds. Although when running a report in the background, users can easily deal with it taking minutes or even hours, OE needs to be able to give them the results, including full-blown reports, on any arbitrary polygon that they would draw or upload then immediately query, and in that use case, it’s simply not acceptable for the system to be taking minutes to give you the data you need.

Publishing Data

So we discussed using OE for querying and reporting metrics on geospatial data on predefined or arbitrary areas over several layers of raw or processed data, but another important use case for this is the dissemination of data. Say you’ve run an R&D project where you’ve developed a brand new way of quantifying regrowth. In the end, you might publish an article or two, and possibly make the raw and processed data available online. For example, a GeoTIFF of a per-cell regrowth indicator at a given point in time.

And then, that’s it. You just hope people will use it.

OE means it doesn’t have to be the end of your project. The model could be incorporated into OE as a new metric available to all, or available for sale to those who need it. Or you could make the model output available on OE, again to all for free, or for sale with a lower resolution preview version. Furthermore, everyone can query the data, and even use your data in their own models, without having to download huge files, and use complicated GIS software.

Upcoming

In coming posts we’ll discuss a new resource tracing system being developed, allowing organisations to track resources linked to locations in a secure and traceable manner, such as carbon credits or organic coffee, and more!

Related insights

The Ecometrica Homeworker Methodology

The Ecometrica Homeworker Methodology

Ecometrica has alway helped clients calculate homeworker emissions, however until last year homeworking was a rarity, now with the global pandemic it’s the norm.