Multi-Target Spatial Prediction using the Meuse Dataset

Here we are using the meuse dataset which is included in the pyspatialml package as an example of performing a spatial model and prediction. We can access the datasets using the pyspatialml.datasets module:

from copy import deepcopy
from tempfile import NamedTemporaryFile
import geopandas as gpd
import numpy as np
from pyspatialml import Raster
from pyspatialml.preprocessing import xy_coordinates, distance_to_corners
import pyspatialml.datasets.meuse as ms

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import cm

predictor_files = ms.predictors
training_pts_file = ms.meuse

stack = Raster(predictor_files)
stack.names

dict_keys(['chnl_dist', 'dem', 'dist', 'ffreq', 'landimg2', 'landimg3', 'landimg4', 'mrvbf', 'rsp', 'slope', 'soil', 'twi'])

Pyspatialml implements pandas-style indexing for Raster objects, using Raster.loc to index by the name of the raster, and Raster.iloc to select by index. This method also accepts slices. Label-based indexing is also provided directly by the __getattr_ magic method, i.e. Raster[name] or for multiple layers Raster[(names)].

For example we can remove layers from Raster object using the Raster.drop method, or by subsetting the raster:

stack.drop('ffreq')

Raster Object Containing 11 Layers
    attribute                                             values
0       names  [chnl_dist, dem, dist, landimg2, landimg3, lan...
1       files  [/Users/stevenpawley/GitHub/Pyspatialml/pyspat...
2        rows                                                104
3        cols                                                 78
4         res                                       (40.0, 40.0)
5  nodatavals  [-99999.0, -99999.0, -1.0, -1.0, -1.0, -1.0, -...

We can store matplotlib cmaps as an attribute within each layer in the Raster:

stack.chnl_dist.cmap = 'RdBu'
stack.dem.cmap = 'terrain'
stack.dist.cmap = 'Reds'
stack.landimg2.cmap = 'Greys'
stack.landimg3.cmap = 'Greys'
stack.landimg4.cmap = 'Greys'
stack.landimg4.cmap = 'Greys'
stack.mrvbf.cmap = 'jet'
stack.rsp.cmap = 'gnuplot2'
stack.slope.cmap = 'PuRd'
stack.soil.cmap = 'Set2'
stack.twi.cmap = 'coolwarm'

Plot the predictors in the Raster object as a raster matrix:

mpl.style.use('seaborn-v0_8')
axs = stack.plot(figsize=(9, 7))
ax = axs.flatten()[10]
im = ax.images
im[0].colorbar.set_ticks([1,2,3])
ax = axs.flatten()[8]
ax.tick_params(axis='x', labelrotation=65)

plt.tight_layout()
plt.show()

Feature Engineering

We want the prediction results to be dependent on the spatial locations of the training data. So to include spatial information, coordinate grids can be generated and added to the Raster object:

xy_layer = xy_coordinates(
    layer=stack.iloc[0], 
    file_path=NamedTemporaryFile(suffix=".tif").name
)

xy_layer = xy_coordinates(
    layer=stack.iloc[0], 
    file_path=NamedTemporaryFile(suffix=".tif").name
)

edms = distance_to_corners(
    layer=stack.iloc[0], 
    file_path=NamedTemporaryFile(suffix=".tif").name
)
edms.rename(
    {old: new for (old, new) in zip(edms.names, ["tl", "tr", "bl", "br", "c"])},
    in_place=True
)

edms.plot()
plt.show()

Append them to the Raster object:

stack = stack.append([xy_layer, edms])

Plot the new predictors:

axs = stack.plot(figsize=(9, 7))
ax = axs.flatten()[10]
im = ax.images
im[0].colorbar.set_ticks([1,2,3])
ax = axs.flatten()[8]
ax.tick_params(axis='x', labelrotation=65)

plt.tight_layout()
plt.show()

The area that is filled by some of the grids is different. This doesn’t matter for the prediction because pixels in the Raster object that include some NaNs in some of the layers will be removed. However, the plots could potentially be given a cleaner look. We can use the Raster.intersect method to fix this:

stack = stack.intersect()

axs = stack.plot(figsize=(9, 7))
ax = axs.flatten()[10]
im = ax.images
im[0].colorbar.set_ticks([1,2,3])
ax = axs.flatten()[8]
ax.tick_params(axis='x', labelrotation=65)

plt.tight_layout()
plt.show()

Read the Meuse Dataset

training_pts = gpd.read_file(training_pts_file)
training_pts.head()

	cadmium	copper	lead	zinc	elev	dist	om	ffreq	soil	lime	landuse	dist.m	geometry
0	11.7	85.0	299.0	1022.0	7.909	0.001358	13.6	1	1	1	Ah	50.0	POINT (181072.000 333611.000)
1	8.6	81.0	277.0	1141.0	6.983	0.012224	14.0	1	1	1	Ah	30.0	POINT (181025.000 333558.000)
2	6.5	68.0	199.0	640.0	7.800	0.103029	13.0	1	1	1	Ah	150.0	POINT (181165.000 333537.000)
3	2.6	81.0	116.0	257.0	7.655	0.190094	8.0	1	2	0	Ga	270.0	POINT (181298.000 333484.000)
4	2.8	48.0	117.0	269.0	7.480	0.277090	8.7	1	2	0	Ah	380.0	POINT (181307.000 333330.000)

Plot the training points:

from mpl_toolkits.axes_grid1 import make_axes_locatable

fig, axs = plt.subplots(2, 3, figsize=(8.5, 7))

for i, (ax, target) in enumerate(zip(axs.ravel(), ['cadmium', 'copper', 'lead', 'zinc', 'om'])):
    ax.set_title(target.title())
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="10%", pad=0.05)
    training_pts.plot(column=target, legend=True, ax=ax, cax=cax, cmap='viridis')
    
    if i != 0:
        ax.set_yticklabels([])
    
    if i != 3:
        ax.set_xticklabels([])
    else:
        ax.tick_params(axis='x', labelrotation=65)
    
fig.delaxes(axs.flatten()[i+1])
plt.tight_layout()
plt.show()

Extract Raster Values at the Training Point Locations

Pixel values from a Raster object can be extracted using geometries within a geopandas.GeoDataFrame (points, lines, polygons) or by using labelled pixels from another raster with the same dimensions and crs.

By default the extracted values are returned as a geopandas.GeoDataFrame that contains the data and the coordinates of the pixels:

training_df = stack.extract_vector(gdf=training_pts)

training_df.index = training_df.index.get_level_values("geometry_idx")
training_df = training_df.merge(
    training_pts.loc[:, ("lead", "cadmium", "copper", "zinc", "om")], 
    left_index=True, 
    right_index=True
)

training_df = training_df.dropna()
training_df.head()

	chnl_dist	dem	dist	ffreq	landimg2	landimg3	landimg4	mrvbf	rsp	slope	...	tr	bl	br	c	geometry	lead	cadmium	copper	zinc	om
geometry_idx
0	0.000000	3214.0	0.001358	1.0	97.0	92.0	192.0	3.523824e-06	0.000000	1.423307	...	12.369317	119.268608	100.717430	55.470715	POINT (181072.000 333611.000)	299.0	11.7	85.0	1022.0	13.6
1	79.849854	3402.0	0.012224	1.0	160.0	183.0	183.0	9.879866e-06	0.082085	1.286004	...	13.928389	117.046997	98.858482	53.235325	POINT (181025.000 333558.000)	277.0	8.6	81.0	1141.0	14.0
2	0.000000	3277.0	0.103029	1.0	178.0	209.0	179.0	1.340742e-03	0.000000	0.674711	...	10.295630	119.281181	98.412399	55.226807	POINT (181165.000 333537.000)	199.0	6.5	68.0	640.0	13.0
3	184.743164	3563.0	0.190094	1.0	114.0	135.0	152.0	6.547428e-07	0.192325	1.413479	...	8.485281	120.208153	97.185387	56.035702	POINT (181298.000 333484.000)	116.0	2.6	81.0	257.0	8.0
4	16.768555	3406.0	0.277090	1.0	133.0	154.0	151.0	1.588824e-03	0.016689	0.531276	...	11.661903	117.004272	93.193344	52.801514	POINT (181307.000 333330.000)	117.0	2.8	48.0	269.0	8.7

5 rows × 25 columns

Developing a Machine Learning Model

Here we are going to create a machine learning pipeline that correctly handles categorical predictors via one-hot encoding:

stack.names

dict_keys(['chnl_dist', 'dem', 'dist', 'ffreq', 'landimg2', 'landimg3', 'landimg4', 'mrvbf', 'rsp', 'slope', 'soil', 'twi', 'x_coordinates', 'y_coordinates', 'tl', 'tr', 'bl', 'br', 'c'])

from sklearn.pipeline import Pipeline
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

soil_idx = [i for i, name in enumerate(stack.names) if name == 'soil']

trans = ColumnTransformer([
    ('ohe', OneHotEncoder(categories='auto', handle_unknown='ignore'), soil_idx)
    ], remainder='passthrough')

et = ExtraTreesRegressor(n_estimators=500, n_jobs=-1, random_state=1234)
et = Pipeline([
    ('preproc', trans),
    ('regressor', et)])

Now we can separate our response and predictor variables and train the model:

X = training_df.loc[:, stack.names]
y = training_df.loc[:, ['lead', 'cadmium', 'copper', 'zinc', 'om']]
et.fit(X, y)

Pipeline(steps=[('preproc',
                 ColumnTransformer(remainder='passthrough',
                                   transformers=[('ohe',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  [10])])),
                ('regressor',
                 ExtraTreesRegressor(n_estimators=500, n_jobs=-1,
                                     random_state=1234))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To evaluate the performance of the model, we will use 10-fold cross validation:

from sklearn.model_selection import cross_validate, KFold

outer = KFold(n_splits=10, shuffle=True, random_state=1234)
scores = cross_validate(et, X, y, scoring='neg_mean_squared_error', cv=10, n_jobs=1)
rmse = np.sqrt(-scores['test_score']).mean()

print("Our RMSE score is {}".format(rmse))

Our RMSE score is 105.19227221271413

Feature Importances

ohe_names = deepcopy(list(stack.names))
ohe_names.insert(soil_idx[0], 'soil1')
ohe_names.insert(soil_idx[0], 'soil2')
ohe_names = np.array(ohe_names)

mpl.style.use('ggplot')

fimp = et.named_steps['regressor'].feature_importances_

fig, ax = plt.subplots(figsize=(4, 6))
ax.barh(y=ohe_names[fimp.argsort()], width=fimp[fimp.argsort()])
ax.set_xlabel('Feature Importance Score')
plt.show()

Prediction on the Raster object

preds = stack.predict(et)
preds.rename(
    {old: new for old, new in zip(preds.names, ['lead', 'cadmium', 'copper', 'zinc', 'om'])},
    in_place=True
)
preds.lead.cmap = 'rainbow'
preds.cadmium.cmap = 'rainbow'
preds.copper.cmap = 'rainbow'
preds.zinc.cmap = 'rainbow'
preds.om.cmap = 'rainbow'

/Users/stevenpawley/Library/Caches/pypoetry/virtualenvs/pyspatialml-NqZ1tMUm-py3.11/lib/python3.11/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OneHotEncoder was fitted with feature names
  warnings.warn(
/Users/stevenpawley/Library/Caches/pypoetry/virtualenvs/pyspatialml-NqZ1tMUm-py3.11/lib/python3.11/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but OneHotEncoder was fitted with feature names
  warnings.warn(

Plot the results:

preds.plot(out_shape=(200, 200), title_fontsize=14, figsize=(10, 8))
plt.show()