How Databricks Powers Stantec’s Flood Predictor Engine

December 11, 2022

1

This can be a collaborative submit between Stantec and Databricks. We thank ML Operations Lead Assaad Mrad, Ph.D. and Knowledge Scientist Jared Van Blitterswyk, Ph.D. of Stantec for his or her contributions.

At Stantec, we got down to develop the primary ever fast flood estimation product. Flood Predictor is constructed upon a ML algorithm skilled on high-quality options. These options are a product of a characteristic engineering course of that ingests knowledge from quite a lot of open datasets, performs a collection of geospatial computations, and publishes the ensuing options to a characteristic retailer (Determine 1). Sustaining the reliability of the characteristic engineering pipeline and the observability of supply, intermediate, and resultant datasets guarantee our product can convey worth to communities, catastrophe administration professionals, and governments. On this weblog submit, we’ll clarify a few of the challenges we’ve encountered in implementing manufacturing grade geospatial characteristic engineering pipelines and the way the Databricks suite of options have enabled our small staff to deploy manufacturing workloads.

Figure 1: Data flow and machine learning overview for flood predictor from raw data retrieval to flood inundation maps. Re-used from [2211.00636] Pi theorem formulation of flood mapping (arxiv.org). — Determine 1: Knowledge move and machine studying overview for flood predictor from uncooked knowledge retrieval to flood inundation maps. Re-used from [2211.00636] Pi theorem formulation of flood mapping (arxiv.org).

The abundance of distant sensing knowledge supplies the potential of fast, correct, and data-driven flood prediction and mapping. Flood prediction just isn’t simple, nevertheless, as every panorama has distinctive topography (e.g., slope), land use (e.g., paved residential), and land cowl (e.g., vegetation cowl and soil kind). A profitable mannequin can be explainable (engineering requirement) and common: in a position to carry out properly over a variety of geographical areas. Our preliminary method was to make use of direct derivatives of the uncooked knowledge (Determine 2) with minimal processing like normalization and encoding. Nevertheless, we discovered that the generalization of the predictions was not ample, mannequin coaching was computationally costly and the mannequin was not explainable. To handle these points, we leveraged Buckingham Pi theorem to derive a set of dimensionless options (Determine 3); i.e. options that don’t depend upon absolute portions however on the ratio of combos of hydrologic and topographic variables. Not solely did this scale back the dimensionality of the characteristic house, however the brand new options seize the similarity within the flooding course of throughout a variety of topographies and local weather areas.

By combining these options with logistic regression, or tree-based machine studying fashions we’re in a position to produce flood danger likelihood maps, in comparison with extra sturdy and sophisticated fashions. The mix of characteristic engineering and ML permits us to develop flood prediction in new areas the place flood modeling is scarce or not accessible and supplies the idea of fast estimation of flood danger on a big scale. Nevertheless, modeling and have engineering with massive geospatial datasets is an advanced downside that may be computationally costly. Many of those challenges have been simplified and develop into less expensive by leveraging capabilities and compute assets inside databricks.

Figure 2: Illustrative maps of the dimensional features first used to build the initial model for the 071200040506 HUC12 (12-digit hydrologic unit code) from the categorization used by the United States Geological Survey (USGS Water Resources: About USGS Water Resources). — Determine 2: Illustrative maps of the dimensional options first used to construct the preliminary mannequin for the 071200040506 HUC12 (12-digit hydrologic unit code) from the categorization utilized by the USA Geological Survey (USGS Water Assets: About USGS Water Assets).

Figure 3: Maps of dimensionless indices for a single sub-watershed with a 12-digit hydrologic unit code of the 071200040506 HUC12. Compare to figure 6 for visual confirmation of the predictive power of these dimensionless features. — Determine 3: Maps of dimensionless indices for a single sub-watershed with a 12-digit hydrologic unit code of the 071200040506 HUC12. Evaluate to determine 6 for visible affirmation of the predictive energy of those dimensionless options.

Every geographical area consists of tens of hundreds of thousands of knowledge factors, with knowledge compiled from a number of completely different knowledge sources. Computing the dimensionless options requires a various set of capabilities (e.g., geospatial), substantial compute energy, and a modular design the place every “module” or job performs a constrained set of operations to an enter with a selected schema. The substantial compute energy required meant that we gravitated towards options that have been cost-effective but appropriate for giant quantities of knowledge and Databricks Delta Reside Tables (DLT) was the reply. DLT brings extremely configurable compute with superior capabilities akin to autoscaling and auto-shutdown to cut back prices.

Throughout the conceptual growth of Flood Predictor we positioned emphasis on the flexibility to rapidly iterate on knowledge processing characteristic creation, and fewer precedence on maintainability and scalability. The end result was a monolithic characteristic engineering code, the place dozens of desk transformations have been carried out inside a couple of python scripts and jupyter notebooks, it was arduous to debug the pipeline and monitor the computation. The push to productionizing Flood Predictor made it obligatory to deal with these limitations. The automation of our characteristic engineering pipeline with DLT enabled us to implement and monitor knowledge high quality expectations in real-time. A further benefit is that DLT breaks the pipeline aside into the views and tables on a visually pleasing diagram (Determine 4). Moreover, we arrange knowledge high quality expectations to catch bugs in our characteristic processing.

Figure 4: Part of our Delta live table pipeline implementation for feature engineering. — Determine 4: A part of our Delta reside desk pipeline implementation for characteristic engineering.

The high-level pipeline visualization and pipeline expectations make upkeep and diagnostics easier: we’re in a position to pinpoint failure factors, go straight to the offending code and repair it, and cargo and validate intermediate knowledge frames. For instance, we have been in a position to uncover that our transformations have been resulting in pixels or knowledge factors being duplicated as much as 4 occasions within the last characteristic set (Determine 5). This was extra simply detected after setting an “count on or fail” situation on row duplication. The chart in Determine 5 routinely updates at every pipeline run as a dashboard within the Databricks SQL workspace. Inside a couple of hours we have been in a position to determine that the offender was an edge case the place knowledge factors on the sting of the maps weren’t correctly geolocated.

Figure 5: Databricks dashboard based on a SQL query to count the number of duplicated rows and categorize (color) by the frequency of duplication (2 or 4). huc and HUC12 are the 12-digit hydrologic unit code delineating subwatersheds. — Determine 5: Databricks dashboard primarily based on a SQL question to depend the variety of duplicated rows and categorize (colour) by the frequency of duplication (2 or 4). huc and HUC12 are the 12-digit hydrologic unit code delineating subwatersheds.

We now return to an earlier level: we want our flood prediction system to be as broadly relevant (common) as potential. How can we quantify the flexibility of a skilled Machine Studying mannequin to to generalize past areas throughout the coaching datasets? The geospatial dependency of factors in our datasets requires care when partitioning knowledge into coaching and check units. For this we use a variant of the cross-validation approach referred to as spatial cross validation (e.g.: https://www.nature.com/articles/s41467-020-18321-y) to compute analysis metrics. The overarching thought of spatial cross-validation is to separate the options and labels into numerous folds primarily based on geographical location. Then, the mannequin is skilled successively on all however one fold, leaving one fold out at every step. The analysis metric (e.g., root imply squared error) is computed for every step and a distribution is obtained. We had 10 subwatersheds with labels to coach on so we utilized a 10-fold spatial cross-validation the place every fold is a subwatershed. Spatial cross-validation is a pessimistic metric as a result of it might pass over a fold with options that aren’t consultant of the set of coaching folds, however that is precisely the high-standard we would like our dimensionless options and mannequin to realize.

With our options and analysis course of in place the following step is coaching the mannequin. Fortuitously, coaching a statistical mannequin on a big knowledge set is simple in Databricks. Ten subwatersheds comprise on the order of 100 million pixels, so the complete coaching set doesn’t match into the reminiscence of most compute nodes. We experimented with various kinds of fashions and hyperparameters on a subset of the complete knowledge on an interactive Databricks cluster. As soon as we settled on a given algorithm and parameter grid, we then outlined a coaching job to make the most of the decrease price for job clusters. We use PySpark’s MLLib library to coach a scalable Gradient-Boosted Tree classifier for Flood Predictor and let MLflow monitor and observe the coaching job. Selecting the best metric to judge a mannequin for flooding is necessary; for many occasions, the frequency of ‘dry’ pixels is way larger than that of flooded pixels (on the order of 10 to 1). We selected to compute the harmonic imply of precision and recall (F1 rating) concerning the flood pixels as a measure of mannequin efficiency. This selection was made due to the massive imbalance in our goal labels and classification threshold invariance just isn’t fascinating for our downside.

In contrast to different types of knowledge, a consumer requests geospatial knowledge by delineating a bounding field, a form file, or by referencing geographical entities like cities. A typical request for Flood Predictor output specifies a bounding field in decimal diploma coordinates, and a precipitation depth akin to 3.4 inches. The server aspect software ingests these inputs, in a JSON format, and queries the characteristic retailer for the pixel options throughout the requested bounding field. As is typical for geospatial knowledge companies, the returned output just isn’t on the degree of a single pixel, however pre-defined tiles containing a given variety of pixels. In Flood Predictor’s case, every tile comprises 256×256 pixels. The benefit of this method is proactively limiting the information quantity served by the API to keep up passable response occasions and never overload the database nodes. To attain this design, we tag every pixel within the database with a tile ID that specifies the tile it belongs to. So, after the consumer question is ingested, the server aspect software finds the tiles that intersect the requested bounding field and predicts flooding for these areas. With this design, Flood Predictor can return high-quality flood predictions for dozens of sq. miles at 3-10 meters of decision inside solely minutes.

Figure 6: Flood predictor output compared to the label from 2D hydraulic and hydrologic modeling for the 071200040506 HUC12 and a 10-year storm event. These are categorical maps where “1” denotes flooded areas. The resolution of the map is 3 meters and downsampled by a factor of 100 for visualization purposes. — Determine 6: Flood predictor output in comparison with the label from 2D hydraulic and hydrologic modeling for the 071200040506 HUC12 and a 10-year storm occasion. These are categorical maps the place “1” denotes flooded areas. The decision of the map is 3 meters and downsampled by an element of 100 for visualization functions.

Flooding is basically a geospatial random course of that’s constrained by spatial patterns of topography and land cowl and impacts a considerably massive variety of communities and belongings. This is the reason flood prediction has been the topic of quite a few bodily and statistical fashions of various high quality however there exists, to our information, no comparable product to Flood Predictor. Databricks has been a decisive issue for Flood Predictors success: it was by far essentially the most cost-effective manner for a small staff to rapidly develop a proof-of-concept prediction device with accessible datasets, in addition to implement production-grade jobs and pipeline.

Backed by Databricks’ end-to-end machine studying operations, Stantec’s enterprise-grade Flood Predictor helps you get forward of the following flooding occasions and save lives by well-designed predictive analytics. Test it out on the Santec web site: Flood Predictor (stantec.com) and on the Microsoft Azure market.

Supply hyperlink