**2. 2. 1 Hazard inducing factor selection and feature matrix construction**

The various topographic factors extracted based on the accuracy and reliability of the DEM are important inducing factors of geological disaster susceptibility analysis. Based on the open source SAGA GIS of Conrad et al.

^{[5]}, this paper systematically extracts a variety of inducing factors for the assessment of ice edge hazards. These factors can be divided into basic topographic factors, hydrological factors, morphometry and geotechnical hydraulic factors. These topographic factors, together with the lithology type and soil type, serve as the prophylactic factors for the assessment of stone-sliding susceptibility, and comprehensively depict the stone-sliding hazard inducing environment. But inevitably there are some redundancies. Drawing on the idea of data warehousing

^{[6]}, the integration of features could eliminate redundancies and inconsistencies. The redundancy detection of discrete features is based on Pearson’s hypothesis test of the

\({\mathrm{\chi }}^{2}\) statistic

\(\mathrm{q}\), which is calculated as Equation (1)

^{[7]}.The original hypothesis

\({\mathrm{H}}_{0}\) of the Pearson’s

\({\mathrm{\chi }}^{2}\) test is that the two variables are independent of each other, and

\({\mathrm{H}}_{0}\) is accepted only if

\({\mathrm{q}<{\mathrm{\chi }}^{2}}_{1-\mathrm{p}}\left(\mathrm{d}\mathrm{f}-1\right)\).

\(\mathrm{p}\) is the level of significance, and the degree of freedom

\(\mathrm{d}\mathrm{f}=\left(\mathrm{c}-1\right)×\left(\mathrm{r}-1\right)\)^{[7]} .

In the test of \({\chi }^{2}\), for discrete variables A and B, suppose A has* c* unique values 𝑎_{1}, 𝑎_{2}, … , 𝑎_{𝑐}, and B has *r* unique values 𝑏_{1}, 𝑏_{2}, …, 𝑏_{𝑟}. The *c* values of A are columns, and the *r* values of B are rows, forming a contingency table. In this table, each “*a*” value meets the corresponding “*b*” value to form a joint event (*𝑨*_{i} , *𝐁*_{j} ), occupying a position of the matrix. The relevant statistics are called Pearson 𝝌^{𝟐} statistics, where 𝒐_{𝒊𝒋} is the observed frequency of the joint event (*𝑨*_{i} , *𝐁*_{j} ) and 𝒆_{𝒊𝒋} is the expected frequency of the joint event:

\(q=\sum _{i=0}^{c-1}\sum _{j=0}^{r-1}\frac{{{\left(o}_{ij}-{e}_{ij}\right)}^{2}}{{e}_{ij}}\)

For any continuous numerical variables *A *and *B*, the classical Pearson correlation test is used for discriminant. The correlation statistic is \({r}_{A,B}\) or its square form \({r}^{2}\). In the statistical machine learning model, it is necessary to avoid the input of the continuous large variable pairs of \({r}_{A,B}\) into the model at the same time. Figure 3 is a set of hazard inducing factors obtained after the above redundancy detection.

**Fig. 3
**The set of hazard inducing factors after redundancy detectionStatistical machine learning models often require the input samples to be subject to a certain distribution, at least to certain input specifications. The semi-supervised learning method selected in this paper involves the calculation of Euclidean distance, so it is necessary to encode discrete features and to perform the standardization of continuous features.

The feature matrix contains 4 discrete variables and 9 continuous variables. Discrete variables include 3 categorical variables: lithology generics, surface classification

^{[8]}and soil type as well as one binary variable, i.e. friction-sensitive instability index. The OneHot encoder is used to encode four discrete variables. After the OneHot encoding, the 4 discrete variables are converted into 31-dimensional binary variables. The 31-dimensional binary feature and continuous features constitute a feature matrix of 40 features.

In the new feature matrix, there are ratio scales and interval scales. Some units are radians while some have no unit, and the range of values is jagged. Standardization is required before modeling. The minimum and maximum normalization methods are selected here for processing. For any feature X, the method of minimum and maximum normalization first constrains it to *X* by using the maximum and minimum values of the sample, and then continues to normalize \(\stackrel{˙}{X}\)̇ to the user-defined interval [a, b] as \(\stackrel{˙}{\stackrel{¨}{X}}\), where *min* and *max *are functions of the minimum and maximum values of the sequence, respectively.

\(\stackrel{˙}{X}=\left(X-min\left(X\right)\right)/\left(max\left(X\right)-min\left(X\right)\right)\) (2)

\(\stackrel{¨}{X}=\stackrel{˙}{X}×\left(b-a\right)+a\) (3)

Feature selection, feature integration, feature coding and feature conversion constitute the complete processes of feature engineering. The complete processing flow of feature encoding and conversion is shown in Figure 4.

**Fig. 4
**Flow of feature encoding and conversion**2. 2. 2 Algorithm selection and data expansion of stone-sliding survey**

It is difficult to collect a sufficient number of samples to set up a supervised learning model in the geological field survey. The small sample-driven supervised learning model is a challenge. This article tries to solve this problem by means of semi-supervised learning.

Semi-supervised learning can achieve good prediction results based on a small number of labeled training samples. The principle is label propagation, which is to construct a similarity graph for all samples in the input dataset to assign category labels to unlabeled data. Compared with the label propagation method proposed by Zhu et al.

^{[9]}, the method of Zhou et al.

^{[10]}is to minimizes a loss function containing regular attributes, so the method is more robust. This paper takes Zhou’s method. and the algorithm is implemented based on Scikit-Learn

^{[11]}. The available kernel methods are radial basis function (RBF) and K-nearest neighbor (KNN). Both of these kernel functions calculate distance in the European space. RBF can map features to high-dimensional space, but with high time complexity and space complexity; KNN has higher computational efficiency and lower space complexity.

The data on the stone-sliding hazard survey was collected from both sides of Gaizi Valley in the summer of 2017. The data format is vector point. The shape and distribution characteristics of the slippery slope can be clearly seen by superimposing the data on the hazard survey spots into the orthophoto of GF-1. The stone-sliding slopes formed on both sides of the road are fan-shaped and extend to the valleys and highways, and some even could spread to the other side. Figure 5a is a stone-sliding spot on the eastern slope of Kungay Mountain, and Figure 5b is a stone-sliding hazard spot on the west slope of Kongur Tagh in Gaizi Valley. As the infrastructure such as roads and some villages are directly built on the sloping slope, the stone-sliding slope poses a huge potential hazard.

**Fig. 5
**Stone-sliding survey data superposed with GF-1 imagesThis study adopts the strategy of grid-based machine learning model construction. In other words, each grid point is a sample. The sample volume required by the selected label-propagation semi-supervised learning algorithm is not demanding. As long as the ratio of the unknown labels and the known is controlled to about 60:1; namely the grid points with category labels accounting for about 1/60. The profile of the stone-sliding survey can be clearly seen through the superimposition of the slippery slope survey point and the GF-1 Orthographic image map (Fig. 5). The vector graph obtained based on the stone-sliding survey data serves as the label data on the model training. The feature matrix is used as the initial feature set of model training for the purpose of model training.

**2. 2. 3 Machine learning model training and parameter selection**

In order to weigh the prediction accuracy and the complexity of the algorithm (including time complexity and space complexity), K-nearest neighbor is chosen as the kernel function of the label propagation model. K-adjacent is a typical distance-based machine learning algorithm, also known as lazy learning. The K neighbor searches for each of the K most recent neighbors in the training set for each unlabeled data, assigning the most frequently occurring category label in the neighbor to the unlabeled data, and iterating through the process until all the samples are assigned with labels. The calculation process of K nearest neighbor is shown in Fig. 6.

**Fig. 6
**KNN-based semi-supervised learning calculation processObviously, an important hyperparameter of the kernel function KNN is the number of neighborhood points K used for neighborhood distance calculation. The K value directly affects the 0 precision of the model, so it is crucial to get the optimal K value for the parameter selection. Adjust the K value to check the accuracy on the validation set. When K=18, the accuracy of the positive sample, negative sample and the global accuracy reach the maximum. Another important parameter is the number of iterations, but the number of iterations reaches 30 and it is convergent and stable. Hence, it is advisable to adjust the number of iterations to convergence.

**2. 2. 4 Spatial Prediction of Stone-sliding Susceptibility**

The potential hazardous areas of China-Pakistan Highway along Gaizi Valley can be obtained by using the established model to traverse the range of 2 kilometers on both sides of the road. Although the spectral features and texture features of any optical remote sensing data are not included in the modeling features, the prediction results of the model in the buffer zone of 2 kilometers on both sides of the Gaizi Valley highways are highly matched with the visually-interpreted hazard areas in the GF-1 orthoimage. Though there are many suspected misjudgments to the naked eye.