China-Pakistan Economic Corridor Zone II Versions EN2 Vol 4 (3) 2019
Data of stone-slide investigation and susceptibility distribution in Gaizi Valley
: 2018 - 11 - 19
: 2019 - 01 - 17
: 2019 - 08 - 20
392 2 0
Abstract & Keywords
Abstract: The Karakoram Highway passes through the preglacial area of Kongur Mountain along the northeastern edge of the Pamirs Plateau. On both sides of the highway along the Gaizi Valley, collapse and secondary disaster stone-sliding occur frequently; sands and stones would be thrust by powerful external forces to block the traffic. While stone-sliding surveys have been made to the Karakoram Highway (domestic section), the work of assessment or simulation is rarely implemented, and no stone-sliding distribution data have be collected. In this study, principles and technologies of data mining are introduced to integrate, code and convert environmental factors of disaster, through which a feature matrix is constructed for the easy onset modeling of stone-sliding. The spatial distribution of the stone-sliding in a 2 km range of both sides along Gaiz Valley is conducted, with a total data accuracy of up to 80%.
Keywords: Karakoram Highway; Gaizi Valley; stone-sliding; susceptibility analysis; machine learning
Dataset Profile
Dataset nameData of stone-slide investigation and susceptibility distribution in Gaizi Valley
Data authorsTian Deyu, Zhang Yaonan, Han Liqing, Luo Lihui, Ai Minghao
* Author contact:Zhao Yaonan: (aonan@lzb. ac. cn)
Data time range2017
Geographic scope38°30′N–39°00′N,75°00′E–75°30′E
Spatial resolution30 m
Data volume2694 KB
Data formatTIF, SHP
Data service system URLhttp: //www. crensed. ac. cn/portal/metadata/9cf11d7a-3b96-4ca7-a96a-57a5c03d6c86
Sources of fundingData Sharing Fundamental Program for Construction of the National Science and Technology Infrastructure Platform (Y719H71006)
Dataset components1. Vector data on the stone-sliding survey spot in Gaizi Valley; 2. Raster data on stone-sliding susceptibility distribution in Gaizi Valley
1.   Introduction
China-Pakistan Highway passes through the preglacial area of Kongur Mountain along the northeastern edge of the Pamirs Plateau (Fig.1), which is a major component of the Kashgar River basin, a tributary of the Tarim River. On both sides of the highway along the Gaizi Valley, collapse and secondary disaster stone-sliding occur frequently; sands and stones would be thrust by powerful external forces to block the traffic[1]. In particular, the stone-sliding slope is a natural collapsed slope that is widely formed in the northern part of the China-Pakistan Highway. It extends in a cone or slope shape along the road.
Its shape, granules, sedimentation and other features are different from the typical sand-slide slope along the Sichuan-Tibet Highway[2]. Wang Zongsheng once investigated the stone-sliding hazard of the China-Pakistan Highway (domestic section), but there are few efforts of assessment or simulation towards the disaster or the collection of stone-sliding distribution data. Therefore, this work poses more important significance for the simulation study of the stone-sliding disaster, and it has enlightenment and application value in engineering geological prevention.

Fig. 1   The diagrammatic sketch of Gaizi Valley basin
An important modeling idea based on the data-driven approach is machine learning, which has been widely used in the assessment of geological hazards in recent years. For the assessment of geological hazards, although the rule-based method has certain advantages in the case of unknown sample data, the ability of machine learning to optimize the ability to solve problems and simulate stochastic processes is far superior to the rule-based method. The stone-sliding survey data are valuable data on on-site geological survey that are collected in Gaizi Valley using GPS measurements. Based on the data on the stone-sliding survey data, this paper sets up the date-driven modeling to achieve the spatial distribution dataset of the stone-sliding slope. This dataset has potential reusability in the field of disaster assessment research and engineering geology.
2.   Data collection and processing methods
2.   1 Data sources and preprocessing
This study made a three-dimensional scatter plot (Fig. 2) of ASTER GDEM products, STRM DEM products and ASTER L1T single-view DEM around Muztagh Ata in the same coordinate space. It was found that the ASTER L1T single-view DEM (red in the figure) was obviously abnormal, and ASTER GDEM (green in the picture) and STRM DEM (yellow in the picture) had no obvious anomalies; but the maximum value of STRM DEM was more than 200 meters above the peak of the main peak of Muztagh Ata; the maximum value of ASTER GDEM was closest to the highest value. The study finally selected ASTER GDEM to get involved in the entire research.

Fig. 2   Comparison scatter diagram of elevation data
The soil type data is based on the FAO soil classification system made by the Food and Agriculture Organization of the United Nations with a horizontal resolution superior to one kilometer[3]. Under this classification system, different types of soil data definitions may be different: For example, Leptosol (LP) is defined by the topographical conditions of soil formation, mainly referring to the highlands of erosion. Calcisol (CL), Gypsisol (GY), Kastanozem (KS), Phaeozem (PH), etc. are defined according to the climate, time, organisms and other conditions during the formation of the soil. The data are available free of charge in the Cold and Arid Science Data Center ( The data source in China is the 1:1000000 soil data of the Second National Land Survey. The data from the study area was re-projected into the coordinate system of the UTM Zone 43N World WGS-84 and resampled to a resolution of 30 m. According to the data, there are 12 soil types in addition to glaciers in the study area. Based on the comparative analysis of the data, it is found that: firstly, the glaciers (code 11930) range of the data differs greatly from the edge of the latest glaciers extracted from the Landsat 8 Operational Land Imager (OLI) snowmelt image; secondly, all the soil types in the glacier edge of the glacier are Leptosol. Therefore, the soil type data obtained in the study area were corrected. The adjusted soil types include 5 types, namely Leptosol, Phaeozem, Gypsisol, Calcisol and Kastanozem. In addition, there are 2 two types of water bodies: glacier and water body.
The regional lithological map comes from the global lithology vector data produced by Hartmann[4]of the University of Hamburg, et al, and the lithological map of the lithological data in the study area was collected in the year of Xinjiang Bureau of Geo-exploration & Mineral Development in the year of 1992. The data format is shapefile. The scale is 1:1500000 and there is no data in the glacier area. Based on this schematic lithology map, the study mainly includes Carbonate Sedimentary Rocks (SC) and Unconsolidated Sediments (SU). In addition, there are Mixed Sedimentary Rocks (SM) and Acid Plutonic Rocks (PA). With the regional lithological map of 1:1.5 million as the ground truth overview map, based on the theoretical knowledge of geology, the study uses the lithology index of the ASTER TIR and Landsat OLI sensors to extract the lithological index, providing a more detailed explanation for the lithological types of the regional lithological map, and finally prepare a lithological geological map with a spatial resolution of 30 m.
The stone-sliding survey data are important ground truth data. Zhou Gongdan from Institute of Mountain Hazards and Environment, CAS provided the survey data on geological disasters during the snowmelt period in 2017. The dataset contains the latitude and longitude position information of many hazard spots of stone-sliding slopes. The survey data were verified by the indoor visual interpretation based on orthophoto maps, which is also used for machine learning modeling of stone-sliding hazards. Figure 1 contains an overview of the distribution of disaster survey spots. The orthophoto map of the Kashgar-Islamabad section of the China-Brazil highway, which is formed through the fusion and mosaic of the GF-1 PMS data, is used as the survey base map and auxiliary geological survey data.
The data used in the article are shown in Table 1.
Table 1   Data source list
Serial numberNameData collection dateUsageType
1Landsat 8 OLI imageJuly 20, 2016Remote sensing lithological mappingRaster
2ASTER TIR imageCompososited for nearly 6 years of snowmelt periodRemote sensing lithological mappingRaster
3GF-1 PMS imageCompososited for nearly 6 years of snowmelt periodStone-sliding sample accretionRaster
4ASTER GDEM productUnknownThe extraction of stone-sliding hazard inducing factorRaster
5The ground surface lithological map1992Remote sensing lithological mappingVector
6Stone-sliding survey spotThe snowmelt period in 2017Training sample of stone-sliding assessmentVector
7The data on soil types2008Stone-sliding hazard inducing factorsRaster
2.   2 Data processing steps
2. 2. 1 Hazard inducing factor selection and feature matrix construction
The various topographic factors extracted based on the accuracy and reliability of the DEM are important inducing factors of geological disaster susceptibility analysis. Based on the open source SAGA GIS of Conrad et al.[5], this paper systematically extracts a variety of inducing factors for the assessment of ice edge hazards. These factors can be divided into basic topographic factors, hydrological factors, morphometry and geotechnical hydraulic factors. These topographic factors, together with the lithology type and soil type, serve as the prophylactic factors for the assessment of stone-sliding susceptibility, and comprehensively depict the stone-sliding hazard inducing environment. But inevitably there are some redundancies. Drawing on the idea of data warehousing[6], the integration of features could eliminate redundancies and inconsistencies. The redundancy detection of discrete features is based on Pearson’s hypothesis test of the \({\mathrm{\chi }}^{2}\) statistic \(\mathrm{q}\), which is calculated as Equation (1)[7].The original hypothesis \({\mathrm{H}}_{0}\) of the Pearson’s \({\mathrm{\chi }}^{2}\) test is that the two variables are independent of each other, and \({\mathrm{H}}_{0}\) is accepted only if \({\mathrm{q}<{\mathrm{\chi }}^{2}}_{1-\mathrm{p}}\left(\mathrm{d}\mathrm{f}-1\right)\). \(\mathrm{p}\) is the level of significance, and the degree of freedom \(\mathrm{d}\mathrm{f}=\left(\mathrm{c}-1\right)×\left(\mathrm{r}-1\right)\)[7] .
In the test of \({\chi }^{2}\), for discrete variables A and B, suppose A has c unique values 𝑎1, 𝑎2, … , 𝑎𝑐, and B has r unique values 𝑏1, 𝑏2, …, 𝑏𝑟. The c values of A are columns, and the r values of B are rows, forming a contingency table. In this table, each “a” value meets the corresponding “b” value to form a joint event (𝑨i , 𝐁j ), occupying a position of the matrix. The relevant statistics are called Pearson 𝝌𝟐 statistics, where 𝒐𝒊𝒋 is the observed frequency of the joint event (𝑨i , 𝐁j ) and 𝒆𝒊𝒋 is the expected frequency of the joint event:
\(q=\sum _{i=0}^{c-1}\sum _{j=0}^{r-1}\frac{{{\left(o}_{ij}-{e}_{ij}\right)}^{2}}{{e}_{ij}}\)
For any continuous numerical variables A and B, the classical Pearson correlation test is used for discriminant. The correlation statistic is \({r}_{A,B}\) or its square form \({r}^{2}\). In the statistical machine learning model, it is necessary to avoid the input of the continuous large variable pairs of \({r}_{A,B}\) into the model at the same time. Figure 3 is a set of hazard inducing factors obtained after the above redundancy detection.

Fig. 3   The set of hazard inducing factors after redundancy detection
Statistical machine learning models often require the input samples to be subject to a certain distribution, at least to certain input specifications. The semi-supervised learning method selected in this paper involves the calculation of Euclidean distance, so it is necessary to encode discrete features and to perform the standardization of continuous features.
The feature matrix contains 4 discrete variables and 9 continuous variables. Discrete variables include 3 categorical variables: lithology generics, surface classification[8]and soil type as well as one binary variable, i.e. friction-sensitive instability index. The OneHot encoder is used to encode four discrete variables. After the OneHot encoding, the 4 discrete variables are converted into 31-dimensional binary variables. The 31-dimensional binary feature and continuous features constitute a feature matrix of 40 features.
In the new feature matrix, there are ratio scales and interval scales. Some units are radians while some have no unit, and the range of values is jagged. Standardization is required before modeling. The minimum and maximum normalization methods are selected here for processing. For any feature X, the method of minimum and maximum normalization first constrains it to X by using the maximum and minimum values of the sample, and then continues to normalize \(\stackrel{˙}{X}\)̇ to the user-defined interval [a, b] as \(\stackrel{˙}{\stackrel{¨}{X}}\), where min and max are functions of the minimum and maximum values of the sequence, respectively.
\(\stackrel{˙}{X}=\left(X-min\left(X\right)\right)/\left(max\left(X\right)-min\left(X\right)\right)\) (2)
\(\stackrel{¨}{X}=\stackrel{˙}{X}×\left(b-a\right)+a\) (3)
Feature selection, feature integration, feature coding and feature conversion constitute the complete processes of feature engineering. The complete processing flow of feature encoding and conversion is shown in Figure 4.

Fig. 4   Flow of feature encoding and conversion
2. 2. 2 Algorithm selection and data expansion of stone-sliding survey
It is difficult to collect a sufficient number of samples to set up a supervised learning model in the geological field survey. The small sample-driven supervised learning model is a challenge. This article tries to solve this problem by means of semi-supervised learning.
Semi-supervised learning can achieve good prediction results based on a small number of labeled training samples. The principle is label propagation, which is to construct a similarity graph for all samples in the input dataset to assign category labels to unlabeled data. Compared with the label propagation method proposed by Zhu et al.[9], the method of Zhou et al.[10]is to minimizes a loss function containing regular attributes, so the method is more robust. This paper takes Zhou’s method. and the algorithm is implemented based on Scikit-Learn[11]. The available kernel methods are radial basis function (RBF) and K-nearest neighbor (KNN). Both of these kernel functions calculate distance in the European space. RBF can map features to high-dimensional space, but with high time complexity and space complexity; KNN has higher computational efficiency and lower space complexity.
The data on the stone-sliding hazard survey was collected from both sides of Gaizi Valley in the summer of 2017. The data format is vector point. The shape and distribution characteristics of the slippery slope can be clearly seen by superimposing the data on the hazard survey spots into the orthophoto of GF-1. The stone-sliding slopes formed on both sides of the road are fan-shaped and extend to the valleys and highways, and some even could spread to the other side. Figure 5a is a stone-sliding spot on the eastern slope of Kungay Mountain, and Figure 5b is a stone-sliding hazard spot on the west slope of Kongur Tagh in Gaizi Valley. As the infrastructure such as roads and some villages are directly built on the sloping slope, the stone-sliding slope poses a huge potential hazard.

Fig. 5   Stone-sliding survey data superposed with GF-1 images
This study adopts the strategy of grid-based machine learning model construction. In other words, each grid point is a sample. The sample volume required by the selected label-propagation semi-supervised learning algorithm is not demanding. As long as the ratio of the unknown labels and the known is controlled to about 60:1; namely the grid points with category labels accounting for about 1/60. The profile of the stone-sliding survey can be clearly seen through the superimposition of the slippery slope survey point and the GF-1 Orthographic image map (Fig. 5). The vector graph obtained based on the stone-sliding survey data serves as the label data on the model training. The feature matrix is used as the initial feature set of model training for the purpose of model training.
2. 2. 3 Machine learning model training and parameter selection
In order to weigh the prediction accuracy and the complexity of the algorithm (including time complexity and space complexity), K-nearest neighbor is chosen as the kernel function of the label propagation model. K-adjacent is a typical distance-based machine learning algorithm, also known as lazy learning. The K neighbor searches for each of the K most recent neighbors in the training set for each unlabeled data, assigning the most frequently occurring category label in the neighbor to the unlabeled data, and iterating through the process until all the samples are assigned with labels. The calculation process of K nearest neighbor is shown in Fig. 6.

Fig. 6   KNN-based semi-supervised learning calculation process
Obviously, an important hyperparameter of the kernel function KNN is the number of neighborhood points K used for neighborhood distance calculation. The K value directly affects the 0 precision of the model, so it is crucial to get the optimal K value for the parameter selection. Adjust the K value to check the accuracy on the validation set. When K=18, the accuracy of the positive sample, negative sample and the global accuracy reach the maximum. Another important parameter is the number of iterations, but the number of iterations reaches 30 and it is convergent and stable. Hence, it is advisable to adjust the number of iterations to convergence.
2. 2. 4 Spatial Prediction of Stone-sliding Susceptibility
The potential hazardous areas of China-Pakistan Highway along Gaizi Valley can be obtained by using the established model to traverse the range of 2 kilometers on both sides of the road. Although the spectral features and texture features of any optical remote sensing data are not included in the modeling features, the prediction results of the model in the buffer zone of 2 kilometers on both sides of the Gaizi Valley highways are highly matched with the visually-interpreted hazard areas in the GF-1 orthoimage. Though there are many suspected misjudgments to the naked eye.
3.   Data Sample Description
Figure 7 shows the distribution of the stone-sliding survey spots and the spatial distribution of the prediction results of stone-sliding hazards. The purple area is the predicted stone-sliding zone.

Fig. 7   The Prediction of Stone-sliding Susceptibility
4.   Data Quality control and Evaluation
4.   1 The Quality Control of Raw Data
The Landsat OLI data involved in lithological mapping has undergone rigorous atmospheric correction, and the lithological index inversion of ASTER TIR data is based directly on the radiance at sensor, which has been demonstrated and proved by previous studies[12]. The quality of DEM data is assured by comparing DEM products from different sources to select the DEM products that are closest to the true elevation of the study area. The soil type data have been corrected based on the Landsat OLI-derived glacial range at the end of the snowmelt period to ensure the authenticity of the data in the study area. The study expands the lithological genus of the remote sensing mapping of the 1:1.5 million lithological map in the study area. The accuracy of remote sensing lithology mapping is ensured by setting a reasonable lower threshold of the band index,.that is, using the threshold suggested by the band index proposer or a higher threshold to perform the lithological mapping.
4.   2 The Evaluation of Stone-sliding Prediction
Experiments show that the semi-supervised learning model trained by a small number of samples is relatively ideal. The accuracy of positive samples could reach over 57% in the verification set, and that of the negative samples is over 99%; the global accuracy is close to 80% (Fig. 8). Firstly, the prediction accuracy of the negative sample increases the global accuracy of the model. Secondly, the accuracy-based analysis indicates that a grid point would not be predicted as a stone-sliding slope if it is not a stone-sliding slope; but it is highly probable that it would be predicted as a non-stone-sliding slope if it is a stone-sliding slope. In other words, even if the prediction accuracy is not very high, the non-stone-sliding slope would not be mistakenly predicted as a stone-slide slope.

Fig. 8   Model Accuracy Evaluation
5.   Data Usage and Recommendation
The data on stone-sliding survey spots are in the format of vector SHP, and the longitude and latitude information is recorded in the attribute table. The data on stone-sliding susceptibility distribution are saved in the format of raster TIF. The superposition of the two shows the best result. Commonly-used GIS and remote sensing software such as ArcGIS, QGIS, ENVI and ERDAS can support the reading and manipulation of the data.
Thanks to the project support from National Special Environment and Function of Observation and Research Stations Shared Service Platform. Thanks to the stone-sliding survey data supplied by Researcher Gordon G.D. Zhou from Chengdu Institute of Mountain Hazards and Environment, CAS.
[1] Wang Zongsheng. Survey and Assessment of the Glacial Geological Disasters along the China-Pakistan. Economic Corridor (Domestic Section) [D]. Beijing: China University of Geosciences (Beijing), 2016.
[2] Yang Zhiquan, Zhu Yingyan, Liao Liping et al. Gravel-Sliding Slope along International Karakoram Highway (KKH) [J]. Geological Science and Technology Information, 2013, 32(6): 175-180.
[3] FISCHER G, VELTHUIZEN H V, SHAH M, et al. Global AgroEcological Assessment for Agriculture [C]// The Century, Rome, Food and Agriculture Organization of the United Nations, 2010.
[4] HARTMANN J, MOOSDORF N. The new global lithological map database GLiM: A representation of rock properties at the Earth surface [J]. Geochemistry Geophysics Geosystems, 2012, 13(12): 1-37.
[5] CONRAD O, BECHTEL B, BOCK M, et al. System for Automated Geoscientific Analyses (SAGA) v. 2. 1. 4 [J]. Geoscientific Model Development Discussions, 2015, 8(2): 2271-2312.
[6] HAN J, KAMBER M. Data Mining Concept and Techniques [M]. Amsterdam: Elsevier, 2011.
[7] PAPOULIS A, PILLAI S U. Probability, Random Variables, and Stochastic Processes, Fourth Edition [M]. NYC: McGraw-Hill, 2002.
[8] IWAHASHI J, PIKE R J. Automated classifications of topography from DEMs by an unsupervised nested-means algorithm and a three-part geometric signature [J]. Geomorphology, 2007, 86(3-4): 0-440.
[9] ZHU X, Ghahramaniy Z B. Learning from labeled an unlabeled data with label propagation [R]. School Comput. Sci. , Carnegie Mellon Univ. , Pittsburgh, PA, Tech. Rep. CMU-CALD-02-107, 2002.3.
[10] ZHOU D, BOUSQUET O, LAL T N, et al. Learning with local and global consistency[C]// International Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2003: 321-328.
[11] PEDREGOSA F, GRAMFORT A, MICHEL V, et al. Scikit-learn: Machine Learning in Python[J]. Journal of Machine Learning Research, 2012, 12(10): 2825-2830.
[12] YARBROUGH L D, EASSON G, KUSZMAUL J S. Using at-sensor radiance and reflectance tasseled cap transforms applied to change detection for the ASTER sensor [C]// International Workshop on the Analysis of Multi-temporal Remote Sensing Images. IEEE, 2005.
Data citation
Tian DY, Zhang YN, Han LQ, et al. Data of stone-slide investigation and susceptibility distribution in Gaizi Valley. National Special Environment and Function of Observation and Research Stations Shared Service Platform, 2018. (June 13, 2018). DOI: 10. 12072/casnw. 062. 2019. db.
Article and author information
How to cite this article
Tian DY, Zhang YN, Han LQ, et al. Data of stone-slide investigation and susceptibility distribution in Gaizi Valley. China Scientific Data, 4 (2019). (August 18, 2019). DOI: 10. 11922/csdata. 2018. 0078. zh.
Tian Deyu
Contribution: the modeling of stone-sliding susceptibility and paper writing
male, from Siziwang Banner, Inner Mongolia, master candidate, majoring in remote sensing application in cold and arid areas
Zhang Yaonan
Contribution: data quality control methods and paper writing
male, from Tianshui City, Gansu Province, doctoral candidate, researcher, majoring in geoscience big data.
Han Liqing
Contribution: The acquisition of field survey stone-sliding spots and cartographic visualization
male, from Puyang City, Henan Province, doctoral candidate, majoring in quantitative geological disaster remote monitoring
Kang Jianfang
Contribution: Data quality control and management
female, from Qin’an County, Gansu Province, master, engineer, majoring in big data application in cold and arid regions.
Luo Lihui
Contribution: Data quality control and cartographic visualization
male, from Changde City, Hunan Province, doctor, associate researcher, majoring in remote sensing application in cold and arid areas
Ai Minghao
Contribution: the production of GF-1 geological survey DOM basemap
male, from Jining City, Shandong Province, doctor, engineer, majoring in big data application in cold and arid regions.
Han Yufang
Contribution: Paper writing and modification
female, from Lintan County, Gansu Province, doctor, engineer, majoring in big data application in cold and arid regions.
Data Sharing Fundamental Program for Construction of the National Science and Technology Infrastructure Platform (Y719H71006)
Publication records
Published: Aug. 20, 2019 ( VersionsEN1
Updated: Aug. 20, 2019 ( VersionsEN2
Released: Jan. 17, 2019 ( VersionsZH2
Published: Aug. 20, 2019 ( VersionsZH3