This dataset is the distribution data of permafrost in Northeast China, combined with actual drilling and pit exploration data. Driven by terrain factors, vegetation factors, meteorological factors, soil and hydrological factors, the model is constructed using machine learning methods (random forest). Permafrost is mainly distributed in a "W" shape in the Greater and Lesser Khingan Mountains. In the high latitude areas of the northern part of the Greater Khingan Mountains, permafrost is distributed in patches or continuously, and extends southward along the ridge line of the Greater Khingan Mountains to the vicinity of Alshan. The distribution area has a higher altitude. On the east side of the Greater Khingan Range and the north side of the Lesser Khingan Range, permafrost is mainly distributed in the form of tree branches in mountainous valleys and lowlands. The reliable accuracy enables this frozen soil distribution data to serve as a calibration benchmark and historical reference for simulating permafrost in Northeast China under the background of global warming. The data format is GeoTIFF, with a spatial resolution of approximately 1km and a geographic coordinate system of WGS 1984.
| collect time | 2023/01/01 - 2024/12/31 |
|---|---|
| collect place | Northeast China |
| data size | 2.4 MiB |
| data format | *.tif |
| Data spatial resolution (/ M) | 1km |
| Data time resolution | |
| Coordinate system | WGS84 |
Raw data: measured drilling and pit exploration data.
Environmental variable data: Five major categories of environmental variables including terrain, vegetation, climate, hydrology, and soil were selected as predictive factors.
Terrain factor: Extracting altitude, slope, aspect, terrain humidity index, terrain position index, and terrain undulation based on digital elevation model (DEM).
Vegetation/hydrological factors: Use Landsat 8 remote sensing products to extract normalized vegetation index (NDVI), enhanced vegetation index (EVI), and normalized water index (NDWI).
Meteorological factors: Surface temperature (LST) and precipitation data are based on product data, and are used to calculate melting and freezing indices as key intermediate variable inputs.
Data preprocessing: Perform spatial registration and standardization on all multi-source raster data. The unified geographic coordinate system is WGS1984, and the spatial range is cropped to the boundary of the study area. The spatial resolution of all variables is uniformly downscaled to 1000 m using resampling techniques, and the format is unified as GeoTIFF to ensure strict spatial matching of multi-source data. Using ArcGIS' Extract Multi Values to Points feature, extract the environmental variable values corresponding to each sample point and construct a high-dimensional dataset of "sample environment features". The constructed sample dataset includes the target variables (classification labels: 1 represents permafrost, 0 represents seasonal permafrost) and their corresponding feature vectors. Perform integrity checks on the extracted results, eliminate samples containing missing values (NoData) or outliers, and ensure the quality of the input data for the model.
Random Forest Model Construction: Stratified Random Sampling is used to divide the dataset into a training set (70%) and a testing set (30%). Build a random forest classification model based on the scikit learn machine learning library in Python environment. To address the issue of sample imbalance, set the class_ceight parameter to 'balanced'. Optimize key hyperparameters through grid search, and ultimately determine the number of decision trees (n_estimators) to be 1000, the maximum depth (x_depth), and the minimum number of samples for node splitting (min_stamples_split), and fix the random seed (random_state) to ensure the reproducibility of the results. Use environmental variables as feature inputs and frozen soil types as labels for model training.
Accuracy evaluation: Calculate confusion matrix, Overall Accuracy, Precision, Recall, F1 Score, and Kappa coefficient.
This data is modeled using machine learning methods, calculating confusion matrix, Overall Accuracy, Precision, Recall, F1 Score, and Kappa coefficient. The results show that the model has high consistency.
| # | number | name | type |
| 1 | 2022FY100700 | Survey of Permafrost Conditions and Freeze-Thaw Damage in the High-Latitude Regions of Northeast China | Basic Resource Survey Project |
This work is licensed under
CC BY 4.0 (Creative Commons Attribution 4.0 International License).
| # | title | file size |
|---|---|---|
| 1 | 东北1km多年冻土分布图(2023-2024年).jpg | 2.1 MiB |
| 2 | 东北1km多年冻土分布图(2023-2024年).tif | 124.5 KiB |
| 3 | 东北1km多年冻土分布图(2023-2024年)_元数据.docx | 108.1 KiB |
| 4 | 东北1km多年冻土分布图(2023-2024年)_说明文档.docx | 23.0 KiB |
G7QiT2
P1lI.2A0
©Copyright 2005-. Northwest Institute of Eco-Environment and Resources, CAS.
Donggang West Road 320, Lanzhou, Gansu, China (730000)

