Zone II • Versions EN1
Abstract: Among the numerous attributes of scientific data, geographical attributes are the most intuitive way of expressing data. Geographical description of the data will help users to understand the data, as well as to access and use it. Based on the Database of Fishing Port Quantity, Distribution, Function and Current Status of the Fishery Science Data Sharing Center, this study uses a geographical attribute analysis method to construct a dataset on the geographical distribution of fishing ports in coastal China, and conducts quality control through data analysis. By promoting users' understanding and grasp of the data, the study lays the foundation for geographical data application. The method presented here also provides a template for geographical attribute analyses of similar datasets.
Keywords: fishing ports; geographical information; scientific data; geographical distribution
|Chinese title||2017 年我国沿海渔港地理分布数据集|
|English title||Geographical distribution of fishing ports in coastal China, 2017|
|Data corresponding author||Xu Shuo (email@example.com)|
|Data authors||Chen Mengjie, Xu Shuo, Liu Huiyuan, Jiang Qingzhao|
|Geographical scope||18°15'28" N – 48°17'53" N, 108°15'23" E – 130°14'51" E; specific areas include China’s coastal provinces|
|Spatial resolution||1000 m||Data volume||657 entries|
|Data service system||<http://www.sciencedb.cn/dataSet/handle/542>|
|Sources of funding||Special research project of the National Science and Technology Infrastructure Platform – “Agricultural Science Data Sharing Center” (2005DKA31800);|
Special research project of the National Science and Technology Infrastructure Platform – “Fishery Science Data Sharing Center” (2005DKA31800-03);
Central Public-interest Scientific Institution Basal Research Fund, CAFS (2017) – “data summary analysis of the National Aquatic Science Data Center” (2016HY-ZC10);
Central Public-interest Scientific Institution Basal Research Fund, CAFS (2016) – “Fisheries Engineering Data Architecture Research” (2016JC0110)
|Dataset composition||The dataset consists of 657 data entries on the geographical distribution of fishing ports in China’s coastal areas.|
In the era of mobile Internet, geographical information applications, especially those location-based, developed extensively in such areas as transportation, shopping, and catering, which greatly changed people's lifestyle. Public demand for geographic context information further drives the development of this technology.1–5 In fishery science data, geographical attribute is implicit in data attributes and data details. Geographical attribute analysis and relevant applied research thus constitute an important direction for fishery science to promote user understanding of the data. The Fishery Science Data Sharing Center (FSDSC)6 brings together a wealth of fishery science datasets at its platform that are for open access and utilization. However, the number of user visits is proportional to the display position of the dataset on the platform, while data association and data-user correlation are not clearly prescribed. This leads to weak service capabilities of the datasets. To address the inaccurate expression of user interest and user visits, a common technical method is to research personalized data service, data mining, machine science, etc.7,8 Another solution is to mine geographical attribute information from the data to facilitate the establishment of a geographically and contextually relevant context for the data and the user, to present the data to the user through the most intuitive map form, and to promote user associations. Taking fishing ports in coastal China (2017) as an example, the study collects geographical attributes of the ports, based on which a new dataset is built. The study is expected to lay a foundation for future data research and relevant support work.
The “Fishing Port Quantity, Distribution, Function and Current Status” (FPQDFCS)9 database provides over 1300 attribute parameters of fishing ports throughout the country, including shelter levels and pier lengths, as well as other geographical attributes in text format, such as descriptive information of Dalian Bay and Qianyang Town of Donggang City. The data set has distinct geographical attributes which can be converted into a format convenient for location marking through certain technical means. This study uses string processing tools, location analysis tools, JS scripting language and so forth to analyze FPQDFCS data and to obtain quantified information of geographical properties. It provides data support for geographically related studies of fishing ports.
2.1 Data sources and profiles
As this data set is derived from the FPQDFCS database through a certain calculation method, the relationship between the two data sets is causal. FPQDFCS is based on the data of fishing ports in China's coastal areas released by the Ministry of Agriculture in 1990, which belongs to basic data of fishery science and technology. Detailed data contents are available at the FSDSC online platform (http://fishery.agridata.cn/grade3.asp?st=llsj&id=A040361). The FPQDFCS database has an integrity rate of 85.5%, calculated from the ratio of the number of missing fields to the total number of recorded fields. Particularly, geographic location data are relatively complete, with an integrity rate of 99.2%. The high integrity of this database accounts for why it was selected for processing.
With data collection, processing and storage completed, our dataset is now available on the Fisheries Scientific Data Platform (http://fishery.agridata.cn/grade3.asp?st=llsj&id=A040364).
2.2 Data acquisition and processing
2.2.1 Overall flowchart
The data collection and processing process involves five stages: raw data preprocessing, geographic data collection, data processing, data association, and data verification. Figure 1 shows the overall data collection and processing flow. Each stage will be detailed in the following subsections.
2.2.2 Overall flowchart
In the FPQDFCS database, each data entry consists of the name, geographical location, shelter level, pier length, revetment length, and breakwater length of the fishing port, together with data provider, time and date of updates. A view of the data details shows that geographic information is mainly included in the name and geographic location description, whereas other fields describe attributes or parameters of the fishing port itself. In the FPQDFCS database, geographical location of 11 data entries was left blank, and geographical information can only be identified by the fishing port name. For the rest of the data entries, a data field was added to combine the name and geographical location of the fishing port, so as to form complete geographical attribute information for further processing. As required by the address resolution tool, contents of the added data field “name+location” are spliced into a string array format, as shown in Figure 2.
2.2.3 Geographic data collection
After the raw data are preprocessed, our task is to further process geographical attribute information of the fishing ports, and to convert geographic location attributes into geographic coordinates.
The World Geodetic System (WGS-84) is a coordinate system used by the GPS global satellite positioning system. In China, it is required that the GCJ-02 coordinate system (also called the Mars coordinate system), developed by the State Survey Bureau, is used first to encrypt all geographical location. Some online map applications, such as Baidu maps, Gaode maps, Tencent maps, etc., provide geographical coordinates extracting services, through which natural geographical locations can be converted into absolute latitude and longitude coordinates. However, they do not provide exact coordinates for security reasons, as the data have been encrypted before released for public access. Therefore, there might be a certain deviation between publicized and actual values of the latitudes and longitudes, but it does not affect geographic location analysis or user-data correlation. Hence, encrypted geographical coordinates of the fishing ports collected by this study do not infringe information security.
After we input a string array of geographical locations, the tool generates address information composed of serial numbers (No.), geographical locations, longitude values, and latitude values. We then enter the preprocessed data into the source code and run the tool, and output A (1046 entries) is obtained. The execution time is about 10 minutes, including network response time and data processing time. Figure 3 shows the format of the output data.
In order to verify the results, the name and geographic location of the fishing ports are used as input data to generate two data sets using the tool, including output B (about 900 entries) and output C (200 entries), respectively. In the three data sets, 11.11% of the fishing port data in the original database fail to generate geographic coordinates, for which other methods are required. These data can be used to supplement data sources for the geographical distribution of fishing ports. Among the three data groups, A is obtained through detailed information input; B is semantically more geographically covered, with relatively rough results and a lot of duplicate data; C records the most detailed location information, with the smallest amount of output.
2.2.4 Data processing
The address resolution tool produces output in an unstructured text format, which enables an easier view when the amount of data is small. Data query and processing will become more complex as the amount increases, and the text format needs to be converted into a structured format. A common method for the conversion is to use a program to read and convert data line by line. An easier and more effective method is through the rich data functions of Excel spreadsheets. Through structured format conversion, a clearer correspondence can be established between the new and original datasets, such as using the name of fishing port as the field of association.
Use Excel to process normalized output A. Excel provides a variety of processing techniques. One is that it allows user to directly write a formula to intercept the name and geographic coordinates of the fishing port. Two functions are involved in the process, that is, the string intercept function MID and the character position search function FIND. For example, the formula for intercepting the name of the fishing port is: MID (A1, FIND (",", A1) +1, FIND (":", A1)-1-FIND (",", A1)), where A1 denotes one entry of the output result A. Another is that Excel allows the pre-definition of easier operations. With pre-defined or specified separators, one can use the data disaggregation tool to cut the data into multiple columns. Figure 4 shows the process of disaggregation. The data obtained through disaggregation can be directly stored in a structured format, as shown in Figure 4.
2.2.5 Data association
After the data are basically converted into a structured storage format, the main content of this data set is initially formed. However, because the new dataset focuses on geographical distribution of the fishing ports and it does not contain information on other parameters. it needs to be associated with the original FPQDFCS database.
The unique identification of each fishing port as per the original database can be obtained by parsing the URL address of the port at the FCDC platform. For example, after URL address parsing, we obtained the unique identification of “Ocean Red Center Fishing Port of Dandong City”, which is 2 (Figure 5).
Therefore, each data entry in the new dataset needs to be associated with other data in the original database, which can be done through original data code or data entry ID. The original database is encoded as a unified A040360. The ID of each entry needs to be identified by comparing the names and geographical locations of the fishing port in the two data sets. To this end, we use VLOOKUP in Excel to locate the name of each fishing port in the original database one by one, through which to obtain the ID of each data entry in the new dataset, as shown in Figure 6.
2.2.6 Data verification
Output A constitutes the main data of this dataset. Its input data is the combination of the name and geographic location of the fishing ports. Despite more accurate description, A has redundant and duplicate information that may derive inaccurate or even erroneous results when entered into the tool. When the error is too large, it needs to be processed.
While A is the main data of the fishing port geographical distribution dataset, A and B are validated against each other to test data rationality, and C offsets the deficiencies of the A and B data to a certain extent.
(1) Comparison of the three output data groups
In the resulted data sets A, B, and C, the Euclidean distances of the geographical distribution values range from 0 to 25, and the numerical distributions of A and B are shown in Figure 7.
Numerical Euclidean distance is abstract and needs to be converted into actual distance. The distance is calculated using the Baidu open platform’s distance calculation function map.getDistance (pointA, pointB). Calculate pairwise distances between A, B, and C data and 3 sets of distances are yielded, as shown in Figure 8. Some distances are not available (N/A) because of address resolution failure where no corresponding coordinate points are obtained. A deviation value of 1 km is set as the threshold in data comparison: when the AB deviation is within 1 km, the data are considered valid which can be stored into the final data set; when the AB deviation exceeds 1 km, C is referred – A would be adopted if C data are closer to A, or B would be adopted if otherwise.
(2) Validation through other attribute data of the fishing ports
In the original FSDSC database, pier, revetment and breakwater are described in units of length, reflecting the actual construction scale and size of the fishing port. The significance of these attributes for new dataset validation is that the error range of certain data can be appropriately expanded. For example, if the A-B deviation is 2 km and the fishing port has a size of 3 km, then the geographic data entry is within a reasonable range, which can be entered into the final data set. Validation through other attribute data helps supplement some of the data records. Through the above two-step data verification, a geographical distribution data set for fishing ports is finally formed, with 657 valid data entries.
This dataset consists of 657 data entries. Each data entry records five attributes: ID, name, x, y, dbcode, and preID. ID is the unique number of the data entry, usually expressed by an integer. Name is a text field containing the geographic location and name of the fishing port. The longitude of the fishing port is recorded as x, and the latitude is recorded as y. Dbcode and preID denote the code and serial number of the fishing port in the original database, respectively. Take the second entry of the dataset as an example (Figure 9). This entry is for Cheguan fishing port in Shitang Village, Shitang Town, as the “name” field indicates. It is located at the longitude of 117.641872 and the latitude of 31.93985. The data entry is cited from the 737-th (preID) entry coded as A040360 (dbcode) in FSDSC.
Data quality of this dataset depends on the integrity and accuracy of the FPQDFCS source data, and the processing accuracy of the geographic information analysis tools.
The FPQDFCS database is mainly derived from the official fishery port data of coastal China released in 1990 by the Ministry of Agriculture, which ensures a high level of data reliability and accuracy.
The geographic information analysis tools, on the other hand, allow a certain amount of error on the geographic coordinates obtained. These errors are mainly caused by less accurate geographical description in the source data. Besides, service providers may encrypt the coordinates of certain geographic locations in order to comply with relevant laws and regulations. However, the deviation has to match commonsense understandings. The dataset is then validated against Baidu Map, where the fishing ports are entered and located for a comparison of latitudes and longitudes (http://api.map.baidu.com/lbsapi/getpoint/index.html). Results show that 20 locations in the dataset have a relatively larger deviation, and the accuracy of the data is 96.97%.
By quantifying geographical locations, the dataset makes it easier to associate different data categories and to analyze and utilize the data. This includes: first, data association analysis. Algorithmic tools of data mining, data statistics, machine learning, and so forth can be used to analyze data characteristics, such as the relationship between fishing port location and features, or the common features of a certain group of fishing ports; second, user-data correlation analysis. User and data can be correlated via user location to provide relevant services, such as information recommendation and other personalized services. Examples include to recommend information of users’ interest, or to recommend hotspots near users’ location.
An example of data application is shown in Figure 10. According to user location, all fishing ports within the threshold range are marked on Baidu Map. Location distance, data popularity, and so forth, can be used as indicators of thresholds, based on which recommendations can be made to reflect user interest.
Nian Y, Zhai S & Xue C. Design and development of the Bohai Sea fishery service system based on WebGIS. Remote Sensing Technology and Application 30 (2015): 391 – 398.
Jiang K. Social Media Mining and Application with Geographical Location Information, PhD dissertation. Hefei: University of Science and Technology of China, 2014.
Deng Z, Yu Y, Yuan X et al. Situation and development tendency of indoor positioning. China Communications 10 (2013): 42 – 55.
Tang K, Xu F & Shen C. Survey on location-based services. Application Research of Computers 29 (2012): 4432 – 4436.
Han H, Xiao H, Yang N et al. Exploring personalized service on the fishing platform in the scientific data application. Guangdong Agricultural Sciences 39 (2012): 151 – 154. DOI:10.16768/j.issn.1004-874x.2012.02.024
Zhou A, Yang B, Jin Z et al. Location-based services: Architecture and process. Chinese Journal of Computers 34 (2011): 1155 – 1171.
Chen F, Yang C, Shen S et al. Research on mobile GIS based on LBS. Computer Engineering and Applications (2006): 200 – 202, 210.
1. Chen M, Xu S, Liu H et al. Geographical distribution of fishing ports in coastal China, 2017. Science Data Bank. DOI: 10.11922/sciencedb.542
How to cite this article
Chen M, Xu S, Liu H et al. Geographical distribution of fishing ports in coastal China, 2017. China Scientific Data 3 (2018), DOI: 10.11922/csdata.2017.0009.zh