Zone I • Versions EN4
Abstract: With an increasing number of scientific achievements published, it is especially important to conduct scientific literature-based knowledge discovery and data mining (KDDM). As one of the most destructive disasters, flood has been the topic of large amounts of scientific publications. On 1st July, 2017, we conducted literature data collection and processing on flood research, and categorized the retrieved paper records into Whole SCI Dataset (WS) and High-cited SCI Dataset (HCS). This dataset can serve as basic data for a bibliometric analysis, through which to identify the status and future trends of global flood research during 2010 - 2017.
Keywords: literature datasets; flood; WS; HCS
|English title||Global literature database of flood research, 1990 – 2017|
|Data corresponding author||Li Guoqing (firstname.lastname@example.org)|
|Data authors||Zhang Hongyue, Li Guoqing, Huang Mingrui, Qing Xiuling|
|Time range||1990 – 2017|
|Data volume||6.46 MB (8370 records in Whole SCI Dataset and 156 records in High-cited SCI Dataset)|
|Data service system||<http://www.sciencedb.cn/dataSet/handle/591>|
|Sources of funding||National Key Research and Development Program of China (2016YFE0122600) ; International Partnership Program of Chinese Academy of Sciences（131C11KYSB20160061）|
|Dataset composition||This dataset consists of two compressed (ZIP) files, which are "WS.zip" and "HCS.zip". "WS.zip" refers to the Whole SCI Dataset that stores a full list of the flood papers collected. "HCS.zip" represents the High-cited Dataset that stores a collection of the paper with over 100 citations. These data are saved in XLS format.|
Flood is among the most destructive natural disasters. In recent years, flood was one of the most frequently happened disaster globally. According to the German Reinsurance Company statistics, flood has been one of the most significant natural disasters in the world.1 As flood frequently occurs with widespread damage, it is difficult to identify a specific flood event by the generic hazard type. Therefore, the scientific publications on flood were taken as a group to illustrate the scientific findings.
Literature-based knowledge discovery has been applied in several research domains, such as medical and biological research,2,3 information and science development studies4,5. Owing to the increasing body of text and the open-access policies of many journals, literature mining is becoming useful for both hypothesis generation and scientific discovery. However, due to semantic heterogeneity, there are some restrictions that limit us in literature-based data acquisition. To the best of our knowledge, studies collecting scientific publications on disaster events from online literature databases are rare by far.
In order to unearth the hidden knowledge of flood research, literature-based discovery (LBD) is adopted to obtain the whole dataset and the high-cited dataset on flood study, by means of literature retrieval and processing approaches.
Among the most famous web-based literature databases (including Google scholar, Web of Science (WOS), Scopus, PubMed), WOS (Thomson Reuters) was one of the most frequently used databases for scientific journals within natural sciences in recent years.6 The Web of Science Core Collection includes several Citation Indexes, including Science Citation Index (SCI), Social Sciences Citation Index (SSCI), Arts and Humanities Citation Indexes (AHCI).7 Despite the continual emergence of bibliometric databases in recent years, the Science Citation Index (SCI) is arguably the most reliable index for documenting scientific output.8
The data were obtained from the SCI database published by and subscribed from the ISI, Web of Science (now Thomson Scientific) Core Collection. While WOS guarantees a relatively stable search environment, with clearly defined lists of indexed journals, its searching conditions are restricted to metadata, such as title, keywords and abstracts.9 In order to get full and accurate texts, we narrowed down the literature queries through setting four parameters.
The first parameter is research topic. In advanced search, the "Topic" field tag was set as "TS=(((flood near (event or hazard or disaster)) or (flood near/3 (inundation or damage or risk or zone))) not (volcano or basalt))", where 'near/n' was used to find records containing all terms within a certain number of words (n) of each other. Considering volcano and basalt research is often published as flood papers, 'not (volcano or basalt)' was added to the searching formula to exclude relevant records.
The second parameter is time span. A time span was set to restrict the search results published during the period from January 2000 to December 2017.
The third parameter is publication type. "Article" was selected as the only publication type of this search. This is because other publication types (e.g., discussion, biographical item, editorial material) do not provide sufficient attribute information and are hence not very suitable for our discovery process.
The last parameter is additional citation indexes, Science Citation Index Expanded (SCIE) was added.
In order to get a full description of the papers, full records and cited references were downloaded in .txt format. Description fields include abstract, authors, country, publication year, institution, research area, web of science category, journal, title, source et al. See Figure 1 for more details.
Figure 1 Description fields of the articles collected
Notes: In the above figure, PT represents Publication Type (J=Journal; B=Book; S=Series; P=Patent); AU-authors; AF-Author Full Name; TI- Document Title; SO- Publication Name; LA-language; DT- Document Type; DE- Author Keywords; ID- Keywords Plus; AB-abstract; C1- Author Address; RP- Reprint Address; EM- E-mail Address.
A total of 15935 records were obtained on 1st January, 2018.
2.2 Data processing
The processing tools adopted in this study include Thomson Data Analyzer (TDA)10 and Microsoft Excel. TDA is a strong text mining software which can be used to mine the text information from multiple aspects and offers visualized comprehensive analysis. Moreover, it offers a systematic organization of literature information.11 The software Microsoft Excel was used to rearrange the exported data.
The technical framework employed by this study is presented as a flow chart in Figure 2. It includes two main steps: records removal and natural language processing (NLP). Raw data are downloaded from the literature databases, and the text data are organized into several attribute fields, including title, authors, abstract, keywords, journal, publication year and country, and so forth. Since the attribute fields reflect paper information, researchers can select specific fields to perform knowledge mining.
The raw records obtained were firstly imported into TDA using import filter, and the records were then preprocessed. NLP modules in TDA were employed to process the metadata fields of the initial literature dataset, including title, abstract, authors, keywords and keywords plus. The words and phrases segmentation should be further processed, including tokenization, stop words removal, stemming, normalization and lemmatization, fields merge, and so forth. The preprocessing module includes de-duplication and empty records removal. De-duplication refers to the removal of duplicate records, so the abstract of each record could be taken as a unique identifier. If two records shared exactly the same abstract content, one of them would be removed from the dataset. De-duplication was to ensure the uniqueness of the abstract of each record. Empty records removal meant to exclude the null fields of the paper records. If a record did not contain any information on the title, abstract or keywords, the record would be removed. This screening process retained 8370 valid records.
Literature analysis can performed by using natural language processing (NLP) and statistical methods. The goal of NLP is to use rules – either predefined or learned through supervised or unsupervised methods – to process text for specific purposes, such as translation, extraction of assertion and summarization, among others.
A total of 15935 papers were initially retrieved from the ISI Web of Science Core Collection, of which 7565 were excluded and the remaining 8370 records were included. The key attribute fields selected include title, abstract, keywords, country, document type, journal, publication year, research area, and times cited.
2.3 WS and HCS
In order to highlight the differences between high-cited papers and whole articles, the papers are grouped into two sets (Figure 3) – Whole SCI Dataset (WS) and High-cited SCI Dataset (HCS). WSData refers to all the research papers obtained after processing, and HSData refers to the selected papers with over 100 citations.
Because citation index reflects the impact factor of a research paper, the user community of a research paper includes the authors who write the paper, the editors who review and decided on the papers, and the readers who read the papers. A published paper reflects the authors' research interests and the editors' recommendations; high-cited papers are a group of published papers most popular among readers. In this sense, WS represents the authors' interests and the editors' views, while HCS represents the readers' preferences.
The literature datasets were stored into two tables: WS.xls and HCS.xls. WS.xls represents the Whole SCI dataset, and HCS.xls stands for the High-cited SCI dataset. There are 8370 records in WS.xls, 156 records in HCS.xls. Each record consists of 11 attribute fields, including article ID, title, abstract, (publication) country, times cited, keywords (authors'), keywords plus, research area, document type, journal, and publication year. Take WS.xls as an sample – its fields statistics are shown in Table 1.
|Field||Number of item||Coverage (%)||Data type||Meta tags|
|ISI Unique Article Identifier||8370||100%||Number||Identity Number|
|Number of Authors||29||100%||Number|
|Document Type||6||100%||Document Type|
In order to guarantee the relevance of each record, we excluded those whose title or keywords don't contain "flood". In addition, duplicate records were removed from the dataset, as well as records including empty fields. Keywords (authors') and Keywords Plus correspond to keywords provided by authors and the Web of Science, respectively. During the processing procedure, stop words, punctuation and number were deleted. After data collection was completed, we manually checked the validity of the data, and removed incomplete entries as well as entries irrelevant to flood disaster.
The volume of the scientific literature is large (with dozens of millions of documents) and it continues to grow rapidly (over a million per year), so it is necessary to make a general assessment of the literature by scientometric methods. In recent years, scientometric methods had been adopted in global remote sensing,12,13 Night-Time Light Remote Sensing14 and the application of remote sensing in Human Health.15 To the best of our knowledge, there were no literature-based datasets for flood research before, and our dataset effectively fills up this gap. The data provided here can be used to analyze some heated issues of flood study. The analysis results based on Whole Dataset (WS) and High-cited Dataset (HCS) can be compared against each other to illustrate potential knowledge in disaster research.
This work is supported by the National Key Research and Development Program of China (2016YFE0122600). We thank Dr. Huang Mingrui from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for her support on the collection of this dataset. Thank Qing Xiuling from the National Science Library, Chinese Academy of Sciences for her suggestion on data retrieval and processing.
Syvitski JPM, Overeem I, Brakenridge GR et al. Floods, floodplains, delta plains — A satellite imaging approach. Sedimentary Geology 267 – 268 (2014): 1 – 14.
Hristovski D, Peterlin B, Mitchell JA et al. (2005) Using literature-based discovery to identify disease candidate genes. International Journal of Medical Informatics 74 (2005): 289 – 298.
Jensen, LJ, Saric J & Bork P. Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics 7 (2006): 119 – 129.
He L & Li F. Topic discovery and trend analysis in scientific literature based on topic model. Journal of Chinese Information Processing 26 (2012): 109 – 115.
Zins C. Conceptual approaches for defining data, information, and knowledge. Journal of the American Society for Information Science and Technology 58 (2007); 479 – 493.
Vieira E & Gomes J. A comparison of Scopus and Web of Science for a typical university. Scientometrics 81 (2009): 587 – 600.
Bakkalbasi N, Bauer K, Glover J et al. Three options for citation tracking: Google Scholar, Scopus and Web of Science. Biomedical digital libraries 3 (2006): 7.
Perianes-Rodriguez A, Waltman L & van Eck NJ. Constructing bibliometric networks: A comparison between full and fractional counting. Journal of Informetrics 10 (2016): 1178 – 1195, DOI:10.1016/j.joi.2016.10.006
Feng H & Fang S. Research on the application of Thomson Data Analyzer to analyse the patent intelligence of science institutions. Information Science 26 (2008): 1833 – 1843.
Yang Y, Akers L, Klose T et al. Text mining and visualization tools – impressions of emerging capabilities. World Patent Information 30 (2008): 280 – 293.
Zhang H, Huang M, Qing X et al. Bibliometric analysis of global remote sensing research during 2010 – 2015. ISPRS International Journal of Geo-Information 6 (2017): 332.
Zhuang Y, Liu X, Nguyen T et al. Global remote sensing research trends during 1991 – 2010: A bibliometric analysis. Scientometrics 96 (2013): 203 – 219.
Hu K, Qi K, Guan Q et al. A scientometric visualization analysis for night-time light remote sensing research from 1991 to 2016. Remote Sensing 9 (2017): 802 – 809.
1. Zhang HY, Li GQ, Huang MR et al. A dataset of Scientific literature on flood, 2010 – 2017. Science Data Bank. DOI: 10.11922/sciencedb.591