The processing tools adopted by this study included Thomson Data Analyzer (TDA)10 and Microsoft Excel. TDA is a strong text mining software able to mine text information from multiple fields and offer visualized comprehensive analyses. Moreover, TDA enables a systematic organization of the literature information retrieved.11 Microsoft Excel was used to rearrange the exported data.
Figure 2 presents the data processing flowchart used by this study, which includes two major steps: record removal and field cleaning. Natural language processing (NLP) was adopted to perform data cleaning. NLP mainly involved tokenization, stop word removal, stemming, lemmatization, and field merging.
Processing flowchart of initial datasets (raw data)
Tokenization is the process of breaking up a given string into a series of subsequences, such as words, keywords, phrases.12 Each of the subsequences is called a “token”. In this process, some special symbols, such as punctuation, etc., will be removed. In some cases, some common words are of little value when the document matches the user’s needs and thus need to be completely removed from the vocabulary. These words are calledstop words .12 In literature retrieval, stemming and lemmatization are different in meaning. Stemming usually refers to a very crude heuristic process that removes the affixes at both ends of the word. This process often involves the removal of derived affixes.12Lemmatization usually refers to the process of using the vocabulary and the morphological analysis to remove the inflection affixes,12 thereby returning the original form of the words or the words in the dictionary, and the returned result is called a lemma. Raw data of this dataset were downloaded from the WoS database and stored in text format, which were then organized into several attribute fields, including title, authors, abstract, keywords, journal, publication year, and country. As each attribute field reflects distinct paper information, researchers can select specific fields to perform knowledge mining.
The raw records obtained were first imported into TDA by using an import filter, and the records were then segmented and stored in respective fields. A literature analysis was performed by using NLP and statistical methods. NLP modules in TDA were utilized to process the metadata fields of initial literature datasets, including title, abstract, authors, keywords, and keywords plus. The goal of NLP is to use rules to process text for specific purposes (such as translation, extraction of assertion, and summarization), where the rules may be predefined or learned through supervised or unsupervised methods.13 First, long sentences, including the title and abstracts, were segmented, and part-of-speech tagging was conducted on the segmented words and phrases. The segmented words and phrases were then further processed. Regular expression is an efficient tool for extending retrieval. The Fuzzy Matching Editor allows the user to tailor TDA’s cleanup algorithms to suit the requirements of data sources. These two modules were adopted to perform the processing task. Preprocessing includes de-duplication and empty record removal. De-duplication refers to the removal of duplicate records so that the abstract of each record could be used as a unique identifier. If two records shared exactly the same abstract, then one of them would be removed from the dataset. De-duplication ensures the uniqueness of each record. Empty record removal refers to the exclusion of paper records with null fields. If a record did not contain full information on title, abstract, or keywords, then the record would be removed.
A total of 15,935 papers were initially retrieved from the ISI WoS Core Collection, of which 7,820 were removed and the remaining 8,115 records were included. Key attribute fields include title, abstract, keywords, country, document type, journal, publication year, research area, and times cited.
However, the field data included inconsistencies ranging from spelling differences – whether intentional or accidental, to synonyms (e.g., “happy” and “glad”). As accurate analysis relies on minimizing these inconsistencies, the keywords were first preprocessed through data cleaning, by using such tools as number filter, punctuation eraser, stop word filter, English stemmer, and self-defined regex filter. Machine-assisted and rule-based recognitions were then adopted to merge synonyms and reduce the size of keyword list.