Zone II • Versions EN5
Abstract: Typhoons are a category of natural disasters whose annual occurrence causes major life and property loss in the Northwestern Pacific region. During typhoon events, social media serve as an effective tool to transmit and acquire disaster information in real time. Texts and photos from social media can be used as a way of crowd sourcing to extract disaster loss information, analyze human behaviors and formulate responses. The dataset presented here consists of social media-based data collected from "Sina-Weibo" microblogs, "WeChat" articles, and "Baidu" news about the typhoon events in 2017, covering Typhoon "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar". We mainly collected text data from these social media platforms and websites, which were then cleaned for redundancy and irrelevance. This dataset can be used for deeper disaster information mining of typhoon events.
Keywords: typhoon; social media; disaster reduction; data mining
|English title||A social media-based dataset of typhoon disasters, 2017|
|Data corresponding author||Xie Jibo (firstname.lastname@example.org)|
|Data authors||Yang Tengfei, Xie Jibo, Li Guoqing|
|Geographical scope||15°N – 30°N, 101°E – 132°E; specific areas include: southeast China and surrounding area|
|Data volume||1.70 GB (9749 texts from "Baidu" news and "WeChat" Subscription; 9601 records from "Sina-Weibo")|
|Data format||.html, .xls, .sql|
|Data service system||<http://www.sciencedb.cn/dataSet/handle/547>|
|Sources of funding||National Key R&D Program of China (2016YFE0122600); International Partnership Program of Chinese Academy of Sciences（131C11KYSB20160061）|
|Dataset composition||This dataset consists of two compressed (ZIP) files, which are "Data.zip" and "Classification example.zip". Among them, "Data.zip" is made up of eight subfolders, which are "Haitang", "Hato", "Khanun", "Mawar", "Merbok", "Nesat", "Pakhar", and "Roke". Social media data are stored in these subfolders in different formats, which include .html, .xls and .sql. "Classification example.zip" is made of seven subfolders which represent seven large categories of disaster losses, respectively. Each subfolder contains a few subfolders which represent small categories under corresponding large categories. These data are saved in XLS format.|
● XLS file: Texts from social media are stored in XLS format in a structured form.
● SQL file: Users can execute the SQL file in their own MySQL database to import the data which contain structured texts from social media.
● HTML file: It is used to store original web pages retrieved from "Baidu" news and "WeChat" Subscription.
● XLS file:It is used to store data of disaster loss. Each file corresponds to a specific category of disaster loss.
Typhoons cause major losses to human life and property each year in the Northwestern Pacific region. How to quickly collect information and make reasonable responses is an urgent problem faced by disaster relief departments. Crowd sourcing and citizen observation has been an effective method to obtain disaster information, among which social media, in particular represented by Twitter,1 Facebook,2 micro-blog data,3 etc., provide near real-time information during the disaster period. By making full use of the dynamic information collected by social media, the disaster relief department can get timely information about the disaster events and people's responses to them. Research has been done on the mining of disaster information based on social media data. Evidence shows that people's behavior is greatly influenced by social media when disasters occur.4 A study commissioned by the American Red Cross5 found that more than half of the respondents believed that government agencies should monitor social media to acquire timely and effective disaster information. As to how to use social media data to mine valuable disaster information, Chae J et al.6 used Twitter data for hurricane disaster analysis, and the results provided support for government departments' policy decision-making. Some studies7,8 built disaster event classifiers based on microblog data for disaster event identification, which detected disasters through citizen observation. In addition, achievements have been made in the spatio-temporal analysis of disaster,9,10 the characteristics of disaster social responses,11 and the prediction simulation of disaster trends,12,13 etc., which greatly improved the efficiency of disaster relief.
Collecting useful information for disaster events from social media is quite time-consuming and complicated due to unstructured expression. Although some social media platforms provide the API (Application Program Interface) for public information access, they also set restrictions to limit the information we can acquire. For example, we can't get the micro-blog information that relates to a specific disaster event; nor can we get the micro-blog information on a specified historical period directly through API. In other words, the API of these platforms does not provide corresponding retrieval functions, which undoubtedly increases the workload of subsequent data processing. Therefore, in our research project, we develop a toolkit to automatically harvest and process social media-based disaster information. We use the toolkit to generate a typhoon disaster dataset for 2017 based on several social media platforms. The dataset is mainly composed of text data that come from "Sina-Weibo" microblogs, "WeChat" Subscription and "Baidu" news. Figure 1 shows typhoon disaster data from "Sina-Weibo". The data contain textual descriptions and pictures of the disasters, as well as the time and location of data upload. It provides data support for the disaster relief departments to understand the timely progress of the disaster.
The dataset records information on the following eight typhoon events: "Merbok", "Roke", "Khanun", "Haitang", "Mawar", "Hato", "Nesat" and "Pakhar" (Table 1).
The data from "WeChat" Subscription and "Sina-Weibo" are mostly from unofficial media and public uploads, which mainly describe the progression of a disaster based on public observation. In order to give a more comprehensive understanding of the disaster, we added data from Baidu news which were released by official media, which mainly contained disaster loss statistics, reliefmeasures, etc. We used different methods to obtain data from varied data sources. Among them, keyword search was used to retrieve data from "WeChat" Subscription and "Baidu" news. For example, when "Typhoon Hato 2017" was entered, the "Baidu" search engine returned the news related to "Typhoon Hato" in 2017. The toolkit we developed was used to conduct the search and to automatically generate relevant contents. Then, we parsed and cleaned these texts and stored them into the database in a structured form. The same method was used to obtain data from "WeChat" Subscription. For "Sina-Weibo", we used the advanced search function of the platform to obtain data related to the typhoon events. According to the track of the typhoon events (Figure 2), we selected the name of the Typhoon plus the characters "台风 (Typhoon)" as the keywords for setting retrieval conditions.
2.2 Data collection process
We developed a social media data harvesting system with functions of data collection, parsing, cleaning, and management, as shown in Figure 3. We acquired data from different platforms by using the collection module, and then parsed them into a structured form. The HTML pages from "WeChat" Subscription and "Baidu" news were stored in their original HTML format. Cleaning the data involved a process that comprised removing duplicated information, translating traditional Chinese into simplified Chinese, translating full-width characters into half-width characters, etc. Finally, these data were stored in a structured form. The structure of the data is shown in Table 2.
|File(.zip)||Folder||Folder||File(.xls, .sql, .html)||Notes|
|.html: Users can parse the page themselves according to their research needs.|
.sql: User can execute the SQL file in their own MySQL database to import the data into it.
.xls: Users can use the data directly through the XLS file.
2.3 Data classification
Social media data contain a lot of disaster loss information, and different types of damage may be included in the same data. For example, a text from "Sina-Weibo" writes, "After the typhoon, many trees were blown down and many cars were smashed." The text contains disaster loss information about the destruction of trees and cars and we divided these information into different categories of disaster losses. Below we provide a classification example according to the type of reported damage caused by the disaster. The raw data in this classification example are all from "Sina-Weibo" microblogs related to typhoon "Hato" in Zhuhai. Users can classify the rest of the data in the dataset by referring to the classification example or according to their specific needs in research. The seven large categories include social effects, forestry, fisheries, traffic, electric power, communication and infrastructure damage. One large category contains several small categories, as shown in Figure 4. For example, the category of social effects contains injuries and deaths, water shortage, building damage, and market shutdown. The classification example is shown in Table 3.
|Large category||Small categories||Number of posts|
|Social effects||Injuries and deaths||12|
|Forestry||Destruction of trees and plants||119|
|Fisheries||Loss of fishing ground||1|
|Damage of fishing boats||1|
|Electric power||Electric powercutoff||287|
|Damage of electric power equipment||4|
|Communication||Interruption of networks and signals||123|
|Infrastructure damage||Damage of street lamps, billboards, bridges, roads, and so on||34|
Data fields for "Sina-Weibo" includes ID, keyword, province, city, content, picture, location, release time, platform, number of forwards, comments, number of likes, as shown in Table 4. Each column has a limit of no more than 140 characters. The topics of the dataset include property loss, traffic impact, casualties, power supply, communication impact, rescue arrangements, response measures, and public attitudes toward the typhoon, among others.
|Content||After the typhoon, Mr. Liu asked me out for a walk to experience the post-disaster Zhuhai. Almost no restaurant was open. Having looked for a long time, finally we found a restaurant which was open. We saw so many cars smashed, trees blown down, and yachts blown ashore. My little white car was scratched by the branches. How can I go to work tomorrow, since Hengqin is so far away? The last picture, as a tribute to our soldiers!|
|Release time||2017-08-23 22:25|
|Number of forwards||–|
|Number of likes||1|
Data fields for "Baidu" news include ID, title, link, source, release time, and keyword, as shown in Table 5. The fields for "WeChat" Subscription include ID, title, content, source, release time, and keyword, as shown in Table 6. The themes of the data include typhoon tracks, disaster loss statistics, government announcements, emergency measures, etc.
Keywords related to the designated typhoon event were diversified and optimized to ensure maximum retrieval of related information from each social media platform. After data collection was completed, we manually checked the validity of the data, and removed incomplete entries as well as entries irrelevant to the typhoon disaster. In addition, we established a database index system to avoid duplicate data. For disaster classification, three colleagues were arranged to classify these original data to ensure the accuracy of the final classification results. Prior to this, classification standards had been set up to minimize possible discrepancies. Finally, we randomly sampled 500 data entries from each platform and found an accuracy rate of nearly 100%.
To our knowledge, there were no social media-based datasets for these typhoons before, and our dataset effectively fills up this gap. The data in our dataset can be analyzed to meet different needs of disaster research. For example, the disaster loss data presented here can be re-classified into different categories to support real-time evaluations of disaster losses. The data can also be used for further analysis of typhoon disasters such as victims’ sentiment analysis in the typhoon area, the extraction of buzzwords during typhoon transits, etc. In follow-up studies, we have used the texts in this dataset to train the corpus for automatic identification of typhoon disaster information, which achieved satisfactory results.
This work is supported by the National Key R&D Program of China (2016YFE0122600). We thank Edward T.-H. Chu, Associate Professor at National Yunlin University of Science and Technology, Taiwan, China for his advice on data collection. We thank Li Zhenyu from Shandong University of Science and Technology and Dr. Tian Chuanzhao from the Institute of Remote Sensing and Digital Earth, Chinese Academy of Sciences for their careful examination of our dataset.
Sakaki T, Okazaki M & Matsuo Y. Twitter analysis for real-time event detection and earthquake reporting system development. IEEE Transactions on Knowledge & Data Engineering 25 (2013): 919 – 931.
Bird D, Ling M & Haynes K. Flooding Facebook – the use of social media during the queensland and Victorian floods. Australian Journal of Emergency Management 27 (2012): 27 – 33.
Wang YD, Li H, Wang T et al. Emergency information mining and analysis of emergency based on social media. Journal of Wuhan University 41 (2016): 290 – 297.
National Research Council (U.S.). Public Response to Alerts and Warnings Using Social Media: Report of a Workshop on Current Knowledge and Research Gaps. Washington, DC: The National Academies Press, 2013.
American Red Cross. Social media in disasters and emergencies. Available at: <http://i.dell.com/sites/content/shared-content/campaigns/en/Documents/red-cross-survey-social-media-in-disasters-aug-2010.pdf> [Accessed December 11, 2017].
Chae J, Thom D, Yun J et al. Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.
Zhou Y, Yang L, Walle BVD et al. Classification of microblogs for support[ing] emergency responses: Case Study [of] Yushu Earthquake in China, 2014. Proceedings of the 47th Hawaii International Conference on System Sciences, 2013: 1553 – 1562.
Qu Y, Huang C, Zhang P et al. Microblogging after a major disaster in China: a case study of the 2010 Yushu earthquake. Proceedings of ACM Conference on Computer Supported Cooperative Work, 2011: 25 – 34.
Chae J, Thom D, Jang Y et al. Special section on visual analytics: Public behavior response analysis in disaster events utilizing visual analytics of microblog data. Computers & Graphics 38 (2014): 51 – 60.
Chen Z, Gao T, Luo NX et al. Social media effectiveness to reflect the spatial and temporal distribution of natural disasters. Science of Surveying and Mapping 42 (2017): 44 – 48.
Liu HB & Zhai GF. A comparative study of the social response characteristics of different disasters based on social media information. Journal of Catastrophology 32 (2017):187 – 193.
Stoové MA & Pedrana AE. Making the most of a brave new world: Opportunities and considerations for using Twitter as a public health monitoring tool. Preventive Medicine 63 (2014): 109 – 111.
1. Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. Science Data Bank. DOI: 10.11922/sciencedb.547
How to cite this article
Yang T, Xie J & Li G. A social media-based dataset of typhoon disasters, 2017. China Scientific Data 3 (2018), DOI: 10.11922/scdata.2017.0014.en