Data Paper Zone II Versions EN1 Vol 3 (3) 2018
Database of quantum chemical calculation results based on compounds molecular structure
>>
： 2018 - 06 - 08
： 2018 - 08 - 02
： 2018 - 09 - 29
432 7 0
Abstract & Keywords
Abstract: At present, there is a lack of basic physical property data and thermodynamic data for a large number of the compounds of known structure. To promote data integrity and usability in chemistry databases, this study performs geometric structure optimization, thermodynamic data analysis, and spectrum calculation on the structure of about 200,000 compounds using Gaussian03 software, based on compounds structure data and compounds profiles. Then, compound types, together with their thermodynamic parameters and spectral data, are extracted from quantum calculation results according to international standards, including infrared absorption spectrum, dipole moment, exact polarizability, rotational temperature and constant, zero-point vibrational energy, zero-point correction, molecular internal energy, molecular kinetic energy, enthalpy, free energy, heat capacity at constant volume, and entropy. A second processing (i.e., data analysis, mining and duplicate checking) results in quantum chemical calculation data of 18,000 compounds (including spectral data of 5,321 compounds). To improve data (re)usability, all the resulted data have been standardized and processed.
Keywords: compounds structure; quantum chemical calculation; data analysis and processing; thermodynamic data; spectral data
Dataset Profile
 Chinese title 量化计算结果数据库 English title A database of quantum chemical calculation results Data corresponding author Han Qingzhen (qzhan@ipe.ac.cn) Data authors Han Qingzhen, Zhao Yuehong, Wen Hao Time range 2015 – 2017 Data format *.xls Data service system Sources of funding National Science & Technology Infrastructure Program of China – Fundamental Science Data Sharing Platform (DKA2017-12-02-05); CAS Informatization Program of the Thirteenth Five-Year Plan – "Key Database Construction and Application Services for the Discipline of Chemistry" (XXH1350303-103). Database composition The dataset consists of 13 subsets in total covering the following aspects: Dipole moment (Debye), Exact polarizability, Approx. polarizability, Rotational temperatures (Kelvin), Rotational constants (GHZ), Zero-point vibrational energy (kJ/mol), Zero-point correction (Hartree/Particle), Thermal energy (kJ/mol), Thermal enthalpy (kJ/mol), Thermal free energy (kJ/mol), Total molecular kinetic energy (kJ/mol), CV (J/Mol-Kelvin), S (J/Mol-Kelvin). The database contains one compressed file “Quantumdata.xls”, which stores thermodynamic chemical properties data resulted from quantum calculation.
1.   Introduction
With the increasing demand for the new materials and the improvement of research and development capabilities in recent years, a great many materials of specific properties emerged to meet varied requirements. However, for insufficient data on the thermodynamic properties of these materials and their derivatives, we have limited knowledge about many compounds of known structures. Under these circumstances, it benefits researchers in chemical engineering simulation and molecular material design to perform quantum chemical calculations for the compounds already included in existing chemical databases, and to obtain corresponding thermodynamic and spectral data. Therefore, it is of great significance to develop the database of quantum chemical calculation results.
As this database serves for chemical engineering simulation and material design, relevant data cannot be obtained from existing literature or through experiments. As such, we adopted reliable quantum chemical simulation methods. In the process, these data can be further tested through future calculations on the one hand, and on the other hand, they can be utilized in chemical engineering simulation, molecular design, water pollution treatment, air purification, and so on. Eventually, a database like an inquiry library is created to serve the customers for free. Moreover, the database will be gradually optimized into a sound data management and information service system providing web-based service and information inquiry. Furthermore, this database will be integrated into the ChemDB as an affiliated sub-library.
2.   Data collection and processing
2.1   Data pre-processing
The research objects of this study mainly consisted of traffic nodes and traffic lines. Historical documents such as Shiji,7 Hanshu ,8 Houhanshu9 and Parthian Stations10 provided important sources to identify the traffic nodes' historical names, supplemented by archaeological materials such as bamboo slips unearthed at Juyan and Dunhuang. To locate these nodes required us to correlate their historical names with corresponding modern designations. For this end, we referred to historical documents to locate each historical site to metropolis, prefecture seat, county seat, inhabited locality, bridges, and pass. These documents included Historical Atlas of China,11 Cihai. Geographical Volume: Historical Geography ,12 An Atlas of Chinese Cultural Relics,13–17 and the 3rd National Cultural Relics Survey. The historical sites extracted were then correlated with their modern designations by using the latest administrative data, including Administrative Divisions of the People's Republic of China18 and the Administrative Division Network (http://www.xzqh.org/html/).19
The core of traffic route restoration was to determine the direction of the route and the regions it passed through. The route direction was determined based on the descriptive texts of historical documents and relevant archaeological discoveries, supplemented by research and investigation findings like A History of the Silk Road Transportation. After basic data for the traffic nodes and traffic lines were collected, we used EXCEL files to store attribute tables for the traffic nodes, and WORD documents to store textual description of the traffic routes.
2.2   Methods of thermodynamic data calculation
We adopt Gaussian031 and Gaussian092 and the B3LYP/6-31G3 basis set, and set the temperature T=298.15K and the pressure P=101.3KPa. Structural optimization and frequency calculation are performed on all the input Gaussian files to obtain thermodynamic parameters of all the compounds. Then, batch processing is used to search for output files that terminate normally and convergently, through which to extract the compounds’ optimized geometrical parameters and complete thermodynamic parameters, as well as their dipole moment, exact polarizability, approx polarizability, rotational temperature, rotational constants, zero-point vibrational energy, zero-point correction, molecular internal energy, molecular kinetic energy, enthalpy, free energy, heat capacity at constant volume, and entropy. Unit conversion and standalization4 are performed to generate a standard table (Table 1), these data are then stored in the database of quantum chemical calculation results for online access. The flowchart for data calculation and database building is shown in Figure 1.
Table 1   Thermodynamic parameters of the compounds and their units
 Thermodynamic parameters Symbols and Definitions SI units Dipole Moment Dipole Moment (μ) Debye Exact Polarizability Exact Polarizability — Approx Polarizability Approx Polarizability — Rotational Temperatures Rotational Temperatures K Rotational Constants Rotational Constants GHZ Zero-point Vibrational Energy Zero-point Vibrational Energy kJ∙mol−1 Zero-point Correction Zero-point Correction Hartree/Particle Thermal Energies Thermal Energies kJ∙mol−1 Thermal Enthalpies Thermal Enthalpies kJ∙mol−1 Thermal Free Energies Thermal Free Energies kJ∙mol−1 Total Molecular Kinetic Energy Total Molecular Kinetic Energy (298.15 K) kJ∙mol−1 Heat q or Q J Work w or W J Inner Energy J Enthalpy J Thermodynamic Temperature T K Entropy J∙K−1 Gibbs Free Energy J Isobaric Heat Capacity J∙K−1 Isothermal Capacity J ∙mol−1∙K−1 Heat Capacity Ratio $$\mathrm{\gamma }\left(\mathrm{o}\mathrm{r} k\right)=\mathrm{C}\mathrm{p}/\mathrm{C}\mathrm{v}$$ — Compression Factor — Chemical Potential J∙mol−1 Standard Chemical Potential or J∙mol−1 Standard Reaction Gibbs Free Energy J∙mol−1 Reaction Affinity J∙mol−1 Standard Reaction Enthalpy J∙mol−1 Standard Reaction Entropy J∙mol−1∙K−1 Equilibrium Constant —

Fig.1   Flowchart showing data calculation and database building
2.3   Methods of optical spectrum calculation
Spectrum calculation is performed on the converted input files of the compounds, through which to obtain optimized geometrical configuration, complete frequency analysis data, and spectral data, as shown in Table 2. Batch processing is used to search for output files that terminate normally and convergently, through which to extract corresponding compouds. Then GaussSum2.2 is used to extract oscillation frequency of the compounds and corresponding infrared optical absorption data. Finally, Gnuplot is used to plot all the related figures to generate a compressed file named “Spectum.zip”. The results are also stored into the database of quantum chemical calculation results for online access.
Table 2   Spectral data of the compounds and their units
 Spectrum parameters Symbols and Definitions SI units Wavelength λ m Refractive Index — Frequency Hz Circle/Angle Frequency ω = 2πν s−1, rad∙s−1 Wave Number $$\stackrel{ˇ}{\mathrm{v}\mathrm{ }}=\frac{v}{{c}_{0}}=\frac{1}{n\stackrel{~}{\lambda }} Vacuum$$$$\sigma =\frac{1}{\stackrel{~}{\lambda }} Media$$ m−1 m−1 Planck Constants h J∙s Absorption Ratio/Factor — Absorbance — Transition Wavenumber m−1 Transition Frequency Hz Electron Items m−1 Vibration Items m−1 Rotation Items m−1 Rotation Constants $$\stackrel{~}{\mathrm{A}},\stackrel{~}{B},\stackrel{~}{C} \stackrel{~}{\mathrm{A}}=\frac{\hslash }{8{\pi }^{2}c{ I}_{A}} Wave number$$$$A, B, C A=\frac{\hslash }{8{\pi }^{2}{ I}_{A}} Frequency$$ m−1 Hz Asymmetric Parameters — Harmonic Vibration Wavenumber $${\mathrm{\varpi }}_{e} ;{\mathrm{\varpi }}_{r}$$ m−1 Molecular Electric Dipole Moment $$\mathrm{p}\mathrm{ }\left(\mathrm{o}\mathrm{r}\mathrm{ }\mathrm{\mu }\right) Ep=-p*E$$ C∙m Molecular Magnetic Dipole Moment $$\mathrm{m}\mathrm{ }\left(\mathrm{o}\mathrm{r}\mathrm{ }\mathrm{\mu }\right) Ep=-m*E$$ J∙T−1 Molecular Transition Dipole Moment $$\mathrm{M}\mathrm{ }\left(\mathrm{o}\mathrm{r}\mathrm{ }\mathrm{R}\right)=\int \psi \text{'}p\psi \text{'}d\tau$$ C∙m Chemical Displacement —
2.4   Database use facilitation
The database of quantum chemical calculation results is integrated into ChemDB as a sublibrary, which provides diverse means for data query and features online global access. In order to be uniform with ChemDB, the database adopts identical labels like ID, CASRN, InChIKey and SRN. The retrieving and restoring methods are listed in Table 3.
Table 3   Retrieving and restoring of the compound labels
 Labels Retrieving and Restoring Methods ID Compound ID comes from the CAS RN of the compound, subject to check code verification. Compounds whose CAS RN can not be determined are named in format of “B+serial number”, such as “B2000166”. CAS RN ID and CAS RN of the chemical database are collected from various data literature, which are then subject to check code verification. InChIKey InChIkey of the compound is generated by using InChI Software Version 1.02 shared by the International Union of Pure and Applied Chemistry. SRN SRN is generated by the compound structure login system of the compound reference library. It is a decimal integer consisting of an ontological part and a digit verification code.1
Note: The SRN check code is generated by using the Mode 11 Calculator of the ISO 7064:1983 standard.
3.   Description of data samples
Till now, more than 200,000 compounds have been calculated, which generated thermodynamic data of 25, 000 compounds. The quantum chemical calculation results of about 18, 000 compounds have been appended into the database. As more compounds are being calculated, the data amount will continue to increase.
By analyzing specific contents of the result contents, some elements of the database are determined, which are used as the basis for designing the database structure. The index structure of the database showcases the required elements and their order confirmed through targeted analysis. Descriptions of the index structure are shown in Table 4.
Table 4   Index structure of the database
 Index Sample Note OseChemX C.20110210.111413.257D 8 8 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.2124 0.7000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.4249 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.6373 0.7000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -3.6373 2.1000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -2.4249 2.8000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.2124 2.1000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.2124 0.7000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 2 3 1 0 3 4 2 0 4 5 1 0 5 6 2 0 6 7 1 0 2 7 2 0 1 8 2 0 M END Compound English Name benzaldehyde Optional Compound Chinese Name benzaldehyde Mandatory Compound Formula C7H6O Optional CAS RN 100-52-7 Optional Internal Number E-PN-001 Mandatory
The table of required elements is set mainly to standardize and structure the data fields, in order to form a table for database structure analysis and design, and as references for program writing (Table 5).
The table of required elements is described as follows: a) Element Chinese Name: Chinese name of the data field per the database. b) Element English Name: English name of the data field per the database. c) Data Type: numerical data are stored in int files, textual data in text files, and others as varchar type. d) Size: space required of the element (unit: byte). An English character or Arabic number is represented by one byte, and a Chinese character by two bytes.
Table 5   Required elements
 Element Chinese Name Element English Name Data Type Size 化合物结构 Str_compound Image — 化合物InChI_Key InChI key varchar 27 化合物InChI码 InChI text — 分子式 Formula varchar 100 中文名称 Name_CN varchar 100 英文名称 Name_EN varchar 150 中文别名 OthName_CN text — 英文别名 OthName_EN text — 化合物CAS登录号 CASRN int 10 内部id ID varchar 25 分子量 Mol_weight float — 吉布斯自由能 G float — 偶极矩 Dipole float — 绝对极化率 Exact Polarizability float — 相对极化率 Approx Polarizability float — 转动温度 Rotational Temperatures float — 转动常数 Rotational Constants float — 零点振动能 Zero-point Vibrational Energy float — 零点校正 Zero-point Correction float — 分子热能 Thermal Energies float — 分子反应焓 Thermal Enthalpies float — 分子反应自由能 Thermal Free Energies float — 分子总动能 Total Molecular Kinetic Energy (298.15 K) float — 热力学温度 T float — 熵 S float — 等容热容 Cv float —
4.   Data quality control and assessment
The basic principle and strategy of the database construction rests on the reliability of the data collected. Mature and verified theoretical methods and calculation models are used for the quantum chemical calculation of the compound structure to ensure the reliability of the collected data. During data collection, the calculation results which are obviously wrong or not in accordance with the basic laws of physical chemistry have been eliminated. Data are input, updated and processed in batches at an interval of about half a year, while manual input is carried out on a monthly basis. The database of quantum chemical calculation results sets the following requirements on its data: first, time range: the structure of the compounds is calculated within 100 hours of their convergence; second, discipline scope: the compounds are mainly from chemical subject databases; third, data amount: about 18,000 compounds have been accumulated till now. The data amount will continue to increase as the server continues to calculate new data sources; fourth, data accuracy: usually 5 decimals are retained; fifth, language: English and Chinese; sixth, data type: the basic types of the data include text, number, picture, custom binary format, etc.
At present, the input and output data types of the database include text, numerical, image and custom binary formats. Both the data types and values are within the normative range of the database system. The optimized results of the quantum chemical calculation are filtered and extracted, which are then sorted using EXCEL. In order to ensure the quality of the data, typos and type errors have been checked and eliminated, and data formats have been converted where appropriate.
5.   Data usage and recommendation
The data of the quantum chemical calculation results are described in 13 categories, including Dipole Moment, Exact Polarizability, Approx Polarizability, Rotational Temperatures, Rotational Constants, Zero Point Vibrational Energy, Zero Point Correction, Thermal Energies, Thermal Enthalpies, Thermal Free Energies, Total Molecular Kinetic Energy, CV and Entropy. Furthermore, the database also provides the infrared spectrum of partial structures. Users can query corresponding thermodynamic calculation data by entering the CAS RN, structure, SRN, or InChi key of the compounds, so there will be no need for tedious molecular structure model construction, calculation analysis, results extraction, and so on. It facilitates the evaluation and analysis, structural design, chemical and industrial applications of new materials.
1.
Frisch MJ, Trucks GW, Schlegel HB et al. Gaussian 03, Revision B.03. Pittsburgh, PA: Gaussian Inc, 2003.
2.
Frisch MJ, Trucks GW, Schlegel HB et al. Gaussian 09, Revision A.02. Pittsburgh, PA: Gaussian Inc, 2009.
3.
Becke AD. Density-functional thermochemistry. III. The role of exact exchange. Journal of Chemical Physics 98(1993): 5648 – 5652.
4.
Jiang L. The Chemical Data Sources Integration Research Based on Ontology of ChDR. Master’s Thesis, University of Chinese Academy of Sciences (Beijing), 2015.
Data citation
1. Han Q, Zhao Y & Wen H. A database of quantum chemical calculation results. Science Data Bank. DOI: 10.11922/sciencedb.630 (2018).
Article and author information
Han Q, Zhao Y & Wen H. Database of quantum chemical calculation results based on compounds molecular structure. China Scientific Data 3(2018). DOI: 10.11922/csdata.2018.0037.zh
Han Qingzhen
database creation; calculation, analysis, update and maintenance of the quantum data.
qzhan@ipe.ac.cn
PhD, Associate Professor; research area: computational chemistry and chemical engineering.
Zhao Yuehong
database creation, operation and maintenance.
PhD, Associate Professor; research area: computational chemistry and chemical engineering.
Wen Hao
database creation and development.
PhD, Professor; research area: computational chemistry and chemical engineering.
National Science & Technology Infrastructure Program of China – Fundamental Science Data Sharing Platform (DKA2017-12-02-05);CAS Informatization Program of the Thirteenth Five-Year Plan – "Key Database Construction and Application Services for the Discipline of Chemistry" (XXH1350303-103)
Publication records
Published: Sept. 29, 2018 （ VersionsEN1
Released: Aug. 2, 2018 （ VersionsZH2
Published: Sept. 29, 2018 （ VersionsZH3
References

csdata