Tetrahymena thermophila Functional Genomics Database

No comments yet

Submit questions or advice:



You are not logged in, please[Login]or[ Register]!

Tetrahymena thermophila Functional Genomics Database

Total number of views and downloads

View in HTML Paper download
895 1

Tetrahymena thermophila Functional Genomics Database

The author's papers

Sorry, failed to retrieve the author's related papers.

            Data source: Chinese Science Citation Database(CSCD)

Tetrahymena thermophila Functional Genomics Database

Yang Wentao1,2, Wang Guangying1,2, Tian Miao1,2, Yuan Dongxia1, Miao Wei1, Zeng Honghui1, Xiong Jie1*

1. Key laboratory of biodiversity and conservation of hydrobios, Institute of hydrobiology, Chinese Academy of Sciences, Wuhan 430072, P. R. China;

2.University of Chinese Academy of Sciences, Beijing 100049, P. R. China

*Email: xiongjie@ihb.ac.cn

Abstract: Tetrahymena thermophila is a unicellular eukaryotic organism, which is a widely-studied model organism that harbors clear genetic background and can be cultured in the laboratory with well-developed molecular manipulation techniques. The development of high-throughput technology and the establishment of bioinformatics analysis methods have made great progress in functional genomics research of Tetrahymena. Based on the results of DNA microarray, RNA-seq data and gene network analysis of three major physiological and developmental stages (growth, starvation and conjugation) of Tetrahymena thermophila in this context, we constructed the functional genomic database of the Tetrahymena thermophila, which provides an important resource for the study of Tetrahymena.

Keywords: Tetrahymena thermophila; functional genomics; DNA microarray; RNA-seq; gene network

Database Profile

Chinese title

嗜热四膜虫功能基因组数据库

English title

Tetrahymena thermophila Functional Genomics Database (TetraFGD)

Corresponding author

Xiong Jie (xiongjie@ihb.ac.cn)

Data author(s)

Yang Wentao, Wang Guangying, Tian Miao, Yuan Dongxia, Miao Wei, Zeng Honghui, Xiong Jie

Data publishing entity

Institute of hydrobiology, Chinese Academy of Sciences

Data format

SQL, BAM

Data volume

6.12 GB

Data service system

http://tfgd.ihb.ac.cn,

http://www.sciencedb.cn/dataSet/handle/10

Source(s) of funding

Special Fiscal Funds for Informatization of Chinese Academy of Sciences; “Integration and sharing project of scientific data resourses” (XXH12504-3-14)

Database composition

The database consists of: RNA-seq data of logarithmic growth phase, 3 hours of starvation, 15 hours of starvation, 2 hours after conjugation and 8 hours after conjugation; DNA microarray datasets of 3 physiological stages (growth, starvation, and conjugation) at 20 time points; gene network datasets based on the results of the DNA microarray analysis

1. Introduction

Tetrahymena thermophila is a unicellular eukaryotic organism that belongs to the Ciliata of Protozoan Phylum in taxonomy. Tetrahymena thermophila is a widely-studied model organism because it harbors clear genetic background and can be cultured in the laboratory with well-developed molecular manipulation techniques. Like most ciliates, Tetrahymena thermophila has two nuclei, a macronucleus and a micronucleus. The micronucleusis caryogonad is diploid with 5 pairs of chromosomes. Similar to the germ cell in multicellular organisms, the caryogonad does not express its genes in the vegetative phase. The macronuclues, polyploid, also known as the vegetative nucleus, is generated from the micronucleus and contains about 225 chromosomes with transcriptional activity[1]. Although the genomic sequencing of the macronucleus of Tetrahymena thermophila was accomplished in 2006[2], and the corresponding database was established, the functional genomics data of Tetrahymena thermophila was relatively scarce.

The DNA microarray platform andhigh-throughput sequencing technology[1,3] provided new opportunities for functional genomics research of Tetrahymena. The DNA microarray of Tetrahymena thermophilus facilitates the study of Tetrahymena gene expression under different physiological conditions at the whole genome level. High-throughput RNA-Seq technology, on one hand, can obtain more comprehensive and detailed genetic transcription information under a certain physiological condition, and on the other hand can greatly improve the accuracy of the gene model prediction.

In this study, the RNA-seq data of 6 Tetrahymena samples in three major physiological and developmental stages (growth, starvation and conjugation) and the DNA microarray data at 20 time points were generated. And the gene network data was built based on the expression profiles of different genes. By integrating all the data above, this study constructed the Tetrahymena functional genomics database.

2. Data collection and processing

Tetrahymena thermophilus Functional Genomics Database can be accessed by visiting http://tfgd.ihb.ac.cn and http://www.sciencedb.cn/dataSet/handle/10, which mainly includes 3 datasets, namely RNA-seq, DNA microarray and gene network. The procedure of data collection and processing for this database is shown in Figure 1.

Figure 1 The flow chart of dataset generation and analysis

2.1 Sample preparation

Sample preparation mainly includes two parts:

(1) Sample preparation

Tetrahymena thermophilus wild-type cell lines, B2086 and CU428, were provided by doctor P. J. Bruns. Both of these two cell lines have an inbred B-type genetic background, which is similar to the genetic background of the macronucleus genomic sequence of SB210 cell line for probe design in microarray analysis. The cells were cultured in 1XSPP medium under growth conditions in a thermostatic shaker at 30 °C , 150 rpm. Under starvation conditions, the cells were placed in 10 mM Tris solution at a pH of 7.5. Under conjugation conditions, two cell lines with different mating types were starved for 18 hours and then mixed in the ratio of 1:1 to finally maintain the cell density of 2×105 cells/ml[2].

For the microarray analysis of the cells in the growth phase, we sampled CU428 cell lines with three different densities and labeled them with low (L-l), medium (L-m) and high (L-h), respectively. The cell density of L-l cells is about 1×105 cells/ml, corresponding to the logarithmic growth phase under the culture condition. The cell density of L-m concentration is about 3.5×105 cells/ml, corresponding to the log growth phase under the culture condition. And the cell density of L-h is about 1×106 cells/ml, corresponding to the stable stage under the culture condition. For the starved cells, we collected the CU428 cells with a cell density of approximately 2×105 cells/ml, washed the cells and treated them with 10 mM Tris (pH 7.5). We collected the cells at the time points of 0, 3, 6, 9, 12, 15 and 24 hours post-starvation and labeled them with S-0, S-3, S-6, S-9, S-12, S-15 and S-24, respectively. For the cells in conjugation stage, we mixed the CU428 cells and B2086 cells at a density of 2×105cells/ml each after the 18-hour starvation in Tris buffer. We collected the cells at the time points of 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 hours after mixing the two cell types together, and labeled them with C-0, C-2, C-4, C-6, C-8, C-10, C-12, C-14, C-16 and C-18, respectively[1].

For cells used for RNA-seq analysis, SB4217 (mating type Ⅴ) and SB4220 (mating type Ⅵ) cell lines were selected. Under growth condition, SB4220 cell lines in logarithmic growth phase (3.5×106 cell/ml) were harvested. Under starvation condition, SB4217 (mating type Ⅴ) starved for 3 hours and SB422015 (mating type Ⅵ) starved for 3 hours and 15 hours were collected. Under conjugation condition, cells were harvested after the cells were mixed for 2 hours and 8 hours in a ratio of 1:1[3].

(2) RNA extraction

For RNA-seq and microarray analysis, RNA extraction was performed with RNA isolation kit from Qiagen. The extraction of RNA from the sample was performed according to the manufacturer’s instructions provided in the kit.

2.2 RNA-seq data collection and processing

RNA-seq data collection and processing mainly include two steps:

(1) Library preparation and sequencing

The mRNA with the ployA tail was isolated using Dynal magnetic beads and was heated to 94 °C for its fragmentation. The cDNA was synthesized by adding the random hexamer primers and reverse transcriptase using the mRNA as the template. DNA polymerase and random hexamer primers were added to synthesize double-stranded DNA using cDNA as the template. The terminal of double-stranded DNA was repaired and the adenine nucleotide was added to the 3 'end, where the Illumina adapter was ligated subsequently. The DNA fragments (200-250 bp) were isolated by gel electrophoresis. The library was subjected to PCR amplification using Phusion polymerase. The sequencing library was denatured with NaOH, diluted in hybridization buffer and loaded onto a single lane of the Illumina GA flow cell. According to the methods recommended by Illumina Company, specific reagents were selected for cluster formation, primer hybridization and double-ended sequencing[3].

(2) Data analysis

Firecrest, Bustard and GERALD program (Illumina) were adopted to analyze sequencing images. Terminal reads with low-quality bases (sequencing quality value < 5) were cut off. The genomic sequences of Tetrahymena thermophilus macronucleus can be downloaded from http://ciliate.org/index.php/home/downloads, which includes 1148 scaffolds. The mapping of sequencing reads with the reference genome was performed using Tophat 1.1.4 (https://ccb.jhu.edu/software/tophat/index.shtml) to locate the exon-exon junction with the corresponding parameters as the following, -I 10, -I 10000, -coverage-search, -microexon-search and -m 2. For the mismatched reads, Nucmer plugin in Mummer 3.0 program (http://sourceforge.net/projects/mummer/files, parameters -c 25, -l 15, -g 10000) was adopted to align this part of the reads with the reference genome to find the exon-exon junctions that span two or more. The assembly of transcripts was performed using Cufflink software version 0.9.2 (http://cole-trapnell-lab.github.io/cufflinks) with the parameters of -l 10000、-min-intron-length 10. Assembled transcripts were compared with predicted genetic structures using cuffcompare in Cufflinks program to determine and optimize genetic structure. The search of reading frames from newly discovered transcripts was done by Getorf program in EMBOSS program (http://emboss.sourceforge.net/download)to find novel genes [3].

For gene expression analysis in RNA-seq data, the expression level of a gene can be calculated by the number of reads per kilobase length (RPKM) from the exon region of a gene per million reads. Number of reads in each transcript can be computed by HTSeq software (http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html)[3].

2.3 DNA microarray data collection and processing

DNA microarray data collection and processing mainly include three steps:

(1) Sample labeling and probe design

NimbleGen system (Roche) was adopted to synthesize cDNA and add Cy3 fluorescent marker, which made double-stranded cDNA from equal amout of total RNA of each sample using Super Script II cDNA kit (Invitrogen, Carlsbad, CA)[1].

According to SB210 genomic sequences and annotation information from J. Craig Venter Institute (http://www.tigr.org/tdb/e2k1/ttg), we constructed whole-genome high-density oligonucleotide DNA microarray comprising 28,064 Tetrahymena thermophila sequences that contain 27,055 predicted protein-encoding genes in 2006 version. For each protein-encoding sequence, NimbleGen system (Roche) was used to design 13 or 14 oligonucleotide probes with the length of 60nt, and these probes were not allowed to have mismatches with targeted sequences[1].

(2) Microarray synthesis and hybridization

The oligonucleotide DNA microarray of the Tetrahymena thermophila whole genome was synthesized by the maskless photolithography method with NimbleGen system (Roche). The microarray was hybridized with cDNA from samples under different conditions by NimbleGen system. Three replicates were performed for samples under growth conditions and starvation conditions respectively, and the samples under conjugation condition were subjected to two replicates[1].

(3) Data extraction and analysis

We scanned the hybridized arrays with GenePix4000B microarray scanner of NimbleGen system (Roche) and extracted the scanning results with NimbleScan software. The normalization of expression values for each protein-coding gene probe was done by RMA (Robust Multi-array Average) method[1].

2.4 Gene network data collection and processing

Gene network construction mainly includes two steps:

(1) Microarray data and gene filtrations

The data for construction gene network mainly includes the genome-wide gene expression microarrays datasets of Tetrahymena thermophila in the three major physiological stages (growth, starvation and conjugation), which contain 67 single-channel NimbleGen microarrays totally [4].

In order to remove low-informative genes, based on differences between samples and minimal gene expression signals, the filtered genes must meet both of the following criteria at the same time[4]:

(a) The difference in the expression value of one gene between the parallel samples (Exphighest-lowest) is less than the median of all the differences in the expression values between the samples calculated for each gene (medianExphighest-lowest).

(b) The average expression signal of a gene between samples is less than the median value of the average expression signal of all the genes between the samples calculated for each gene.

After filtered by this standard, a total of 15,091 genes were used for gene network construction[4].

(2) Network construction

The gene expression values of the 15,901 genes were subjected to logarithmic transformation with a base of 2, the CLR algorithm (http://omictools.com/clr-s2342.html) was adopted to construct the gene network [4-5].

3. Sample description

3.1 Database structure

According to the Tetrahymena thermophila genome and its annotation information, we constructed a gene-based relational database, which is used for searching the expression value of genes at different times and their interaction with other genes based on the gene names.

The database consists of three parts:

(1) RNA-seq database. This Tetrahymena thermophila database contains the expression level of each gene under different conditions including, logarithmic growth phase, 3 hours of starvation, 15 hours of starvation, 2 hours post-conjugation and 8 hours post-conjugation. The expression value of each gene was provided.

(2) DNA microarray database. This database contains the gene expression values of each gene in three major physiological conditions at a total of 20 time points, which include three different growth densities, 0, 3, 6, 9, 12, 15, 24 hours starvation and 0, 2, 4, 6, 8, 10, 12, 14, 16, 18 hours post-conjugation.

(3) Gene network database. This database contains the interaction relationship between the genes constructed based on gene expression level.

In order to facilitate the use, all of the three databases can be directly searched with the gene name of the query.

3.2 Data sample

RNA-seq database contains large amounts of data. To illustrate the detailed characteristics of the dataset, we use the gene TTHERM_00360310 as an example to search the RNA-seq database. We can locate the reads in the genome, under different conditions and the gene expression value (RPKM) can be obtained, as shown in Figure 2.

Figure 2 The GBrowse visualization of RNA-seq data using TTHERM_00360310 as an example

The DNA microarray data mainly describes the gene expression value of protein-coding genes under three major physiological stages at different time points. Similarly we use the gene TTHERM_00360310 as an example to search the microarray database, which is shown in Figure 3. The red line and blue line represent expression values obtained by two different normalization methods respectively.

Figure 3 Gene expression profiles generated by searching Microarray data using TTHERM_00360310 as an example

Gene network data describes the interaction relationship between genes. We use the gene TTHERM_00360310 as an example to search the gene network database, which is shown in Table 1. One column lists the target genes; the other column lists the genes that interact with the targeted genes. The third column lists the Z-score obtained with CLR algorithm. A higher Z-score indicates a higher possibility of interaction between these two genes.

Table 1 Gene network information obtained by searching gene network database using TTHERM_00360310 as an example

Gene_ID

Gene_ID

Z-score

TTHERM_00360310

TTHERM_01332060

14.51

TTHERM_00360280

TTHERM_00360310

13.73

TTHERM_00360310

TTHERM_00360320

13.45

TTHERM_00360300

TTHERM_00360310

12.92

TTHERM_00360310

TTHERM_00752040

11.45

TTHERM_00360310

TTHERM_01153620

11.37

TTHERM_00360250

TTHERM_00360310

11.35

4. Quality control and assessment

For the sample preparation, the concentration of extracted RNA was measured by NanoDrop ND-1000 spectrophotometer and the A260/A280 ratio should be in the range of 1.8 to 2.1. In order to assure the RNA quality for experiments, we routinely check the integrity of RNA by Bioanalyzer1000 [1].

For the original reads obtained from RNA-seq, FASTX-Toolkit software (http://hannonlab.cshl.edu/fastx_toolkit) was applied to control the data quality.

In the process of DNA microarray hybridization, the nonspecific binding between probes and targeted sequences in sample preparation, microarray synthesis and processing procedures made background noise on experiments a major concern. In order to measure the background hybridization level, we designed 4,308 randomly generated oligonucleotide probes whose length and GC content were similar to those of the Tetrahymena thermophila microarray probes. After the background hybridization signal intensity was determined, and the final hybridization signal intensity was obtained by subtracting the background values of each Tetrahymena thermophila microarray probe. The hybridization signals were normalized by RMA (Robust Multi-array Average) method, and the expression values of each gene were obtained at different time points for three major physiological stages, so as to map the expression profile of Tetrahymena at the whole genome level[1].

In gene network construction, it is necessary to assess the gene networks built with different methods. We adopted three methods to construct gene network, which include CLR (context likelihood of relatedness) algorithm, PCC (Pearson Correlation Coefficient) and SCC (Spearman Correlation Coefficient). For both PCC and SCC methods, the correlation coefficient is chosen as the corresponding threshold value. While for the CLR method, the threshold is determined by the Z-score[4].

In order to determine the optimal method for gene network construction, we used yeast protein complex data to confirm and assess the constructed gene networks. The yeast protein complex data can be downloaded from http://www.inetbio.org/yeastnet/downloadnetwork.php. This website provided the latest yeast protein complex that validated by experiments and recognized by computation. The interactions between the four largest protein complexes (the large ribosomal subunit, the small ribosomal subunit, the 20S proteolytic enzyme core particle and the 19S proteolytic enzyme regulating particle) in the yeast were taken as the criteria to evaluate the effect of the three network building methods. The evaluation index mainly includes three parameters, accuracy (p), coverage (r) and the overall effect (F-score). Accuracy (p) indicates the correct percentage of predicted interactions, calculated as follows:

Coverage (r) indicates the percentage of correct interactions predicted in the network constructed in each method for all predicted correct interactions, calculated as follows:

The overall effect (F-score) takes account of both the accuracy (p) and the coverage (r), calculated as follows:

The p-values, r-values and F-score values were calculated at different correlation coefficients (correlation coefficients) or confidence levels of the Z-score (CLR method), and the CLR overall performance was determined to be optimal[4].

4. Usage notes

This database uses Mysql for data storage. The contents of the relevant database can be directly viewed by searching with keywords, such as the gene ID or transcript ID. In addition, the database integrates the online Blast search tool, which facilitates the search of corresponding transcript information in the Tetrahymena thermophila RNA-seq database by alignments.

Acknowledgments

We thank Professor Martin Gorovsky from University of Rochester and Doctor Eileen Hamilton from University of California, Santa Barbara for the valuable suggestions in DNA microarray design.

References

[1]  Jonathan A, Robert S, Martin W et al. Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLoS Biol. 4 (2006): e286.

[2]  Miao W, Xiong J, Bowen J et al. Microarray Analyses of Gene Expression duringthe Tetrahymena thermophila Life Cycle. PLoS ONE 4 (2009): e4429.

[3]  Xiong J, Lu X, Zhou Z et al. Transcriptome Analysis of the Model Protozoan, Tetrahymena thermophila, Using Deep RNA Sequencing. PLoS ONE 7 (2012): e30630.

[4]  Xiong J,Yuan D, Fillingham J et al. Gene Network Landscape of the Ciliate Tetrahymena thermophila. PLoS ONE 6 (2011): e20124.

[5]  Faith J, Hayete B, Thaden J et al.Large-scale mappingand validation of Escherichia coli transcriptional regulation from acompendium of expression profiles. PLoS Biol. 5 (2012): 54–66.

Data citation

1. Yang W, Wang G, Tian M et al. Tetrahymena thermophila functional genomic database. Science Data Bank, DOI:10.11922/sciencedb.180.10

Authors and contributions

Yang Wentao, Master; research area: genomics. Contribution: data analysis and data management of Tetrahymena genomics.

Wang Guangying, PhD; research area: genomics. Contribution: data analysis and data management of Tetrahymena genomics.

Tian Miao, PhD; research area: molecular biology. Contribution: data synthesis of Tetrahymena genomics.

Yuan Dongxia, Master; research area: molecular biology. Contribution: data generation of Tetrahymena genomics.

Miao Wei, PhD, Principal Investigator; research area: protozoology. Contribution: data management of Tetrahymena genomics.

Zeng Honghu, Master, Senior Engineer; research area: database and software development, bioinformatics. Contribution: database construction, maintenance and management.

Xiong Jie, PhD, Research Assistant; research area: genomics. Contribution: data collection, analysis and integration of Tetrahymena genomics.

 

How to cite this article: Yang W, Wang G, Tian M et al. Tetrahymena thermophila Functional Genomic Database. China Scientific Data 2 (2016). DOI: 10.11922/csdata.180.2015.0011

Download