Assessing the gene space in draft genomes




Скачать 119.64 Kb.
НазваниеAssessing the gene space in draft genomes
Дата11.10.2012
Размер119.64 Kb.
ТипДокументы
Assessing the gene space in draft genomes


Genis Parra, Keith Bradnam, Zemin Ning, Thomas Keane, and Ian Korf


SUPPLEMENTARY MATERIAL


Supplemental Table S1. Comparison of the numbers of mapped CEGs and N50 contig/scaffold lengths in the simulated draft human genomes, and the original assemblies on which the simulations were based.


Values of “CEGs mapped original” correspond to the number of the 248 CEGs that were mapped in various genome assemblies of four non-human vertebrate species. The contig and scaffold lengths of these assemblies were used to form simulated draft genomes by selecting equivalent lengths of sequence from the human genome sequence. The result for mapping the 248 CEGs in these simulated draft genomes is shown in the “CEGs mapped draft” columns. The average transcript lengths for each of the original species are: 35 ± 68 Kb for guinea pig (C. porcellus), 30±58 Kb cow (B. taurus), 42 ± 90 Kb macaque (M. mulatta) and 49 ± 112 Kb for chimpanzee (P. troglodytes). There are some expected discrepancies in genomes where the transcript lengths are shorter than the human transcript length (41 ± 100 Kb). Cow being the clearest example, where transcripts are 24% shorter on average and we find on average 17.8 more genes than in the simulated drafts. However, for chimpanzee we find only 3.3 more genes in the original genome versus the simulated genome. Thus, the results obtained in the simulated drafts appear to be consistent with the data observed in the original genomes.



Coverage

Species

N50

CEGs mapped

original

CEGs mapped

draft

N50

CEGs

mapped

original

CEGs mapped

draft

Contigs

Scaffolds

1.9x

C. porcellus

3,107

60

52

51,922

114

105

3x

B. taurus

4,185

112

88

13,515

174

125

4.2x

P. troglodytes

13,099

149

142

2,425,220

227

225

5.3x

M. mulatta

14,727

164

148

692,751

238

228

6x

B. taurus

19,095

165

149

436,806

236

212

6.6x

P. troglodytes

28,844

187

179

8,217,119

240

238

7.1x

B. taurus

44,278

228

198

1,042,974

243

230

Supplemental Table S2. Gene structure properties of 248 CEGs and the six original genomes. Length of CDS and primary transcripts (CDS+introns) for the complete set of genes and CEGs. The standard deviation is shown after the length.




Species


Genome data

CEGs




Genome size (Gb)

Average gene size

(bp)

Average transcript size (bp)

Average gene size

(bp)

Average transcript size (bp)

A. thaliana

0.146

1,250 ± 738

2,036 ± 1,273

1,217 ± 841

2,329 ± 1,548

C. elegans

0.100

1,243 ± 1,209

3,195 ± 4,125

1,201 ± 831

2,554 ± 2,895

D. melanogaster

0.180

1,729 ± 1,579

5,201 ± 11,071

1,210 ± 850

1,861 ± 1,643

H. sapiens

3.253

1,696 ± 1,607

40,767 ± 100,652

1,259 ± 844

22,597 ± 25,138

S. cerevisiae

0.012

1,636 ± 1,250

1,649 ± 1,244

1,232 ± 849

1,261 ± 833

S. pombe

0.014

1,569 ± 1,248

1,644 ± 1,240

1,201 ± 831

1,309 ± 822



Supplemental Table S3. The high copy number of pseudogenes in vertebrate genomes may present problems when mapping CEGs and we may be classifying some pseudogenes as orthologs of CEGs. To test this we used information from the Yale Pseudogene Database (http://www.pseudogene.org), which contains information on 14,383 duplicated and 11,769 processed pseudogenes in the human genome (ncbi36 build). We first identified how much overlap there was between the set of human core gene predictions (orthologs + homologs) and the pseudogenes. A CEG was considered a pseudogene if at least 20% of its nucleotides overlapped with any of the annotated pseudogenes. Mapped CEGs in each of the six simulated genomes and published genome sequence (ncbi36) are listed along with the number of those mapped CEGs which correspond to known pseudogenes. Results are shown for partial and complete cutoff (see Methods section for more details).





Assembly

Original number of mapped CEGs

Number of CEGs that correspond to known pseudogenes

% of pseudogenes predicted as CEGs

Partial

predictions

draft 1.9x

98

7

5.9

draft 3x

146

7

4.6

draft 4.2x

208

3

1.5

draft 5.3x

214

5

2.3

draft 6x

197

1

0.5

draft 6.6x

229

5

2.1

ncbi36

248

0

0

Complete prediction

draft 1.9x

52

4

7.6

draft 3x

88

6

6.8

draft 4.2x

142

4

2.8

draft 5.3x

148

4

2.7

draft 6x

149

0

0

draft 6.6x

179

7

3.9

ncbi36

248

0

0


Supplemental Table S4. CEGs grouped by conservation. Group 1 represents the least conserved of all 248 CEGs, with the degree of conservation increasing in subsequent groups through to group 4. Lower and upper limits refer to the average identity from pairwise alignments of all proteins within each CEG.


Conservation group

Number of CEGs

Lower limit

Upper limit

Average within-group pairwise identity

Average CDS length (bp)

1

66

30.4%

42.9%

38.6% ± 2.9%

425 ± 239

2

56

43%

49.9%

46.8% ± 2.1%

413 ± 286

3

61

50%

56.9%

53.1% ± 1.9%

448 ± 297

4

65

57%

90.8%

65.3% ± 5.9%

381 ± 317



Supplemental Table S5. Proportion of genes clustered in different conservational groups mapped in the C. briggsae, T. gondii and H. sapiens contig assemblies (based on partially mapped genes, see Methods). Total partially mapped genes are also shown; these genes have a higher proportion in low coverage assemblies as expected. For the CEGs grouped by conservation, we can see that only T. gondii shows a notable increase in the proportion of mapped CEGs in the most conserved group (G4) compared to the least conserved group (G1).


Species


Assembly details

Mapped CEGs

Partially mapped CEGs

Map %

G1


Map %

G2

Map %

G3

Map %

G4

C.briggsae

2x

110 (44.3%)

175 (70.6%)

72.7

64.3

81.9

63.0

108 Mb

4x

200 (80.6%)

234 (94.4%)

93.8

91.1

96.7

96.9

19,296 genes

6x

227 (91.5%)

242 (97.6%)

98.5

94.6

98.4

100




8x

231 (93.1%)

241 (97.2%)

96.9

98.2

98.4

96.9




10x

243 (97.9%)

246 (99.2%)

98.5

100

98.4

100

T.gondii

0.7x

10 (4.0%)

22 (8.8%)

0

12.5

11.4

12.3

63 Mb

1x

19 (7.6%)

38 (15.3%)

4.5

17.8

16.4

23.1

7,793 genes

2x

82 (33.0%)

104 (41.9%)

21.1

44.6

49.2

52.3




4x

163 (65.3%)

174 (70.2%)

48.5

73.2

75.5

84.6




6x

199 (80.2%)

202 (81.4%)

62.1

78.6

90.1

95.5




10x

207

(83.5%)

212 (85.5%)

66.7

82.1

93.4

100

H.sapiens

draft 1.9x

52

(21.0%)

98 (39.5%)

39.4

35.7

47.5

35.4

3,253 Mb

draft 3x

88

(35.5%)

146 (58.9%)

51.5

75.0

57.3

53.8

23,713 genes

draft 4.2x

142 (57.2%)

208 (83.9%)

83.3

85.7

85.2

81.5




draft 5.3x

148 (59.7%)

214 (86.3%)

83.3

92.8

88.5

81.5




draft 6x

149

(60.1%)

197 (79.4%)

71.2

80.3

85.2

81.5




draft 6.6x

179 (72.2%)

229 (92.3%)

89.4

94.6

98.3

87.6




draft 7.1x

198 (79.8%)

226 (91.9%)

87.8

96.4

90.1

90.7


Supplemental Table S6. Sources of genome data. When no assembly/release name was available the date that the data was downloaded on is listed. When multiple assemblies are provided for a species, the last listed assembly is the latest available and is used for the majority in this work (with the exception of C. intestinalis for which the v1.95 assembly was used).


Species

Assembly/ Release

Download Site

A. gambiae

AgamP3

ftp://ftp.ensembl.org/pub/current_anopheles_gambiae

A. melifera

Amel 4.0

ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Amellifera/fasta/

A. thaliana

TAIR6

ftp://ftp.arabidopsis.org/home/tair/Sequences/

B. taurus

v1.0, v2.0, v3.1

ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/

C. brenneri

v4.0

ftp://genome.wustl.edu/pub/organism/Invertebrates

C. briggsae

Cb1 & Cb2

ftp://ftp.sanger.ac.uk/pub/C.briggsae/

C. elegans

WS160

ftp://ftp.wormbase.org/pub/wormbase/data_freezes

C. remanei

v1.0

ftp://genome.wustl.edu/pub/organism/Invertebrates

C. familiaris

CanFam 2.0

ftp://ftp.broad.mit.edu/pub/assemblies/mammals/

C. intestinalis

v1.00, v1.95 & v2.00

http://genome.jgi-psf.org/ciona4/ciona4.info.html

C. porcellus

cavPor2

ftp://ftp.broad.mit.edu/pub/assemblies/mammals/guineaPig

C. reinhartdii

v3.1

http://genome.jgi-psf.org/Chlre3/Chlre3.download.ftp.html

D. melanogaster

Release 5

http://www.fruitfly.org/sequence/


F. catus

FelCat3

http://www.broad.mit.edu/ftp/pub/assemblies/mammals/cat/

G. gallus

v2.1

ftp:// ftp.ensembl.org/pub/gallus_gallus

G. lamblia

giardia14

http://gmod.mbl.edu/perl/site/giardia

H. sapiens

ncbi 36

ftp://ftp.ensembl.org/pub/current_human/

L. africana

LoxAfr 1.0

ftp://ftp.broad.mit.edu/pub/assemblies/mammals/

M. mulatta

v1.0

ftp://genome.wustl.edu/pub/organism/Primates/Macaca_mulatta

M. grisea

v5

http://www.broad.mit.edu/annotation/genome/magnaporthe_grisea/Downloads.html

N. crassa

v7

http://www.broad.mit.edu/annotation/genome/neurospora/Downloads.html

O. anatinus

v5.0

ftp://genome.wustl.edu/pub/organism/Other_Vertebrates

O. sativa

Build 4.0

http://rgp.dna.affrc.go.jp/E/IRGSP/Build4/build4.html

P. falciparum

v5.2

http://www.plasmodb.org/download/release-5.2/

P. trichocarpa

v1.0

ftp://ftp.jgi-psf.org/pub/JGI_data/Poplar/assembly/v1.0/

P. troglodytes

v1.1 & v2.1

ftp://genome.wustl.edu/pub/organism/Primates/

S. cerevisiae

Sep 2006

ftp://genome-ftp.standford.edu/pub/yeast/

S. pombe

Sep 2006

ftp://ftp.sanger.ac.uk/pub/yeast/pombe

T. gondii

v0.7, v.1.0, v2.0, v2.1, v2.2 & v3.0

http://v3-0.toxodb.org/restricted/data/Genome/nuc/

T. rubripes

v4

http://fugu.biology.qmul.ac.uk/Download/

T. spiralis

v1

ftp://genome.wustl.edu/pub/organism/Invertebrates/

X. tropicalis

v4.1

ftp://ftp.jgi-psf.org/pub/JGI_data/Xenopus_tropicalis



Supplemental Figure S1. Histograms showing the distribution of lengths of known genes that are detected in assemblies with varying levels of sequence coverage. Annotations for C. briggsae (19,296 CDSs) and H. sapiens (23,713 CDSs) were mapped to various assemblies and for each CDS we recorded what fraction of its length was present in either contigs (this page) or scaffolds (next page). Note that in some low-coverage assemblies, we fail to find any fragment of some CDSs. The right-hand graph in each pair of graphs is a zoomed-in view. Scaffolds are shown in the panels below contigs.



Contig data

C. briggsae

H. sapiens
















Supplemental Figure S1 (continued)



Scaffold data

C. briggsae

H. sapiens






















Supplemental Figure S2. Mapping of core genes and all genes in C. briggsae, H. sapiens, and T. gondii. X-axis shows percentage of 248 CEGs that were successfully mapped; Y-axis shows percentage of 23,713 (H. sapiens), 19,296 (C. briggsae), or 7,793 (T. gondii) genes contained in contigs from the genomes. Diagonal line shows line of unity. Each data point is labeled with the sequence coverage of its respective assembly. The 10x T. gondii data point represents mapping against scaffold sequences rather than contigs as no contig data was available for this assembly.





Похожие:

Assessing the gene space in draft genomes iconComparison of sequences, protein 3D structures and genomes

Assessing the gene space in draft genomes icon2001. Type of document: Draft rivo id number: 109 Aarts, P. G. Guidelines for Programmes Psychosocial and Mental Health Care Assistance in (Post) Disaster and Conflict Areas. Draft. 2001

Assessing the gene space in draft genomes iconCurrent position chief Scientist, Geodynamics, Geophysics, and Space Geodesy Program, Raytheon, Greenbelt, md, 1986-present. Located within the Planetary Geodynamics Lab, Goddard Space Flight Center, nasa, Greenbelt, md past positions

Assessing the gene space in draft genomes iconAssessing the Need for a Paradigm Shift

Assessing the gene space in draft genomes iconAssessing Vital Functions Accurately

Assessing the gene space in draft genomes iconAssessing “The Responsibility to Protect” Ten Years On – a Roundtable

Assessing the gene space in draft genomes iconAssessing the impact of foreign direct investment in transition economies

Assessing the gene space in draft genomes iconAbbas, S. Q., and S. Dein. “The Difficulties Assessing Spiritual Distress in Palliative Care Patients: a Qualitative Study.”

Assessing the gene space in draft genomes iconAbbas, S. Q., and S. Dein. “The Difficulties Assessing Spiritual Distress in Palliative Care Patients: a Qualitative Study.”

Assessing the gene space in draft genomes iconOverview of integrative tools and methods in assessing ecological integrity in estuarine and coastal systems worldwide

Разместите кнопку на своём сайте:
Библиотека


База данных защищена авторским правом ©lib.znate.ru 2014
обратиться к администрации
Библиотека
Главная страница