Chromosome-Level Genome Assembly of The Greenfin Horse-Faced Filefish (Thamnaconus
Chromosome-Level Genome Assembly of The Greenfin Horse-Faced Filefish (Thamnaconus
Chromosome-Level Genome Assembly of The Greenfin Horse-Faced Filefish (Thamnaconus
Changlin Liu1,2, Kun Liu6, Xintian Liu7, Xuming Li8, Hongju Chen8, Siqing Chen1,2, Changwei Shao1,2, Zhishu
Lin9
1. Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, China
2. Laboratory for Marine Fisheries Science and Food Production Processes, Pilot National Laboratory for
3. National Demonstration Center for Experimental Fisheries Science Education, Shanghai Collaborative
Innovation for Aquatic Animal Genetics and Breeding, Shanghai Engineering Research Center of Agriculture,
5. Guangdong Provincial Key Laboratory of Fishery Ecology and Environment, South China Sea Fisheries
Correspondence:
Siqing Chen, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, China.
This article has been accepted for publication and undergone full peer review but has not been
through the copyediting, typesetting, pagination and proofreading process, which may lead to
differences between this version and the Version of Record. Please cite this article as doi:
10.1111/1755-0998.13183
This article is protected by copyright. All rights reserved
Email: [email protected]
Changwei Shao, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao,
China.
Email: [email protected]
Zhishu Lin, Qingdao Municipal Ocean Technology Achievement Promotion Center, Qingdao, China.
Email: [email protected]
Abstract
The greenfin horse-faced filefish, Thamnaconus septentrionalis, is a valuable commercial
fish species that is widely distributed in the Indo-West Pacific Ocean. This fish has characteristic
blue-green fins, rough skin and a spine-like first dorsal fin. T. septentrionalis represents a
conservation issue because its population has declined sharply, and it is an important marine
aquaculture fish species in China. The genomic resources of the filefish are lacking, and no
reference genome has been released. In this study, the first chromosome-level genome of T.
septentrionalis was constructed using nanopore sequencing and Hi-C technology. A total of 50.95
Gb polished nanopore sequences were generated and were assembled to 474.31 Mb genome,
accounting for 96.45% of the estimated genome size of this filefish. The assembled genome
contained only 242 contigs, and the achieved contig N50 was 22.46 Mb, reaching a surprisingly
high level among all the sequenced fish species. Hi-C scaffolding of the genome resulted in 20
pseudochromosomes containing 99.44% of the total assembled sequences. The genome contained
67.35 Mb of repeat sequences, accounting for 14.2% of the assembly. A total of 22,067
protein-coding genes were predicted, 94.82% of which were successfully annotated with putative
functions. Furthermore, a phylogenetic tree was constructed using 1,872 single-copy gene families
and 67 unique gene families were identified in the filefish genome. This high-quality assembled
genome will be a valuable genomic resource for a range of genomic, conservation and breeding
studies of T. septentrionalis in future research.
Key words
Filefish, genome assembly, Oxford Nanopore sequencing, Hi-C
Acknowledgements
We appreciate the help from Tianyuan Fisheries Co., Ltd who provided the filefish samples.
This work was supported by fund of Key Laboratory of Open-Sea Fishery Development, Ministry
An, H. S., Kim, E. M., Lee, J. W., Dong, C. M., Lee, B. I., & Kim, Y. C. (2011). Novel polymorphic
microsatellite loci for the Korean black scraper (Thamnaconus modestus), and their application to the
genetic characterization of wild and farmed populations. International journal of molecular sciences,
An, H. S., Lee, J. W., Park, J. Y., & Jung, H. T. (2013). Genetic structure of the Korean black scraper
Thamnaconus modestus inferred from microsatellite marker analysis. Molecular Biology Reports, 40(5),
3445-3456. doi:10.1007/s11033-012-2044-7
Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J., Dehal, P., . . . Brenner, S. (2002). Whole-genome
shotgun assembly and analysis of the genome of Fugu rubripes. Science, 297(5585), 1301-1310.
doi:10.1126/science.1072104
Austin, C. M., Tan, M. H., Harrisson, K. A., Lee, Y. P., Croft, L. J., Sunnucks, P., . . . Gan, H. M. (2017). De
novo genome assembly and annotation of Australia's largest freshwater fish, the Murray cod
(Maccullochella peelii), from Illumina and Nanopore sequencing read. GigaScience, 6(8), gix063.
doi:10.1093/gigascience/gix063
Bao, W., Kojima, K. K., & Kohany, O. (2015). Repbase Update, a database of repetitive elements in eukaryotic
doi:10.6084/m9.figshare.11874825.v1
Bian, L., Wang, P. F., Chen, S. Q., Li, F. H., Zhang, L. L., Liu, C. L., & Ge, J. L. (2018). Population genetic
(in Chinese)
Blanco, E., Parra, G., & Guigó, R. (2007). Using geneid to identify genes. Current protocols in bioinformatics,
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., . . . Schneider, M. (2003).
The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids
Burge, C., & Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. Journal of
Burton, J. N., Adey, A., Patwardhan, R. P., Qiu, R., Kitzman, J. O., & Shendure, J. (2013). Chromosome-scale
Chen, W. Z., Li, C. S., & Hu, F. (2000). Application and improvement of Virtual Population Analysis (VPA) in
stock assessment of Thamnaconus septentrionalis. Journal of Fisheries of China, 24(6), 522-526. (in
Chinese)
Chen, W. Z., Zheng, Y. J., & Li, C. S. (1998). Stock assessment and catch prediction of the filefish,
Conesa, A., Götz, S., García-Gómez, J. M., Terol, J., Talón, M., & Robles, M. (2005). Blast2GO: a universal
tool for annotation, visualization and analysis in functional genomics research. Bioinformatics, 21(18),
3674-3676. doi:10.1093/bioinformatics/bti610
Consortium, G. O. (2004). The Gene Ontology (GO) database and informatics resource. Nucleic acids research,
Daub, J., Eberhardt, R. Y., Tate, J. G., & Burge, S. W. (2015). Rfam: annotating families of non-coding RNA
sequences. In Picardi E. (Eds.), RNA Bioinformatics. Methods in Molecular Biology (pp. 349-363).
De Bie, T., Cristianini, N., Demuth, J. P., & Hahn, M. W. (2006). CAFE: a computational tool for the study of
Edgar, R. C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic
Edgar, R. C., & Myers, E. W. (2005). PILER: identification and classification of genomic repeats.
Gao, Y., Gao, Q., Zhang, H., Wang, L., Zhang, F., Yang, C., & Song, L. (2014). Draft sequencing and analysis
doi:10.1093/dnares/dsu025
Ge, H., Lin, K., Shen, M., Wu, S., Wang, Y., Zhang, Z., . . . Zheng, L. (in press). De novo assembly of a
Griffiths-Jones, S., Grocock, R. J., Van Dongen, S., Bateman, A., & Enright, A. J. (2006). miRBase: microRNA
sequences, targets and gene nomenclature. Nucleic acids research, 34(Database issue), D140-D144.
doi:10.1093/nar/gkj112
Guan, J., Ma, Z., Zheng, Y., Guan, S., Li, C., & Liu, H. (2013). Breeding and larval rearing of bluefin
leatherjacket, Thamnaconus modestus (Gunther, 1877) under commercial scales. International Journal
Guindon, S., Dufayard, J., Lefort, V., Anisimova, M., Hordijk, W., & Gascuel, O. (2010). New algorithms and
Haas, B. J., Delcher, A. L., Mount, S. M., Wortman, J. R., Smith Jr, R. K., Hannick, L. I., . . . White, O. (2003).
Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic
Haas, B. J., Papanicolaou, A., Yassour, M., Grabherr, M., Blood, P. D., Bowden, J., . . . Regev, A. (2013). De
novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference
Haas, B. J., Salzberg, S. L., Zhu, W., Pertea, M., Allen, J. E., Orvis, J., . . . Wortman, J. R. (2008). Automated
eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced
Han, Y., & Wessler, S. R. (2010). MITE-Hunter: a program for discovering miniature inverted-repeat
transposable elements from genomic sequences. Nucleic acids research, 38(22), e199.
doi:10.1093/nar/gkq862
Hoede, C., Arnoux, S., Moisset, M., Chaumier, T., Inizan, O., Jamilloux, V., & Quesneville, H. (2014).
PASTEC: an automatic transposable element classification tool. Plos One, 9(5), e91929.
doi:10.1371/journal.pone.0091929
Jaillon, O., Aury, J., Brunet, F., Petit, J., Stange-Thomann, N., Mauceli, E., . . . Roest Crollius, H. (2004).
Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate
Jansen, H. J., Liem, M., Jong-Raadsen, S. A., Dufour, S., Weltzien, F. A., Swinkels, W., . . . Henkel, C. V.
(2017). Rapid de novo assembly of the European eel genome from nanopore sequencing reads.
Kadobianskyi, M., Schulze, L., Schuelke, M., & Judkewitz, B. (2019). Hybrid genome assembly and annotation
Kanehisa, M., & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic acids research,
Kang, K. H., Kho, K. H., Chen, Z. T., Kim, J. M., Kim, Y. H., & Zhang, Z. F. (2004). Cryopreservation of
filefish (Thamnaconus septentrionalis Gunther, 1877) sperm. Aquaculture research, 35(15), 1429-1433.
doi:10.1111/j.1365-2109.2004.01166.x
Keilwagen, J., Wenk, M., Erickson, J. L., Schattat, M. H., Grau, J., & Hartung, F. (2016). Using intron position
conservation for homology-based gene prediction. Nucleic acids research, 44(9), e89.
doi:10.1093/nar/gkw092
Kim, D., Langmead, B., & Salzberg, S. L. (2015). HISAT: a fast spliced aligner with low memory requirements.
Koren, S., Walenz, B. P., Berlin, K., Miller, J. R., Bergman, N. H., & Phillippy, A. M. (2017). Canu: scalable
and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Research,
Kumar, S., Stecher, G., Suleski, M., & Hedges, S. B. (2017). TimeTree: a resource for timelines, timetrees, and
divergence times. Molecular Biology & Evolution, 34(7), 1812-1819. doi: 10.1093/molbev/msx116
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform.
Li, L., Stoeckert, C. J., & Roos, D. S. (2003). OrthoMCL: identification of ortholog groups for eukaryotic
Li, P. L., Jiang, M. C., Xu, J. B., & Liu, B. (2002). The preliminary net cage culturing experiment of
Lin, X. Z., Gan, J. B., Zheng, Y. J., & Guan, X. D. (1984). The migration research of Thamnaconus
Liu, B., Shi, Y., Yuan, J., Hu, X., Zhang, H., Li, N., . . . Fan, W. (2013). Estimation of genomic characteristics
Liu, K., Zhang, L. L., Zhang, Q. W., Chen, S. Q., Liu, C. L., & Bian, L. (2017). Study on Thamnaconus
Lowe, T. M., & Eddy, S. R. (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in
Majoros, W. H., Pertea, M., & Salzberg, S. L. (2004). TigrScan and GlimmerHMM: two open source ab initio
Mizuno, K., Shimizu-Yamaguchi, S., Miura, C., & Miura, T. (2012). Method for efficiently obtaining fertilized
eggs from the black scraper Thamnaconus modestus by natural spawning in captivity. Fisheries science,
Nawrocki, E. P., & Eddy, S. R. (2013). Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics,
Pan, H., Yu, H., Ravi, V., Li, C., Lee, A. P., Lian, M. M., . . . Venkatesh, B. (2016). The genome of the largest
bony fish, ocean sunfish (Mola mola), provides insights into its fast growth rate. GigaScience, 5(1), 36.
doi:10.1186/s13742-016-0144-3
Payne, A., Holmes, N., Rakyan, V., & Loose, M. (2018). BulkVis: a graphical viewer for Oxford nanopore bulk
Pertea, M., Pertea, G. M., Antonescu, C. M., Chang, T. C., Mendell, J. T., & Salzberg, S. L. (2015). StringTie
enables improved reconstruction of a transcriptome from RNA-seq reads. Nature Biotechnology, 33(3),
290-295. doi:10.1038/nbt.3122
Price, A. L., Jones, N. C., & Pevzner, P. A. (2005). De novo identification of repeat families in large genomes.
Rao, S. S., Huntley, M. H., Durand, N. C., Stamenova, E. K., Bochkov, I. D., Robinson, J. T., . . . Aiden, E. L.
(2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping.
Servant, N., Varoquaux, N., Lajoie, B. R., Viara, E., Chen, C. J., Vert, J. P., . . . Barillot, E. (2015). HiC-Pro: an
optimized and flexible pipeline for Hi-C data processing. Genome Biology, 16(1), 259.
doi:10.1186/s13059-015-0831-x
Simao, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V., & Zdobnov, E. M. (2015). BUSCO:
assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics,
Stanke, M., & Waack, S. (2003). Gene prediction with a hidden Markov model and a new intron submodel.
Tan, M. H., Austin, C. M., Hammer, M. P., Lee, Y. P., Croft, L. J., & Gan, H. M. (2018). Finding Nemo: hybrid
assembly with Oxford Nanopore and Illumina reads greatly improves the clownfish (Amphiprion
Tang, S., Lomsadze, A., & Borodovsky, M. (2015). Identification of protein coding regions in RNA transcripts.
Tarailo-Graovac, M., & Chen, N. (2009). Using RepeatMasker to identify repetitive elements in genomic
doi:10.1002/0471250953.bi0410s25
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., . . . Natale, D. A.
(2003). The COG database: an updated version includes eukaryotes. BMC bioinformatics, 4(1), 41.
doi:10.1186/1471-2105-4-41
Vaser, R., Sović, I., Nagaranjan, N., & Šikić, M. (2017). Fast and accurate de novo genome assembly from long
Walker, B. J., Abeel, T., Shea, T., Priest, M., Abouelliel, A., Sakthikumar, S., . . . Earl, A. M. (2014). Pilon: an
integrated tool for comprehensive microbial variant detection and genome assembly improvement. Plos
Xu, G. B., Chen, S. L., & Tian, Y. S. (2010). New polymorphic microsatellite markers for bluefin leatherjacket
doi:10.1007/s10592-009-9891-3
Xu, G. B., Tian, Y. S., Liao, X. L., & Chen, S. L. (2009). Isolation and characterization of polymorphic
microsatellite loci from bluefin leatherjacket (Navodon septentrionalis Gunther, 1877). Conservation
Xu, Z., & Wang, H. (2007). LTR_FINDER: an efficient tool for the prediction of full-length LTR
Yang, Z. (2007). PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology & Evolution,
[dataset] YSFRI. (2019a). Genome sequencing of Thamnaconus septentrionalis. GenBank, GenBank accession
number: SRX6862879.
[dataset] YSFRI. (2019b). Hi-C of Thamnaconus septentrionalis. GenBank, GenBank accession number:
SRX6875660.
[dataset] YSFRI. (2019c). Nanopore sequencing of Thamnaconus septentrionalis. GenBank, GenBank accession
number: SRX6875837.
[dataset] YSFRI. (2019d). RNA-seq of Thamnaconus septentrionalis. GenBank, GenBank accession number:
SRX6875519.
Yue, G. H., & Wang, L. (2017). Current status of genome sequencing and its applications in aquaculture.
Data Accessibility
This Whole Genome Shotgun project has been deposited at DDBJ/ENA/GenBank under the
accession WFAA00000000. The version described in this paper is version WFAA01000000. Raw
sequencing reads and genome assembly are available at GenBank as BioProject PRJNA565600.
Raw sequencing data (Nanopore, Illumina , Hi-C and RNA-seq data) have been deposited in SRA
(Sequence Read Archive) database as SRX6875837, SRX6862879, SRX6875660, and
SRX6875519. Genome assembly has been deposited in assembly database as GCA_009823395.1.
Gene annotation, transcriptome assembly and a high-quality genome-wide Hi-C heatmap of
filefish have been deposited in FigShare,
doi:10.6084/m9.figshare.11874825.v1(https://figshare.com/articles/Thamnaconus_septentrionalis_
genome_supporting_data/11874825).
Author Contributions
S.C., C.S. and Z.L. designed and managed the project. L.B., F.L. and J.G. interpreted the data
and drafted the manuscript. P.W., S.Z., C.L. and X.L. prepared the materials. Q.C., J.L., K.L. and
H.C. preformed the DNA extraction, RNA extraction and libraries construction. L.B., F.L., X.L.
and C.S. performed the bioinformatic analysis. All authors contributed to the final manuscript
editing.
† The coverage was calculated using an estimated genome size of 491.74 Mb.
Species T. septentrionalis Takifugu rubripes† Takifugu flavidus Tetraodon nigroviridis Mola mola
Sequencing technology Oxford Nanopore PacBio Sequel PacBio Sequel Plasmid library + BAC library Illumina Hiseq 2000
sequencing sequencing
† The assembly statistics of other tetraodontiform genomes were from NCBI assembly database. The GenBank assembly accession numbers were as follows:
Takifugu rubripes (GCA_901000725.2), Takifugu flavidus (GCA_003711565.2), Tetraodon nigroviridis (GCA_000180735.1), Mola mola (GCA_001698575.1). ‡
Numbers shown are percentages of complete BUSCO genes covered by the assemblies. § Numbers shown are percentages of 458 core eukaryotic genes presented in
the assemblies.
Group 1 3 34,805,468
Group 2 3 34,142,503
Group 3 3 29,239,029
Group 4 13 27,092,115
Group 5 3 24,789,104
Group 6 7 24,144,372
Group 7 10 23,815,151
Group 8 3 23,107,901
Group 9 11 22,985,309
Group 10 5 23,048,615
Group 11 2 22,982,431
Group 12 6 23,025,906
Group 13 3 22,547,364
Group 14 11 22,005,842
Group 15 16 20,921,416
Group 16 3 20,603,809
Group 17 2 19,738,352
Group 18 5 17,694,734
Group19 13 18,094,054
Group 20 25 16,862,837
Genscan 28,628
Augustus 44,749
GeneID 24,446
SNAP 58,914
PASA 30,768
TransDecoder 78,130
† The GenBank accession numbers were as follows: Takifugu rubripes (GCA_000180615.2), Tetraodon
FIGURE 2 The genome-wide Hi-C heatmap of the filefish. LG 1-20 are the abbreviations of Lachesis
the outgroup. The calibration times were L. oculatus-L. crocea (315~322 Mya), L. maculatus-L. crocea
(95∼100 Mya) and T. flavidus-T. rubripes (4.22∼4.70 Mya). The estimated species divergence time (million
years ago) and the 95% confidential intervals were labeled at each branch site.