Keywords
Genome assembly, liftover, remap, copy number segment.
This article is included in the Python collection.
Genome assembly, liftover, remap, copy number segment.
The first draft version of human genome was published in 20011. In subsequent years, several new editions were released to perfect the quality of the genome assembly. The current version of human genome (GRCh38, UCSC hg38) was made available in 2013, with the latest revision (Grch38.p12) still containing more than 10 million unplaced bases (see NCBI website). Over the years, large numbers of genomic studies have been performed, generating data mapped to different versions of the reference genome. However, when performing genome analyses integrating data from multiple resources, it is imperative to convert all data to the same genomic coordinate system.
Two general methodologies are used for conversion between coordinates from different genome assemblies. The first approach is to re-align the original sequence data to the target assembly. This method could provide the best result but is very time consuming, and is not possible when the original sequence data is not available or does not consist of direct sequences (i.e. segmentation of array based data). Another approach is to convert the coordinates of genome data between assemblies by using a mapping file. This method, although bearing a side effect of minor information loss, for most applications provides a good balance between performance and accuracy.
Currently, three tools are in widespread use for the conversion between genome assemblies by coordinates: liftOver from University of California, Santa Cruz (referred as UCSC liftOver in the following article)2; CrossMap from Zhao3; and Remap from NCBI4. The UCSC liftOver tool exists in two flavours, both as web service and command line utility. It offers the most comprehensive selection of assemblies for different organisms with the capability to convert between many of them. CrossMap has the unique functionality to convert files in BAM/SAM or BigWig format. It generates almost identical results as UCSC liftOver, but is not optimised for converting genome coordinates between species. Remap provides for each organism a comprehensive list of major assemblies and the corresponding sub-versions. It can also perform cross species mapping, however, with only a limited number of organisms.
All those tools are efficient in coordinate conversion and provide almost identical results. However, as shown in Figure 1a, challenges arise when dealing with genome segments that are not continuous anymore in the target assembly; there, these three tools take on different strategies. CrossMap and UCSC liftOver break the segment into smaller segments and map them to different locations. Remap keeps the integrity of the segment and maps the span to the target assembly.
(a) When lifting a segment to another assembly, the landscape of the segment may be affected by indels and copy number variations, but the overall span of the segment does not change significantly. (b) When the end positions cannot be converted by the UCSC liftOver, the nearby regions will be searched for convertible positions as approximation. (c) Quality control checks for changes of chromosome or size to make sure the segment is converted properly.
In research such analysis of copy number variation (CNV) data, where the quantitative representation of a genomic range takes precedence over base-specific representation, the integrity of a continuous segment indicates the proper conversion between assemblies, but may not be a direct outcome of current re-mapping approaches. Although Remap can convert contiguous segments, it only provides web service with submission limits, which is difficult to use for large scale or pipelined applications. The limitation to single input files is a general limitation of those tools, which precludes their direct use in comparative studies which may require to work with tensor hundreds of thousands of files, as indicated through our own projects and requests from the research community.
In this article,we introduce segment_liftover, a tool to perform an integrity-preserving conversion of genomic segments data between genome assemblies. It features two major functional additions over existing tools: First, re-conversion by locus approximation, in instances where a precise conversion of genomic positions fails; and second, the capability to handle any number of files and optional integration into data processing pipelines with supporting features such as automatic file traversal, interruption resumption and detailed logging.
segment_liftover can convert both probe files and segment files at the same time or in separate runs. It starts from a structured directory or a list of files, then traverses and converts all files meeting the specified name pattern, and finally outputs to a designated directory. To convert a probe file, segment_liftover will first use the UCSC liftOver to convert the file, then apply an approximate conversion on probes that the UCSC liftOver failed to convert. To convert a segment file, segment_liftover will use the UCSC liftOver to convert the start and end positions of segments. A successful segment conversion needs to satisfy the following four criteria:
positionnew_start ≠ ∅ (1)
positionnew_end ≠ ∅ (2)
chromosomenew_start = chromosomenew_end (3)
Where β controls the threshold of the length ratio and the default value is set to 2. If criteria (1) or (2) fails, segment_liftover will apply an approximate conversion; if the conversion still fails, it is reported as unconvertible. If criteria (3) or (4) fails, the conversion is reported as rejected (Figure 1c). The reason of failure is recorded in log files.
When a position cannot be converted by the UCSC liftOver, segment_liftover will attempt an approximate conversion and try to find a convertible position in the adjacency (Figure 1b). The range and the resolution of the search is defined by parameters –range and –step_size, respectively.
The segment_liftover tool is implemented in Python. The package is available for both Linux and OSX. It requires a minimum of 2G memory and the capacity of running Python 3. We recommend an installation using pip in a Python virtual environment. segment_liftover requires and depends on the UCSC liftOver program. A chain file, which provides alignments from source to target assembly, is also required. The chain files between common human assemblies (hg18, hg19 and hg38) are included in the program package. Chain files of other species and assemblies are available from the USCS Genome Browser. Figure 2 illustrates the work-flow of segment_liftover.
(1) It can take either a folder or a file containing the list of files as the input. (2) It will try to convert by approximation when UCSC liftOver fails to convert a coordinate. (3) The directory structure will be kept in the output folder and detailed log files are also available.
In this section, we provide two examples of using segment_liftover to convert probes and segments, respectively. The two examples are part of the pipeline which updates the arrayMap database, a reference resource of somatic genome copy number variations in cancer5, from human genome assembly hg19 to hg38. In the first example, we converted 44,632 probe files and 44,471 segment files from hg19 to hg38. The probe data were generated from nine Affymetrix genotyping array platforms, which currently only support annotations for hg19. Circular binary segmentation (CBS) analysis (DNAcopy R-package) was used to infer copy number segments from log2 values of probes. The final segment files contain a list of genomic regions separated by their copy number values6. We ran the segment_liftover tool on a 12-core, 128GB RAM machine with 8 parallel processes. It took 42 hours to convert 44,632 probe files with 5.5 billion probe positions and 40 minutes to convert 44,471 segment files with 4.8 million segments.
Overall, more than 99.99% of probes and more than 99% of segments could be directly converted from hg19 to hg38 (Figure 3). As conversion of segments is more complicated and involves a quality control procedure to ensure the meaningfulness of the segment, it is expected to have a higher number of unconvertible segments than probes. As shown in Figure 4, the unconvertible regions are mainly around telomeres, centromeres, or other gene-sparse locations. In total, 38 genetic elements were found to be affected by the conversion (description in Supplementary Table 1).
Directly converted is the sum of successful conversion from UCSC liftOver; approximately converted is the sum of successful re-conversion in locus adjacency; converted but rejected is the sum of all rejections from quality control; unconvertible is the sum of everything that is not converted.
In the second example, we compared the performance of different conversion strategies using 1,000 samples randomly drawn from the first example (Table 1). Copy number segments were generated using four different strategies (Figure 5): (1) segments in hg19 were generated from probes in hg19 using the aforementioned standard pipeline; (2) segments in hg38 were generated with probes converted from hg19 to hg38 using the standard pipeline; (3) segments converted from hg19 to hg38 with approximate conversion; (4) segments converted from hg19 to hg38 without approximate conversion.
Table 2 shows the comparison of segments conversion in average between (3), (4) and (2) (complete table in Supplementary Table 2). Exact segment matches are categorized as "perfect"; "minor difference" is defined the same as condition (3) of quality control; the rest of the segments are categorized as "significant difference"; "sum" is the total number of segments of a sample in average. By comparing the sum of reference hg19, reference hg38 and approximation, it shows that the conversion result is very close to the result of the standard pipeline. The difference between converted and generated sums is much smaller than the difference between two generated sums of different genome versions. On average, the approximate conversion could rescue one additional segment per file. Finally, we zoom into a specific example on chromosome 9 in GSM276858 (illustrated in Figure 5). Because of the removal of probes from hg19 to hg38, the second segment will be lost without approximate conversion.
Segments directly processed from hg38 probes are used as reference (top right). hg19 segments converted with approximation (middle right) and without approximation (bottom right) are used for comparison.
perfect | minor difference | significant difference | sum | |
---|---|---|---|---|
reference hg19 | na | na | na | 218.42 |
reference hg38 | na | na | na | 215.04 |
approximation | 198.18 | 6.24 | 10.25 | 214.67 |
no approximation | 198.18 | 5.23 | 10.01 | 213.42 |
The two examples above demonstrated the efficiency and effectiveness of segment_liftover in processing large number of probe and segment files. It can provide conversion results that are similar to results generated from the standard pipeline. Moreover, with approximate conversion, the number of properly converted segments is slightly increased. In general, segment_liftover is able to provide reliable conversions and the ease of use.
Translation between genome versions of sequencing data is a tedious but crucial task in bioinformatics. With the functionalities of automated batching, approximate conversion and segment conversion, segment_liftover can dramatically reduce the complexity and workload of such data processing. Furthermore, segment_liftover’s detailed logs of execution result provide an easy and clear foundation for follow up analysis.
1. pip version: https://pypi.python.org/pypi/segment-liftover
2. Latest source code: https://github.com/baudisgroup/segment-liftover
3. Archived source code as at time of publication: https://dx.doi.org/10.5281/zenodo.11868037
4. Software license: MIT
BG is recipient of a grant from the China Scholarship Council (CSC 201606620032). QH is supported through the University of Zurich’s "CanDoc" program.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We thank Paula Carrio Cordo and the participants of the "Zurich Seminars in Bioinformatics" for helpful discussions.
Supplementary Table 1 : genetic elements affected by the conversion The 38 genetic elements that are affected by the segment_liftover conversion from hg19 to hg38.
Click here to access the data.
Supplementary Table 2 : complete conversion results of different strategies A complete list of conversion results of comparing different conversion strategies using 1000 random samples.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 08 Jun 18 |
read | |
Version 1 14 Mar 18 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)