0% found this document useful (0 votes)
41 views4 pages

NGS Data Analysis

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views4 pages

NGS Data Analysis

Uploaded by

lucylit0666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

NGS Data Analysis

NGS (Next-Generation Sequencing) generates massive amounts of raw data,


requiring systematic analysis to ensure accuracy and reliability. The initial
steps include handling FASTQ files, performing a quality check, and
applying pre-processing steps to prepare the data for downstream
analysis.

1. FASTQ Files

What are FASTQ Files?

• FASTQ is a standard file format for storing raw sequence data


generated from NGS platforms (e.g., Illumina, Oxford Nanopore).
• It combines both nucleotide sequence data and quality scores in a
single file.

Structure of a FASTQ File:

Each sequence entry in a FASTQ file consists of 4 lines:

1. Sequence Identifier: Starts with @ followed by a unique sequence


identifier.
2. Sequence: The actual nucleotide sequence (A, T, G, C, N).
3. Plus (+) Line: A + symbol, often followed by the sequence ID
(optional).
4. Quality Scores: ASCII-encoded quality scores corresponding to each
base in the sequence.

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGT
+
!''*((((***+))%%%++)(%%%%).1***-+*''

Key Tools for Handling FASTQ Files:

• FASTQC: Quality control checks.


• seqtk: Lightweight toolkit for FASTQ file manipulation.
• FASTP: FASTQ pre-processing tool.

2. Quality Check (QC)

Why is Quality Check Important?

• Ensures the accuracy of raw sequencing data.


• Identifies poor-quality reads, adapter contamination, and other
sequencing artifacts.
• Prevents downstream errors in alignment, variant calling, or
assembly.

Key Metrics in Quality Control:

1. Per-base Sequence Quality: Quality scores across each nucleotide


position.
2. Per-sequence Quality Scores: Overall quality distribution of all
reads.
3. Adapter Content: Detects adapter sequences that may still be
present in reads.
4. GC Content: Ensures uniform GC distribution.
5. Read Length Distribution: Consistency in read lengths across
samples.
6. Duplicated Reads: Identifies PCR duplicates.

Quality Control Tools:

• FASTQC: Comprehensive quality assessment.


• MultiQC: Aggregates multiple FASTQC reports.
• Trim Galore!: Combines adapter trimming and QC filtering.

Example Output from FASTQC:

• Green: Good quality.


• Orange: Warning.
• Red: Poor quality (requires intervention).

3. Pre-processing

What is Pre-processing?

Pre-processing involves cleaning and preparing raw sequencing data for


downstream analysis. It includes:

1. Adapter Trimming
2. Quality Filtering
3. Read Trimming and Cropping
4. Removal of Low-quality Reads
5. De-duplication
Key Steps in Pre-processing:

1. Adapter Trimming:

• Adapters are short sequences added during library preparation.


• Residual adapter sequences can interfere with alignment and
analysis.
• Tools:
o Cutadapt
o Trimmomatic

2. Quality Filtering:

• Removes reads with poor-quality scores.


• Filters based on:
o Minimum Phred Score (e.g., Q30)
o Minimum read length (e.g., >50 bp)
• Tools:
o FASTP
o PRINSEQ

3. Read Trimming and Cropping:

• Trims poor-quality bases from the ends of reads.


• Crops reads to a specific length if required.
• Tools:
o Sickle
o Trim Galore!

4. Removal of Contaminants:

• Identifies and removes reads originating from non-target sources


(e.g., host genomes, bacterial contamination).
• Tools:
o Bowtie2
o Kraken2

5. De-duplication:

• PCR duplicates arise from library amplification and should be


removed to prevent bias.
• Tools:
o Picard (MarkDuplicates)
o Samtools rmdup
4. Workflow Summary:

Step Purpose Tools


1. Quality Check Assess raw data FASTQC, MultiQC
(QC) quality

2. Adapter Remove adapter Cutadapt,


Trimming sequences Trimmomatic

3. Quality Filtering Remove low-quality FASTP, PRINSEQ


reads

4. Read Trimming Remove low-quality Sickle, Trim Galore!


bases

5. Contaminant Filter unwanted Bowtie2, Kraken2


Removal reads

6. De-duplication Remove PCR Picard, Samtools


duplicates

Final Output After Pre-processing:

• Cleaned FASTQ Files: High-quality reads, free from adapters and


contaminants.
• Quality Metrics Report: Ensures the data meets downstream
analysis requirements.

Key Takeaways:

1. FASTQ Files: Store raw sequencing reads and quality scores.


2. Quality Check: Detects sequencing errors and biases using tools like
FASTQC.
3. Pre-processing: Improves data quality by trimming adapters,
filtering low-quality reads, and removing contaminants.
4. Tools: Essential tools include FASTQC, Cutadapt, Trimmomatic,
Bowtie2, and Picard.
5. Next Steps After Pre-processing: Alignment, variant calling,
transcriptome assembly, or metagenomic analysis.

You might also like