Assignment of Bioinformatics
Assignment of Bioinformatics
Session: 2020-2024
DNA Database
A DNA database or DNA databank is a database of DNA profiles which can be used in the
analysis of genetic diseases, genetic fingerprinting for criminology, or genetic genealogy. DNA
databases may be public or private, the largest ones being national DNA databases.
1
Types of Databases
The DNA Gateway was established in 2002, and at the end of 2013, it had more than 140,000
DNA profiles from 69 member countries. Unlike other DNA databases, DNA Gateway is only
used for information sharing and comparison, it does not link a DNA profile to any individual,
and the physical or psychological conditions of an individual are not included in the database.
CODIS
The United States national DNA database is called Combined DNA Index System (CODIS).
CODIS consists of three levels of information; Local DNA Index Systems (LDIS) where DNA
profiles originate, State DNA Index Systems (SDIS) which allows for laboratories within states
to share information, and the National DNA Index System (NDIS) which allows states to
compare DNA information with one another.
The CODIS software contains multiple different databases depending on the type of information
being searched against. Examples of these databases include, missing persons, convicted
offenders, and forensic samples collected from crime scenes. Each state, and the federal system,
has different laws for collection, upload, and analysis of information contained within their
database. However, for privacy reasons, the CODIS database does not contain any personal
identifying information, such as the name associated with the DNA profile. The uploading
2
agency is notified of any hits to their samples and are tasked with the dissemination of personal
information pursuant to their laws.
The first national DNA database in the United Kingdom was established in April 1995, called
National DNA Database (NDNAD). By 2006, it contained 2.7 million DNA profiles (about 5.2%
of the UK population), as well as other information from individuals and crime scenes. in 2020 it
had 6.6 million profiles (5.6 million individuals excluding duplicates). The information is stored
in the form of a digital code, which is based on the nomenclature of each STR
In 1995 the database originally had 6 STR markers for each profile, from 1999 10 markers, and
from 2014, 16 core markers and a gender identifier. Scotland has used 21 STR loci, two Y-DNA
markers and a gender identifier since 2014. In the UK, police have wide-ranging powers to take
DNA samples and retain them if the subject is convicted of a recordable offence. As the large
amount of DNA profiles which have been stored in NDNAD, "cold hits" may happen during the
DNA matching, which means finding an unexpected match between an individual's DNA profile
and an unsolved crime-scene DNA profile. This can introduce a new suspect into the
investigation, thus helping to solve the old cases.
In England and Wales, anyone arrested on suspicion of a recordable offence must submit a DNA
sample, the profile of which is then stored on the DNA database. Those not charged or not found
guilty have their DNA data deleted within a specified period of time. In Scotland, the law
similarly requires the DNA profiles of most people who are acquitted be removed from the
database.
The Australian national DNA database is called the National Criminal Investigation DNA
Database (NCIDD). By July 2018, it contained 837,000+ DNA profiles. The database used nine
STR loci and a sex gene for analysis, and this was increased to 18 core markers in 2013.NCIDD
combines all forensic data, including DNA profiles, advanced bio-metrics or cold cases.
3
The Canadian national DNA database is called the National DNA Data Bank (NDDB) which
was established in 1998 but first used in 2000. The legislation that Parliament enacted to govern
the use of this technology within the criminal justice system has been found by Canadian courts
to be respectful of the constitutional and privacy rights of suspects, and of persons found guilty
of designated offences.
Applications:
Criminal Investigation: The NDDB will assist law enforcement agencies in solving crimes by
matching DNA from crime scenes with profile stored in the database.
Link Crimes: By finding matches between DNA evidence from different crime scenes, aiding in
identifying serial offenders.
Exonerate the innocent: By using DNA evidence to clear individuals wrongfully accused or
convicted of crimes.
Identify Missing persons: By comparing DNA from missing persons or their relatives with
unidentified human remains.
The NFI (Netherlands Forensic Institute), formerly known as Gerechtelijk Laboratorium has a
long tradition in the Netherlands as the main institute that provides forensic services to the
criminal justice chain.
Forensic Services
CBRN: The NFI provides governments with expertise to help them prepare for, and prevent
terrorism-related incidents with chemical, biological, radiological and nuclear agents, possibly in
combination with explosives.
Forensics in Nuclear Security: The NFI has taken the lead in coordinating an integrated
international forensic response to nuclear terrorism.
4
Wildlife Forensics: The NFI offers wildlife forensic services to governments and government
agencies in order to combat illegal trading in endangered, protected or dangerous species of flora
and fauna.
A medical DNA database is a DNA database of medically relevant genetic variations. It collects
an individual's DNA which can reflect their medical records and lifestyle details. Through
recording DNA profiles, scientists may find out the interactions between the genetic environment
and occurrence of certain diseases (such as cardiovascular disease or cancer), and thus finding
some new drugs or effective treatments in controlling these diseases. It is often collaborated with
the National Health Service.
The Cancer Genome Atlas (TCGA) is a project to catalogue the genomic alterations responsible
for cancer using genome sequencing and bioinformatics. The overarching goal was to apply
high-throughput genome analysis techniques to improve the ability to diagnose, treat, and
prevent cancer through a better understanding of the genetic basis of the disease.
TCGA was supervised by the National Cancer Institute's Center for Cancer Genomics and the
National Human Genome Research Institute funded by the US government. A three-year pilot
project, begun in 2006, focused on characterization of three types of human cancers:
glioblastoma multiforme, lung squamous carcinoma, and ovarian serous adenocarcinoma. In
2009, it expanded into phase II, which planned to complete the genomic characterization and
sequence analysis of 20–25 different tumor types by 2014. Ultimately, TCGA surpassed that
goal, characterizing 33 cancer types including 10 rare cancers.
Goals
The goal of TCGA's pilot project was to establish an infrastructure to collect, molecularly
characterize, and analyze 500 cancers and matched controls. The work required extensive
cooperation among a team of scientists from various institutions and assessment of multiple
5
burgeoning high-throughput technologies. TCGA wanted to not only generate high-quality and
biologically meaningful genomic data, but also make that data freely available to the cancer
research community.
GISAID
GISAID (the Global Initiative on Sharing All Influenza Data) is a global science initiative
established in 2008 to provide access to genomic data of influenza viruses. The database was
expanded to include the coronavirus responsible for the COVID-19 pandemic, as well as other
pathogens. The database has been described as "the world's largest repository of COVID-19
sequences". GISAID facilitates genomic epidemiology and real-time surveillance to monitor the
emergence of new COVID-19 viral strains across the planet.
GISAID maintains what has been described as "the world's largest repository of COVID-19
sequences", and "by far the world's largest database of SARS-CoV-2 sequences". By mid-April
2021, GISAID's SARS-CoV-2 database reached over 1,200,000 submissions, a testament to the
hard work of researchers in over 170 different countries. Only three months later, the number of
uploaded SARS-CoV-2 sequences had doubled again, to over 2.4 million. By late 2021, the
database contained over 5 million genome sequences; as of December 2021, over 6 million
sequences had been submitted; by April 2022, there were 10 million sequences accumulated; and
in January 2023 the number had reached 14.4 million.
In January 2020, the SARS-CoV-2 genetic sequence data was shared through GISAID.
Throughout the first year of the COVID-19 pandemic, most of the SARS-CoV-2 whole-genome
sequences that were generated and shared globally were submitted through GISAID. When the
SARS-CoV-2 Omicron variant was detected in South Africa, by quickly uploading the sequence
to GISAID, the National Institute for Communicable Diseases there was able to learn that
Botswana and Hong Kong had also reported cases possessing the same gene sequence.
A national or forensic DNA database is not available for non-police purposes. DNA profiles can
also be used for genealogical purposes, so that a separate genetic genealogy database needs to
be created that stores DNA profiles of genealogical DNA test results.
6
Gen Bank
Genebank is a public genetic genealogy database that stores genome sequences submitted by
many genetic genealogists. Until now, GenBank has contained large number of DNA sequences
gained from more than 140,000 registered organizations, and is updated every day to ensure a
uniform and comprehensive collection of sequence information. These databases are mainly
obtained from individual laboratories or large-scale sequencing projects. The files stored in
GenBank are divided into different groups, such as BCT (bacterial), VRL (viruses), PRI
(primates) etc. People can access GenBank from NCBI's retrieval system, and then use
“BLAST” function to identify a certain sequence within the GenBank or to find the similarities
between two sequences.
The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international
research effort to establish the most detailed catalogue of human genetic variation at the time.
Scientists planned to sequence the genomes of at least one thousand anonymous healthy
participants from a number of different ethnic groups within the following three years, using
advancements in newly developed technologies. In 2010, the project finished its pilot phase,
which was described in detail in a publication in the journal Nature. In 2012, the sequencing of
1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported
results and the completion of the project and opportunities for future research.
The 1000 Genomes Project was designed to bridge the gap of knowledge between rare genetic
variants that have a severe effect predominantly on simple traits (e.g. cystic fibrosis, Huntington
disease) and common genetic variants have a mild effect and are implicated in complex traits
(e.g. cognition, diabetes, heart disease).
The primary goal of this project was to create a complete and detailed catalogue of human
genetic variations, which can be used for association studies relating genetic variation to disease.
7
The International Hap Map Project was an organization that aimed to develop a haplotype map
(Hap Map) of the human genome, to describe the common patterns of human genetic variation.
Hap Map is used to find genetic variants affecting health, disease and responses to drugs and
environmental factors. The information produced by the project is made freely available for
research. The International Hap Map Project is a collaboration among researchers at academic
centers, non-profit biomedical research groups and private companies in Canada, China
(including Hong Kong), Japan, Nigeria, the United Kingdom, and the United States.
It officially started with a meeting on October 27 to 29, 2002, and was expected to take about
three years. It comprises two phases; the complete data obtained in Phase I were published on 27
October 2005. The analysis of the Phase II dataset was published in October 2007.
Objectives
1. *Identify SNPs*: Systematically identify common genetic variations (SNPs) across different
populations.
2. *Map Haplotypes*: Determine how these SNPs are organized into blocks, called haplotypes,
which are inherited together.
3. *Facilitate Research*: Provide researchers with tools to find genes associated with diseases
and responses to pharmaceuticals and environmental factors.
China National GeneBank or CNGB (Chinese: 国家基因库) is China's first national-level gene
storage bank, approved and funded by the Chinese government. Based in the Dapeng Peninsula
of Shenzhen, CNGB's mission is to support public welfare, life science research and innovation,
as well as industry incubation, through effective bioresource conservation, digitalization and
utilization.
Application
Based on research projects supported by and resources stored in CNGB, CNGB db has
developed multiple scientific databases to advance scientific discoveries in major research areas,
8
such as plants, animals, micro-organisms, health and diseases, etc. The databases provide not
only high-quality datasets but also specialized analysis tools.
23andMe Holding Co. is a publicly traded personal genomics and biotechnology company based
in South San Francisco, California. It is best known for providing a direct-to-consumer genetic
testing service in which customers provide a saliva sample that is laboratory analysed, using
single nucleotide polymorphism genotyping, to generate reports relating to the customer's
ancestry and genetic predispositions to health-related topics. The company's name is derived
from the 23 pairs of chromosomes in a diploid human cell.
Founded in 2006, 23andMe soon became the first company to begin offering autosomal DNA
testing for ancestry, which all other major companies now use. Its saliva-based direct-to-
consumer genetic testing business was named "Invention of the Year" by Time in 2008.
My Heritage
MyHeritage is an online genealogy platform with web, mobile, and software products and
services, introduced by the Israeli company MyHeritage in 2003.Users of the platform can
obtain their family trees, upload and browse through photos, and search through over 19.9 billion
historical records, among other features.
As of 2023, the service supports 42 languages. In 2016, it launched a genetic testing service
called MyHeritage DNA, with more than 6.5 million DNA kits in the company's database by
March 2023.
MyHeritage DNA is a genetic testing service launched by MyHeritage in 2016. DNA results are
obtained from home test kits, allowing users to use cheek swabs to collect samples. The results
provide DNA matching and ethnicity estimates.
MyHeritage has included on its website a series of image-editing tools, offering from restoration
to colorization and animation of images, with some of them using artificial intelligence, like the
9
Photo Enhancer and MyHeritage In Color, both launched in 2020, and the Photo Repair,
launched in 2021.
In May 2010, FamilyTreeDNA launched an autosomal microarray chip based DNA test. They
called the new product Family Finder.
Family Finder allows customers to match relatives as distant as about fifth cousins. Family
Finder also includes a component called myOrigins. The results of this test provide percentages
of a DNA associated with general regions or specific ethnic groups (e.g. Western Europe, Asia,
Jewish, Native American, etc.)
In December 2018, FamilyTreeDNA changed its terms of service to allow law enforcement to
use their service to identify suspects of "a violent crime" (defined as child abduction, sexual
assault or homicide) or identify the remains of victims. The company confirmed it was working
with the FBI on at least a handful of cases.
10
11