0% found this document useful (0 votes)
42 views2 pages

CR Micro

Bioinformatics utilizes diverse databases and applications to analyze biological data. It involves developing software tools and databases to store, retrieve, and analyze large volumes of genetic information. Common databases include those for genomes, proteins, sequences, gene expression, and phylogeny. Applications include genomic analysis, proteomics, structural biology, and clinical genomics, offering insights across fields.

Uploaded by

ankitraj318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views2 pages

CR Micro

Bioinformatics utilizes diverse databases and applications to analyze biological data. It involves developing software tools and databases to store, retrieve, and analyze large volumes of genetic information. Common databases include those for genomes, proteins, sequences, gene expression, and phylogeny. Applications include genomic analysis, proteomics, structural biology, and clinical genomics, offering insights across fields.

Uploaded by

ankitraj318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1. ~Appraise bioinformatics, explain different databases system, state 3 applications of bioinformatics. 1.

1. ~Establish the features and objectives of a biological database


Information contained in biological databases includes gene function, structure, localization (both cellular &
Bioinformatics is an interdisciplinary field that combines biology, computer science, and mathematics to analyze chromosomal), clinical effects of mutations as well as similarities of biological sequences & structures.
and interpret biological data. It involves the development and application of software tools and databases to store,
retrieve, and analyze large volumes of biological and genetic information. 4 objectives of biological databases:
Databases Systems: - To make all relevant data available at one place.
1. Genomic Databases: GenBank, Ensembl, NCBI Genome. - To store all relevant information easily.
2. Protein Databases: UniProt, PDB, Pfam. - To make biological data available to the scientist.
3. Sequence Databases: BLAST, FASTA. - To update existing information easily.
4. Gene Expression Databases: GEO, Array Express. -
5. Phylogenetic Databases: Tree of Life, NCBI Taxonomy. 2. ~Explain the significance of biological databases, focusing on sequence data. Provide examples of
Applications: databases that store sequence data and discuss how researchers can utilize these resources.
1. Genomic Analysis: Identifying genes, regulatory elements, and variations for genetic diseases and
evolutionary studies. Biological databases, particularly those storing sequence data, are essential for researchers to access and analyze
2. Proteomics and Structural Biology: Predicting protein structures, analyzing interactions, and aiding drug genetic information. For example, the NCBI (National Centre for Biotechnology Information) database contains vast
discovery. amounts of DNA and protein sequence data, as well as associated metadata.
3. Clinical Genomics: Analyzing patient DNA for diagnosis, risk assessment, and personalized medicine. Researchers can utilize these databases to perform sequence alignments, identify genes, study genetic variation,
In summary, bioinformatics utilizes diverse databases and applications to analyze biological data, offering insights into and conduct phylogenetic analyses. These resources enable scientists to explore genetic relationships,
genomics, proteomics, and personalized healthcare. investigate functional elements, and contribute to advances in fields like genomics and proteomics.

1. ~Describe the steps for BLAST and classify its different types.
Main steps of BLAST are: 3. ~Differentiate between primary, secondary and composite databases with examples of each.
Primary databases store and make data available to the public, acting as repositories. Example: GenBANK, DDBJ
Step 1: The first step is to create a lookup table or list of words from the query sequence. This step is also called
Secondary databases make use of publicly available sequence data in primary databases to provide layers of
seeding. BLAST takes the query sequence and breaks it into short segments called words.
information to DNA or protein sequence data. Example: UniProt Knowledgebase.
Step 2: Search database for exact matching with the list of words complied in Step 1. Composite databases are meant for keeping records of specific datasets meant for specific purpose and
Step 3: BLAST then scores the similarity of the matching words. The matching of the words is scored by a given applications. Example: OMIM
substitution matrix.
Step 4: Evaluating significance of extended hits from step 3. 4. ~Infer Global alignment.
There are five types of BLAST that are differentiated based on the type of sequence (DNA or protein) of the query Global alignment is a method of comparing two sequences, which aligns the entire length of the sequences by
and database sequences. They are: BLASTN, BLASTP, BLASTX, TBLASTN and TBLASTX. maximizing the overall similarity. This method is used when comparing sequences that are of the same length.
Global alignment is based on Needleman-Wunsch alignment. In global alignment Sequence to be aligned
2. Estimate the characteristics and the applications of BLAST.
Several key features of BLAST make it a widely used tool in bioinformatics. assume to be genetically similar over there entire length. Alignment is carried out from beginning to end of both
- BLAST is fast and efficient, making it possible to handle large databases of sequences. sequences to find the best possible alignment across the entire length between the sequences. The two
- It is a flexible and versatile tool as it can be used to search for similarities in both nucleotide and protein sequences are treated as potentially equivalent.
sequences.
- It is highly sensitive which allows the identification of even small similarities between sequences. 5. ~Describe the primary purpose of the NCBI database in the field of bioinformatics.
The NCBI database, or the National Centre for Biotechnology Information database, serves as a central repository
- It aims to identify regions of local similarity between the query sequence and the database sequence, rather
for a wide range of biological and genetic information. Its primary purpose is to provide researchers, scientists,
than attempting to align the entire sequences.
and the public with access to data related to genetics, genomics, and other biological sciences. It hosts DNA and
- It has a user-friendly interface that makes it easy to input query sequences and interpret the results.
Applications of BLAST are: protein sequences, genomic data, literature references, and tools for sequence analysis. Researchers use NCBI to
- BLAST can be used to identify unknown sequences by comparing them with known sequences in a database study genetic variations, conduct comparative genomics, and access valuable information for various biological
research purposes.
which helps in predicting the functions of proteins or genes.
- BLAST can also be used in phylogenetic analysis which is important for understanding the
evolutionary relationships between different species.
- BLAST can also be used to identify functionally conserved domains within proteins which is important 6. ~Infer Local alignment and describe its application.
for predicting the functions of proteins. In local alignment, instead of attempting to align the entire length of the sequences, only the regions with the
highest density of matches are aligned. This is useful for identifying short, conserved regions in protein or
nucleotide sequences. Local alignment programs are based on the Smith-Waterman algorithm. Local alignment
3. ~Articulate the different types of phylogenetic tree. does not assume that two sequences in question have similarity over the entirement; rather it only finds local
- Rooted tree. Make the inference about the most common ancestor of the leaves or branches of the tree. regions with the highest level of similarities between the two sequences and aligns these sequences without
- Un-rooted tree. Make an illustration about the leaves or branches and do not make any assumption regarding regard for the alignment of the rest of the sequence regions. There are three primary methods for producing local
the most common ancestor. alignments, dot Matrix method. dynamic programming and word or k tuple method.
- Bifurcating tree: Phylogenetic trees that only have two branches or leaves are referred to as Goal: See whether a substring in one sequence aligns well with a substring in the other.
bifurcating trees. Additionally, it can be divided into rooted and unrooted bifurcating trees. Application:
- Multifurcating tree: Multiple branches can be found on a single node in a multifurcating tree, as the name 1. Searching for local similarities in large sequence (example newly sequenced genome).
suggests. Both a rooted multifurcating tree and an unrooted multifurcating tree are categories for it once more. 2. Searching conserved domains or motifs.

7. ~Develop the importance of studying Bioinformatics


Understanding Biological Processes: Unraveling molecular and genomic data enhances knowledge of genetics, 4. ~Biology is important in computer science. Analyze your answer with suitable examples.
evolution, and disease mechanisms, driving progress in medicine, agriculture, and environmental science. Biology and computer science are two seemingly distinct fields, but there are several areas where they intersect
Drug Discovery: Bioinformatics accelerates drug discovery by identifying targets, screening compounds, and predicting and complement each other. Here are some justifications for the importance of biology in computer science,
drug effects, streamlining the development process and reducing costs.
along with suitable examples:
Personalized Medicine: Analyzing individual genetic profiles facilitates personalized medicine, optimizing treatment
effectiveness and minimizing adverse effects. Bioinformatics: Uses computational techniques for genomics and proteomics. Example: Genome sequencing
Genomic Medicine: Accessible genome sequencing aids in identifying disease-causing mutations, understanding employs algorithms for DNA sequence assembly, advancing medical research.
genetic bases of diseases, and developing genetic tests, advancing genomic medicine. Computational Biology: Models and simulates biological processes, aiding drug discovery. Example: Computer
models predict drug molecule interactions with biological targets, expediting drug development.
2. ~Discuss Bayes theorem, Naïve Bayes classifier and neighbor joining algorithms. Machine Learning and AI: Analyzes biological datasets, predicts protein structures, and identifies drug
Bayes' theorem is a fundamental concept in probability theory and statistics that has wide-ranging applications in candidates. Example: Deep learning predicts protein structures from amino acid sequences, aiding drug
various fields, including bioinformatics. In bioinformatics, Bayes' theorem is a mathematical formula that describes design.
how to update the probability of a hypothesis (an event or statement) based on new evidence. Biological Networks: Computer science analyzes complex networks like gene regulatory systems. Example:
Network analysis identifies key genes in disease pathways, offering therapeutic targets.
Bayes' theorem is used to calculate conditional probabilities. It allows us to update our beliefs about the likelihood
Phylogenetics: Uses computational algorithms to infer evolutionary relationships. Example: Maximum
of a hypothesis being true when we obtain new data or evidence.
Likelihood method reconstructs species evolutionary trees, aiding understanding of biodiversity.
Formula: The formal expression of Bayes' theorem is as follows:
In summary, biology and computer science collaborate to manage and analyze biological data, advancing
P(H|E) = [P(E|H) * P(H) ] / P(E) understanding in medicine, biology, and technology development.
Where:
P(H|E) is the posterior probability of hypothesis H given evidence E. 6. ~Illustrate K means.
P(E|H) is the probability of observing evidence E given that hypothesis H is true. P(H) is the The k-means algorithm is an iterative clustering technique that aims to partition a dataset into 'k' clusters. The key
prior probability of hypothesis H before considering evidence E. steps include initializing cluster centroids, assigning data points to the nearest centroid, updating centroids based on
P(E) is the probability of observing evidence E. assigned points, and repeating these steps until convergence. The strengths of k-means include simplicity and
scalability, but it assumes that clusters are spherical and equally sized, making it sensitive to initializations and
1. ~Explain the process and significance of a microarray experiment in gene expression analysis. Discuss how outliers.
microarray technology has transformed our understanding of gene regulation.
Microarray experiments involve hybridizing RNA samples to microarray chips containing thousands of gene probes,
allowing simultaneous measurement of gene expression levels. They have significantly advanced our
3. ~Describe the main objective of the BIRCH (Balanced Iterative Reducing and Clustering using
understanding of gene regulation by enabling the study of gene expression on a genome- wide scale. Researchers
Hierarchies) algorithm in data mining, and explain how it achieves this objective.
can identify differentially expressed genes under various conditions, discover biomarkers, and uncover regulatory
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm that can cluster
networks. This technology has been instrumental in fields such as cancer research, where it has helped identify
large datasets by first generating a small and compact summary of the large dataset that retains as much
genes associated with specific cancer subtypes and potential therapeutic targets.
information as possible. This smaller summary is then clustered instead of clustering the larger dataset. The BIRCH
clustering algorithm consists of two stages:
A microarray is a laboratory tool used to detect the expression of thousands of genes at the same time. DNA
microarrays are microscope slides that are printed with thousands of tiny spots in defined positions, with each spot - Building the CF Tree: BIRCH condenses large datasets into Clustering Feature (CF) entries, each represented
containing a known DNA sequence or gene. Often, these slides are referred to as gene chips or DNA chips. The as (N, LS, SS), denoting cluster size, linear sum, and squared sum. CF entries can be composed hierarchically,
DNA molecules attached to each slide act as probes to detect gene expression, which is also known as the and the initial CF tree may be optionally condensed for efficiency.
transcriptome or the set of messenger RNA (mRNA) transcripts expressed by a group of genes.
- Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A CF tree is a tree where
each leaf node contains a sub-cluster. Every entry in a CF tree contains a pointer to a child node and a CF entry
3. ~Analyze the central dogma procedure in brief. made up of the sum of CF entries in the child nodes. Optionally, we can refine these clusters.
The central dogma of molecular biology is a fundamental concept that describes the flow of genetic information in
biological systems. It consists of three main processes: Due to this two-step process, BIRCH is also called Two Step Clustering.
Replication: Copying DNA to produce identical molecules before cell division. DNA unwinds, and each strand
serves as a template for a complementary strand, resulting in two identical DNA molecules. 4. ~Explain the basic idea behind the DIANA (Divisive Analysis) clustering algorithm in data mining, and describe
Transcription: DNA information used to synthesize RNA (mRNA) in the nucleus (eukaryotes) or nucleoid the key steps involved in its process.
(prokaryotes). RNA polymerase reads DNA, producing a complementary mRNA strand. DIANA is also known as DIvisie ANAlysis clustering algorithm. It is the top-down approach form of hierarchical
Translation: mRNA guides protein synthesis on ribosomes in the cytoplasm. Ribosomes match mRNA codons with clustering where all data points are initially assigned a single cluster. Further, the clusters are split into two least
specific amino acids, forming a polypeptide chain through tRNA. This chain folds into a functional protein. similar clusters. This is done recursively until clusters groups are formed which are distinct to each other.
In essence, the Central Dogma explains genetic information flow: DNA to RNA (transcription) to protein
(translation), forming a foundational framework for genetic expression in organisms. In step 1 that is the blue outline circle can be thought of as all the points are
assigned a single cluster. Moving forward it is divided into 2 red-colored
1. Differentiate between cladogram and phylogenetic tree construction. clusters based on the distances/density of points. Now, we have two red-
Cladograms and phylogenetic trees are functionally very similar, but they show different things. Cladograms do colored clusters in step 2. Lastly, in step 3 the two red clusters are further
not indicate time or the amount of difference between groups, whereas phylogenetic trees often indicate time divided into 2 black dotted each, again based on density and distances to
spans between branching points.
give us final four clusters. Since the points in the respective four clusters are
2. Discuss the structure of a Nucleotide in brief and mention its different types. very similar to each other and very different when compared to the other
A nucleotide is the basic building block of nucleic acids (RNA and DNA). A nucleotide consists of a sugar molecule
cluster groups they are not further divided. Thus, this is how we get DIANA
(either ribose in RNA or deoxyribose in DNA) attached to a phosphate group and a nitrogen- containing base. The
clusters or top-down approached Hierarchical clusters.
bases used in DNA are adenine (A), cytosine (C), guanine (G) and thymine (T).
5. Interpret what does Bayes' Theorem state. 6. Identify the key characteristics of dynamic programming algorithms in the context of sequence
Bayes' theorem describes the probability of occurrence of an event related to any condition. It is also considered alignment, and inspect that why are they used.
for the case of conditional probability. Bayes theorem is also known as the formula for the probability of “causes”. Dynamic programming algorithms are characterized by their ability to solve complex problems by breaking them
down into smaller overlapping subproblems. They are used in sequence alignment to find the optimal alignment by
P(H|E) = [ P(E|H) * P(H) ] / P(E) considering all possible alignments and selecting the best one.
Where: 7. In sequence alignment, analyze the primary purpose of heuristic alignment algorithms, and give an example
P(H|E) is the posterior probability of hypothesis H given evidence E. of one such algorithm?
P(E|H) is the probability of observing evidence E given that hypothesis H is true. P(H) is the Heuristic alignment algorithms aim to find reasonably good alignments quickly, often by making simplified
prior probability of hypothesis H before considering evidence E. assumptions. An example is the BLAST (Basic Local Alignment Search Tool) algorithm, which rapidly identifies
P(E) is the probability of observing evidence E. local sequence similarities.
8. Identify the use of Neighbor Joining Algorithm in the context of phylogenetic tree construction?
6. Identify the main idea behind the Naïve Bayes classifier, and infer that how does it handle feature The Neighbor Joining Algorithm is used to construct phylogenetic trees from distance matrices. It iteratively joins
independence? pairs of taxa or clusters based on their pairwise distances to build a hierarchical tree representing evolutionary
relationships.
The Naïve Bayes classifier assumes that all features are conditionally independent, given the class label. It
calculates the probability of a data point belonging to a class by multiplying the individual conditional 9. Relate the fundamental concept behind the dynamic programming algorithms in pairwise sequence
probabilities of each feature given that class. alignment?
1. Find the key difference between k-means and k-medoid clustering algorithms? Dynamic programming algorithms, such as the Needleman-Wunsch and Smith-Waterman algorithms, use a
K-means uses the mean (centroid) of data points in a cluster to represent it, whereas k-medoid uses the actual matrix-based approach to find the optimal alignment by considering all possible alignment paths and choosing the
data point (medoid) to represent the cluster. one with the highest score.
K-Means uses the average of all points in a cluster (centroid), which may not be an actual data point1. K-
10. Interpret some common challenges faced in the integration of biological data from various sources in systems
Medoids selects an actual data point as the center (medoid) of the cluster1.
biology studies?
Challenges include data heterogeneity, differing data formats, data quality issues, and the need for robust data
Outlier Sensitivity: K-Means is sensitive to outliers2. K-Medoids is more robust to outliers and noise2.
integration methods to combine diverse biological datasets effectively.

Computational Complexity: K-Means is computationally less expensive compared to K-Medoids3.


11. In the context of bioinformatics, identify the significance of microarray experiments.
Microarray experiments are used to measure the expression levels of thousands of genes simultaneously, helping
2. Discover what are some common challenges faced in the integration of biological data. researchers study gene expression patterns under different conditions and gain insights into cellular processes.
Some common challenges in the integration of biological data include: 12. Analyze some critical issues related to the design of a biological information system, especially when dealing
Data Heterogeneity: Biological data comes from various sources, such as genomics, proteomics, and clinical with large- scale datasets?
studies, each with different formats and standards, making it challenging to integrate. Issues include data storage and retrieval efficiency, data security, scalability to handle large datasets, user-friendly
Data Volume: The sheer volume of biological data generated is enormous, leading to issues related to storage, interfaces for data access and analysis, and compliance with ethical and privacy standards.
processing, and analysis. 13. Explain in a few sentences what a scoring model is in bioinformatics.
Data Quality: Ensuring the accuracy and consistency of data from diverse sources is a significant challenge, as A scoring model in bioinformatics is a mathematical system used to assign scores or values to various biological
errors or inconsistencies can lead to misleading conclusions. These challenges make the integration of biological sequence alignments. It helps determine how well two sequences align with each other, with higher scores
data a complex task that requires specialized tools and techniques to address. indicating better alignment. Scoring models are essential in tasks such as sequence alignment, where they aid in
identifying similarities and differences between biological sequences.
3. Outline the primary advantage of the BIRCH (Balanced Iterative Reducing and Clustering using
Hierarchies) clustering algorithm?
14. Express why are positive scores typically assigned to matching nucleotides or amino acids in scoring models
BIRCH excels in managing large datasets, a critical advantage in the era of big data. Its hierarchical structure and for sequence alignment?
space-saving data summarization techniques enable it to handle extensive data efficiently. The algorithm iteratively Positive scores are assigned to matching nucleotides or amino acids in scoring models because they reflect the
reduces the data size and performs clustering at multiple levels, making it suitable for applications dealing with idea that identical or similar sequences in biological molecules are biologically significant. A positive score
vast amounts of information. This efficiency is particularly valuable in scenarios where traditional clustering encourages the alignment algorithm to prioritize regions of similarity, helping to identify homologous sequences
algorithms may struggle due to memory or processing constraints. and functional similarities between biological molecules.
4. Interpret that what do grid-based clustering methods primarily rely on to divide the data space into cells or 15. Briefly describe the role of gap penalties in scoring models for sequence alignment.
Gap penalties in scoring models for sequence alignment are used to account for the introduction of gaps (insertions
regions?
or deletions) in the alignment. Gap penalties are typically negative values. They discourage excessive gap creation,
Grid-based methods rely on a grid structure to divide the data space into cells or regions, making them suitable
ensuring that the alignment algorithm favors alignments with fewer gaps. This helps to maintain biologically
for handling uniformly distributed data.
meaningful alignments by penalizing gaps that may not reflect true evolutionary relationships.
5. Analyze fundamental concept behind ISODATA(Iterative Self-Organizing Data Analysis Technique).
16. Explain the significance of GenBank, one of the primary databases hosted by NCBI.
Grid-based clustering methods primarily rely on a structured grid to partition the data space into cells or regions. This GenBank is a critical component of the NCBI database. It is a repository for DNA and RNA sequences submitted by
approach is particularly effective when dealing with uniformly distributed data. The grid structure provides a systematic
researchers worldwide. The significance of GenBank lies in its role as a comprehensive and freely accessible
and organized way to divide the dataset into discrete units, making it easier to identify clusters and analyze spatial
relationships. This method simplifies the clustering process, especially in scenarios where the distribution of data collection of genetic information. Researchers can deposit their sequences into GenBank, allowing others to access
points is regular and can be aligned with the grid structure. The reliance on a grid facilitates efficient data organization and use this data for various research purposes, including gene discovery, phylogenetic studies, and understanding
and retrieval, contributing to the effectiveness of grid-based clustering methods in handling certain types of datasets. the genetic basis of diseases. GenBank promotes data sharing, collaboration, and scientific advancement in the
field of molecular biology and genetics.

You might also like