Buy ebook Hands on Data Science for Biologists Using Python 1st Edition Yasha Hasija cheap price
Buy ebook Hands on Data Science for Biologists Using Python 1st Edition Yasha Hasija cheap price
Buy ebook Hands on Data Science for Biologists Using Python 1st Edition Yasha Hasija cheap price
com
https://ebookmeta.com/product/hands-on-data-science-for-
biologists-using-python-1st-edition-yasha-hasija/
OR CLICK HERE
DOWLOAD NOW
https://ebookmeta.com/product/all-about-bioinformatics-from-beginner-
to-expert-yasha-hasija/
ebookmeta.com
https://ebookmeta.com/product/translational-biotechnology-a-journey-
from-laboratory-to-clinics-1st-edition-yasha-hasija-editor/
ebookmeta.com
https://ebookmeta.com/product/data-driven-seo-with-python-solve-seo-
challenges-with-data-science-using-python-1st-edition-andreas-
voniatis/
ebookmeta.com
https://ebookmeta.com/product/the-clover-chapel-2-jamison-
valley-2021st-edition-devney-perry/
ebookmeta.com
Seeing Four Dimensional Space and Beyond Using Knots 1st
Edition Eiji Ogasa
https://ebookmeta.com/product/seeing-four-dimensional-space-and-
beyond-using-knots-1st-edition-eiji-ogasa/
ebookmeta.com
https://ebookmeta.com/product/the-philology-of-life-walter-benjamins-
critical-program-1st-edition-kevin-mclaughlin/
ebookmeta.com
https://ebookmeta.com/product/what-bad-girls-get-1st-edition-emily-
tilton/
ebookmeta.com
https://ebookmeta.com/product/the-absite-review-7th-edition-steven-m-
fiser/
ebookmeta.com
Hands-On Data Science for
Biologists Using Python
Hands-On Data Science for
Biologists Using Python
Typeset in Times
by MPS Limited, Dehradun
Contents
Preface................................................................................................................................ xi
Author Bio ........................................................................................................................ xii
3. Biopython ................................................................................................................................45
Introduction .............................................................................................................................. 45
Installing Biopython ................................................................................................................ 45
Biopython Seq Class ............................................................................................................... 45
Parsing Sequence Files ............................................................................................................ 47
Writing Files ............................................................................................................................ 51
Pairwise Sequence Alignment................................................................................................. 53
BLAST with Biopython .......................................................................................................... 57
Multiple Sequence Alignment................................................................................................. 59
Construction of a Phylogenetic Tree ...................................................................................... 62
Handling PDB Files................................................................................................................. 64
Exercise .................................................................................................................................... 70
v
vi Contents
7. Hands-On Projects...............................................................................................................137
Differential Gene Expression Analysis................................................................................. 137
Quality Control ...................................................................................................................... 138
Normalization......................................................................................................................... 141
Differential Expression Analysis........................................................................................... 146
Cluster Map ........................................................................................................................... 151
Gene Enrichment Analysis .................................................................................................... 152
SNP Analysis ......................................................................................................................... 153
Exercise .................................................................................................................................. 160
Index..............................................................................................................................................285
Preface
Data science is rapidly becoming a vital discipline involving the use of big data to extract meaningful
information. With the advent of high throughput technologies in the field of healthcare, it is becoming
increasingly imperative for life science researchers to analyze the massive amount of data being
generated. Researchers with little or no computational skills often find the task challenging. In order to
overcome this challenge, we have meticulously drafted this book, using illustrative examples, as a
stepwise guide to ease newcomers from the field of life sciences to the field of data science. We have
chosen Python as our programming language of choice because of its easy accessibility on all operating
systems, versatility, comprehensible interface, ease of use, object-oriented features, and wide range of
applicability.
This book will serve as a beginner’s guide for anyone interested in the basics of programming, data
science, and Machine Learning. Every topic has an intuitive explanation of concepts and is accompanied
by the implementation of the concepts using biological examples. This book can also serve as a
handbook for biological data analysis using standard Python code templates for model building -
facilitated with supplementary files for each chapter. The text is made to be as interactive as possible
with accompanying Jupyter Notebooks for every section, to help readers practice the codes in their local
systems. Each chapter is specially designed with examples.
The book is divided into two sections. The first section deals with an introduction to basic Python
programming and a hands-on tutorial for data handling. Chapters in this section elaborate on the usage
of some of the basic Python libraries and packages. One of the important libraries for life sciences data -
Biopython - is explained in this section with examples of reading and writing various biological file
formats, performing Pairwise and Multiple Sequence Alignments, handling protein and sequence data,
etc. The subsequent sections elaborate on data handling using NumPy and Pandas, data visualization
techniques, and dimensionality reduction methods that are common to all data analyzes and also provide
illustrative examples for biological data.
Machine Learning is an integral part of several research projects today and has numerous applications in
the present-day era. Almost all of the disciplines of technology have been transformed by Machine
Learning and artificial networks, and life sciences are no exception, with Machine Learning applications in
fields ranging from agriculture to diagnostics to personalized medicine to drug development to biological
imaging - the list is mounting. The second section of the book deals with Python implementation in
Machine Learning algorithms. Chapters in this section contain an introduction to Machine Learning to
make readers comfortable with the various terminologies used in Machine Learning. This section also
explores popular supervised and unsupervised Machine Learning algorithms - such as logistic regression,
k-nearest neighbors, decision trees, random forests, support vector machines, artificial neural networks,
convoluted neural networks, natural language processing, and k-means clustering - and shows their
implementation in Python.
The book is written considering the need for biologists to learn programming in light of handling
massive data, analyzing it, and deriving useful insights from it. I hope our readers will benefit from this
hands-on book on data science for biologists using Python.
xi
Author Bio
Dr. Yasha Hasija (B.Tech, M.Tech, Ph.D.) is an Associate Professor at the Department of
Biotechnology and the Associate Dean of Alumni Affairs at the Delhi Technological University. Her
research interests include genome informatics, genome annotation, microbial informatics, integration of
genome-scale data for systems biology, and personalized genomics. Several of her works have been
published in international journals of high repute, and she has made noteworthy contributions in the area
of biotechnology and bioinformatics as author and editor of notable books. Her expertise, through her
book chapters and conference papers, is of significance to other academic scholarship and teaching. She
is also on the editorial boards of numerous international journals.
Dr. Hasija’s work has brought her recognition and several prestigious awards - including the Human
Gene Nomenclature Award at the Human Genome Meeting (2010) held in Montpellier, France. She is
the project investigator for several research projects sponsored by the Government of India - including
DST-SERB, CSIR-OSDD, and DBT. As Dr. Hasjia continues conducting research, her passion for
finding the translational implications of her findings grows.
Mr. Rajkumar Chakraborty (B.Tech, M.Tech) received his Bachelor of Technology Degree in
Biotechnology from the Bengal College of Engineering and Technology, West Bengal, India and
completed his Masters of Technology Degree in Bioinformatics from the Delhi Technological
University, Delhi, India. He is currently pursuing his Ph.D. in the field of bioinformatics. He was a
part of the 4-member team which won “Promising Innovative Implementable Idea Award” at the
SAMHAR-COVID19 Hackathon 2020 for innovating a solution towards drug repurposing against
COVID-19. His research interests are in applied Machine Learning and the integration of big data in
biological science.
xii
1
Python: Introduction and
Environment Setup
1
2 Hands on Data Science for Biologists
Programming skills are a valuable asset for any biologist. There are many programming lan
guages that have been developed. Some are for instantaneous computation, website creation, and
database generation, among others, and some are general-purpose programming languages that
were developed to be used in a variety of application domains. Python is one example of a general-
purpose programming language. Guido van Rossum developed it as a hobby in the Netherlands
around 30 years ago and named it after a famous British comedian group called “Monty Python’s
Circus”. Now, Python has applications in various domains like data science, web development,
data visualization, and desktop applications, to name a few. Python is one of the popular pro
gramming languages in the data science and Machine Learning area, and it is community-driven.
Since it has a very steady learning curve, it is recommended by many experts for beginners as their
first programming language to learn. Primarily, Python has simple English-like readable syntax
which is easily understandable by users. For example, if one wants to find the proportion of the
amino acid Leucine with a symbol “L” contained in a protein sequence, the following Python
code will do that:
Protein = “MKLFWLLFTIGFCWAQYSSNTQQGRTSIVHLFEWRWVDIALECERY”
Leu_contain = Protein.count(‘L’)/len(Protein)
print(Leu_contain)
The code is very much similar to the English language. The first line is the protein sequence. The
second line calculates the Leucine residues (denoted by the letter “L”) by counting the number of times
“L” appears in the sequence and then dividing it by the total length of the sequence. Moreover, at last
printing the value, it turns out to be 0.108
Thanks to the readability of Python codes, learners can concentrate on the concepts of programming
and problems more than learning the syntax of the language. As Python is community-driven and it has
one of the largest communities, Python has evolved to contain several important libraries that are pre-
installed or are freely available to install. These libraries help in the quick and efficient development of
complex applications, because these do not need to be written from scratch.
Another advantage of learning Python is that it can be used for various purposes due to the devel
opment of popular libraries, such as:
• Frameworks like Django, Flask, Pylons are used for creating static and dynamic websites.
• Libraries like Pandas, NumPy, and Matplotlib are accessible for data science and visualization.
• Scikit-Learn and TensorFlow are advanced libraries for Machine Learning and deep learning
• Desktop applications can be built using packages like PyQt, Gtk, and wxWidgets, among others.
• Modules like BeeWare or Kivy are taking the lead in mobile applications.
Learning programming is the same as learning a new language; we have to first understand the
vocabulary and syntaxes. Next, we learn how to construct some meaningful but terse sentences.
Using those sentences, we then form paragraphs, and finally, we write our own story. In this book,
we will start with Python syntaxes and vocabulary. Then, we will construct small programs with
biological relevance to help biologists learn programming with problems that are important
to them.
Installing Python
We are using Python 3.7, which is the current and stable version of Python. Most of the operating
systems either already have Python installed by default, or it can be downloaded from the Python
Software Foundation’s website (https://www.python.org/), where it is freely available. After installing
Python, open the Python Shell in Windows or type “python3” in the terminal of Mac or Linux as
follows:
Python: Introduction and Environment Setup 3
Our first instruction was simple - to print “Welcome to Python”. If it runs correctly, then Python has
been successfully installed and we are all set and ready to go!
will appear. Under the notebook section, choose “Python 3” to create a Python 3-compatible Jupyter
Notebook (Figure 1.2).
As indicated here, the current name of the notebook is “untitled”. To rename it, click on the “untitled”
text itself. The cells will run Python 3 codes, as Python 3 was selected as the kernel. Try writing the
same code that we have initially typed in the Python terminal:
print(‘Welcome to Python’)
Click on the “Run” button above or press “Shift” + “Enter” to execute the code.
The following output should appear in the notebook:
Welcome to Python
In the event that the user has different cells in their Notebook, and the user runs the cells altogether,
then the user can share their variables and imports among cells. This makes it simple to separate out the
code into legitimate pieces without expecting to reimport libraries, reintroduce variables, or define
functions in each cell.
The Jupyter Notebook has a few menus that the user can utilize to connect with their Notebook. The
menus are as follows:
• File
The File menu is used to create new notebooks, save notebooks, and open previously saved
notebooks. Jupyter Notebook is typically saved in a “.ipynb” format, but the user can also save it
in other formats by using the “Download as” option. Also, saving checkpoints options are au
tomatically given.
• Edit
The Edit menu consists of typical editing options like cut, copy, paste, merge cells, and others.
• View
The View menu is useful in toggling the header and the toolbar.
• Insert
The Insert menu is used for inserting cells below and above the current cell.
• Cell
The Cell menu consists of running the cells and changing the type of cells.
Python: Introduction and Environment Setup 5
• Kernel
The Kernel option is mostly used in debugging to interrupt and to restart the Python 3 kernel.
• Widgets
JavaScript widgets can be added to our cells to create dynamic content using Python. This menu is
for saving and clearing the widget state.
• Help
The Help menu is used for learning about Jupyter Notebook, its documentation, shortcuts, etc.
We can also add rich content in the Jupyter Notebook using markup language in the cells and change the
cell type using the “Cell” menu to markdown. The markup language is a superset of HTML and is used
for styling text, inserting maths equations, etc. To learn more about Jupyter Notebook, the user can
always refer to the documentation.
Errors in Python
Python gives rather detailed error messages by pinpointing the statement and library which are being
used. Correcting or understanding errors are sometimes bothersome, but this process hones one into a
successful programmer. There are different types of errors. Some are understandable by Python, and
these can give alerts as warnings, but errors are not native to Python most of the time, so programs are
sometimes executed with unexpected results. Here are three major types of errors in this discussion:
Syntax Error: These are the errors that are the most simple to understand and correct. These usually
happen when the user scrambles the grammar rules of Python, and Python gets confused over these
disarranged statements. Python will tell the user where the exact point of confusion is with the line and
word and ask the user to correct this. For learners, this is the most common mistake or error message.
Structuring the statements correctly is an essential requirement for proper execution.
Logic Errors: These are the errors in which Python does not understand the program and executes it
with unexpected results. These happen when the user’s statement is grammatically correct, but the
meaning is not intentional. Logical errors are bugs, and the debugging process will help here. The user
must look through all the steps to find the bug.
6 Hands on Data Science for Biologists
Semantic Errors: These types of mistakes happen when the user gives a grammatically correct
statement in the proper order, but there is a problem in the program. For example, when the user tries to
add or subtract a number with a string. This kind of operation is not possible and will raise a semantic
error - pointing out the operation or statement.
Readers will encounter a lot of errors, and correcting these requires the skill of asking the questions
discussed above.
With this prerequisite knowledge and environment setup, we are now ready to take deep dive into
the exciting world of Python language and start writing our programs. In this book, the codes are
explained in a step-by-step process so that these are understandable and applicable in solving in
dividual problems. Data science techniques sometimes require a great depth of mathematical and
statistical understanding. These are beyond the scope of this book. However, we will provide a ne
cessary and intuitive explanation in every section. We hope that this wealth of knowledge helps the
readers understand and appreciate the usability of programming for a biologist, the features of Python
language, the combination of Python and data science for biologists, and ultimately, discover the fun
way to learning all of this.
Exercise
>sp|Q9SE35|20-107
QSIADLAAANLSTEDSKSAQLISADSSDDASDSSVESVDAASSDVSGSSVESVDVSGSSL
ESVDVSGSSLESVDDSSEDSEEEELRIL
In this chapter, we will go through a basic understanding and overview of Python programming which is
a prerequisite for any form of data analysis. Variables and operator, string, list and tuples, dictionary,
conditions, loops, functions, and objects are some of the topics covered in this chapter.
Let us begin with a familiar syntax in Python which is used for commenting a statement. If a
statement or a line begins with “#”, then Python will ignore it. Comments are useful to make any code
self-explanatory. We will use a lot of comments wherever required to make the code more under
standable to readers.
Code:
#Let’s print “Hi there!”
print('Hi there!')
Output:
Hi there!
After executing the above code in the Jupyter Notebook by pressing “ctrl + enter”, the first statement
or line starting with “#” will be ignored, and as a result, we will see “Hi there!”. Therefore, the first line
is a comment which describes that the code will print “Hi there!”.
There are two important ways in which Python represents a number: int and float. Decimals numbers
(float), such as 1.0, 3.14, ‒2.33, etc., will potentially consume more space than integers or whole
numbers, like 1, 3, ‒4, 0, etc. Think of this way, if we take whole numbers between 0 and 1, then we will
see only the two numbers 0 and 1, but, in the case of decimal numbers, we will get infinite numbers
between 0.0 and 1.0. Next, we have a Boolean datatype, which is “True” or “False”, and these are used
in making conditions which we will learn shortly. Lastly, “str” or the string datatype is the datatype that
biologists will need and encounter the most - whether it is the DNA, RNA, and/or protein sequences or
names, most of them are text or strings. Therefore, we have a separate section for strings in this chapter.
It is imperative to mention here that string-type data always remains inside quotes, i.e. (‘<string data>’).
For example, ‘ATGAATGC’ will be a string for Python.
7
8 Hands on Data Science for Biologists
To know the datatype of values, we can write “type(<value>)” to get its datatype in the Jupyter
coding cell.
Code:
print(type(4)) # integer, or a whole number
print(type(4.0)) # floating point, or decimal number
print(type(True)) # boolean, or a True/False
print(type('ATGAATGC')) # means string, or ‘a piece of text’
Output:
<class 'int'>
<class 'float'>
<class 'bool'>
<class 'str'>
In the statements, we can see that 4 is the “int” type, whereas 4.0 is in “float”. For now, ignore the
word “class”. We will learn about this in the succeeding parts of this book. The key takeaways here are
the datatypes and the method in identifying the datatype of value in Python.
Code:
# Addition
print(4 + 6) #Integer + Integer
print(4 + 6.0) #Integer + Float
# Subtraction
print(6-3) #Integer - Integer
print(6-3.0) #Integer - Float
# Multiplication
print(2 * 5) #Integer * Integer
print(2 * 5.0) #Integer * Float
# Division
print(24/3) #Integer / Integer
# Power
print(2**8) #Integer ** Integer
print(2**8.0) #Integer ** Float
# ‘%’ or modulo operator, also known as the modulo or remainder operator gives
# the remainder of two numbers which are not a factor of each other.
8%3 #Integer % Integer
Output:
10
10.0
3
3.0
10
10.0
8.0
256
256.0
2
Basic Python Programming 9
Operators
In this section, we will discuss some of the standard operators in Python. We are familiar with some of
the operators like “+”, “‒”, “*”, “/”, “=”, and “**”.
TABLE 2.1
Some Common Operators in Python.
Symbol Name
+ Addition
‒ Subtraction
* Multiplication
/ Division
** Power
% Modulo
= Equal to
Operations with an integer and float will always return float-type results, and operations with two
integers will return integers, except for division where these will still return a float type. Subsequently,
we can attain an integer-type for division by using an integral division operator (i.e.“//”).
1. Parentheses ()
2. Exponent **
3. Multiplication *
4. Division / // %
5. Addition +
6. Subtraction ‒
After PEMDAS, the order goes from left to right. For example, try to evaluate “2 + 5*4/2”.
According to “PEMDAS”, first calculate “5*4”, then “5*4/2”, and lastly “2 + 5*4/2 = 12”. Now, if
the user has to break this order, they can use the Parentheses as used in pen and paper-solving of
equations.
Variables
Variables in Python are like the variables of algebra in mathematics. We think of a variable as a box
with a name on it that can hold any value or datatype. Variables can also inherit all the properties of
the value stored inside it. Variables consist of two parts: the name and the value. We assign a name for
the value by using an equal to “=” operator. The name is on the left side, and the value is on the
right side.
Code:
length_of_gene = 1300
print (length_of_gene)
10 Hands on Data Science for Biologists
Output:
1300
Once we assign a variable, then we can recall them. In the example below, we can see that variables:
“length_of_gene” and “length_of_introns” are assigned and then are used for finding the mRNA length
and storing it in another variable called the “length_of_mRNA”.
Code:
length_of_gene = 1300
length_of_introns = 350
length_of_mrna = length_of_gene - length_of_introns
print(length_of_mrna)
Output:
950
From this point forward, we will use these variables in other programs.
Variables make our programs clear enough to read, and these are reusable. For example, if the user
has to use a long protein or nucleotide sequence, then it would not be wise to write it every time.
Therefore we can assign it to a variable, and we can reuse this every time it is required. Variables can be
assigned to other variables, reassigned anytime to different values, and also allocated to another vari
able. Let us explain this in code:
Code:
some_var = 100
another_var = some_var
some_var = 300
length_of_gene,length_of_introns = 1300,350
In the code, “another_var” is assigned the same as “some_var”, and the next line “some_var” is
reassigned to another value. When assigning a new value to the variable, the old value will be forgotten
and, thus, cannot be retrieved. This reassigning of a variable can also be done with a non-identical
datatype. For example, a variable containing an integer can be reassigned to a variable containing string
and vice versa. This property is not true for many other programming languages. In the last statement,
two variables are assigned values in the same statement - which is also one of the unique points of
Python that sets it apart from any other programming language. Last but not the least, variable names
are case sensitive - for example, a variable name “protein_id” cannot be called “Protein_ID” or
“PROTEIN_ID”.
TABLE 2.2
Keywords in Python
False else import pass Yield
None break except in Raise
True class finally is return
and continue for lambda try
as def from nonlocal while
assert del global not with
elif if or
Most Python programmers prefer to name the variables with the following guidelines:
• Most variables should be in “snake_case”, which means there is an underscore between words.
• Most variables are in lowercase other than constants.
• CamelCase is used for defining class or functions. Please note that we have a dedicated section for
classes and functions in this chapter, so just remember this part for now.
Strings
For computer programmers, strings are the collection of characters or, more commonly, any texts. In
bioinformatics studies, handling strings is very common - like sequencing files, finding patterns in the
sequences, data-mining from texts, processing data from various file formats, etc. By enclosing a sequence
of characters between a pair of single quotes, double quotes, triple-single quotes, or triple-double quotes, a
string object can be constructed in Python. While characters enclosed between single or double quotes can
only have a single line, characters between triple-single or triple-double quotes can have multiple lines.
Let us take a look at the following example:
Code:
# A string within a pair of single quotes
seq_1 = 'ATGCGTCA'
print(seq_1)
print('---------')
# A string within a pair of triple single quotes, can have multiple lines
seq_5 = '''MALNSGSPPA
IGPYYENHGY'''
print(seq_5)
print('---------')
# A string within a pair of triple double quotes, can have multiple lines
seq_6 = """IGPYYENHGY
IGPYYENHGY"""
print(seq_6)
Output:
ATGCGTCA
---------
ATGCGTCA
---------
ATGCGTCA
---------
ATGCGTCA
---------
IGPYYENHGY
---------
IGPYYENHGY
IGPYYENHGY
The characters should be enclosed within the same type of quote - usually single or double quotes -
for defining a string datatype.
\\ Backslash (\)
\' Single quote (')
\" Double quote (")
\a ASCII Bell (BEL)
\b ASCII Backspace (BS)
\f ASCII Formfeed (FF)
\n ASCII Linefeed (LF)
\r ASCII Carriage Return (CR)
\t ASCII Horizontal Tab (TAB)
\v ASCII Vertical Tab (VT)
\ooo ASCII character with octal value ooo
\xhh… ASCII character with hex value hh…
Basic Python Programming 13
Although most of these are not commonly used, we will try out some of the examples.
Code:
# Escape Sequence Characters
print(' Hey Ashok, "How\'re you?" ') #escaping single quotes
print('---------')
print('First line\nSecond line') #escaping new line
print('---------')
print('\\') #escaping Backslash
Output:
Hey Ashok, "How're you?"
---------
First line
Second line
---------
\
In this example, applications of escaping characters are shown. they are mostly used for writing text
files using Python.
String Indexing:
A string is a collection or sequence of characters, so it is possible in Python to grab single characters as
well as a part of the text by using their indexes. For grabbing the character, we have to place the index
number inside the square bracket pair after the string name.
Given below is an example of String Indexing in the DNA sequence “ATGCGTCA” to print the
second nucleotide.
It may be noted that the index of any string starts with 0, starting with the leftmost character -
meaning that the index of first nucleotide “A” is 0, that of the second nucleotide “T” is 1, and so on. In
backward indexing, the indexing starts with -1 from the rightmost character, meaning the backward
index of last nucleotide “A” is -1, that of second last nucleotide “C” is -2, and so on.
Code:
dna_seq = 'ATGCGTCA'
print(dna_seq[1])
Output:
T
In the output we got “T”, but the first nucleotide was “A”. It is because, unlike our customary
practice, Python counting starts with zero.
Another example of character index for Python is shown in the figure below, where the first row is the
sequence; the second row is the forward index of nucleotides, and the third row shows the backward
index (Figure 2.1):
Below is the code for extracting the first character of the string:
Code:
# Extracting the first nucleotide
dna_seq = 'ATGCGTCA'
print(dna_seq[0])
print('---------')
Output:
A
---------
C
To grab a part of a text or string, the annotation used is “string_name[start: end]”, where “start” is the
starting index, the “end” is the index extending up to the provided number, but not including it.
• dna_seq[3:6] is “CGT” - characters starting at index 3 and extending up to but not including
index 6
• dna_seq[3:] is “CGTCA” - leaving a blank for either index defaults to the start or end index of the
string
• dna_seq[:] is “ATGCGTCA” - emptying both fields always produces a copy of the whole string
• dna_seq[1:5] is “TGCGTCA” - an index that is too big is truncated to string length
• dna_seq[:-4] is “ATGC” - selecting up to but not including the last four characters
• dna_seq [-4:] is “GTCA” - starting with the fourth character from the right end to the right end
String Concatenation
There are a few ways to concatenate or join strings. The easiest and most common way to add join
strings by using the plus symbol (+) or, in simplest terms, by simply adding them.
Code:
#String concatenation
dna_1 = ‘ATGCGTCA’
dna_2 = ‘ACTGCGTC’
full_dna = dna_1 + dna_2
print('The sequence of DNA is 'full_dna)
Output:
The sequence of DNA is ATGCGTCAACTGCGTC.
We can add any number of strings using the “+” operator. An important thing to note here is that all of
the datatypes should be strings while adding strings - for example, if we add a string with an integer, like
“ACTGCGTC” + 4, then there will be an error message suggesting that “str” type and “int” cannot be
added. To add a number, we have to convert the number to “str” type by using str(number) function.
While we cannot add integer with strings, we can print the same string multiple times using the “*”
operator with an “int” datatype. For example, “ACTGCGTC”*2 will double the string into “
ACTGCGTCACTGCGTC”.
Basic Python Programming 15
Commands in Strings
Various commands are available to make the desired modifications in strings or to carry out analyses. We
will discuss some of the most common methods in this section. Remember that these methods do not
modify the string itself but, rather, produce a new string, because the string is an immutable datatype.
Let us return to the Jupyter Notebook and try out the following codes:
Code:
#Converting a string into lowercase letters
dna_seq = 'ATGCGTCA'
print(dna_seq.lower())
print('---------')
print(dna_seq)
Output:
atgcgtca
---------
ATGCGTCA
In the example above, the lower() method is used. It reverts the strings in lowercase letters. We can
also observe that the original variable “dna_seq” is not changed after applying the lower() method on it.
In the same way, using the command str.upper() will change the string into uppercase letters.
A few more commands for string alteration include count(), find(), and len(). Their usage is described below:
Code:
dna_seq = 'ATGCGTCA'
print(dna_seq.count('A')) #str.count()counts all the occurrences of the
selected string in the parent string.
print(dna_seq.find('GT')) #str.find() returns the index of the first occurrence
of the selected string in the parent string.
print(len(dna_seq)) # len()returns the length of the string.
Output:
2
4
8
In the above examples, “len()” is a function that returns the length of the string. There is a primary
method called str.split() which is very frequently used for extracting data from delimited text file
formats like CSV, TSV, etc. CSV stands for comma-separated-values, where values of each column are
separated by a comma delimiter, and TSV stands for tab-separated-values, where the values of each
column are separated by a tab delimiter
Figure 2.2 is an example of a CSV-formatted file, where the first row is known as the header row and
consists of column names, and the rest of the rows are instances that have values separated by a comma for
each column. We can extract the values of each row if we consider each row as a string using str.split() method:
Code:
#str.split()
first_row = '6,148,72,35,0,33.6,0.627,50,1'
Pregnancies,Glucose,BloodPressure,
SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,
Age,Outcome = first_row.split(',')
print(Glucose)
print(BloodPressure)
print(Insulin)
print('---------')
print(first_row.split(','))
Output:
148
72
0
---------
['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1']
Let us study the above code line by line. We assign the first observation (i.e. the second row of the
CSV file shown in Figure 2.2) to a variable named “first_row”. Second, we use the multiple-variables
assigning feature of Python for setting up each column as variable and first observations as their values,
respectively. Here the split(‘,’) method collects a string and returns a list of values that are split by
commas. We can print the variables named after the columns in the header of the CSV file. Also, the last
line of the output is a list of split values. The list is a particular datatype in Python, which we are going
to discuss in the next section. We are barely grazing the surface, and it should be noted that there are
other exciting methods present for the string datatype. Readers can always refer to the Python doc
umentation to find all of the methods available for strings.
Lists
While strings are a collection or sequence of characters, lists are a series of values that are more like
arrays in other programming languages but are more comparatively flexible. Values in lists are known
as items or elements. Some essential features of lists are:
• Lists are ordered. A list notes the order of the items inserted and can be retrieved later.
• Objects in a list can be accessed with an index.
• Lists can contain any entity - numbers, strings, tuples, and even other lists.
• Lists can be modified or mutable. Changes may be made in the list; new items can be added;
existing items removed or revised.
Basic Python Programming 17
There are different ways to build a new list. The best way is to put the elements in a square bracket
(‘[‘and’]’) separate them by commas.
Code:
# Creating an empty list called "list"
list = [ ]
# Adding the values inside the List
list = [ 1, 2, 3, 4, 5]
# Printing the List
list
Output:
[1, 2, 3, 4, 5]
Lists can hold any datatype or objects and can be assigned to any variable
Code:
# Adding the values irrespective of their datatype: Integer, String, float.
list = [1, 2, 3,'Metformin', 4.0, 4/2]
list
Output:
[1, 2, 3, 'Metformin', 4.0, 2.0]
Code:
# Creating a list called drug_name
drug_name = ['Metformin', 'Acarbose', 'Canagliflozin', 'Dapagliflozin']
print(drug_name)
Output:
['Metformin', 'Acarbose', 'Canagliflozin', 'Dapagliflozin']
0 1 2 3
-4 -3 -2 -1
Code:
# Accessing the elements in the list
print(drug_name[0]) # Metformin
Discovering Diverse Content Through
Random Scribd Documents
vexations—the dark weather of life, that beset even such a humble
career as mine.
So much for the introduction—and now to business.
The following letter is very welcome. Can Harriet venture to tell
us who the author of this capital riddle really is?
Newport, March 28, 1842.
Friend Merry:
In looking over, a few days since, some old papers
belonging to my father, I found the following riddle. My father
informs me that it was written many years ago, by a school-
boy of his, then about fifteen years old, and who now
occupies a prominent place in the literary and scientific world.
If you think it will serve to amuse your many black-eyed and
blue-eyed readers, you will, by giving it a place in the
Museum, much oblige a blue-eyed subscriber to, and a
constant reader of, your valuable and interesting Magazine.
Harriet.
riddle.
If you think the above worthy a place, you can publish it.
You may hear from me again soon. My sheet is full, so I have
but to subscribe myself,
Very respectfully,
W. F. W.
Dear Sir:
My little daughter has handed me the following puzzle to
send to you for your next number, which please insert, and
oblige
A Subscriber.
I offer my best thanks for the letters from the following friends:
“One of your blue-eyed readers in New York;” “A little subscriber in
Canandaigua,” whom I shall always be happy to hear from; E. D. H
——s, of Saugus; C. W., of Millbury; C. A. S. and L. B. S., of
Sandwich; L. W——e, and W. B. W——e; and “A Subscriber.”
S. L.’s letter about the postage, dated Utica, April 22, was duly
received.
H. E. M. thinks that Puzzle No. 5, in the April number, is either a
hoax, or that the solution is Nantucket. We think it is a little of both:
that is, that our friend who sent it to us intended it for Nantucket;
but about that time it was “all fools day,” and the unlucky types of
the printer seem to have made a very good puzzle, as sent to us,
into “an April fool.”
ROBERT MERRY’S
MUSEUM.
edited by
S. G. GOODRICH,
VOLUME IV.
BOSTON:
B R A D B U R Y, S O D E N , & C O . ,
No. 10 School Street, and 127 Nassau Street, New York. 1842.
CONTENTS OF VOLUME IV.
JULY TO DECEMBER, 1842.
Entered, according to Act of Congress, in the year 1842, by S. G. Goodrich, in the Clerk’s
Office of the
District Court of Massachusetts.
KNIGHTS TEMPLARS.
MERRY’S MUSEUM.
V O L U M E I V . — N o . 1 .
chapter ix.
Limby Lumpy was the only son of his mamma. His father was called
the “pavier’s assistant;” for he was so large and heavy, that, when
he used to walk through the streets, the men who were ramming
the stones down, with a large wooden rammer, would say, “Please to
walk over these stones, sir.” And then the men would get a rest.
Limby was born on the 1st of April; I do not know how long ago;
but, before he came into the world, such preparations were made!
There was a beautiful cradle; and a bunch of coral, with bells on it;
and lots of little caps; and a fine satin hat; and nice porringers for
pap; and two nurses to take care of him. He was, too, to have a little
chaise, when he grew big enough; after that, he was to have a
donkey, and then a pony. In short, he was to have the moon for a
plaything, if it could be got; and as to the stars, he would have had
them, if they had not been too high to reach.
Limby made a rare to do when he was a little baby. But he never
was a little baby—he was always a big baby; nay, he was a big baby
till the day of his death.
“Baby Big,” his mamma used to call him; he was “a noble baby,”
said his aunt; he was “a sweet baby,” said old Mrs. Tomkins, the
nurse; he was “a dear baby,” said his papa,—and so he was, for he
cost a good deal. He was “a darling baby,” said his aunt, by the
mother’s side; “there never was such a fine child,” said everybody,
before the parents; when they were at another place, they called
him “a great, ugly, fat child.”
We call it polite in this world to say a thing to please people,
although we think exactly the contrary. This is one of the things the
philosopher Democrates, that you may have heard of, would have
laughed at.
Limby was almost as broad as he was long. He had what some
people call an open countenance; that is, one as broad as a full
moon. He had what his mamma called beautiful auburn locks, but
what other people said were carroty;—not before the mother, of
course.
Limby had a flattish nose and a widish mouth, and his eyes were
a little out of the right line. Poor little dear, he could not help that,
and, therefore, it was not right to laugh at him.
Everybody, however, laughed to see him eat his pap; for he would
not be fed with the patent silver pap-spoon which his father bought
him; but used to lay himself flat on his back, and seize the pap-boat
with both hands, and never let go of it till its contents were fairly in
his dear little stomach.
So Limby grew bigger and bigger every day, till at last he could
scarcely draw his breath, and was very ill; so his mother sent for
three apothecaries and two physicians, who looked at him,—told his
mamma there were no hopes; the poor child was dying of over-
feeding. The physicians, however, prescribed for him—a dose of
castor oil!
His mamma attempted to give him the castor oil; but Limby,
although he liked sugar plums, and cordial, and pap, and
sweetbread, and oysters, and other things nicely dished up, had no
fancy for castor oil, and struggled, and kicked, and fought, every
time his nurse or mamma attempted to give it to him.
“Limby, my darling boy,” said his mamma, “my sweet cherub, my
only dearest, do take the oily poily—there’s a ducky, deary—and it
shall ride in a coachy poachy.”
“Oh! the dear baby,” said the nurse, “take it for nursey. It will take
it for nursey—that it will.”