100% found this document useful (21 votes)
167 views

(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF

This book provides an in-depth overview of advanced data analytics techniques using Python. It covers topics such as ETL, feature engineering, supervised and unsupervised machine learning, deep learning, time series analysis, and distributed analytics. The book is intended for advanced users and includes coding exercises and real-world examples.

Uploaded by

toqanipergar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (21 votes)
167 views

(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF

This book provides an in-depth overview of advanced data analytics techniques using Python. It covers topics such as ETL, feature engineering, supervised and unsupervised machine learning, deep learning, time series analysis, and distributed analytics. The book is intended for advanced users and includes coding exercises and real-world examples.

Uploaded by

toqanipergar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Advanced Data Analytics Using Python

: With Architectural Patterns, Text and


Image Classification, and Optimization
Techniques 2nd Edition Sayan
Mukhopadhyay
Visit to download the full and correct content document:
https://ebookmass.com/product/advanced-data-analytics-using-python-with-architectu
ral-patterns-text-and-image-classification-and-optimization-techniques-2nd-edition-sa
yan-mukhopadhyay/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Data Mining for Business Analytics: Concepts,


Techniques and Applications in Python eBook

https://ebookmass.com/product/data-mining-for-business-analytics-
concepts-techniques-and-applications-in-python-ebook/

Practical Business Analytics Using R and Python: Solve


Business Problems Using a Data-driven Approach 2nd
Edition Umesh R. Hodeghatta

https://ebookmass.com/product/practical-business-analytics-using-
r-and-python-solve-business-problems-using-a-data-driven-
approach-2nd-edition-umesh-r-hodeghatta/

Practical Business Analytics Using R and Python 2nd


Edition Umesh R. Hodeghatta

https://ebookmass.com/product/practical-business-analytics-using-
r-and-python-2nd-edition-umesh-r-hodeghatta/

Bayesian Optimization: Theory and Practice Using Python


1st Edition Peng Liu

https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-1st-edition-peng-liu/
Python Data Analytics: With Pandas, NumPy, and
Matplotlib, 3rd Edition Fabio Nelli

https://ebookmass.com/product/python-data-analytics-with-pandas-
numpy-and-matplotlib-3rd-edition-fabio-nelli-2/

Python Data Analytics: With Pandas, NumPy, and


Matplotlib, 3rd Edition Fabio Nelli

https://ebookmass.com/product/python-data-analytics-with-pandas-
numpy-and-matplotlib-3rd-edition-fabio-nelli/

Bayesian Optimization : Theory and Practice Using


Python Peng Liu

https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-peng-liu/

Data-Driven SEO with Python: Solve SEO Challenges with


Data Science Using Python 1st Edition Andreas Voniatis

https://ebookmass.com/product/data-driven-seo-with-python-solve-
seo-challenges-with-data-science-using-python-1st-edition-
andreas-voniatis/

Meta-analytics. Consensus approaches and system


patterns for data analysis Simske S

https://ebookmass.com/product/meta-analytics-consensus-
approaches-and-system-patterns-for-data-analysis-simske-s/
Sayan Mukhopadhyay and Pratip Samanta

Advanced Data Analytics Using Python


With Architectural Patterns, Text and Image
Classification, and Optimization Techniques
2nd ed.
Sayan Mukhopadhyay
Kolkata, West Bengal, India

Pratip Samanta
Kolkata, West Bengal, India

ISBN 978-1-4842-8004-1 e-ISBN 978-1-4842-8005-8


https://doi.org/10.1007/978-1-4842-8005-8

© Sayan Mukhopadhyay, Pratip Samanta 2018, 2023

Apress Standard

The use of general descriptive names, registered names, trademarks,


service marks, etc. in this publication does not imply, even in the
absence of a specific statement, that such names are exempt from the
relevant protective laws and regulations and therefore free for general
use.

The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been
made.

This Apress imprint is published by the registered company APress


Media, LLC, part of Springer Nature.
The registered company address is: 1 New York Plaza, New York, NY
10004, U.S.A.
The reason for the success of this book is that it has original research, so I
dedicate it to the person from whom I learned how to do research: Dr.
Debnath Pal, IISc.
—Sayan Mukhopadhyay
Introduction
We are living in the data science/artificial intelligence era. To thrive in
this environment, where data drives decision-making in everything
from business to government to sports and entertainment, you need
the skills to manage and analyze huge amounts of data. Together we
can use this data to make the world better for everyone. In fact, humans
have yet to find everything we can do using this data. So, let us explore!
Our objective for this book is to empower you to become a leader in
this data-transformed era. With this book you will learn the skills to
develop AI applications and make a difference in the world.
This book is intended for advanced user, because we have
incorporated some advanced analytics topics. Important machine
learning models and deep learning models are explained with coding
exercises and real-world examples.
All the source code used in this book is available for download at
https://github.com/apress/advanced-data-analytics-
python-2e.
Happy reading!
Any source code or other supplementary material referenced by the
author in this book is available to readers on GitHub
(github.com/apress). For more detailed information, please visit
www.apress.com/source-code.
Acknowledgments
Thanks to Labonic Chakraborty (Ripa) and Soumili Chakraborty.
Table of Contents
Chapter 1:​A Birds Eye View to AI System
OOP in Python
Calling Other Languages in Python
Exposing the Python Model as a Microservice
High-Performance API and Concurrent Programming
Choosing the Right Database
Summary
Chapter 2:​ETL with Python
MySQL
How to Install MySQLdb?​
Database Connection
INSERT Operation
READ Operation
DELETE Operation
UPDATE Operation
COMMIT Operation
ROLL-BACK Operation
Normal Forms
First Normal Form
Second Normal Form
Third Normal Form
Elasticsearch
Connection Layer API
Neo4j Python Driver
neo4j-rest-client
In-Memory Database
MongoDB (Python Edition)
Import Data into the Collection
Create a Connection Using pymongo
Access Database Objects
Insert Data
Update Data
Remove Data
Cloud Databases
Pandas
ETL with Python (Unstructured Data)
Email Parsing
Topical Crawling
Summary
Chapter 3:​Feature Engineering and Supervised Learning
Dimensionality Reduction with Python
Correlation Analysis
Principal Component Analysis
Mutual Information
Classifications with Python
Semi-Supervised Learning
Decision Tree
Which Attribute Comes First?​
Random Forest Classifier
Naïve Bayes Classifier
Support Vector Machine
Nearest Neighbor Classifier
Sentiment Analysis
Image Recognition
Regression with Python
Least Square Estimation
Logistic Regression
Classification and Regression
Intentionally Bias the Model to Over-Fit or Under-Fit
Dealing with Categorical Data
Summary
Chapter 4:​Unsupervised Learning:​Clustering
K-Means Clustering
Choosing K:​The Elbow Method
Silhouette Analysis
Distance or Similarity Measure
Properties
General and Euclidean Distance
Squared Euclidean Distance
Distance Between String-Edit Distance
Similarity in the Context of a Document
Types of Similarity
Example of K-Means in Images
Preparing the Cluster
Thresholding
Time to Cluster
Revealing the Current Cluster
Hierarchical Clustering
Bottom-Up Approach
Distance Between Clusters
Top-Down Approach
Graph Theoretical Approach
How Do You Know If the Clustering Result Is Good?​
Summary
Chapter 5:​Deep Learning and Neural Networks
Backpropagation
Backpropagation Approach
Other Algorithms
TensorFlow
Network Architecture and Regularization Techniques
Updatable Model and Transfer Learning
Recurrent Neural Network
LSTM
Reinforcement Learning
TD0
TDλ
Example of Dialectic Learning
Convolution Neural Networks
Summary
Chapter 6:​Time Series
Classification of Variation
Analyzing a Series Containing a Trend
Curve Fitting
Removing Trends from a Time Series
Analyzing a Series Containing Seasonality
Removing Seasonality from a Time Series
By Filtering
By Differencing
Transformation
To Stabilize the Variance
To Make the Seasonal Effect Additive
To Make the Data Distribution Normal
Stationary Time Series
Stationary Process
Autocorrelation and the Correlogram
Estimating Autocovariance and Autocorrelation Functions
Time-Series Analysis with Python
Useful Methods
Autoregressive Processes
Estimating Parameters of an AR Process
Mixed ARMA Models
Integrated ARMA Models
The Fourier Transform
An Exceptional Scenario
Missing Data
Summary
Chapter 7:​Analytics at Scale
Hadoop
MapReduce Programming
Partitioning Function
Combiner Function
HDFS File System
MapReduce Design Pattern
A Notes on Functional Programming
Spark
PySpark
Updatable Machine Learning and Spark Memory Model
Analytics in the Cloud
Internet of Things
Essential Architectural Patterns for Data Scientists
Scenario 1:​Hot Potato Anti-Pattern
Scenario 2:​Proxy and Layering Patterns
Thank You
Index
About the Authors
Sayan Mukhopadhyay
has more than 13 years of industry
experience and has been associated with
companies such as Credit Suisse, PayPal,
CA Technologies, CSC, and Mphasis. He
has a deep understanding of applications
for data analysis in domains such as
investment banking, online payments,
online advertising, IT infrastructure, and
retail. His area of expertise is in applying
high-performance computing in
distributed and data-driven
environments such as real-time analysis,
high-frequency trading, and so on.
He earned his engineering degree in
electronics and instrumentation from
Jadavpur University and his master’s
degree in research in computational and data science from IISc in
Bangalore.

Pratip Samanta
is a principal AI engineer/researcher
with more than 11 years of experience.
He has worked for several software
companies and research institutions. He
has published conference papers and has
been granted patents in AI and natural
language processing. He is also
passionate about gardening and
teaching.
About the Technical Reviewer
Joos Korstanje
is a data scientist with more than five
years of industry experience in
developing machine learning tools, of
which a large part is forecasting models.
He currently works at Disneyland Paris
where he develops machine learning for
a variety of tools.
© Sayan Mukhopadhyay, Pratip Samanta 2023
S. Mukhopadhyay, P. Samanta, Advanced Data Analytics Using Python
https://doi.org/10.1007/978-1-4842-8005-8_1

1. A Birds Eye View to AI System


Sayan Mukhopadhyay1 and Pratip Samanta1

(1) Kolkata, West Bengal, India

In this book, we assume that you are familiar with Python programming. In this introductory chapter, we
explain why a data scientist should choose Python as a programming language. Then we highlight some
situations where Python may not be the ideal choice. Finally, we describe some best practices for application
development and give some coding examples that a data scientist may need in their day-to-day job.

OOP in Python
In this section, we explain some features of object-oriented programming (OOP) in a Python context.
The most basic element of any modern application is an object. To a programmer or architect, the world
is a collection of objects. Objects consist of two types of members: attributes and methods. Members can be
private, public, or protected. Classes are data types of objects. Every object is an instance of a class. A class
can be inherited in child classes. Two classes can be associated using composition.
Python has no keywords for public, private, or protected, so encapsulation (hiding a member from the
outside world) is not implicit in Python. Like C++, it supports multilevel and multiple inheritance. Like Java,
it has an abstract keyword. Classes and methods both can be abstract.
In the following code, we are describing an object-oriented question-answering system without any
machine learning. The program’s input is a set of dialogs in input.txt, as shown here:

glob is I
prok is V
pish is X
tegj is L
glob glob Silver is 34 Credits
glob prok Gold is 57800 Credits
pish pish Iron is 3910 Credits
how much is pish tegj glob glob ?
how many Credits is glob prok Silver ?
how many Credits is glob prok Gold ?
how many Credits is glob prok Iron ?
how much wood could a woodchuck chuck if a woodchuck could chuck wood?
Program has a knowledge base in config.txt.
I,1,roman
V,5,roman
X,10,roman
L,50,roman
C,100,roman
D,500,roman
M,1000,roman

Based on this input and the configuration program, the answer to the question is given in input.txt in
standard output, as shown here:
pish tegj glob glob is 42
glob prok Silver is 68 Credits
glob prok Gold is 57800 Credits
glob prok Iron is 782 Credits
I have no idea what you are talking about
The parsing logic is in the Observer class.

import operator

#this class verify the validity of input


class Observer(object):
#store frequecy of symbols
length = {}
#most frequent symbol
symbol = ''
#count of most frequent symbol
count = 0
#calling class
compiler = None

def __init__(self,cmpiler):
self.compiler = cmpiler

def initialize(self, arr):


for i in range(len(arr)):
self.length[arr[i]] = 0

#increase count for each occurence of symbol


def increment(self,symbol):
self.length[symbol] = self.length[symbol] + 1

#claculate most frequent symbol and it's count


def calculate(self):
self.symbol,self.count = max(self.length.items(),
key=operator.itemgetter(1))

#verify if wrong symbol is subtracted ie ( V, ..


def verifySubstract(self, current):
while current % 10 != 0:
current = current / 10
if current == 5:
raise Exception("Wrong Substraction")

def evaluate(self):
#check mximum repeatation is crossing the limit
if self.count > 3:
raise Exception("Repeat more than 3")
#symbol is proper or not
if self.symbol not in self.compiler.symbol_map:
raise Exception("Wrong Symbol")
#check if wrong symbol is repeated ie (V, ..
self.symbol,unit = self.compiler.evaluateSymbol(self.symbol)
while self.symbol % 10 != 0:
self.symbol = self.symbol / 10
if self.count > 1 and self.symbol == 5:
raise Exception("Wrong Symbol repeated")
#checking if input sentence is proper or not
def evaluateSentence(self, line):
if "is" not in line:
return "I have no idea what you are talking about"
The compilation logic is in the compiler class, as shown here:

import sys

from observer import Observer

class compilerTrader(object):
#store mapping of symbols with score and unit
symbol_map = {}
#store the list of valid symbol
valid_values = []

#read the config and initialize the class member


def __init__(self, config_path):
with open(config_path) as f:
for line in f:
if ',' in line :
symbol, value, type = line.strip().split(',')
self.symbol_map[symbol] = float(value), type
self.valid_values.append(float(value))
f.close()

#evaluate the ultimate numerical score with unit for a symbol


def evaluateSymbol(self, symbol):
while symbol not in self.valid_values:
symbol,unit = self.symbol_map[symbol]
return float(symbol), unit

#compiling the info in line


def compile_super(self, line):
obs = Observer(self)
if 'is' in line:
fields = line.split(' is ')
value = fields[-1]
var = fields[0]
#if one symbol and one value
if ' ' not in var:
if value in self.symbol_map:
self.symbol_map[var] = int(self.symbol_map[value][0]) ,'rom
else:
#logic for value with unit
if ' ' in value:
fields = value.split(' ')
user_unit = fields[-1]
if ' ' not in var:
self.symbol_map[var] = int(fields[0]), user_unit
else:
#logic for multiple symbols in input
total = int(fields[0])
factor = 0
arr = var.split(' ')
obs.initialize(arr)
for i in range(len(arr)):
obs.increment(arr[i])
if arr[i] in self.symbol_map and arr[i+1] in
self.symbol_map and i < len(arr) -1:
current, current_unit =
self.evaluateSymbol([arr[i]][0])
next, next_unit = self.evaluateSymbol([arr
[0])
if current >= next:
factor = factor + current
else:
obs.verifySubstract(current)
factor = factor - current
else:
if arr[i] in self.symbol_map:
current, current_unit =
self.evaluateSymbol([arr[i]][0])
factor = factor + current
else:
self.symbol_map[arr[i]] =
total/factor, user_unit
self.valid_values.append(total
obs.calculate()
obs.evaluate()
The answering logic is in the answer layer, which calls Observer and compiler. The answering
class inherits the compiler class.

import sys

from observer import Observer


from compiler import compilerTrader

class answeringTrader(compilerTrader):

def __init__(self, config_path):


super().__init__(config_path)

#compiling info in line


def compile(self, line):
super().compile_super(line)

#answering query in line


def answer(self, line):
obs = Observer(super())
if 'is' in line:
values = line.split(' is ')[-1]
ans = 0
arr = values.split(' ')
unit = ''
obs.initialize(arr)
for i in range(len(arr)):
if arr[i] in "?.,!;":
continue
obs.increment(arr[i])
if i < len(arr)-2:
if arr[i] in super().symbol_map and arr[i+1]
in super().symbol_map:
current, current_unit =
super().evaluateSymbol([arr[i]][0])
next, next_unit =
super().evaluateSymbol([arr[i+1]][0])
if current >= next:
ans = ans + current
else:
if next_unit == 'roman':
obs.verifySubstract(current)
ans = ans - current
else:
ans = ans + current
else:
if arr[i] in super().symbol_map:
current,unit =
super().evaluateSymbol([arr[i]][0])
if unit != 'roman':
ans = ans * current
else:
ans = ans + current

obs.calculate()
obs.evaluate()
values = values.replace("?" , "is ")
if unit == 'roman':
unit = ''
return(values + str(ans) + ' ' + unit)
Finally, the main program calls the answering class and the observer, and then it performs the task
and does unit testing on the logic.

import sys
import unittest

sys.path.append('./answerLayer')
sys.path.append('./compilerLayer')
sys.path.append('./utilityLayer')

from answer import answeringTrader


from observer import Observer

#client interface for the framework


class ClientTrader(object):
trader = None
def __init__(self, config_path):
self.trader = answeringTrader(config_path)

#processing an input string


def process(self, input_string):
obs = Observer(self.trader)
valid = obs.evaluateSentence(input_string)
if valid is not None:
return valid
if input_string.strip()[-1] == '?' :
return self.trader.answer(input_string)
else:
return self.trader.compile(input_string)
#unit test cases
class TestTrader(unittest.TestCase):
trader = None

def setUp(self):
pass

#test case for non-roman symbol unit other than roman


def test_answer_unit(self):
ans = self.trader.process("how many Credits is glob prok Silver
?")
self.assertEqual(ans.strip(), "glob prok Silver is 68.0
Credits")

#test case with only roman symbol in unit case


def test_answer_roman(self):
ans = self.trader.process("how much is pish tegj glob glob ?")
self.assertEqual(ans.strip(), "pish tegj glob glob is 42.0")

#test case if repeatation of symbol is exceed max limit (3)


def test_exception_over_repeat(self):
with self.assertRaises(Exception) as context:
ans = self.trader.process("how much is pish tegj glob glob
glob glob ?")
self.assertTrue("Repeat more than 3" in context.exception)

#test case if wrong symbol repeated ie (V ..


def test_exception_unproper_repeat(self):
with self.assertRaises(Exception) as context:
ans = self.trader.process("how much is pish tegj D D ?")
self.assertTrue("Repeat more than 3" in context.exception)

#test case if wrong symbol substracted ie (V ...


def test_wrong_substraction(self):
with self.assertRaises(Exception) as context:
ans = self.trader.process("how much is V X ?")
self.assertTrue("Wrong Substraction" in context.exception)

#test case if query is not properly formatted


def test_wrong_format_query(self):
ans = self.trader.process("how much wood could a woodchuck chuck
if a woodchuck could chuck wood ?")
self.assertEqual(ans.strip(), "I have no idea what you are
talking about")

if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage is : " + sys.argv[0] + " <intput file
path> <config file path>")
exit(0)
tr = ClientTrader(sys.argv[2])
f = open(sys.argv[1])
for line in f:
response = tr.process(line.strip())
if response is not None:
print(response)
TestTrader.trader = tr
unittest.main(argv = [sys.argv[0]], exit = False)
You can run this program with the following command:

python client.py input.txt config.txt

Calling Other Languages in Python


Now we will describe how to use other languages in Python. There are two examples here. The first is calling
R code from Python. R code is required for some use cases. For example, if you want a ready-made function
for the Holt-Winters method in a time series, it is difficult to perform in Python, but it is available in R. So,
you can call R code from Python using the rpy2 module, as shown here:

import rpy2.robjects as ro
ro.r('data(input)')
ro.r('x <-HoltWinters(input data frame)')

(You can use example data given in time series chapter.)


Sometimes you need to call Java code from Python. For example, say you are working on a name-entity
recognition problem in the field of natural language processing (NLP); some text is given as input, and you
have to recognize the names in the text. Python’s NLTK package does have a name-entity recognition
function, but its accuracy is not good. Stanford NLP is a better choice here, but it is written in Java. You can
solve this problem in two ways.
You can call Java at the command line using Python code. You need to install Java with the yum/at-get
install java command before calling it.
For Windows, it is recommended that you install the JRE from
https://adoptium.net/temurin/releases/?version=8. You can also install the JRE from
another distribution. The installation will automatically create JAVA_HOME. If it does not, you need to set
JAVA_HOME as the system variable, and the value should be the location of Java installation folder, for
example, JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-8.0.345.1-hotspot\.

import subprocess
subprocess.call(['java','-
cp','*','edu.stanford.nlp.sentiment.SentimentPipeline','-file','foo.txt'])

Please place foo.txt in the same folder where you run the Python code.

You can expose Stanford NLP as a web service and call it as a service. (Before running this code, you’ll
need to download the Stanford nlp JAR file available with the book’s source code.)

nlp = StanfordCoreNLP('http://127.0.0.1:9000')
output = nlp.annotate(sentence, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only
takes in one single sentence.
"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process
faster
"enforceRequirements": "false"
})

You will see a more detailed example of Stanford NLP in Chapter 2.

Exposing the Python Model as a Microservice


You can expose the Python model as a microservice in the same way that your Python model can be used by
others to write their own code. The best way to do this is to expose your model as a web service. As an
example, the following code exposes a deep learning model using Flask:

from flask import Flask, request, g


from flask_cors import CORS
import tensorflow as tf
from sqlalchemy import *
from sqlalchemy.orm import sessionmaker
import pygeoip
from pymongo import MongoClient
import json
import datetime as dt
import ipaddress
import math
app = Flask(__name__)
CORS(app)
@app.before_request
def before():
db = create_engine('sqlite:///score.db')
metadata = MetaData(db)
g.scores = Table('scores', metadata, autoload=True)
Session = sessionmaker(bind=db)
g.session = Session()
client = MongoClient()
g.db = client.frequency
g.gi = pygeoip.GeoIP('GeoIP.dat')
sess = tf.Session()
new_saver = tf.train.import_meta_graph('model.obj.meta')
new_saver.restore(sess, tf.train.latest_checkpoint('./'))
all_vars = tf.get_collection('vars')
g.dropped_features = str(sess.run(all_vars[0]))
g.b = sess.run(all_vars[1])[0]
return
def get_hour(timestamp):
return dt.datetime.utcfromtimestamp(timestamp / 1e3).hour
def get_value(session, scores, feature_name, feature_value):
s = scores.select((scores.c.feature_name == feature_name) &
(scores.c.feature_value == feature_value))
rs = s.execute()
row = rs.fetchone()
if row is not None:
return float(row['score'])
else:
return 0.0
@app.route('/predict', methods=['POST'])
def predict():
input_json = request.get_json(force=True)
features =
['size','domain','client_time','device','ad_position','client_size',
'ip','root']
predicted = 0
feature_value = ''
for f in features:
if f not in g.dropped_features:
if f == 'ip':
feature_value =
str(ipaddress.IPv4Address(ipaddress.ip_address(unicode(request.remote_addr))))
else:
feature_value = input_json.get(f)
if f == 'ip':
if 'geo' not in g.dropped_features:
geo =
g.gi.country_name_by_addr(feature_value)
predicted = predicted + get_value(g.session,
g.scores, 'geo', geo)

return str(math.exp(predicted + g.b)-1)


app.run(debug = True, host ='0.0.0.0')
This code exposes a deep learning model as a Flask web service. A JavaScript client will send the request
with web user parameters such as the IP address, ad size, ad position, and so on, and it will return the price
of the ad as a response. The features are categorical. You will learn how to convert them into numerical
scores in Chapter 3. These scores are stored in an in-memory database. The service fetches the score from
the database, sums the result, and replies to the client. This score will be updated real time in each iteration
of training of a deep learning model. It is using MongoDB to store the frequency of that IP address in that
site. It is an important parameter because a user coming to a site for the first time is really searching for
something, which is not true for a user where the frequency is greater than 5. The number of IP addresses is
huge, so they are stored in a distributed MongoDB database.

High-Performance API and Concurrent Programming


Flask is a good choice when you are building a general solution that is also a graphical user interface (GUI).
But if high performance is the most critical requirement of your application, then Falcon is the best choice.
The following code is an example of the same model shown previously exposed by the Falcon framework.
Another improvement we made in this code is that we implemented multithreading, so the code will be
executed in parallel. In addition to the Falcon-specific changes, you should note the major changes in
parallelizing the calling get_score function using a thread pool class.

import falcon
from falcon_cors import CORS
import json
from sqlalchemy import *
from sqlalchemy.orm import sessionmaker
import pygeoip
from pymongo import MongoClient
import json
import datetime as dt
import ipaddress
import math
from concurrent.futures import *
from sqlalchemy.engine import Engine
from sqlalchemy import event
import sqlite3
@event.listens_for(Engine, "connect")
def set_sqlite_pragma(dbapi_connection, connection_record):
cursor = dbapi_connection.cursor()
cursor.execute("PRAGMA cache_size=100000")
cursor.close()
class Predictor(object):
def __init__(self,domain):
db1 = create_engine('sqlite:///score_' + domain + '0test.db')
metadata1 = MetaData(db1)
self.scores = Table('scores', metadata1, autoload=True)
client = MongoClient(connect=False,maxPoolSize=1)
self.db = client.frequency
self.gi = pygeoip.GeoIP('GeoIP.dat')
self.high = 1.2
self.low = .8
def get_hour(self,timestamp):
return dt.datetime.utcfromtimestamp(timestamp / 1e3).hour
def get_score(self, featurename, featurevalue):
pred = 0
s = self.scores.select((self.scores.c.feature_name ==
featurename) & (self.scores.c.feature_value == featurevalue))
rs = s.execute()
row = rs.fetchone()
if row is not None:
pred = pred + float(row['score'])
res = self.db.frequency.find_one({"ip" : ip})
freq = 1
if res is not None:
freq = res['frequency']
pred2, prob2 = self.get_score('frequency', str(freq))
return (pred1 + pred2), (prob1 + prob2)

conn = sqlite3.connect('multiplier.db')
cursor = conn.execute("SELECT high,low from multiplier
where domain='" + value + "'")
row = cursor.fetchone()
if row is not None:
self.high = row[0]
self.low = row[1]
return self.get_score(f, value)
def on_post(self, req, resp):
input_json = json.loads(req.stream.read(),encoding='utf-8')
input_json['ip'] = unicode(req.remote_addr)
pred = 1
prob = 1
with ThreadPoolExecutor(max_workers=8) as pool:
future_array = {
pool.submit(self.get_value,f,input_json[f]) : f for f in input_json}
for future in as_completed(future_array):
pred1, prob1 = future.result()
pred = pred + pred1
prob = prob - prob1
resp.status = falcon.HTTP_200
res = math.exp(pred)-1
if res < 0:
res = 0
prob = math.exp(prob)
if(prob <= .1):
prob = .1
if(prob >= .9):
prob = .9
multiplier = self.low + (self.high -self.low)*prob
pred = multiplier*pred
resp.body = str(pred)
cors =
CORS(allow_all_origins=True,allow_all_methods=True,allow_all_headers=True)
wsgi_app = api = falcon.API(middleware=[cors.middleware])
f = open('publishers1.list')
for domain in f:
domain = domain.strip()
p = Predictor(domain)
url = '/predict/' + domain
api.add_route(url, p)
Having covered design patterns in Python a bit, let’s now take a look at some essential architecture
patterns for data scientists.

Choosing the Right Database


Before we go, we’ll leave a note for manager on which database is best for which case.
A relational database (MySQL, Oracle, SQL Server) is the preferable choice when data is highly structured
and entities have a clear and strict connection. Mongo, on the other hand, is a better choice when data is
unstructured and unorganized.
Elastic Search or Solr is a better choice when data contains a lengthy textual field and you’re executing lots
of searches in a substring of the text field. With Elastic Search, you get a free data visualization tool called
Kibana as well as an ETL tool called Logstash, and full-stack data analytics solutions are fashionable.
Data must sometimes be represented as a graph. In that situation, a graph database is required. Neo4j is a
popular graph database that comes with many utility tools at a low price.
We occasionally require a quick application. In that situation, an in-memory database like SQLite can be
used. However, SQLite does not support updating your database from a remote host.
You’ll learn more about databases in Chapter 2.

Summary
In this chapter, we discussed fundamental engineering principles for data scientists, which are covered in
separate chapters. The question-answering example can help you understand how to organize your code.
The basic rule is to not put everything into one class. Divide your code into many categories and use parent-
child relationships where they exist. Then you learned how to use Python to call other languages’ code. We
provided two instances of R and Java code calls. Then we showed you how to expose your model as a REST
API and make it perform well by using concurrent programming. Following that, we covered significant
architectural patterns from data scientists.
© Sayan Mukhopadhyay, Pratip Samanta 2023
S. Mukhopadhyay, P. Samanta, Advanced Data Analytics Using Python
https://doi.org/10.1007/978-1-4842-8005-8_2

2. ETL with Python


Sayan Mukhopadhyay1 and Pratip Samanta1

(1) Kolkata, West Bengal, India

Every data science professional has to extract, transform, and load (ETL) data from different data
sources. In this chapter, we will discuss how to perform ETL with Python for a selection of
popular databases. For a relational database, we’ll cover MySQL. As an example of a document
database, we will cover Elasticsearch. For a graph database, we’ll cover Neo4j, and for NoSQL,
we’ll cover MongoDB. We will also discuss the Pandas framework, which was inspired by R’s data
frame concept.
ETL is based on a process in which data is extracted from multiple sources, transformed into
specific formats that involve cleaning enrichment, and finally loaded into its target destination.
The following are the details of each process:
1. Extract: During data extraction, source data is pulled from a variety of sources and moved to a
staging area, making the data available to subsequent stages in the ETL process. After that,
the data undergoes the cleaning and enrichment stage, also known as data cleansing.

2. Transform: In this stage, the source data is matched to the format of the target system. This
includes steps such as changing data types, combining fields, splitting fields, etc.

3. Load: This stage is the final ETL stage. Here, data is loaded into the data warehouse in an
automated manner and can be periodically updated. Once completed, the data is ready for
data analysis.

The previous processes are important in any data analytics work. Once the data goes through
the ETL processes, then it becomes possible to analysis the data, find insights, and so on.
We will discuss various types of ETL throughout this chapter. We discussed in Chapter 1 that
data is not an isolated thing. We need to load data from somewhere, which is a database. We need
to fetch the data from some application, which is extraction. In this chapter and the next, we will
discuss various feature engineering that transforms the data from one form to another.

MySQL
MySQLdb is an API in Python developed to work on top of the MySQL C interface.

How to Install MySQLdb?


First you need to install the Python MySQLdb module on your machine. Then run the following
script:

#!/usr/bin/python
import MySQLdb
If you get an import error exception, that means the module was not installed properly.
The following are the instructions to install the MySQL Python module:

$ gunzip MySQL-python-1.2.2.tar.gz
$ tar -xvf MySQL-python-1.2.2.tar
$ cd MySQL-python-1.2.2
$ python setup.py build
$ python setup.py install

You can download the tar.gz file from


https://dev.mysql.com/downloads/connector/python/. You need to download it to
your working folder.
For Windows, please select the MySQL installer file from
https://dev.mysql.com/downloads/installer/. Once it’s downloaded, double-click
the file to install it and select MySQL Connector/Python as one of the products to install. For
details, you can visit https://dev.mysql.com/doc/connector-
python/en/connector-python-installation-binary.xhtml.

Database Connection
Before connecting to a MySQL database, make sure you do the following:
1. You need to access a database called TEST with the sql "use test" command.

2. In TEST you need a table named STUDENT; use the command sql "create table
student(name varchar(20), sur_name varchar(20),roll_no int");.

3. STUDENT needs three fields: NAME, SUR_NAME, and ROLL_NO.

4. There needs to be a user in TEST that has complete access to the database.

If you do not do these steps properly, you will get an exception in the next Python code.

INSERT Operation
The following code carries out the SQL INSERT statement for the purpose of creating a record in
the STUDENT table:

#!/usr/bin/python
import MySQLdb
# Open database connection
db = MySQLdb.connect("localhost","user","passwd","TEST" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
# Prepare SQL query to INSERT a record into the database.
sql = """INSERT INTO STUDENT(NAME,
SUR_NAME, ROLL_NO)
VALUES ('Sayan', 'Mukhopadhyay', 1)"""
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
db.close()

READ Operation
The following code fetches data from the STUDENT table and prints it:

#!/usr/bin/python
import MySQLdb
# Prepare SQL query to INSERT a record into the database.
sql = "SELECT * FROM STUDENT "
try:
# Execute the SQL command
cursor.execute(sql)
# Fetch all the rows in a list of lists.
results = cursor.fetchall()
for row in results:
fname = row[0]
lname = row[1]
id = row[2]
# Now print fetched result
Print( "name=%s,surname=%s,id=%d" % \
(fname, lname, id ))
except:
print "Error: unable to fecth data"

DELETE Operation
The following code deletes a row from TEST with id=1:

#!/usr/bin/python
import MySQLdb
# Prepare SQL query to DELETE required records
sql = "DELETE FROM STUDENT WHERE ROLL_NO =1"
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()

UPDATE Operation
The following code changes the lastname variable to Mukherjee, from Mukhopadhyay:
#!/usr/bin/python
import MySQLdb
# Prepare SQL query to UPDATE required records
sql = "UPDATE STUDENT SET SUR_NAME="Mukherjee"
WHERE SUR_NAME="Mukhopadhyay"
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()

COMMIT Operation
The commit operation provides its assent to the database to finalize the modifications, and after
this operation, there is no way that this can be reverted.

ROLL-BACK Operation
If you are not completely convinced about any of the modifications and you want to reverse them,
then you can apply the roll-back() method.
The following is a complete example of accessing MySQL data through Python. It will give the
complete description of the data stored in a CSV file or MySQL database.
This code asks for the data source type, either MySQL or text. For example, if MySQL asks for
the IP address, credentials, and database name and shows all tables in the database, it offers its
fields once the table is selected. Similarly, a text file asks for a path, and in the files it points to, all
the columns are shown to the user.

# importing files and reading config file

import MySQLdb
import sys
out = open('Config1.txt','w')
print ("Enter the Data Source Type:")
print( "1. MySql")
print ("2. Exit")
while(1):
data1 = sys.stdin.readline().strip()
if(int(data1) == 1):
out.write("source begin"+"\n"+"type=mysql\n")

# taking inputs from user

print ("Enter the ip:")


ip = sys.stdin.readline().strip()
out.write("host=" + ip + "\n")
print ("Enter the database name:")
db = sys.stdin.readline().strip()
out.write("database=" + db + "\n")
print ("Enter the user name:")
usr = sys.stdin.readline().strip()
out.write("user=" + usr + "\n")
print ("Enter the password:")
passwd = sys.stdin.readline().strip()
out.write("password=" + passwd + "\n")

# making connection to and executing query

connection = MySQLdb.connect(ip, usr, passwd, db)


cursor = connection.cursor()
query = ("show tables")
cursor.execute(query)
data = cursor.fetchall()
tables = []

# appending data to the table

for row in data:


for field in row:
tables.append(field.strip())
for i in range(len(tables)):
print( i, tables[i])
tb = tables[int(sys.stdin.readline().strip())]
out.write("table=" + tb + "\n")
query = ("describe " + tb)
cursor.execute(query)
data = cursor.fetchall()
columns = []
for row in data:
columns.append(row[0].strip())
for i in range(len(columns)):
print( columns[i])
print "Not index choose the exact column names seperated
by coma"
cols = sys.stdin.readline().strip()
out.write("columns=" + cols + "\n")
cursor.close()
connection.close()
out.write("source end"+"\n")
print ("Enter the Data Source Type:")
print ("1. MySql") print ("2. Exit")
out.close()
sys.exit()
Before we go on to the topic of relational databases, let’s talk about database normalization.

Normal Forms
Database normal forms are the principles to organize your data in an optimum way.
Every table in a database can be in one of the normal forms that we’ll go over next. For the
primary key (PK) and foreign key (FK), you want to have as little repetition as possible. The rest of
the information should be taken from other tables.
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)

First Normal Form


There are no repeating groups in the first normal form since only single values are allowed at the
intersection of each row and column.
To get to first normal form, remove the repetitive groups, and establish two new relations to
normalize a connection with a repeating group.
For unique identification, the new connection’s PK is a combination of the old relation’s PK
and a feature from the newly formed relation.
To demonstrate the procedure for 1NF, we’ll use the Student_Grade_Report table, which
comes from a School database.

Student_Grade_Report (StudentNo, StudentName, Major, CourseNo,


CourseName, InstructorNo, InstructorName, InstructorLocation, Grade)

1. The recurring group in the Student Grade Report table contains the course information. A
student can enroll in a variety of courses.

2. Get rid of the group that keeps repeating itself. That’s each student’s course information in
this situation.

3. Determine your new table’s PK.

4. The attribute value must be identified uniquely by the PK (StudentNo and CourseNo).

Student (StudentNo, StudentName, Major)


StudentCourse (StudentNo, CourseNo, CourseName, InstructorNo,
InstructorName, InstructorLocation, Grade)

Second Normal Form


The relation must first be in 1NF for the second normal form. If and only if the PK contains a single
feature, the relationship is automatically in 2NF.
If the connection contains a composite PK, then each nonkey property must be completely
reliant on the entire PK, not just a portion of it (i.e., there can’t be any partial augmentation or
dependency).
A table must first be in 1NF before moving to 2NF.
1. As it has a single-column PK, the Student table is already in 2NF.

2. When looking at the Student Course table, you can observe that not all of the characteristics,
especially the course details, are completely dependent on the PK. The grade is the sole
attribute that is entirely reliant on xxx.

3. Locate the new table containing the course details.

4. Determine the new table’s PK.


The three new tables are as follows:
Student (StudentNo, StudentName, Major)
CourseGrade (StudentNo, CourseNo, Grade)
CourseInstructor (CourseNo, CourseName, InstructorNo,
InstructorName, InstructorLocation)

Third Normal Form


The connection must be in second normal form to be in third normal form. All transitive
dependencies must be eliminated as well; a nonkey attribute cannot be functionally reliant on
another nonkey attribute.
This is the process for achieving 3NF:
1. From each table with a transitive relationship, remove all dependent characteristics in a
transitive relationship.

2. Make a new table with the dependence eliminated.

3. Inspect new and updated tables to ensure that each table has a determinant and that no
tables have improper dependencies.

Take a look at the four new tables:

Student (StudentNo, StudentName, Major)


CourseGrade (StudentNo, CourseNo, Grade)
Course (CourseNo, CourseName, InstructorNo)
Instructor (InstructorNo, InstructorName, InstructorLocation)

There should be no abnormalities in the third normal form at this point. For this example,
consider the dependency diagram in Figure 2-1. As previously said, the first step is to eliminate
repeated groupings.

Student (StudentNo, StudentName, Major)


StudentCourse (StudentNo, CourseNo, CourseName, InstructorNo,
InstructorName, InstructorLocation, Grade)

Figure 2-1 Dependency diagram

Review the dependencies in Figure 2-1, which summarizes the normalization procedure for
the School database.
The following are the abbreviations used in Figure 2-1:
PD stands for partially dependent.
TD stands for transitive dependence.
FD stands for full dependency. (FD stands for functional dependence in most cases. Figure 2-1
is the only place where FD is used as an abbreviated form for full dependence.)
A relational database is valuable when structured data and a strict relationship between the
fields are maintained. But what if you do not have structured data in which a strict relationship
between fields has been maintained? That’s where Elasticsearch comes in.

Elasticsearch
You’ll find that data is often unstructured. Meaning, you may end up with a mix of image data,
sensor data, and other forms of data. To analyze this data, we first need to store it. MySQL or SQL-
based databases are not good at storing unstructured data. So here we introduce a different kind
of storage, which is mainly used to handle unstructured textual data.
Elasticsearch is a Lucene-based database, which makes it is easy to store and search text data.
Its query interface is a REST API endpoint. The Elasticsearch (ES) low-level client gives a direct
mapping from Python to ES REST endpoints. One of the big advantages of Elasticsearch is that it
provides a full-stack solution for data analysis in one place. Elasticsearch is the database. It has a
configurable front end called Kibana, a data collection tool called Logstash, and an enterprise
security feature called Shield.
This example has features called cat, cluster, indices, ingest, nodes, snapshot, and tasks that
translate to instances of CatClient, ClusterClient, IndicesClient, CatClient,
ClusterClient, IndicesClient, IngestClient, NodesClient, SnapshotClient,
NodesClient, SnapshotClient, and TasksClient, respectively. These instances are the
only supported way to get access to these classes and their methods.
You can specify your own connection class, which can be used by providing the
connection_class parameter.

# create connection to local host using the ThriftConnection


Es1=Elasticsearch(connection_class=ThriftConnection)

Installation commands for Elastic Search are given here:

curl -fsSL https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo


apt-key add -
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main"
| sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
sudo apt update
sudo apt install elasticsearch

You can start Elasticsearch in Ubuntu with these commands:

service elasticsearch start


service elasticsearch stop

You can check the status with these commands:

service elasticsearch status


# create connection that will automatically inspect the cluster to
get
# the list of active nodes. Start with nodes running on 'esnode1' and
# 'esnode2'
Es1=Elasticsearch(
['esnode1', 'esnode2'],
# sniff before doing anything
sniff_on_start=True,
# refresh nodes after a node fails to respond
sniff_on_connection_fail=True,
# and also every 30 seconds
sniffer_timeout=30
)

Different hosts can have different parameters (hostname, port number, SSL option); you can
use one dictionary per node to specify them.

# connect to localhost directly and


another node using SSL on port 443
# and an url_prefix. Note that ``port`` needs to be an int.
Es1=Elasticsearch([
{'host':'localhost'},
{'host':'othernode','port':443,'url_prefix':'es','use_ssl':True},
])

SSL client authentication is also supported (see Urllib3HttpConnection for a detailed


description of the options); an example is given here:

Es1=Elasticsearch(
['localhost:443','other_host:443'],
# turn on SSL
use_ssl=True,
# make sure we verify SSL certificates (off by default)
verify_certs=True,
# provide a path to CA certs on disk
ca_certs='path to CA_certs',
# PEM formatted SSL client certificate
client_cert='path to clientcert.pem',
# PEM formatted SSL client key
client_key='path to clientkey.pem'
)

Connection Layer API


Many classes are responsible for dealing with the Elasticsearch cluster. Here, the default
subclasses being utilized can be disregarded by handing over parameters to the
Elasticsearch class. Every argument belonging to the client will be added onto Transport,
ConnectionPool, and Connection.
As an example, if you want to use your own personal utilization of the
ConnectionSelector class, you just need to pass in the selector_class parameter.
The entire API wraps the raw REST API with a high level of accuracy, which includes the
differentiation between the required and optional arguments to the calls. This implies that the
code makes a differentiation between positional and keyword arguments; I advise you to use
keyword arguments for all calls to be consistent and safe. An API call becomes successful (and will
return a response) if Elasticsearch returns a 2XX response. Otherwise, an instance of
TransportError (or a more specific subclass) will be raised. You can see other exceptions and
error states in exceptions. If you do not want an exception to be raised, you can always pass in an
ignore parameter with either a single status code that should be ignored or a list of them.

from elasticsearch import Elasticsearch


es=Elasticsearch()
# ignore 400 cause by IndexAlreadyExistsException when creating an
index
es.indices.create(index='test-index',ignore=400)
# ignore 404 and 400
es.indices.delete(index='test-index',ignore=[400,404])

Neo4j Python Driver


There are a variety of systems, such as network topology and social networks. However, when
difficulties are shown as a graph, they are quickly resolved. Neo4j is a database that stores data in
the form of a graph and executes queries through a graphical interface. The Neo4j Python driver is
supported by Neo4j and connects with the database through the binary protocol. It tries to remain
minimalistic but at the same time be idiomatic to Python.

pip install neo4j-driver


from neo4j.v1 import GraphDatabase, basic_auth
driver11 = GraphDatabase.driver("bolt://localhost",
auth=basic_auth("neo4j", "neo4j"))
session11 = driver11.session()
session11.run("CREATE (a:Person {name:'Sayan',
title:'Mukhopadhyay'})")
result 11= session11.run("MATCH (a:Person) WHERE a.name = 'Sayan'
RETURN a.name AS name, a.title AS title")
for recordi n resul11t:
print("%s %s"% (record["title"], record["name"]))
session.close()

neo4j-rest-client
The main objective of neo4j-rest-client is to make sure that the Python programmers already
using Neo4j locally through python-embedded are also able to access the Neo4j REST server. So,
the structure of the neo4j-rest-client API is completely in sync with python-embedded. But, a new
structure is brought in so as to arrive at a more Pythonic style and to augment the API with the
new features being introduced by the Neo4j team.

In-Memory Database
Another important class of databases is an in-memory database. This type stores and processes
the data in RAM. So, operations on the database are fast, and the data is volatile. SQLite is a
popular example of an in-memory database. In Python you need to use the sqlalchemy library to
operate on SQLite. In Chapter 1’s Flask and Falcon example, I showed you how to select data from
SQLite. Here I will show how to store a Pandas data frame in SQLite:

from sqlalchemy import create_engine


import sqlite3
conn = sqlite3.connect('multiplier.db')
conn.execute('''CREATE TABLE if not exists multiplier
(domain CHAR(50),
low REAL,
high REAL);''')
conn.close()
db_name = "your db name ""
disk_engine = create_engine(db_name)
df.to_sql('scores', disk_engine, if_exists='replace')

MongoDB (Python Edition)


MongoDB is an open-source document database designed for superior performance, easy
availability, and automatic scaling. MongoDB makes sure that object-relational mapping (ORM) is
not required to facilitate development. A document that contains a data structure made up of field
and value pairs is referred to as a record in MongoDB. These records are akin to JSON objects. The
values of fields may be comprised of other documents, arrays, and arrays of documents.

{
"_id":ObjectId("01"),
"address": {
"street":"Siraj Mondal Lane",
"pincode":"743145",
"building":"129",
"coord": [ -24.97, 48.68 ]
},
"borough":"Manhattan",

Import Data into the Collection


mongoimport can be used to place the documents into a collection in a database, within the
system shell or a command prompt. If the collection already exists in the database, the operation
will discard the original collection first.

mongoimport --DB test --collection restaurants --drop --file


~/downloads/primer-dataset.json

The mongoimport command is joined to a MongoDB instance running on localhost on port


27017. The --file option provides a way to import the data; here it’s
~/downloads/primer-dataset.json.
To import data into a MongoDB instance running on a different host or port, the hostname or
port needs to be mentioned specifically in the mongoimport command by including the --host
or --port option.
There is a similar load command in MySQL.

Create a Connection Using pymongo


To create a connection, do the following:

import MongoClient from pymongo.


Client11 = MongoClient()

If no argument is mentioned to MongoClient, then it will default to the MongoDB instance


running on the localhost interface on port 27017.
A complete MongoDB URL may be designated to define the connection, which includes the
host and port number. Let’s take a look at an example.
First, install Mongo using this command: yum/apt install mongo.
Then, launch MongoDB using this command: service mongo start.
The following code makes a connection to a MongoDB instance that runs on
mongodb0.example.net and port 27017:

Client11 = MongoClient("mongodb://myhostname:27017")

Access Database Objects


To assign the database named primer to the local variable DB, you can use either of the following
lines:

Db11 = client11.primer
db11 = client11['primer']

Collection objects can be accessed directly by using the dictionary style or the attribute access
from a database object, as shown in the following two examples:

Coll11 = db11.dataset
coll = db11['dataset']

Insert Data
You can place a document into a collection that doesn’t exist, and the following operation will
create the collection:

result=db.addrss.insert_one({<<your json >>)

Update Data
Here is how to update data:

result=db.address.update_one(
{"building": "129",
{"$set": {"address.street": "MG Road"}}
)

Remove Data
To expunge all documents from a collection, use this:

result=db.restaurants.delete_many({})

Cloud Databases
Even though the cloud has its own chapter, we’d like to provide you with an overview of cloud
databases, particularly databases for large data. People prefer cloud databases when they want
their systems to scale automatically. Google Big Query is the greatest tool for searching your data.
Azure Synapsys has a similar feature; however, it is significantly more expensive. You can store
data on S3, but if you want to run a query, you’ll need Athena, which is expensive. So, in modern
practice, data is stored as a blob in S3, and everything is done in a Python application. If there is
an error in data finding, this method takes a long time. Amazon Redish can also handle a
considerable quantity of large data and comes with a built-in BI tool.

Pandas
The goal of this section is to show some examples to enable you to begin using Pandas. These
illustrations have been taken from real-world data, along with any bugs and weirdness that are
inherent. Pandas is a framework inspired by the R data frame concept.
Please find the CSV file at the following link:

https://github.com/Apress/advanced-data-analytics-python-2e

To read data from a CSV file, use this:

import pandas as pd
broken_df=pd.read_csv('fetaure_engineering_data.csv')

To look at the first three rows, use this:

broken_df[:3]

To select a column, use this:

broken_df[' MSSubClass ']

To plot a column, use this:

broken_df[' MSSubClass’ '].plot()

To get a maximum value in the data set, use this:

MaxValue= broken_df[' MSSubClass’].max() where MSSubClass is the


column header

There are many other methods such as sort, groupby, and orderby in Pandas that are
useful when playing with structured data. Also, Pandas has a ready-made adapter for popular
databases such as MongoDB, Google Big Query, and so on.
One complex example with Pandas is shown next. In the X data frame for each distinct column
value, find the average value of the floor grouping by the root column.

for col in X.columns:


if col != 'root':
avgs =
df.groupby([col,'root'],as_index=False)['floor'].aggregate(np.mean)
for i,row in avgs.iterrows():
k = row[col]
v = row['floor']
r = row['root']
X.loc[(X[col] == k) &
(X['root'] == r), col] = v2.
You can do any experiment in the Pandas framework with the data given for classification and
regression problems.

ETL with Python (Unstructured Data)


Dealing with unstructured data is an important task in modern data analysis. In this section, I will
cover how to parse emails, and I’ll introduce an advanced research topic called topical crawling.

Email Parsing
See Chapter 1 for a complete example of web crawling using Python.
Like Beautiful Soup, Python has a library for email parsing. The following is the example code
to parse email data stored on a mail server. The inputs in the configuration are the username and
number of mails to parse for the user.
In this code, you have to mention the email user, email folder, and index of the mail-in config;
code will write from the address to handle the subject and the date of the email in the CSV file.

from email.parser import Parser


import os
import sys
conf = open(sys.argv[1])
config={}
users={}

# parsing the config file

for line in conf:


if ("," in line):
fields = line.split(",")
key = fields[0].strip().split("=")[1].strip()
val = fields[1].strip().split("=")[1].strip()
users[key] = val
else:
if ("=" in line):
words = line.strip().split('=')
config[words[0].strip()] = words[1].strip()
conf.close()

# extracting information from user email

for usr in users.keys():


path = config["path"]+"/"+usr+"/"+config["folder"]
files = os.listdir(path)
for f in sorted(files):
if(int(f) > int(users[usr])):
users[usr] = f
path1 = path + "/" + f
data = ""
with open (path1) as myfile:
data=myfile.read()
if data != "" :
parser = Parser()
email = parser.parsestr(data)
out = ""
out = out + str(email.get('From')) + "," +
str(email.get('To')) + "," + str(email.get('Subject')) + "," +
str(email.get('Date')).replace(","," ")
if email.is_multipart():
for part in email.get_payload():
out = out + "," +
str(part.get_payload()).replace("\n"," ").replace(","," ")
else:
out = out + "," +
str(email.get_payload()).replace("\n"," ").replace(","," ")
print out,"\n"

#updating the output file

conf = open(sys.argv[1],'w')
conf.write("path=" + config["path"] + "\n")
conf.write("folder=" + config["folder"] + "\n")
for usr in users.keys():
conf.write("name="+ usr +",value=" + users[usr] + "\n")
conf.close()

Sample config file for above code.

path=/cygdrive/c/share/enron_mail_20110402/enron_mail_20110402/maildir
folder=Inbox
name=storey-g,value=142
name=ybarbo-p,value=775
name=tycholiz-b,value=602

Topical Crawling
Topical crawlers are intelligent crawlers that retrieve information from anywhere on the Web.
They start with a URL and then find links present in the pages under it; then they look at new
URLs, bypassing the scalability limitations of universal search engines. This is done by
distributing the crawling process across users, queries, and even client computers. Crawlers can
use the context available to infinitely loop through the links with a goal of systematically locating
a highly relevant, focused page.
Web searching is a complicated task. A large chunk of machine learning work is being applied
to find the similarity between pages, such as the maximum number of URLs fetched or visited.

Crawling Algorithms
Figure 2-2 describes how the topical crawling algorithm works with its major components.
Figure 2-2 Topical crawling described
The starting URL of a topical crawler is known as the seed URL. There is another set of URLs
known as the target URLs, which are examples of desired output.
Another intriguing application of crawling is for a startup that wants to uncover crucial
keywords for every IP address. In the HTTP packet header, they acquire the user’s browsing
history from the Internet service provider. After crawling the URL visited by that IP, they classify
the words in the text using name-entity recognition (Stanford NLP), which is easily
implementable by the RNN explained in Chapter 5. All name entities and their types, such as
names of people, locations, and organizations, are recommended for the user.

import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
import re
import json
import os
import socket
import struct

def ip2int(addr):
return struct.unpack("!I", socket.inet_aton(addr))[0]
def int2ip(addr):
return socket.inet_ntoa(struct.pack("!I", addr))
java_path = '/usr/bin/java'
os.environ['JAVAHOME'] = java_path
os.environ['STANFORD_MODELS'] = '/home/ec2-user/stanford-ner.jar'
nltk.internals.config_java(java_path)

f = open("/home/ec2-user/data.csv")

res = []

stop_words = set(stopwords.words('english'))

for line in f:
fields = line.strip().split(",")
url = fields[1]
ip = fields[-1]
print(ip)
print(url)
tags_del = None
if True:
try:
ip = ip2int(ip)
except:
continue
print(ip)
tagged = None
try:
code = requests.get(url)
plain = code.text
s = BeautifulSoup(plain)
tags_del = s.get_text()
if tags_del is None:
continue
no_html = re.sub('<[^>]*>', '', tags_del)
st = StanfordNERTagger('/home/ec2-
user/english.all.3class.distsim.crf.ser.gz',
'/home/ec2-user/stanford-ner.jar')
tokenized = word_tokenize(no_html)
tagged = st.tag(tokenized)
except:
pass
if tagged is None:
continue
for t in tagged:
t = list(t)
t[0] = t[0].replace(' ', '')
t[-1] = t[-1].replace(' ', '')
print(t)
if t[0] in stop_words:
continue
unit = {}
unit["ip"] = ip
unit["word"] = t[0]
unit["name_entity"] = t[-1]
res.append(unit)
res_final = {}
res_final["result"] = res

#except:
# pass
#except:
# pass

with open('result.json', 'w') as fp:


json.dump(res, fp)

Summary
In this chapter, we discussed different kind of databases and their use cases, and we discussed
collecting text data from the Web and extracting information from different types of unstructured
data like email and web pages.
Another random document with
no related content on Scribd:
as we count sleep, but it is awake at last and its every
member is tingling with Chinese feeling—'China for the
Chinese and out with the foreigners!'

"The Boxer movement is doubtless the product of official


inspiration, but it has taken hold of the popular imagination
and will spread like wildfire all over the length and breadth
of the country: it is, in short, a purely patriotic volunteer
movement, and its object is to strengthen China—and for a
Chinese programme. Its first experience has not been
altogether a success as regards the attainment through
strength of proposed ends,—the rooting up of foreign cults and
the ejection of foreigners, but it is not a failure in respect of
the feeler it put out—will volunteering work?—or as an
experiment that would test ways and means and guide future
choice: it has proved how to a man the people will respond to
the call, and it has further demonstrated that the swords and
spears to which the prudent official mind confined the
initiated will not suffice, but must be supplemented or
replaced by Mauser rifles and Krupp guns: the Boxer patriot of
the future will possess the best weapons money can buy, and
then the 'Yellow Peril' will be beyond ignoring."

Robert Hart,
The Peking Legations
(Fortnightly Review, November, 1900).

{109}

CHINA: A. D. 1900 (March-April).


Proposed joint naval demonstration of the Powers
in Chinese waters.

On receipt of the telegram from Peking (March 10) recommending


a joint naval demonstration in North Chinese waters, the
British Ambassador at Paris was directed to consult the
Government of France on the subject, and did so. On the 13th,
he reported M. Delcassé, the French Minister for Foreign
Affairs, as saying that "he could not, of course, without
reflection and without consulting his colleagues, say what the
decision of the French Government would be as to taking part
in a naval demonstration, but at first sight it seemed to him
that it would be difficult to avoid acting upon a suggestion
which the Representatives of Five Powers, who ought to be good
judges, considered advisable." On the 16th, he wrote to Lord
Salisbury: "M. Delcassé informed me the day before yesterday
that he had telegraphed to Peking for more precise
information. I told him that I was glad to hear that no
precipitate action was going to be taken by France, and that I
believed that he would find that the United States' Government
would be disinclined to associate themselves with any joint
naval demonstration. I added that, although I had no
instructions to say so, I expected that Her Majesty's
Government would also adhere to their usual policy of
proceeding with great caution, and would be in no hurry to
take a step which only urgent necessity would render
advisable."

On the 23d of March, Sir Claude MacDonald telegraphed to Lord


Salisbury: "I learn that the Government of the United States
have ordered one ship-of-war to go to Taku for the purpose of
protecting American interests, that the Italian Minister has
been given the disposal of two ships, and the German Minister
has the use of the squadron at Kiao-chau for the same purpose.
With a view to protect British missionary as well as other
interests, which are far in excess of those of other Powers, I
would respectfully request that two of Her Majesty's ships be
sent to Taku."

On the 3d of April, the Tsung-li Yamên communicated to the


British Ambassador the following information, as to the
punishment of the murderers of Mr. Brooks, and of the
officials responsible for neglect to protect him: "Of several
arrests that had been made of persons accused of having been
the perpetrators of the crime or otherwise concerned in its
committal, two have been brought to justice and, at a trial at
which a British Consul was present, found guilty and sentenced
to be decapitated—a sentence which has already been carried
into effect. Besides this, the Magistrate of Feichen, and some
of the police authorities of the district, accounted to have
been guilty of culpable negligence in the protection of Mr.
Brooks, have been cashiered, or had other punishments awarded
them of different degrees of severity."

For some weeks after this the Boxer movement appears to have
been under constraint. Further outrages were not reported and
no expressions of anxiety appear in the despatches from
Peking. The proposal of a joint naval demonstration in the
waters of Northern China was not pressed.

Great Britain, Papers by Command:


China, Number 3, 1900, pages 6-17.

CHINA: A. D. 1900 (May-June).


Renewed activity of the "Boxers" and increasing gravity of
the situation at Peking.
Return of Legation guards.
Call upon the fleets at Taku for reinforcement and rescue.

About the middle of May the activity of the "Boxers" was


renewed, and a state of disorder far more threatening than
before was speedily made known. The rapid succession of
startling events during the next few weeks may be traced in
the following series of telegrams from the British Minister at
Peking to his chief:

"May 17.
The French Minister called to-day to inform me that the Boxers
have destroyed three villages and killed 61 Roman Catholic
Christian converts at a place 90 miles from Peking, near
Paoting-fu. The French Bishop informs me that in that
district, and around Tien-tsin and Peking generally, much
disorder prevails."

"May 18.
There was a report yesterday, which has been confirmed to-day,
that the Boxers have destroyed the London Mission chapel at
Kung-tsun, and killed the Chinese preacher. Kung-tsun is about
40 miles south-west of Peking."

"May 19.
At the Yamên, yesterday, I reminded the Ministers how I had
unceasingly warned them during the last six months how
dangerous it was not to take adequate measures in suppression
of the Boxer Societies. I said that the result of the apathy
of the Chinese Government was that now a Mission chapel, a few
miles distant from the capital, had been destroyed. The
Ministers admitted that the danger of the Boxer movement had
not previously appeared to them so urgent, but that now they
fully saw how serious it was. On the previous day an Imperial
Decree had been issued, whereby specified metropolitan and
provincial authorities were directed to adopt stringent
measures to suppress the Boxers. This, they believed, would
not fail to have the desired effect."

"May 21.
All eleven foreign Representatives attended a meeting of the
Diplomatic Body held yesterday afternoon, at the instance of
the French Minister. The doyen was empowered to write, in the
name of all the foreign Representatives, a note to the Yamên
to the effect that the Diplomatic Body, basing their demands
on the Decrees already issued by the Palace denunciatory of
the Boxers, requested that all persons who should print,
publish, or disseminate placards which menaced foreigners, all
individuals aiding and abetting, all owners of houses or
temples now used as meeting places for Boxers, should be
arrested. They also demanded that those guilty of arson,
murder, outrages, &c., together with those affording support
or direction to Boxers while committing such outrages, should
be executed. Finally, the publication of a Decree in Peking
and the Northern Provinces setting forth the above. The
foreign Representatives decided at their meeting to take
further measures if the disturbances still continued, or if a
favorable answer was not received to their note within five
days. The meeting did not decide what measures should be
taken, but the Representatives were generally averse to
bringing guards to Peking, and, what found most favour, was as
follows:—

With the exception of Holland, which has no ships in Chinese


waters, it was proposed that all the Maritime Powers
represented should make a naval demonstration either at
Shanhaikuan, or at the new port, Ching-wangtao, while, in case
of necessity, guards were to be held ready on board ship. My
colleagues will, I think, send these proposals as they stand
to their governments. As the Chinese Government themselves
seem to be sufficiently alarmed, I do not think that the above
measure will be necessary, but, should the occasion arise, I
trust that Her Majesty's Government will see fit to support
it. … I had a private interview with my Russian colleague, who
came to see me before the matter reached its acute stages. M. de
Giers said that there were only two countries with serious
interests in China: England and Russia. He thought that both
landing guards and naval demonstrations were to be
discouraged, as they give rise to unknown eventualities.
However, since the 18th instant, he admits that matters are
grave, and agreed at once to the joint note."

{110}

"May 24.
Her Majesty's Consul at Tien-tsin reported by telegraph
yesterday that a Colonel in charge of a party of the Viceroy's
cavalry was caught, on the 22nd instant, in an ambuscade near
Lai-shui, which is about 50 miles south-west of Peking. The
party were destroyed."

"May 25.
Tsung-li Yamên have replied to the note sent by the doyen of
the Corps Diplomatique, reported in my telegram of the 21st
May. They state that the main lines of the measures already in
force agree with those required by the foreign
Representatives, and add that a further Decree, which will
direct efficacious action, is being asked for. The above does
not even promise efficacious action, and, in my personal
opinion, is unsatisfactory."

"May 27.
At the meeting of the Corps Diplomatique, which took place
yesterday evening, we were informed by the French Minister
that all his information led him to believe that a serious
outbreak, which would endanger the lives of all European
residents in Peking, was on the point of breaking out. The
Italian Minister confirmed the information received by M.
Pichon. The Russian Minister agreed with his Italian and
French colleagues in considering the latest reply of the Yamên
to be unsatisfactory, adding that, in his opinion, the Chinese
Government was now about to adopt effective measures. That the
danger was imminent he doubted, but said that it was not
possible to disregard the evidence adduced by the French
Minister. We all agreed with this last remark. M. Pichon then
urged that if the Chinese Government did not at once take
action guards should at once be brought up by the foreign
Representatives. Some discussion then ensued, after which it
was determined that a precise statement should be demanded
from the Yamên as to the measures they had taken, also that
the terms of the Edict mentioned by them should be
communicated to the foreign Representatives. Failing a reply
from the Yamên of a satisfactory nature by this afternoon, it
was resolved that guards should be sent for. Baron von
Ketteler, the German Minister, declared that he considered the
Chinese Government was crumbling to pieces, and that he did
not believe that any action based on the assumption of their
stability could be efficacious. The French Minister is, I am
certain, genuinely convinced that the danger is real, and
owing to his means of information he is well qualified to
judge. … I had an interview with Prince Ch'ing and the Yamên
Ministers this afternoon. Energetic measures are now being
taken against the Boxers by the Government, whom the progress
of the Boxer movement has, at last, thoroughly alarmed. The
Corps Diplomatique, who met in the course of the day, have
decided to wait another twenty-four hours for further
developments."

"May 29.
Some stations on the line, among others Yengtai, 6 miles from
Peking, together with machine sheds and European houses, were
burnt yesterday by the Boxers. The line has also been torn up
in places. Trains between this and Tien-tsin have stopped
running, and traffic has not been resumed yet. The situation
here is serious, and so far the Imperial troops have done
nothing. It was unanimously decided, at a meeting of foreign
Representatives yesterday, to send for guards for the
Legations, in view of the apathy of the Chinese Government and
the gravity of the situation. Before the meeting assembled,
the French Minister had already sent for his."

"May 30.
Permission for the guards to come to Peking has been refused
by the Yamên. I think, however, that they may not persist in
their refusal. The situation in the meantime is one of extreme
gravity. The people are very excited, and the soldiers
mutinous. Without doubt it is now a question of European life
and property being in danger here. The French and Russians are
landing 100 men each. French, Russian, and United States'
Ministers, and myself, were deputed to-day at a meeting of the
foreign Representatives to declare to the Tsung-li Yamên that
the foreign Representatives must immediately bring up guards
for the protection of the lives of Europeans in Peking in view
of the serious situation and untrustworthiness of the Chinese
troops. That the number would be small if facilities were
granted, but it must be augmented should they be refused, and
serious consequences might result for the Chinese Government
in the latter event. In reply, the Yamên stated that no
definite reply could be given until to-morrow afternoon, as
the Prince was at the Summer Palace. As the Summer Palace is
within an hour's ride we refused to admit the impossibility of
prompt communication and decision, and repeated the warning
already given of the serious consequences which would result
if the Viceroy at Tien-tsin did not receive instructions this
evening in order that the guards might be enabled to arrive
here to-morrow. The danger will be greatest on Friday, which
is a Chinese festival."

"May 31.
Provided that the number does not exceed that of thirty for
each Legation, as on the last occasion, the Yamên have given
their consent to the guards coming to Peking. … It was decided
this morning, at a meeting of the foreign Representatives, to
at once bring up the guards that are ready. These probably
include the British, American, Italian, and Japanese."

"June 1.
British, American, Italian, Russian, French and Japanese
guards arrived yesterday. Facilities were given, and there
were no disturbances. Our detachment consists of three
officers and seventy-five men, and a machine gun."

"June 2.
The city is comparatively quiet, but murders of Christian
converts and the destruction of missionary property in
outlying districts occur every day, and the situation still
remains serious. The situation at the Palace is, I learn from
a reliable authority, very strained. The Empress-Dowager does
not dare to put down the Boxers, although wishing to do so, on
account of the support given them by Prince Tuan, father of
the hereditary Prince, and other conservative Manchus, and
also because of their numbers. Thirty Europeans, most of whom
were Belgians, fled from Paoting-fu via the river to
Tien-tsin. About 20 miles from Tien-tsin they were attacked by
Boxers.
{111}
A party of Europeans having gone to their rescue from
Tien-tsin severe fighting ensued, in which a large number of
Boxers were killed. Nine of the party are still missing,
including one lady. The rest have been brought into Tien-tsin.
The Russian Minister, who came to see me to-day, said he
thought it most imperative that the foreign Representatives
should be prepared for all eventualities, though he had no
news confirming the above report. He said he had been
authorized by his Government to support any Chinese authority
at Peking which was able and willing to maintain order in case
the Government collapsed."

"June 4.
I am informed by a Chinese courier who arrived to-day from
Yung-Ching, 40 miles south of Peking, that on the 1st June the
Church of England Mission at that place was attacked by the
Boxers. He states that one missionary, Mr. Robinson, was
murdered, and that he saw his body, and that another, Mr.
Norman, was carried off by the Boxers. I am insisting on the
Chinese authorities taking immediate measures to effect his
rescue. Present situation at Peking is such that we may at any
time be besieged here with the railway and telegraph lines
cut. In the event of this occurring, I beg your Lordship will
cause urgent instructions to be sent to Admiral Seymour to
consult with the officers commanding the other foreign
squadrons now at Taku to take concerted measures for our
relief. The above was agreed to at a meeting held to-day by
the foreign Representatives, and a similar telegram was sent
to their respective Governments by the Ministers of Austria,
Italy, Germany, France, Japan, Russia, and the United States,
all of whom have ships at Taku and guards here. The telegram
was proposed by the French Minister and carried unanimously.
It is difficult to say whether the situation is as grave as
the latter supposes, but the apathy of the Chinese Government
makes it very serious."

"June 5.
I went this afternoon to the Yamên to inquire of the Ministers
personally what steps the Chinese Government proposed to take
to effect the punishment of Mr. Robinson's murderers and the
release of Mr. Norman. I was informed by the Ministers that
the Viceroy was the responsible person, that they had
telegraphed to him to send troops to the spot, and that that
was all they were able to do in the matter. They did not
express regret or show the least anxiety to effect the relief
of the imprisoned man, and they displayed the greatest
indifference during the interview. I informed them that the
Chinese Government would be held responsible by Her Majesty's
Government for the criminal apathy which had brought about
this disgraceful state of affairs. I then demanded an
interview with Prince Ching, which is fixed for to-morrow, as
I found it useless to discuss the matter with the Yamên. This
afternoon I had an interview with the Prince and Ministers of
the Yamên. They expressed much regret at the murder of Messrs.
Robinson and Norman, and their tone was fully satisfactory in
this respect. … No attempt was made by the Prince to defend
the Chinese Government, nor to deny what I had said. He could
say nothing to reassure me as to the safety of the city, and
admitted that the Government was reluctant to deal harshly
with the movement, which, owing to its anti-foreign character,
was popular. He stated that they were bringing 6,000 soldiers
from near Tien-tsin for the protection of the railway, but it
was evident that he doubted whether they would be allowed to
fire on the Boxers except in the defence of Government
property, or if authorized whether they would obey. He gave me
to understand, without saying so directly, that he has
entirely failed to induce the Court to accept his own views as
to the danger of inaction. It was clear, in fact, that the Yamên
wished me to understand that the situation was most serious,
and that, owing to the influence of ignorant advisers with the
Empress-Dowager, they were powerless to remedy it."

"June 6.
Since the interview with the Yamên reported in my preceding
telegram I have seen several of my colleagues. I find they all
agree that, owing to the now evident sympathy of the
Empress-Dowager and the more conservative of her advisers with
the anti-foreign movement, the situation is rapidly growing
more serious. Should there be no change in the attitude of the
Empress, a rising in the city, ending in anarchy, which may
produce rebellion in the provinces, will be the result,
'failing an armed occupation of Peking by one or more of the
Powers.' Our ordinary means of pressure on the Chinese
Government fail, as the Yamên is, by general consent, and
their own admission, powerless to persuade the Court to take
serious measures of repression. Direct representations to the
Emperor and Dowager-Empress from the Corps Diplomatique at a
special audience seems to be the only remaining chance of
impressing the Court."

"June 7.
There is a long Decree in the 'Gazette' which ascribes the
recent trouble to the favour shown to converts in law suits
and the admission to their ranks of bad characters. It states
that the Boxers, who are the objects of the Throne's sympathy
equally with the converts, have made use of the anti-Christian
feeling aroused by these causes, and that bad characters among
them have destroyed chapels and railways which are the
property of the State. Unless the ringleaders among such bad
characters are now surrendered by the Boxers they will be
dealt with as disloyal subjects, and will be exterminated.
Authorization will be given to the Generals to effect arrests,
exercising discrimination between leaders and their followers.
It is probable that the above Decree represents a compromise
between the conflicting opinions which exist at Court. The
general tone is most unsatisfactory, though the effect may be
good if severe measures are actually taken. The general
lenient tone, the absence of reference to the murder of
missionaries, and the justification of the proceedings of the
Boxers by the misconduct of Christian converts are all
dangerous factors in the case."

"June 8.
A very bad effect has been produced by the Decree reported in
my immediately preceding telegram. There is no prohibition of
the Boxers drilling, which they now openly do in the houses of
the Manchu nobility and in the temples. This Legation is full
of British refugees, mostly women and children, and the London
and Church of England Missions have been abandoned. I trust
that the instructions requested in my telegrams of the 4th and
5th instant have been sent to the Admiral. I have received the
following telegram, dated noon to-day, from Her Majesty's Consul
at Tien-tsin:

{112}

'By now the Boxers must be near Yang-tsun. Last night the
bridge, which is outside that station, was seen to be on fire.
General Nieh's forces are being withdrawn to Lutai, and 1,500
of them have already passed through by railway. There are now
at Yang-tsun an engine and trucks ready to take 2,000 more
men.' Lutai lies on the other side of Tien-tsin, and at some
distance. Should this information be correct, it means that an
attempt to protect Peking has been abandoned by the only force
on which the Yamên profess to place any reliance. The 6,000
men mentioned in my telegram
of the 5th instant were commanded by General Nieh."

"Tong-ku, June 10.


Vice-Admiral Sir E. Seymour to Admiralty.
Following telegram received from Minister at Peking:
'Situation extremely grave. Unless arrangements are made for
immediate advance to Peking it will be too late.'

"In consequence of above, I am landing at once with all


available men, and have asked foreign officers' co-operation."
Great Britain, Papers by Command:
China, Number 3, 1900, pages 26-45.

CHINA: A. D. 1900 (June 10-26).


Bombardment and capture of Taku forts by the allied fleets.
Failure of first relief expedition started for Peking.

The following is from an official report by Rear-Admiral


Bruce of the British Navy, dated at Taku June 17, 1900:

"On my arrival here on the 11th inst. I found a large fleet,


consisting of Russian, German, French, Austrian, Italian,
Japanese, and British ships. In consequence of an urgent
telegram from Her Majesty's Minister at Peking, Vice-Admiral
Sir Edward H. Seymour, K. C. B., Commander-in-Chief, had
started at 3 o'clock the previous morning (10th June), taking
with him a force of 1,375 of all ranks, being reinforced by
men from the allied ships as they arrived, until he commanded
not less than 2,000 men. At a distance of some 20 to 30 miles
from Tientsin—but it is very difficult to locate the place, as
no authentic record has come in—he found the railway destroyed
and sleepers burned, &c., and every impediment made by
supposed Boxers to his advance. Then his difficulties began,
and it is supposed that the Boxers, probably assisted by
Chinese troops, closed in on his rear, destroyed
railway-lines, bridges, &c., and nothing since the 13th inst.
has passed from Commander-in-Chief and his relief force and
Tientsin, nor vice versa up to this date. …

"During the night of the 14th inst. news was received that all
railway-carriages and other rolling stock had been ordered to
be sent up the line for the purpose of bringing down a Chinese
army to Tong-ku. On receipt of this serious information a
council of Admirals was summoned by Vice-Admiral Hiltebrandt,
Commander-in-Chief of the Russian Squadron, and the German,
French, United States Admirals, myself, and the Senior
Officers of Italy, Austria, and Japan attended; and it was
decided to send immediate orders to the captains of the allied
vessels in the Peiho River (three Russian, two German, one
United States, one Japanese, one British—'Algerine') to
prevent any railway plant being taken away from Tong-ku, or
the Chinese army reaching that place, which would cut off our
communication with Tientsin; and in the event of either being
attempted they were to use force to prevent it, and to destroy
the Taku Forts. By the evening, and during the night of 15th
inst., information arrived that the mouth of the Peiho River
was being protected by electric mines. On receipt of this,
another council composed of the same naval officers was held
in the forenoon of 16th June on board the 'Rossia,' and in
consequence of the gravity of the situation, and information
having also arrived that the forts were being provisioned and
reinforced, immediate notice was sent to the Viceroy of Chili
at Tientsin and the commandant of the forts that, in
consequence of the danger to our forces up the river, at
Tientsin, and on the march to Peking by the action of the
Chinese authorities, we proposed to temporarily occupy the
Taku Forts, with or without their good will, at 2 a.m. on the
17th inst." Early on Sunday, 17th June, "the Taku Forts opened
fire on the allied ships in the Peiho River, which continued
almost without intermission until 6.30 a.m., when all firing
had practically ceased and the Taku Forts were stormed and in
the hands of the Allied Powers, allowing of free communication
with Tientsin by water, and rail when the latter is repaired."

The American Admiral took no part in this attack on the forts


at Taku, "on the ground that we were not at war with China and
that a hostile demonstration might consolidate the
anti-foreign elements and strengthen the Boxers to oppose the
relieving column."

From the point to which the allied expedition led by Admiral


Seymour fought its way, and at which it was stopped by the
increasing numbers that opposed it, it fell back to a position
near Hsiku, on the right bank of the Peiho. There the allies
drove the Chinese forces from an imperial armory and took
possession of the buildings, which gave them a strong
defensive position, with a large store of rice for food, and
enabled them to hold their ground until help came to them from
Tientsin, on the 25th. They were encumbered with no less than 230
wounded men, which made it impossible for them, in the
circumstances, to fight their way back without aid; though the
distance was so short that the return march was accomplished, on
the 26th, between 3 o'clock and 9 of the same morning. In his
report made the following day Admiral Seymour says: "The
number of enemy engaged against us in the march from Yungtsin
to the Armoury near Hsiku cannot be even estimated; the
country alongside the river banks is quite flat, and consisted
of a succession of villages of mud huts, those on the
out-skirts having enclosures made of dried reeds; outside,
high reeds were generally growing in patches near the village,
and although trees are very scarce away from the River,
alongside it they are very numerous; these with the graves,
embankments for irrigation and against flood, afforded cover
to the enemy from which they seldom exposed themselves,
withdrawing on our near approach. Had their fire not been
generally high it would have been much more destructive than
it was. The number of the enemy certainly increased gradually
until the Armoury near Hsiku was reached, when General Nieh's
troops and the Boxers both joined in the attack. In the early
part of the expedition the Boxers were mostly armed with
swords and spears, and not with many firearms; at the
engagement at Langfang on 18th, and afterwards, they were
armed with rifles of late pattern; this together with banners
captured and uniform worn, shows that they had either the
active or covert support of the Chinese Government, or some of
its high officials."

{113}

CHINA: A. D. 1900 (June 11-29).


Chinese Imperial Edicts.

"On June 11 Mr. Sugiyama, the Chancellor of the Japanese


Legation, was brutally murdered [in Peking] by the soldiers of
General Tung-fuh-siang. Two days later the following Imperial
edict was published in the 'Peking Gazette': 'On June 11 the
Japanese Chancellor was murdered by brigands outside the
Yung-ting Mên. On hearing this intelligence we were
exceedingly grieved. Officials of neighbouring nations
stationed in Peking ought to be protected in every possible
way, and now, especially, extra diligence ought to be
displayed to prevent such occurrences when banditti are as
numerous as bees. We have repeatedly commanded the local
officials to ensure the most efficient protection in their
districts, yet, in spite of our frequent orders, we have this
case of the murder of the Japanese Chancellor occurring in the
very capital of the Empire. The civil and military officials
have assuredly been remiss in not clearing their districts of
bad characters, or immediately arresting such persons, and we
hereby order every Yamên concerned to set a limit of time for
the arrest of the criminals, that they may suffer the extreme
penalty. Should the time expire without any arrest being
effected, the severest punishment will assuredly be inflicted
upon the responsible persons.' It is needless to add that the
'criminals' were never arrested and the 'responsible persons'
were never punished. In the same 'Gazette' another decree
condemns the 'Boxer brigands' who have recently been causing
trouble in the neighbourhood of the capital, who have been
committing arson and murder and revenging themselves upon the
native converts. Soldiers and 'Boxers,' it says, have leagued
together to commit acts of murder and arson, and have vied
with one another in disgraceful acts of looting and robbery.
The 'Boxers' are to disband, desperadoes are to be arrested,
ringleaders are to be seized, but the followers may be allowed
to disband.

"Similar decrees on the 14th and 15th show alarm at the result
of the 'Boxer' agitation and lawlessness within the city.
Nothing so strong against the 'Boxers' had previously been
published. Fires were approaching too Closely to the Imperial
Palace. No steps had been taken by the Court to prevent the
massacre and burning of Christians and their property in the
country, but on the 16th the great Chien Mên gate fronting the
Palace had been burned and the smoke had swept over the
Imperial Courts. Yet even in these decrees leniency is shown
to the 'Boxers,' for they are not to be fired upon, but are,
if guilty, to be arrested and executed. On June 17th the edict
expresses the belief of the Throne that:—'All foreign
Ministers ought to be really protected. If the Ministers and
their families wish to go for a time to Tien-tsin, they must
be protected on the way. But the railroad is not now in
working order. If they go by the cart road it will be
difficult, and there is fear that perfect protection cannot be
offered. They would do better, therefore, to abide here in
peace as heretofore and wait till the railroad is repaired,
and then act as circumstances render expedient.'

"Two days later an ultimatum was sent to the Ministers


ordering them to leave Peking within 24 hours. On the 20th
Baron von Ketteler was murdered and on June 21 China
published, having entered upon war against the whole world,
her Apologia:—

'Ever since the foundation of the Dynasty, foreigners coming


to China have been kindly treated. In the reigns Tao Kuang,
and Hsien Feng, they were allowed to trade and they also asked
leave to propagate their religion, a request that the Throne
reluctantly granted. At first they were amenable to Chinese
control, but for the past 30 years they have taken advantage
of China's forbearance to encroach on China's territory and
trample on Chinese people and to demand China's wealth. Every
concession made by China increased their reliance on violence.
They oppressed peaceful citizens and insulted the gods and
holy men, exciting the most burning indignation among the
people. Hence the burning of chapels and slaughter of converts
by the patriotic braves. The Throne was anxious to avoid war, and
issued edicts enjoining the protection of Legations and pity
to the converts. The decrees declaring 'Boxers' and converts
to be equally the children of the State were issued in the
hope of removing the old feud between people and converts.
Extreme kindness was shown to the strangers from afar. But
these people knew no gratitude and increased their pressure. A
despatch was yesterday sent by Du Chaylard, calling us to
deliver up the Ta-ku Forts into their keeping, otherwise they
would be taken by force. These threats showed their aggressive
intention. In all matters relating to international
intercourse, we have never been wanting in courtesies to them,
but they, while styling themselves civilized States, have acted
without regard for right, relying solely on their military
force. We have now reigned nearly 30 years, and have treated
the people as our children, the people honouring us as their
deity, and in the midst of our reign we have been the
recipients of the gracious favour of the Empress-Dowager.
Furthermore, our ancestors have come to our aid, and the gods
have answered our call, and never has there been so universal
a manifestation of loyalty and patriotism. With tears have we
announced war in the ancestral shrines. Better to enter on the
struggle and do our utmost than seek some measures of
self-preservation involving eternal disgrace. All our
officials, high and low, are of one mind, and there have
assembled without official summons several hundred thousand
patriotic soldiers (I Ping "Boxers"). Even children carrying
spears in the service of the State. Those others relying on
crafty schemes, our trust is in Heaven's justice. They depend
on violence, we on humanity. Not to speak of the righteousness
of our cause, our provinces number more than 20, our people over
400,000,000, and it will not be difficult to vindicate the
dignity of our country.' The decree concludes by promising
heavy rewards to those who distinguish themselves in battle or
subscribe funds, and threatening punishment to those who show
cowardice or act treacherously.
"In the same 'Gazette' Yü Lu reports acts of war on the part
of the foreigners, when, after some days' fighting, he was
victorious. 'Perusal of his memorial has given us great
comfort,' says the Throne. Warm praise is given to the
'Boxers,' 'who have done great service without any assistance
either of men or money from the State. Marked favour will be
shown them later on, and they must continue to show their
devotion.' On the 24th presents of rice are sent to the
'Boxers.' Leaders of the 'Boxers' are appointed by the
Throne—namely, Prince Chuang, and the Assistant Grand
Secretary Kang-Yi to be in chief command, and Ying Nien and
Duke Lan (the brother of Prince Tuan, the father of the Crown
Prince) to act in cooperation with them, while another high
post is given to Wen Jui."

London Times, October 16, 1900


(Peking Correspondence).

{114}

Very different in tone to the imperial decree of June 21,


quoted above, was one issued a week later (June 29), and sent
to the diplomatic representatives of the Chinese government in
Europe and America. As published by Minister Wu Ting-fang, at
Washington, on the 11th of July, it was in the following
words:

"The circumstances which led to the commencement of fighting


between Chinese and foreigners were of such a complex,
confusing and unfortunate character as to be entirely
unexpected. Our diplomatic representatives abroad, owing to
their distance from the scene of action, have had no means of
knowing the true state of things, and accordingly cannot lay
the views of the government before the ministers for foreign
affairs of the respective Powers to which they are accredited.
Now we take this opportunity of going fully into the matter

You might also like