(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
(Download PDF) Advanced Data Analytics Using Python With Architectural Patterns Text and Image Classification and Optimization Techniques 2Nd Edition Sayan Mukhopadhyay Full Chapter PDF
https://ebookmass.com/product/data-mining-for-business-analytics-
concepts-techniques-and-applications-in-python-ebook/
https://ebookmass.com/product/practical-business-analytics-using-
r-and-python-solve-business-problems-using-a-data-driven-
approach-2nd-edition-umesh-r-hodeghatta/
https://ebookmass.com/product/practical-business-analytics-using-
r-and-python-2nd-edition-umesh-r-hodeghatta/
https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-1st-edition-peng-liu/
Python Data Analytics: With Pandas, NumPy, and
Matplotlib, 3rd Edition Fabio Nelli
https://ebookmass.com/product/python-data-analytics-with-pandas-
numpy-and-matplotlib-3rd-edition-fabio-nelli-2/
https://ebookmass.com/product/python-data-analytics-with-pandas-
numpy-and-matplotlib-3rd-edition-fabio-nelli/
https://ebookmass.com/product/bayesian-optimization-theory-and-
practice-using-python-peng-liu/
https://ebookmass.com/product/data-driven-seo-with-python-solve-
seo-challenges-with-data-science-using-python-1st-edition-
andreas-voniatis/
https://ebookmass.com/product/meta-analytics-consensus-
approaches-and-system-patterns-for-data-analysis-simske-s/
Sayan Mukhopadhyay and Pratip Samanta
Pratip Samanta
Kolkata, West Bengal, India
Apress Standard
The publisher, the authors and the editors are safe to assume that the
advice and information in this book are believed to be true and accurate
at the date of publication. Neither the publisher nor the authors or the
editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been
made.
Pratip Samanta
is a principal AI engineer/researcher
with more than 11 years of experience.
He has worked for several software
companies and research institutions. He
has published conference papers and has
been granted patents in AI and natural
language processing. He is also
passionate about gardening and
teaching.
About the Technical Reviewer
Joos Korstanje
is a data scientist with more than five
years of industry experience in
developing machine learning tools, of
which a large part is forecasting models.
He currently works at Disneyland Paris
where he develops machine learning for
a variety of tools.
© Sayan Mukhopadhyay, Pratip Samanta 2023
S. Mukhopadhyay, P. Samanta, Advanced Data Analytics Using Python
https://doi.org/10.1007/978-1-4842-8005-8_1
In this book, we assume that you are familiar with Python programming. In this introductory chapter, we
explain why a data scientist should choose Python as a programming language. Then we highlight some
situations where Python may not be the ideal choice. Finally, we describe some best practices for application
development and give some coding examples that a data scientist may need in their day-to-day job.
OOP in Python
In this section, we explain some features of object-oriented programming (OOP) in a Python context.
The most basic element of any modern application is an object. To a programmer or architect, the world
is a collection of objects. Objects consist of two types of members: attributes and methods. Members can be
private, public, or protected. Classes are data types of objects. Every object is an instance of a class. A class
can be inherited in child classes. Two classes can be associated using composition.
Python has no keywords for public, private, or protected, so encapsulation (hiding a member from the
outside world) is not implicit in Python. Like C++, it supports multilevel and multiple inheritance. Like Java,
it has an abstract keyword. Classes and methods both can be abstract.
In the following code, we are describing an object-oriented question-answering system without any
machine learning. The program’s input is a set of dialogs in input.txt, as shown here:
glob is I
prok is V
pish is X
tegj is L
glob glob Silver is 34 Credits
glob prok Gold is 57800 Credits
pish pish Iron is 3910 Credits
how much is pish tegj glob glob ?
how many Credits is glob prok Silver ?
how many Credits is glob prok Gold ?
how many Credits is glob prok Iron ?
how much wood could a woodchuck chuck if a woodchuck could chuck wood?
Program has a knowledge base in config.txt.
I,1,roman
V,5,roman
X,10,roman
L,50,roman
C,100,roman
D,500,roman
M,1000,roman
Based on this input and the configuration program, the answer to the question is given in input.txt in
standard output, as shown here:
pish tegj glob glob is 42
glob prok Silver is 68 Credits
glob prok Gold is 57800 Credits
glob prok Iron is 782 Credits
I have no idea what you are talking about
The parsing logic is in the Observer class.
import operator
def __init__(self,cmpiler):
self.compiler = cmpiler
def evaluate(self):
#check mximum repeatation is crossing the limit
if self.count > 3:
raise Exception("Repeat more than 3")
#symbol is proper or not
if self.symbol not in self.compiler.symbol_map:
raise Exception("Wrong Symbol")
#check if wrong symbol is repeated ie (V, ..
self.symbol,unit = self.compiler.evaluateSymbol(self.symbol)
while self.symbol % 10 != 0:
self.symbol = self.symbol / 10
if self.count > 1 and self.symbol == 5:
raise Exception("Wrong Symbol repeated")
#checking if input sentence is proper or not
def evaluateSentence(self, line):
if "is" not in line:
return "I have no idea what you are talking about"
The compilation logic is in the compiler class, as shown here:
import sys
class compilerTrader(object):
#store mapping of symbols with score and unit
symbol_map = {}
#store the list of valid symbol
valid_values = []
import sys
class answeringTrader(compilerTrader):
obs.calculate()
obs.evaluate()
values = values.replace("?" , "is ")
if unit == 'roman':
unit = ''
return(values + str(ans) + ' ' + unit)
Finally, the main program calls the answering class and the observer, and then it performs the task
and does unit testing on the logic.
import sys
import unittest
sys.path.append('./answerLayer')
sys.path.append('./compilerLayer')
sys.path.append('./utilityLayer')
def setUp(self):
pass
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage is : " + sys.argv[0] + " <intput file
path> <config file path>")
exit(0)
tr = ClientTrader(sys.argv[2])
f = open(sys.argv[1])
for line in f:
response = tr.process(line.strip())
if response is not None:
print(response)
TestTrader.trader = tr
unittest.main(argv = [sys.argv[0]], exit = False)
You can run this program with the following command:
import rpy2.robjects as ro
ro.r('data(input)')
ro.r('x <-HoltWinters(input data frame)')
import subprocess
subprocess.call(['java','-
cp','*','edu.stanford.nlp.sentiment.SentimentPipeline','-file','foo.txt'])
Please place foo.txt in the same folder where you run the Python code.
You can expose Stanford NLP as a web service and call it as a service. (Before running this code, you’ll
need to download the Stanford nlp JAR file available with the book’s source code.)
nlp = StanfordCoreNLP('http://127.0.0.1:9000')
output = nlp.annotate(sentence, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only
takes in one single sentence.
"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process
faster
"enforceRequirements": "false"
})
import falcon
from falcon_cors import CORS
import json
from sqlalchemy import *
from sqlalchemy.orm import sessionmaker
import pygeoip
from pymongo import MongoClient
import json
import datetime as dt
import ipaddress
import math
from concurrent.futures import *
from sqlalchemy.engine import Engine
from sqlalchemy import event
import sqlite3
@event.listens_for(Engine, "connect")
def set_sqlite_pragma(dbapi_connection, connection_record):
cursor = dbapi_connection.cursor()
cursor.execute("PRAGMA cache_size=100000")
cursor.close()
class Predictor(object):
def __init__(self,domain):
db1 = create_engine('sqlite:///score_' + domain + '0test.db')
metadata1 = MetaData(db1)
self.scores = Table('scores', metadata1, autoload=True)
client = MongoClient(connect=False,maxPoolSize=1)
self.db = client.frequency
self.gi = pygeoip.GeoIP('GeoIP.dat')
self.high = 1.2
self.low = .8
def get_hour(self,timestamp):
return dt.datetime.utcfromtimestamp(timestamp / 1e3).hour
def get_score(self, featurename, featurevalue):
pred = 0
s = self.scores.select((self.scores.c.feature_name ==
featurename) & (self.scores.c.feature_value == featurevalue))
rs = s.execute()
row = rs.fetchone()
if row is not None:
pred = pred + float(row['score'])
res = self.db.frequency.find_one({"ip" : ip})
freq = 1
if res is not None:
freq = res['frequency']
pred2, prob2 = self.get_score('frequency', str(freq))
return (pred1 + pred2), (prob1 + prob2)
conn = sqlite3.connect('multiplier.db')
cursor = conn.execute("SELECT high,low from multiplier
where domain='" + value + "'")
row = cursor.fetchone()
if row is not None:
self.high = row[0]
self.low = row[1]
return self.get_score(f, value)
def on_post(self, req, resp):
input_json = json.loads(req.stream.read(),encoding='utf-8')
input_json['ip'] = unicode(req.remote_addr)
pred = 1
prob = 1
with ThreadPoolExecutor(max_workers=8) as pool:
future_array = {
pool.submit(self.get_value,f,input_json[f]) : f for f in input_json}
for future in as_completed(future_array):
pred1, prob1 = future.result()
pred = pred + pred1
prob = prob - prob1
resp.status = falcon.HTTP_200
res = math.exp(pred)-1
if res < 0:
res = 0
prob = math.exp(prob)
if(prob <= .1):
prob = .1
if(prob >= .9):
prob = .9
multiplier = self.low + (self.high -self.low)*prob
pred = multiplier*pred
resp.body = str(pred)
cors =
CORS(allow_all_origins=True,allow_all_methods=True,allow_all_headers=True)
wsgi_app = api = falcon.API(middleware=[cors.middleware])
f = open('publishers1.list')
for domain in f:
domain = domain.strip()
p = Predictor(domain)
url = '/predict/' + domain
api.add_route(url, p)
Having covered design patterns in Python a bit, let’s now take a look at some essential architecture
patterns for data scientists.
Summary
In this chapter, we discussed fundamental engineering principles for data scientists, which are covered in
separate chapters. The question-answering example can help you understand how to organize your code.
The basic rule is to not put everything into one class. Divide your code into many categories and use parent-
child relationships where they exist. Then you learned how to use Python to call other languages’ code. We
provided two instances of R and Java code calls. Then we showed you how to expose your model as a REST
API and make it perform well by using concurrent programming. Following that, we covered significant
architectural patterns from data scientists.
© Sayan Mukhopadhyay, Pratip Samanta 2023
S. Mukhopadhyay, P. Samanta, Advanced Data Analytics Using Python
https://doi.org/10.1007/978-1-4842-8005-8_2
Every data science professional has to extract, transform, and load (ETL) data from different data
sources. In this chapter, we will discuss how to perform ETL with Python for a selection of
popular databases. For a relational database, we’ll cover MySQL. As an example of a document
database, we will cover Elasticsearch. For a graph database, we’ll cover Neo4j, and for NoSQL,
we’ll cover MongoDB. We will also discuss the Pandas framework, which was inspired by R’s data
frame concept.
ETL is based on a process in which data is extracted from multiple sources, transformed into
specific formats that involve cleaning enrichment, and finally loaded into its target destination.
The following are the details of each process:
1. Extract: During data extraction, source data is pulled from a variety of sources and moved to a
staging area, making the data available to subsequent stages in the ETL process. After that,
the data undergoes the cleaning and enrichment stage, also known as data cleansing.
2. Transform: In this stage, the source data is matched to the format of the target system. This
includes steps such as changing data types, combining fields, splitting fields, etc.
3. Load: This stage is the final ETL stage. Here, data is loaded into the data warehouse in an
automated manner and can be periodically updated. Once completed, the data is ready for
data analysis.
The previous processes are important in any data analytics work. Once the data goes through
the ETL processes, then it becomes possible to analysis the data, find insights, and so on.
We will discuss various types of ETL throughout this chapter. We discussed in Chapter 1 that
data is not an isolated thing. We need to load data from somewhere, which is a database. We need
to fetch the data from some application, which is extraction. In this chapter and the next, we will
discuss various feature engineering that transforms the data from one form to another.
MySQL
MySQLdb is an API in Python developed to work on top of the MySQL C interface.
#!/usr/bin/python
import MySQLdb
If you get an import error exception, that means the module was not installed properly.
The following are the instructions to install the MySQL Python module:
$ gunzip MySQL-python-1.2.2.tar.gz
$ tar -xvf MySQL-python-1.2.2.tar
$ cd MySQL-python-1.2.2
$ python setup.py build
$ python setup.py install
Database Connection
Before connecting to a MySQL database, make sure you do the following:
1. You need to access a database called TEST with the sql "use test" command.
2. In TEST you need a table named STUDENT; use the command sql "create table
student(name varchar(20), sur_name varchar(20),roll_no int");.
4. There needs to be a user in TEST that has complete access to the database.
If you do not do these steps properly, you will get an exception in the next Python code.
INSERT Operation
The following code carries out the SQL INSERT statement for the purpose of creating a record in
the STUDENT table:
#!/usr/bin/python
import MySQLdb
# Open database connection
db = MySQLdb.connect("localhost","user","passwd","TEST" )
# prepare a cursor object using cursor() method
cursor = db.cursor()
# Prepare SQL query to INSERT a record into the database.
sql = """INSERT INTO STUDENT(NAME,
SUR_NAME, ROLL_NO)
VALUES ('Sayan', 'Mukhopadhyay', 1)"""
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
# disconnect from server
db.close()
READ Operation
The following code fetches data from the STUDENT table and prints it:
#!/usr/bin/python
import MySQLdb
# Prepare SQL query to INSERT a record into the database.
sql = "SELECT * FROM STUDENT "
try:
# Execute the SQL command
cursor.execute(sql)
# Fetch all the rows in a list of lists.
results = cursor.fetchall()
for row in results:
fname = row[0]
lname = row[1]
id = row[2]
# Now print fetched result
Print( "name=%s,surname=%s,id=%d" % \
(fname, lname, id ))
except:
print "Error: unable to fecth data"
DELETE Operation
The following code deletes a row from TEST with id=1:
#!/usr/bin/python
import MySQLdb
# Prepare SQL query to DELETE required records
sql = "DELETE FROM STUDENT WHERE ROLL_NO =1"
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
UPDATE Operation
The following code changes the lastname variable to Mukherjee, from Mukhopadhyay:
#!/usr/bin/python
import MySQLdb
# Prepare SQL query to UPDATE required records
sql = "UPDATE STUDENT SET SUR_NAME="Mukherjee"
WHERE SUR_NAME="Mukhopadhyay"
try:
# Execute the SQL command
cursor.execute(sql)
# Commit your changes in the database
db.commit()
except:
# Rollback in case there is any error
db.rollback()
COMMIT Operation
The commit operation provides its assent to the database to finalize the modifications, and after
this operation, there is no way that this can be reverted.
ROLL-BACK Operation
If you are not completely convinced about any of the modifications and you want to reverse them,
then you can apply the roll-back() method.
The following is a complete example of accessing MySQL data through Python. It will give the
complete description of the data stored in a CSV file or MySQL database.
This code asks for the data source type, either MySQL or text. For example, if MySQL asks for
the IP address, credentials, and database name and shows all tables in the database, it offers its
fields once the table is selected. Similarly, a text file asks for a path, and in the files it points to, all
the columns are shown to the user.
import MySQLdb
import sys
out = open('Config1.txt','w')
print ("Enter the Data Source Type:")
print( "1. MySql")
print ("2. Exit")
while(1):
data1 = sys.stdin.readline().strip()
if(int(data1) == 1):
out.write("source begin"+"\n"+"type=mysql\n")
Normal Forms
Database normal forms are the principles to organize your data in an optimum way.
Every table in a database can be in one of the normal forms that we’ll go over next. For the
primary key (PK) and foreign key (FK), you want to have as little repetition as possible. The rest of
the information should be taken from other tables.
First normal form (1NF)
Second normal form (2NF)
Third normal form (3NF)
1. The recurring group in the Student Grade Report table contains the course information. A
student can enroll in a variety of courses.
2. Get rid of the group that keeps repeating itself. That’s each student’s course information in
this situation.
4. The attribute value must be identified uniquely by the PK (StudentNo and CourseNo).
2. When looking at the Student Course table, you can observe that not all of the characteristics,
especially the course details, are completely dependent on the PK. The grade is the sole
attribute that is entirely reliant on xxx.
3. Inspect new and updated tables to ensure that each table has a determinant and that no
tables have improper dependencies.
There should be no abnormalities in the third normal form at this point. For this example,
consider the dependency diagram in Figure 2-1. As previously said, the first step is to eliminate
repeated groupings.
Review the dependencies in Figure 2-1, which summarizes the normalization procedure for
the School database.
The following are the abbreviations used in Figure 2-1:
PD stands for partially dependent.
TD stands for transitive dependence.
FD stands for full dependency. (FD stands for functional dependence in most cases. Figure 2-1
is the only place where FD is used as an abbreviated form for full dependence.)
A relational database is valuable when structured data and a strict relationship between the
fields are maintained. But what if you do not have structured data in which a strict relationship
between fields has been maintained? That’s where Elasticsearch comes in.
Elasticsearch
You’ll find that data is often unstructured. Meaning, you may end up with a mix of image data,
sensor data, and other forms of data. To analyze this data, we first need to store it. MySQL or SQL-
based databases are not good at storing unstructured data. So here we introduce a different kind
of storage, which is mainly used to handle unstructured textual data.
Elasticsearch is a Lucene-based database, which makes it is easy to store and search text data.
Its query interface is a REST API endpoint. The Elasticsearch (ES) low-level client gives a direct
mapping from Python to ES REST endpoints. One of the big advantages of Elasticsearch is that it
provides a full-stack solution for data analysis in one place. Elasticsearch is the database. It has a
configurable front end called Kibana, a data collection tool called Logstash, and an enterprise
security feature called Shield.
This example has features called cat, cluster, indices, ingest, nodes, snapshot, and tasks that
translate to instances of CatClient, ClusterClient, IndicesClient, CatClient,
ClusterClient, IndicesClient, IngestClient, NodesClient, SnapshotClient,
NodesClient, SnapshotClient, and TasksClient, respectively. These instances are the
only supported way to get access to these classes and their methods.
You can specify your own connection class, which can be used by providing the
connection_class parameter.
Different hosts can have different parameters (hostname, port number, SSL option); you can
use one dictionary per node to specify them.
Es1=Elasticsearch(
['localhost:443','other_host:443'],
# turn on SSL
use_ssl=True,
# make sure we verify SSL certificates (off by default)
verify_certs=True,
# provide a path to CA certs on disk
ca_certs='path to CA_certs',
# PEM formatted SSL client certificate
client_cert='path to clientcert.pem',
# PEM formatted SSL client key
client_key='path to clientkey.pem'
)
neo4j-rest-client
The main objective of neo4j-rest-client is to make sure that the Python programmers already
using Neo4j locally through python-embedded are also able to access the Neo4j REST server. So,
the structure of the neo4j-rest-client API is completely in sync with python-embedded. But, a new
structure is brought in so as to arrive at a more Pythonic style and to augment the API with the
new features being introduced by the Neo4j team.
In-Memory Database
Another important class of databases is an in-memory database. This type stores and processes
the data in RAM. So, operations on the database are fast, and the data is volatile. SQLite is a
popular example of an in-memory database. In Python you need to use the sqlalchemy library to
operate on SQLite. In Chapter 1’s Flask and Falcon example, I showed you how to select data from
SQLite. Here I will show how to store a Pandas data frame in SQLite:
{
"_id":ObjectId("01"),
"address": {
"street":"Siraj Mondal Lane",
"pincode":"743145",
"building":"129",
"coord": [ -24.97, 48.68 ]
},
"borough":"Manhattan",
Client11 = MongoClient("mongodb://myhostname:27017")
Db11 = client11.primer
db11 = client11['primer']
Collection objects can be accessed directly by using the dictionary style or the attribute access
from a database object, as shown in the following two examples:
Coll11 = db11.dataset
coll = db11['dataset']
Insert Data
You can place a document into a collection that doesn’t exist, and the following operation will
create the collection:
Update Data
Here is how to update data:
result=db.address.update_one(
{"building": "129",
{"$set": {"address.street": "MG Road"}}
)
Remove Data
To expunge all documents from a collection, use this:
result=db.restaurants.delete_many({})
Cloud Databases
Even though the cloud has its own chapter, we’d like to provide you with an overview of cloud
databases, particularly databases for large data. People prefer cloud databases when they want
their systems to scale automatically. Google Big Query is the greatest tool for searching your data.
Azure Synapsys has a similar feature; however, it is significantly more expensive. You can store
data on S3, but if you want to run a query, you’ll need Athena, which is expensive. So, in modern
practice, data is stored as a blob in S3, and everything is done in a Python application. If there is
an error in data finding, this method takes a long time. Amazon Redish can also handle a
considerable quantity of large data and comes with a built-in BI tool.
Pandas
The goal of this section is to show some examples to enable you to begin using Pandas. These
illustrations have been taken from real-world data, along with any bugs and weirdness that are
inherent. Pandas is a framework inspired by the R data frame concept.
Please find the CSV file at the following link:
https://github.com/Apress/advanced-data-analytics-python-2e
import pandas as pd
broken_df=pd.read_csv('fetaure_engineering_data.csv')
broken_df[:3]
There are many other methods such as sort, groupby, and orderby in Pandas that are
useful when playing with structured data. Also, Pandas has a ready-made adapter for popular
databases such as MongoDB, Google Big Query, and so on.
One complex example with Pandas is shown next. In the X data frame for each distinct column
value, find the average value of the floor grouping by the root column.
Email Parsing
See Chapter 1 for a complete example of web crawling using Python.
Like Beautiful Soup, Python has a library for email parsing. The following is the example code
to parse email data stored on a mail server. The inputs in the configuration are the username and
number of mails to parse for the user.
In this code, you have to mention the email user, email folder, and index of the mail-in config;
code will write from the address to handle the subject and the date of the email in the CSV file.
conf = open(sys.argv[1],'w')
conf.write("path=" + config["path"] + "\n")
conf.write("folder=" + config["folder"] + "\n")
for usr in users.keys():
conf.write("name="+ usr +",value=" + users[usr] + "\n")
conf.close()
path=/cygdrive/c/share/enron_mail_20110402/enron_mail_20110402/maildir
folder=Inbox
name=storey-g,value=142
name=ybarbo-p,value=775
name=tycholiz-b,value=602
Topical Crawling
Topical crawlers are intelligent crawlers that retrieve information from anywhere on the Web.
They start with a URL and then find links present in the pages under it; then they look at new
URLs, bypassing the scalability limitations of universal search engines. This is done by
distributing the crawling process across users, queries, and even client computers. Crawlers can
use the context available to infinitely loop through the links with a goal of systematically locating
a highly relevant, focused page.
Web searching is a complicated task. A large chunk of machine learning work is being applied
to find the similarity between pages, such as the maximum number of URLs fetched or visited.
Crawling Algorithms
Figure 2-2 describes how the topical crawling algorithm works with its major components.
Figure 2-2 Topical crawling described
The starting URL of a topical crawler is known as the seed URL. There is another set of URLs
known as the target URLs, which are examples of desired output.
Another intriguing application of crawling is for a startup that wants to uncover crucial
keywords for every IP address. In the HTTP packet header, they acquire the user’s browsing
history from the Internet service provider. After crawling the URL visited by that IP, they classify
the words in the text using name-entity recognition (Stanford NLP), which is easily
implementable by the RNN explained in Chapter 5. All name entities and their types, such as
names of people, locations, and organizations, are recommended for the user.
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tag import StanfordNERTagger
import re
import json
import os
import socket
import struct
def ip2int(addr):
return struct.unpack("!I", socket.inet_aton(addr))[0]
def int2ip(addr):
return socket.inet_ntoa(struct.pack("!I", addr))
java_path = '/usr/bin/java'
os.environ['JAVAHOME'] = java_path
os.environ['STANFORD_MODELS'] = '/home/ec2-user/stanford-ner.jar'
nltk.internals.config_java(java_path)
f = open("/home/ec2-user/data.csv")
res = []
stop_words = set(stopwords.words('english'))
for line in f:
fields = line.strip().split(",")
url = fields[1]
ip = fields[-1]
print(ip)
print(url)
tags_del = None
if True:
try:
ip = ip2int(ip)
except:
continue
print(ip)
tagged = None
try:
code = requests.get(url)
plain = code.text
s = BeautifulSoup(plain)
tags_del = s.get_text()
if tags_del is None:
continue
no_html = re.sub('<[^>]*>', '', tags_del)
st = StanfordNERTagger('/home/ec2-
user/english.all.3class.distsim.crf.ser.gz',
'/home/ec2-user/stanford-ner.jar')
tokenized = word_tokenize(no_html)
tagged = st.tag(tokenized)
except:
pass
if tagged is None:
continue
for t in tagged:
t = list(t)
t[0] = t[0].replace(' ', '')
t[-1] = t[-1].replace(' ', '')
print(t)
if t[0] in stop_words:
continue
unit = {}
unit["ip"] = ip
unit["word"] = t[0]
unit["name_entity"] = t[-1]
res.append(unit)
res_final = {}
res_final["result"] = res
#except:
# pass
#except:
# pass
Summary
In this chapter, we discussed different kind of databases and their use cases, and we discussed
collecting text data from the Web and extracting information from different types of unstructured
data like email and web pages.
Another random document with
no related content on Scribd:
as we count sleep, but it is awake at last and its every
member is tingling with Chinese feeling—'China for the
Chinese and out with the foreigners!'
Robert Hart,
The Peking Legations
(Fortnightly Review, November, 1900).
{109}
For some weeks after this the Boxer movement appears to have
been under constraint. Further outrages were not reported and
no expressions of anxiety appear in the despatches from
Peking. The proposal of a joint naval demonstration in the
waters of Northern China was not pressed.
"May 17.
The French Minister called to-day to inform me that the Boxers
have destroyed three villages and killed 61 Roman Catholic
Christian converts at a place 90 miles from Peking, near
Paoting-fu. The French Bishop informs me that in that
district, and around Tien-tsin and Peking generally, much
disorder prevails."
"May 18.
There was a report yesterday, which has been confirmed to-day,
that the Boxers have destroyed the London Mission chapel at
Kung-tsun, and killed the Chinese preacher. Kung-tsun is about
40 miles south-west of Peking."
"May 19.
At the Yamên, yesterday, I reminded the Ministers how I had
unceasingly warned them during the last six months how
dangerous it was not to take adequate measures in suppression
of the Boxer Societies. I said that the result of the apathy
of the Chinese Government was that now a Mission chapel, a few
miles distant from the capital, had been destroyed. The
Ministers admitted that the danger of the Boxer movement had
not previously appeared to them so urgent, but that now they
fully saw how serious it was. On the previous day an Imperial
Decree had been issued, whereby specified metropolitan and
provincial authorities were directed to adopt stringent
measures to suppress the Boxers. This, they believed, would
not fail to have the desired effect."
"May 21.
All eleven foreign Representatives attended a meeting of the
Diplomatic Body held yesterday afternoon, at the instance of
the French Minister. The doyen was empowered to write, in the
name of all the foreign Representatives, a note to the Yamên
to the effect that the Diplomatic Body, basing their demands
on the Decrees already issued by the Palace denunciatory of
the Boxers, requested that all persons who should print,
publish, or disseminate placards which menaced foreigners, all
individuals aiding and abetting, all owners of houses or
temples now used as meeting places for Boxers, should be
arrested. They also demanded that those guilty of arson,
murder, outrages, &c., together with those affording support
or direction to Boxers while committing such outrages, should
be executed. Finally, the publication of a Decree in Peking
and the Northern Provinces setting forth the above. The
foreign Representatives decided at their meeting to take
further measures if the disturbances still continued, or if a
favorable answer was not received to their note within five
days. The meeting did not decide what measures should be
taken, but the Representatives were generally averse to
bringing guards to Peking, and, what found most favour, was as
follows:—
{110}
"May 24.
Her Majesty's Consul at Tien-tsin reported by telegraph
yesterday that a Colonel in charge of a party of the Viceroy's
cavalry was caught, on the 22nd instant, in an ambuscade near
Lai-shui, which is about 50 miles south-west of Peking. The
party were destroyed."
"May 25.
Tsung-li Yamên have replied to the note sent by the doyen of
the Corps Diplomatique, reported in my telegram of the 21st
May. They state that the main lines of the measures already in
force agree with those required by the foreign
Representatives, and add that a further Decree, which will
direct efficacious action, is being asked for. The above does
not even promise efficacious action, and, in my personal
opinion, is unsatisfactory."
"May 27.
At the meeting of the Corps Diplomatique, which took place
yesterday evening, we were informed by the French Minister
that all his information led him to believe that a serious
outbreak, which would endanger the lives of all European
residents in Peking, was on the point of breaking out. The
Italian Minister confirmed the information received by M.
Pichon. The Russian Minister agreed with his Italian and
French colleagues in considering the latest reply of the Yamên
to be unsatisfactory, adding that, in his opinion, the Chinese
Government was now about to adopt effective measures. That the
danger was imminent he doubted, but said that it was not
possible to disregard the evidence adduced by the French
Minister. We all agreed with this last remark. M. Pichon then
urged that if the Chinese Government did not at once take
action guards should at once be brought up by the foreign
Representatives. Some discussion then ensued, after which it
was determined that a precise statement should be demanded
from the Yamên as to the measures they had taken, also that
the terms of the Edict mentioned by them should be
communicated to the foreign Representatives. Failing a reply
from the Yamên of a satisfactory nature by this afternoon, it
was resolved that guards should be sent for. Baron von
Ketteler, the German Minister, declared that he considered the
Chinese Government was crumbling to pieces, and that he did
not believe that any action based on the assumption of their
stability could be efficacious. The French Minister is, I am
certain, genuinely convinced that the danger is real, and
owing to his means of information he is well qualified to
judge. … I had an interview with Prince Ch'ing and the Yamên
Ministers this afternoon. Energetic measures are now being
taken against the Boxers by the Government, whom the progress
of the Boxer movement has, at last, thoroughly alarmed. The
Corps Diplomatique, who met in the course of the day, have
decided to wait another twenty-four hours for further
developments."
"May 29.
Some stations on the line, among others Yengtai, 6 miles from
Peking, together with machine sheds and European houses, were
burnt yesterday by the Boxers. The line has also been torn up
in places. Trains between this and Tien-tsin have stopped
running, and traffic has not been resumed yet. The situation
here is serious, and so far the Imperial troops have done
nothing. It was unanimously decided, at a meeting of foreign
Representatives yesterday, to send for guards for the
Legations, in view of the apathy of the Chinese Government and
the gravity of the situation. Before the meeting assembled,
the French Minister had already sent for his."
"May 30.
Permission for the guards to come to Peking has been refused
by the Yamên. I think, however, that they may not persist in
their refusal. The situation in the meantime is one of extreme
gravity. The people are very excited, and the soldiers
mutinous. Without doubt it is now a question of European life
and property being in danger here. The French and Russians are
landing 100 men each. French, Russian, and United States'
Ministers, and myself, were deputed to-day at a meeting of the
foreign Representatives to declare to the Tsung-li Yamên that
the foreign Representatives must immediately bring up guards
for the protection of the lives of Europeans in Peking in view
of the serious situation and untrustworthiness of the Chinese
troops. That the number would be small if facilities were
granted, but it must be augmented should they be refused, and
serious consequences might result for the Chinese Government
in the latter event. In reply, the Yamên stated that no
definite reply could be given until to-morrow afternoon, as
the Prince was at the Summer Palace. As the Summer Palace is
within an hour's ride we refused to admit the impossibility of
prompt communication and decision, and repeated the warning
already given of the serious consequences which would result
if the Viceroy at Tien-tsin did not receive instructions this
evening in order that the guards might be enabled to arrive
here to-morrow. The danger will be greatest on Friday, which
is a Chinese festival."
"May 31.
Provided that the number does not exceed that of thirty for
each Legation, as on the last occasion, the Yamên have given
their consent to the guards coming to Peking. … It was decided
this morning, at a meeting of the foreign Representatives, to
at once bring up the guards that are ready. These probably
include the British, American, Italian, and Japanese."
"June 1.
British, American, Italian, Russian, French and Japanese
guards arrived yesterday. Facilities were given, and there
were no disturbances. Our detachment consists of three
officers and seventy-five men, and a machine gun."
"June 2.
The city is comparatively quiet, but murders of Christian
converts and the destruction of missionary property in
outlying districts occur every day, and the situation still
remains serious. The situation at the Palace is, I learn from
a reliable authority, very strained. The Empress-Dowager does
not dare to put down the Boxers, although wishing to do so, on
account of the support given them by Prince Tuan, father of
the hereditary Prince, and other conservative Manchus, and
also because of their numbers. Thirty Europeans, most of whom
were Belgians, fled from Paoting-fu via the river to
Tien-tsin. About 20 miles from Tien-tsin they were attacked by
Boxers.
{111}
A party of Europeans having gone to their rescue from
Tien-tsin severe fighting ensued, in which a large number of
Boxers were killed. Nine of the party are still missing,
including one lady. The rest have been brought into Tien-tsin.
The Russian Minister, who came to see me to-day, said he
thought it most imperative that the foreign Representatives
should be prepared for all eventualities, though he had no
news confirming the above report. He said he had been
authorized by his Government to support any Chinese authority
at Peking which was able and willing to maintain order in case
the Government collapsed."
"June 4.
I am informed by a Chinese courier who arrived to-day from
Yung-Ching, 40 miles south of Peking, that on the 1st June the
Church of England Mission at that place was attacked by the
Boxers. He states that one missionary, Mr. Robinson, was
murdered, and that he saw his body, and that another, Mr.
Norman, was carried off by the Boxers. I am insisting on the
Chinese authorities taking immediate measures to effect his
rescue. Present situation at Peking is such that we may at any
time be besieged here with the railway and telegraph lines
cut. In the event of this occurring, I beg your Lordship will
cause urgent instructions to be sent to Admiral Seymour to
consult with the officers commanding the other foreign
squadrons now at Taku to take concerted measures for our
relief. The above was agreed to at a meeting held to-day by
the foreign Representatives, and a similar telegram was sent
to their respective Governments by the Ministers of Austria,
Italy, Germany, France, Japan, Russia, and the United States,
all of whom have ships at Taku and guards here. The telegram
was proposed by the French Minister and carried unanimously.
It is difficult to say whether the situation is as grave as
the latter supposes, but the apathy of the Chinese Government
makes it very serious."
"June 5.
I went this afternoon to the Yamên to inquire of the Ministers
personally what steps the Chinese Government proposed to take
to effect the punishment of Mr. Robinson's murderers and the
release of Mr. Norman. I was informed by the Ministers that
the Viceroy was the responsible person, that they had
telegraphed to him to send troops to the spot, and that that
was all they were able to do in the matter. They did not
express regret or show the least anxiety to effect the relief
of the imprisoned man, and they displayed the greatest
indifference during the interview. I informed them that the
Chinese Government would be held responsible by Her Majesty's
Government for the criminal apathy which had brought about
this disgraceful state of affairs. I then demanded an
interview with Prince Ching, which is fixed for to-morrow, as
I found it useless to discuss the matter with the Yamên. This
afternoon I had an interview with the Prince and Ministers of
the Yamên. They expressed much regret at the murder of Messrs.
Robinson and Norman, and their tone was fully satisfactory in
this respect. … No attempt was made by the Prince to defend
the Chinese Government, nor to deny what I had said. He could
say nothing to reassure me as to the safety of the city, and
admitted that the Government was reluctant to deal harshly
with the movement, which, owing to its anti-foreign character,
was popular. He stated that they were bringing 6,000 soldiers
from near Tien-tsin for the protection of the railway, but it
was evident that he doubted whether they would be allowed to
fire on the Boxers except in the defence of Government
property, or if authorized whether they would obey. He gave me
to understand, without saying so directly, that he has
entirely failed to induce the Court to accept his own views as
to the danger of inaction. It was clear, in fact, that the Yamên
wished me to understand that the situation was most serious,
and that, owing to the influence of ignorant advisers with the
Empress-Dowager, they were powerless to remedy it."
"June 6.
Since the interview with the Yamên reported in my preceding
telegram I have seen several of my colleagues. I find they all
agree that, owing to the now evident sympathy of the
Empress-Dowager and the more conservative of her advisers with
the anti-foreign movement, the situation is rapidly growing
more serious. Should there be no change in the attitude of the
Empress, a rising in the city, ending in anarchy, which may
produce rebellion in the provinces, will be the result,
'failing an armed occupation of Peking by one or more of the
Powers.' Our ordinary means of pressure on the Chinese
Government fail, as the Yamên is, by general consent, and
their own admission, powerless to persuade the Court to take
serious measures of repression. Direct representations to the
Emperor and Dowager-Empress from the Corps Diplomatique at a
special audience seems to be the only remaining chance of
impressing the Court."
"June 7.
There is a long Decree in the 'Gazette' which ascribes the
recent trouble to the favour shown to converts in law suits
and the admission to their ranks of bad characters. It states
that the Boxers, who are the objects of the Throne's sympathy
equally with the converts, have made use of the anti-Christian
feeling aroused by these causes, and that bad characters among
them have destroyed chapels and railways which are the
property of the State. Unless the ringleaders among such bad
characters are now surrendered by the Boxers they will be
dealt with as disloyal subjects, and will be exterminated.
Authorization will be given to the Generals to effect arrests,
exercising discrimination between leaders and their followers.
It is probable that the above Decree represents a compromise
between the conflicting opinions which exist at Court. The
general tone is most unsatisfactory, though the effect may be
good if severe measures are actually taken. The general
lenient tone, the absence of reference to the murder of
missionaries, and the justification of the proceedings of the
Boxers by the misconduct of Christian converts are all
dangerous factors in the case."
"June 8.
A very bad effect has been produced by the Decree reported in
my immediately preceding telegram. There is no prohibition of
the Boxers drilling, which they now openly do in the houses of
the Manchu nobility and in the temples. This Legation is full
of British refugees, mostly women and children, and the London
and Church of England Missions have been abandoned. I trust
that the instructions requested in my telegrams of the 4th and
5th instant have been sent to the Admiral. I have received the
following telegram, dated noon to-day, from Her Majesty's Consul
at Tien-tsin:
{112}
'By now the Boxers must be near Yang-tsun. Last night the
bridge, which is outside that station, was seen to be on fire.
General Nieh's forces are being withdrawn to Lutai, and 1,500
of them have already passed through by railway. There are now
at Yang-tsun an engine and trucks ready to take 2,000 more
men.' Lutai lies on the other side of Tien-tsin, and at some
distance. Should this information be correct, it means that an
attempt to protect Peking has been abandoned by the only force
on which the Yamên profess to place any reliance. The 6,000
men mentioned in my telegram
of the 5th instant were commanded by General Nieh."
"During the night of the 14th inst. news was received that all
railway-carriages and other rolling stock had been ordered to
be sent up the line for the purpose of bringing down a Chinese
army to Tong-ku. On receipt of this serious information a
council of Admirals was summoned by Vice-Admiral Hiltebrandt,
Commander-in-Chief of the Russian Squadron, and the German,
French, United States Admirals, myself, and the Senior
Officers of Italy, Austria, and Japan attended; and it was
decided to send immediate orders to the captains of the allied
vessels in the Peiho River (three Russian, two German, one
United States, one Japanese, one British—'Algerine') to
prevent any railway plant being taken away from Tong-ku, or
the Chinese army reaching that place, which would cut off our
communication with Tientsin; and in the event of either being
attempted they were to use force to prevent it, and to destroy
the Taku Forts. By the evening, and during the night of 15th
inst., information arrived that the mouth of the Peiho River
was being protected by electric mines. On receipt of this,
another council composed of the same naval officers was held
in the forenoon of 16th June on board the 'Rossia,' and in
consequence of the gravity of the situation, and information
having also arrived that the forts were being provisioned and
reinforced, immediate notice was sent to the Viceroy of Chili
at Tientsin and the commandant of the forts that, in
consequence of the danger to our forces up the river, at
Tientsin, and on the march to Peking by the action of the
Chinese authorities, we proposed to temporarily occupy the
Taku Forts, with or without their good will, at 2 a.m. on the
17th inst." Early on Sunday, 17th June, "the Taku Forts opened
fire on the allied ships in the Peiho River, which continued
almost without intermission until 6.30 a.m., when all firing
had practically ceased and the Taku Forts were stormed and in
the hands of the Allied Powers, allowing of free communication
with Tientsin by water, and rail when the latter is repaired."
{113}
"Similar decrees on the 14th and 15th show alarm at the result
of the 'Boxer' agitation and lawlessness within the city.
Nothing so strong against the 'Boxers' had previously been
published. Fires were approaching too Closely to the Imperial
Palace. No steps had been taken by the Court to prevent the
massacre and burning of Christians and their property in the
country, but on the 16th the great Chien Mên gate fronting the
Palace had been burned and the smoke had swept over the
Imperial Courts. Yet even in these decrees leniency is shown
to the 'Boxers,' for they are not to be fired upon, but are,
if guilty, to be arrested and executed. On June 17th the edict
expresses the belief of the Throne that:—'All foreign
Ministers ought to be really protected. If the Ministers and
their families wish to go for a time to Tien-tsin, they must
be protected on the way. But the railroad is not now in
working order. If they go by the cart road it will be
difficult, and there is fear that perfect protection cannot be
offered. They would do better, therefore, to abide here in
peace as heretofore and wait till the railroad is repaired,
and then act as circumstances render expedient.'
{114}