0% found this document useful (0 votes)
31 views

Python Coursera 1

Uploaded by

maxew81693
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Python Coursera 1

Uploaded by

maxew81693
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Data Processing Using Python

Data Retrieval and Represent


ZHANG Li/Dazhuang
Nanjing University
Department of Computer Science and Technology
Department of University Basic Computer Teaching
2

Data Processing Using


Python

LOCAL DATA
RETRIEVAL
Nanjing University
Data Retrieval with Python 3

How to get local data?


Open, read/write and close of file.

• Read/write after open.

• Read files

Write files

• Why files need to be closed

Nanjing University
Open File 4

S ource

>>> f1 = open('d:\\infile.txt')
>>> f2 = open(r'd:\outfile.txt', 'w')
>>> f3 = open('record.dat', 'wb', 0)
file_obj = open(filename, mode='r', buffering=-1, …)
• mode is an optional parameter with default value ‘r’
• buffering is an optional integer used to set the buffering policy. Pass 0
to switch buffering off (only allowed in binary mode), 1 to select line
buffering (only usable in text mode), and an integer > 1 to indicate the
size in bytes of a fixed-size chunk buffer.
Nanjing University
open() -mode 5

Mode Function

r Open for reading (default)

w Open for writing, truncating the file first

a Open for writing, appending to the end of the file if it exists.

x Open for exclusive creation, failing if the file already exists

b binary mode

+ open a disk file for updating (reading and writing)

t text mode (default)

Nanjing University
File Related Function 6

Return Value
• open() returns a file object
• File object is iterative
• There exists functions/methods to read/write/close files.
– f.read(), f.write(), f.readline(), f.readlines(), f.writelines()
– f.close()
– f.seek()

Nanjing University
Write a File-f.write() 7

• file_obj.write(str)
− Write a string into file
S ource
firstpro.txt :
>>> f = open('firstpro.txt', 'w') Hello, World!
>>> f.write('Hello, World!')
>>> f.close()
S ource

>>> with open('firstpro.txt', 'w') as f:


f.write('Hello, World!')
Nanjing University
Read a File-f.read() 8

• file_obj.read(size)
− Read at most size byte of data from file, return a string.
• file_obj.read()
− Read file till the end, return a string

S ource

>>> with open('firstpro.txt') as f:


Output:
p1 = f.read(5)
p2 = f.read() Hello, World!
print(p1,p2)

Nanjing University
9
Other Read/Write Functions
F ile

# Filename: companies_a.py • file_obj.readlines()


with open('companies.txt') as f: • file_obj.readline()
cNames = f.readlines() • file_obj.writelines()
print(cNames)

Output:
['GOOGLE Inc.\n', 'Microsoft Corporation\n', 'Apple Inc.\n',
'Facebook, Inc.']

Nanjing University
10
Example
Add sequence number 1, 2, 3, … to the strings in file
companies.txt, and write into another file scompanies.txt.

F ile

# Filename: revcopy.py Output:


with open('companies.txt') as f1: 1 GOOGLE Inc.
cNames = f1.readlines() 2 Microsoft Corporation
for i in range(0, len(cNames)): 3 Apple Inc.
cNames[i] = str(i+1) + ' ' + cNames[i]
with open('scompanies.txt', 'w') as f2: 4 Facebook, Inc.
f2.writelines(cNames)

Nanjing University
11
Other File Related Functions

F ile • file_obj.seek(offset , whence=0)

# Filename: companies_b.py − Set the file pointer in file, with


s = 'Tencent Technology Company Limited'
offset bytes of alignment from
with open('companies.txt' , 'a+') as f:
f.writelines('\n') whence (an optimal parameter
f.writelines(s)
cNames = f.readlines() with default value 0. 0 stands for
print(cNames)
the beginning of file, 1 means
current position, 2 means the end).
Nanjing University
12
Standard File
• When a program begins, the following three
files are available
stdin Standard input 1
stdout Standard output 2
stderr Standard Error 3

S ource

>>> newcName = input('Enter the name of new company: ')


Enter the name of new company: Alibiabia
>>> import sys
>>> print(newcName)
>>> sys.stdout.write('hello')
Alibiabia
Nanjing University
13

Data Processing with


Python

INTERNET DATA
RETRIVAL
Nanjing University
Data Retrieval with Python 14

How to get data on the Internet?


Crawl webpage, and interpret the content.
• Crawling
• Urllib built-in module
– urllib.request

• Requests Third party


(third party library) crawling and
• Scrapy framework intepreting
• Interpreting
• BeautifulSoup library
• re module

Nanjing University
Requests Library 15

• Requests library is a simple, easy and user-friendly Python HTTP third party library.
• Requests Official Site:http://www.python-requests.org/
• Basic method
request resource at given URL ,
requests.get()
corresponding to GET in HTTP.

Respect the crawling protocol robots.txt


S ource

>>> import requests


>>> r = requests.get('https://book.douban.com/subject/1084336/comments/')
>>> r.status_code # Add the headers property in the get() function because the website has been updated
200 headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}
>>> print(r.text) r = requests.get('https://book.douban.com/subject/1084336/comments/', headers = headers)

Nanjing University
Dow Jones Constituent 16

http://finance.yahoo.com/q/cp?s=%5EDJI+Component

http://money.cnn.com/data/dow30/

Nanjing University
Get Doe Jones Constituent with Requests17

• Including multiple strings


– 'AXP', 'American Express Company', '77.77'
– 'BA', 'The Boeing Company', '177.83'
– 'CAT', 'Caterpillar Inc.', '96.39'
F ile
– …

# Filename: dji.py
import requests
re = requests.get('http://money.cnn.com/data/dow30/') # the url may change
print(re.text)

Nanjing University
Interpreting Webpages 18

• BeautifulSoup is a Python library


• re regular expression module
which helps extract data from
HTML or XML files. • Reference:
• Official Website: https://docs.python.org/3.5/libr
https://www.crummy.com/software/ ary/re.html
BeautifulSoup/bs4/doc/
soup.find_all('span', 'short')
'<span class="user-stars allstar(.*?) rating"'
<span class=“short">不知道第几次重读。每过一段
时间再读,都有新的收获。心变得很柔软,脑里的迷雾被 <span class="user-stars
驱散。更多的关注他人,关心这个世界,自私是多么无趣 allstar50 rating" title="力荐
的事情啊。我想,写一本能温暖人心,帮助困难的人们的 "></span>
书,比世界上很多事情都有意义。</span>

Nanjing University
19

Data Processing Using


Python

SEQUENCE

Nanjing University
Sequence 20

• aStr = 'Hello, World!'


• aList = [2, 3, 5, 7, 11]
• aTuple = ('Sunday', 'happy' )
• pList = [('AXP', 'American Express Company', '78.51'),
('BA', 'The Boeing Company', '184.76'),
('CAT', 'Caterpillar Inc.', '96.39'),
('CSCO', 'Cisco Systems, Inc.', '33.71'),
('CVX', 'Chevron Corporation', '106.09')]

Nanjing University
21

Strings

Lists

Tuples

Nanjing University
Sequence in Python 22

0 1 2 3 4 5 6
week 'Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday'
-7 -6 -5 -4 -3 -2 -1

Sequence
Visit mode
0 1 2 N-2 N-1
• Elements are visited by

index offset from 0.
-N -(N-1) -(N-2) -2 -1
• One or multiple elements
can be visited at one time
Nanjing University
Sequence-Related Function 23

standard Sequence Built-in


operator operator Function

Value comparison Get(seq[index]) Sequence type conversion


Object identity Repeat(seq*expr) Available function for
Comparison Connect(seq1+seq2) sequence type(enumerate,
Boolean operation Judge( obj in seq) reversed, sorted, zip, …)

Nanjing University
24
Standard Operator
S ource

>>> 'apple' < 'banana'


True
>>> [1,3,5] != [2,4,6]
True
>>> aTuple = ('BA', 'The Boeing Company', '184.76')
>>> bTuple = aTuple
>>> bTuple is not aTuple
False
>>> '86.40' < '122.64' and 'apple' > 'banana'
False

Nanjing University
25
Standard Operator

Value Comparison Object identity Comparison Boolean operation


< > is not
<= >= is not and
== != or

Nanjing University
Sequence Operator 26

S ource

>>> week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']


>>> print(week[1], week[-2], '\n', week[1:4], '\n', week[:6], '\n', week[::-1])
Tuesday Saturday
['Tuesday', 'Wednesday', 'Thursday']
['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
['Sunday', 'Saturday', 'Friday', 'Thursday', 'Wednesday', 'Tuesday', 'Monday']
>>> 'apple' * 3
'appleappleapple'
>>> 'pine' + 'apple'
'pineapple'
>>> 'BA' in ('BA', 'The Boeing Company', '184.76')
True

Nanjing University
Sequence Operator 27

x in s
x not in s
s+t
s * n, n * s
s[i]
s[i:j]
s[i:j:k]

Nanjing University
Sequence Type Conversion 28

S ource

list() >>> list('Hello, World!')


str() ['H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!']
tuple() >>> tuple("Hello, World!")
('H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!')

Nanjing University
Available Functions for Sequence 29

enumerate() reversed() S ource

len() sorted() >>> aStr = 'Hello, World!'


>>> len(aStr)
max() sum() 13
>>> sorted(aStr)
min() zip() [' ', '!', ',', 'H', 'W', 'd', 'e', 'l', 'l', 'l', 'o', 'o', 'r']

Nanjing University
30

Data Processing Using


Python

STRING

Nanjing University
Different Formats of String 31

lf = [('AXP', 'American Express Company', '78.51'),


('BA', 'The Boeing Company', '184.76'),
('CAT', 'Caterpillar Inc.', '96.39'),
('CSCO', 'Cisco Systems, Inc.', '33.71'),
('CVX', 'Chevron Corporation', '106.09')]
S ource

>>> aStr = 'The Boeing Company'


>>> bStr = "The Boeing Company "
>>> cStr = "I'm a student."
>>> dStr = '''The Boeing
company'''
Nanjing University
Example 32

Replace “World”in“Hello, World!”with“Python”,


and compute the number of punctuation marks.

F ile

# Filename: puncount.py
aStr = "Hello, World!"
bStr = aStr[:7] + "Python!" Output:
count = 0 2
for ch in bStr[:]:
if ch in ',.!?':
count += 1
print(count)
Nanjing University
String and Output Format 33

Output:
Punctuation marks = 2
Output:
2
Output:
There are 2 punctuation marks.
print('There are %d punctuation marks. ' % (count))
format_string % (arguments_to_convert)
print('There are {0:d} punctuation marks. '.format(count))
format_string.format(arguments_to_convert)
Nanjing University
Type Specifier 34

Type Meaning
b Binary format. Outputs the number in base 2
o Octal format. Outputs the number in base 8
x Hex format. Outputs the number in base 16, using lower- case
letters for the digits above 9 (upper-case if use ‘X’)
c Character. Converts the integer to the corresponding unicode
character before printing.
d Decimal Integer. Outputs the number in base 10.
f Fixed point. Displays the number as a fixed-point number. The
default precision is 6.
e Exponent notation. Prints the number in scientific notation using
the letter ‘e’ to indicate the exponent. The default precision is 6.
Nanjing University
Other Available Format 35

符号 描述
+m.nf Output number with sign, keep n digits, and total length is m (if the
number is longer than m, then neglect the constraint)
< Forces the field to be left-aligned, default filling the right with spaces
0>5d Forces the field to be right-aligned, use 0 to fill left part, total length is 5
^ Forces the field to be centered within the available space.
{{}} Output {}

[Alignment][Sign][Minimum width][.Precision][Type]
>>> age, height = 21, 1.758
>>> print("Age:{0:<5d}, Height:{1:5.2f}".format(age, height))
Age:21 , Height: 1.76
Nanjing University
36
Use format() to Output Formatted String
S ource

>>> cCode = ['AXP' , 'BA' , 'CAT' , 'CSCO' , 'CVX' ]


>>> cPrice = ['78.51' , '184.76' , '96.39' , '33.71' , '106.09' ]
>>> for i in range(5):
print('{:<8d}{:8s}{:8s}'.format(i, cCode[i], cPrice[i]))
0 AXP 78.51
1 BA 184.76
2 CAT 96.39
3 CSCO 33.71
4 CVX 106.09
>>> print('I get {:d}{{}}!'.format(32))
I get 32 {}!

Nanjing University
String Application 37

Determine whether string“acdhdca” is a


palindrome, and whether 354435 is a palindrome.

F ile F ile

# Filename: compare.py # Filename: compare.py


sStr = "acdhdca" import operator
if sStr == ''.join(reversed(sStr)): sStr = "acdhdca"
if operator.eq(sStr, ''.join(reversed(sStr)))==1:
print('Yes')
print('Yes')
else: else: sStr == sStr[::-1]
print('No') print('No')

Nanjing University
Useful Methods for String 38

capitalize() center() count() encode() endswith() find()

format() index() isalnum() isalpha() isdigit() islower()

isspace() istitle() isupper() join() ljust() lower()

lstrip() maketrans() partition() replace() rfind() rindex()

rjust() rpartition() rstrip() split() splitlines() startswith()

strip() swapcase() title() translate() upper() zfill()

Nanjing University
Application of String 39

There are some downloaded contents with following format:


What do you think of this saying "No pain, No gain"?
For content between double quotes, first determine whether it
corresponds with title format, and convert the string into title format
then output.
F ile

# Filename: totitle.py
aStr = 'What do you think of this saying "No pain, No gain"?'
lindex = aStr.index('\"',0,len(aStr))
rindex = aStr.rindex('\"',0,len(aStr)) tempstr= aStr.split("\"")[1]
tempStr = aStr[lindex+1:rindex]
if tempStr.istitle():
print('It is title format.')
else:
print('It is not title format.')
print(tempStr.title())

Nanjing University
Escape Character 40

Character Meaning
\0 Empty Character \OOO Character with octal value ooo
\a ASCII Bell (BEL)
\b ASCII Backspace (BS) \xXX Character with hex value XX
\t ASCII Horizontal Tab
(TAB)
\n ASCII Linefeed (LF)
\v ASCII Vertical Tab (VT) S ource

\f ASCII Formfeed (FF)


\r ASCII Carriage >>> aStr = '\101\t\x41\n'
Return (CR) >>> bStr = '\141\t\x61\n'
\" Double quote (")
\' Single quote (')
>>> print(aStr, bStr)
\\ Backslash (\) A A
\(在行尾时) Backslash and a a
newline ignored
Nanjing University
41

Data Processing Using


Python

LIST

Nanjing University
List 42

scalable S ource
Contain S ource

container different
object >>> aList = list('Hello.') types of >>> bList = [1, 2, 'a', 3.5]
>>> aList objects
['H', 'e', 'l', 'l', 'o', '.']
>>> aList = list('hello.')
>>> aList
['h', 'e', 'l', 'l', 'o', '.']
>>> aList[0] = 'H'
>>> aList
['H', 'e', 'l', 'l', 'o', '.']

Nanjing University
Format of List 43

• aList = [1, 2, 3, 4, 5]
• names = ['Zhao', 'Qian', 'Sun', 'Li']
• bList = [3, 2, 1, 'Action']
• pList = [('AXP', 'American Express Company', '78.51'),
('BA', 'The Boeing Company', '184.76'),
('CAT', 'Caterpillar Inc.', '96.39'),
('CSCO', 'Cisco Systems, Inc.', '33.71'),
('CVX', 'Chevron Corporation', '106.09')]

Nanjing University
List 44

F ile
One school holds a competition,
the rate of each singer is # Filename: scoring.py
jScores = [9, 9, 8.5, 10, 7, 8, 8, 9, 8, 10]
decided by 10 judges and aScore = 9
audience. The rule of rating is to jScores.sort()
remove the highest and lowest jScores.pop()
jScores.pop(0)
rating of 10 judges, and average jScores.append(aScore)
with the rate of audience. aveScore = sum(jScores)/len(jScores)
Judges: 9、9、8.5、10、7、8、8、 print(aveScore)
9、8 and 10, [7, 8, 8, 8, 8.5, 9, 9, 9, 10, 10]
Audience: 9 [8, 8, 8, 8.5, 9, 9, 9, 10]
Compute the final result. [8, 8, 8, 8.5, 9, 9, 9, 10, 9]
8.72222222222
Nanjing University
List 45

Merge weekday list(['Monday', 'Tuesday', 'Wednesday',


'Thursday', 'Friday'])with weekend(['Saturday', 'Sunday'])
add sequence numbers and display the result.

F ile Output:
# Filename: week.py 1 Monday
week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'] 2 Tuesday
weekend = ['Saturday', 'Sunday'] 3 Wednesday
week.extend(weekend) 4 Thursday
for i, j in enumerate(week): 5 Friday
6 Saturday
print(i+1, j)
7 Sunday

Nanjing University
List Methods 46

append() Parameters
copy() list.sort(key=None, reverse=False)
count() S ource

extend() >>> numList = [3, 11, 5, 8, 16, 1]


index() >>> fruitList = ['apple', 'banana', 'pear', 'lemon', 'avocado']
>>> numList.sort(reverse = True)
insert()
>>> numList
pop() [16, 11, 8, 5, 3, 1]
remove() >>> fruitList.sort(key = len)
>>> fruitList
reverse() ['pear', 'apple', 'lemon', 'banana', 'avocado']
sort()

Nanjing University
List Comprehension 47

List comprehensions, S ource


list comps >>> [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [x ** 2 for x in range(10)]
Dynamically create list
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
Easy, flexible and useful >>> [x ** 2 for x in range(10) if x ** 2 < 50]
[0, 1, 4, 9, 16, 25, 36, 49]
>>> [(x+1, y+1) for x in range(2) for y in range(2)]
Generator expression [(1, 1), (1, 2), (2, 1), (2, 2)]
>>> sum(x for x in range(10))
45 [ expression for expr in sequence1
lazy evaluation for expr2 in sequence2 ...
for exprN in sequenceN
if condition ]

Nanjing University
48

Data Processing Using


Python

TUPLE

Nanjing University
Tuple 49

• Basic operations of tuple are similar to list.


S ource S ource

>>> 2014 >>> bTuple = (['Monday', 1], 2,3)


2014 >>> bTuple
>>> 2014, (['Monday', 1], 2, 3)
(2014,) >>> bTuple[0][1]
1
>>> len(bTuple)
3
>>> bTuple[1:]
(2, 3)

Nanjing University
Tuple 50

• List element is variable S ource

• Tuple element is not


>>> aList = ['AXP', 'BA', 'CAT']
variable >>> aTuple = ('AXP', 'BA', 'CAT')
>>> aList[1] = 'Alibiabia'
>>> print(aList)
['AXP', 'Alibiabia', 'CAT']
>>> aTuple[1]= 'Alibiabia'
>>> print(aTuple)
aTuple[1]='Alibiabia'
TypeError: 'tuple' object does not support item assignment

Nanjing University
Tuple 51

• Type of function

S ource S ource

>>> aList = [3, 5, 2, 4] >>> aTuple = (3, 5, 2, 4)


>>> aList >>> sorted(aTuple)
[3, 5, 2, 4] [2, 3, 4, 5]
>>> sorted(aList) >>> aTuple
[2, 3, 4, 5] (3, 5, 2, 4)
>>> aList
>>> aTuple.sort()
[3, 5, 2, 4]
>>> aList.sort() Traceback (most recent call last):
>>> aList File "<stdin>", line 1, in <module>
[2, 3, 4, 5] AttributeError: 'tuple' object has no attribute 'sort'

Nanjing University
52
Application of Tuple

Where to use?

Nanjing University
Variable Length Position Parameter(Tuple)
53

Parameter type in Python function: S ource

• Position or keyword parameter


>>> def foo(args1, args2 = 'World!'):
• Only position parameter
print(args1, args2)
• Variable Length Position >>> foo('Hello,')
Parameter Hello, World!
• Variable length keyword >>> foo('Hello,', args2 = 'Python!')
parameter with default value Hello, Python!
>>> foo(args2 = 'Apple!', args1 = 'Hello,')
Hello, Apple!
>>> def foo(args1, *argst):
print(args1)
print(argst)
Nanjing University
Variable Length Position Parameter(Tuple)
54

S ource

>>> def foo(args1, *argst):


print(args1)
print(argst)
>>> foo('Hello,', 'Wangdachui', 'Niuyun', 'Linling')
Hello,
('Wangdachui', 'Niuyun', 'Linling')

Nanjing University
Tuple as a Return Type 55

S ource

Number of Return
return value(s) Type >>> def foo():
0 None return 1, 2, 3
1 object >>> foo()
>1 tuple (1, 2, 3)

Nanjing University
Summary 56

Nanjing University

You might also like