Chapter 3 Python For Data Science
Chapter 3 Python For Data Science
Unit-03
Capturing, Preparing
and Working with data
Outline
Looping
readlinesfor.py OUTPUT
1 f = open('college.txt') Madhuben & Bhanubhai Patel Institute of Technology- Anand
2 lines = f.readlines()
3 for l in lines : Beyond Vitthal Udyognagar Anand,
4 print(l)
Gujarat-388121, INDIA
How to write path?
We can specify relative path in argument to open method, alternatively we can also specify
absolute path.
To specify absolute path,
In windows, f=open(‘D:\\folder\\subfolder\\filename.txt’)
In mac & linux, f=open(‘/user/folder/subfolder/filename.txt’)
We suppose to close the file once we are done using the file in the Python using close()
method.
closefile.py
1 f = open('college.txt')
2 data = f.read()
3 print(data)
4 f.close()
Handling errors using “with” keyword
It is possible that we may have typo in the filename or file we specified is moved/deleted, in
such cases there will be an error while running the file.
To handle such situations we can use new syntax of opening the file using with keyword.
fileusingwith.py
1 with open('college.txt') as f :
2 data = f.read()
3 print(data)
When we open file using with we need not to close the file.
Example : Write file in Python
write() method will write the specified data to the file.
readdemo.py
1 with open('college.txt','a') as f :
2 f.write('Hello world')
If we open file with ‘w’ mode it will overwrite the data to the existing file or will create new file if
file does not exists.
If we open file with ‘a’ mode it will append the data at the end of the existing file or will create
new file if file does not exists.
Reading CSV files without any library functions
A comma-separated values file is a delimited text file that uses a comma to separate values.
Each line of is a data record, Each record consists of many fields, separated by commas.
Example : Book1.csv readlines.py
studentname,enrollment,cpi 1 with open('Book1.csv') as f :
abcd,123456,8.5 2 rows = f.readlines()
bcde,456789,2.5 3 isFirstLine
for r in rows= :
True
cdef,321654,7.6 4 for r in
cols rows :
= r.split(',')
5 if isFirstLine Name
print('Student : = ', cols[0], end=" ")
6 isFirstLine
print('\tEn. No.==False
', cols[1], end=" ")
We can use Microsoft Excel to access 7 continue = \t', cols[2])
print('\tCPI
8 cols = r.split(',')
CSV files. 9 print('Student Name = ', cols[0], end=" ")
10 print('\tEn. No. = ', cols[1], end=" ")
In the later sessions we will access CSV 11 print('\tCPI = \t', cols[2])
Unit-03.01
Lets Learn
NumPy
NumPy
NumPy (Numeric Python) is a Python library to manipulate arrays.
Almost all the libraries in python rely on NumPy as one of their main building block.
NumPy provides functions for domains like Algebra, Fourier transform etc..
NumPy is incredibly fast as it has bindings to C libraries.
Install :
conda install numpy
OR pip install numpy
NumPy Array
The most important object defined in NumPy is an N-dimensional array type called ndarray.
It describes the collection of items of the same type, Items in the collection can be accessed
using a zero-based index.
An instance of ndarray class can be constructed in many different ways, the basic ndarray can
be created as below.
syntax
import numpy as np
a= np.array(list | tuple | set | dict)
numpyarray.py Output
1 import numpy as np <class 'numpy.ndarray'>
2 a= np.array([‘MBIT',‘College',‘Anand']) [‘MBIT',‘College',‘Anand']
3 print(type(a))
4 print(a)
NumPy Array (Cont.)
arange(start,end,step) function will create NumPy array starting from start till end (not included)
with specified steps.
numpyarange.py Output
1 import numpy as np [0 1 2 3 4 5 6 7 8 9]
2 b = np.arange(0,10,1)
3 print(b)
zeros(n) function will return NumPy array of given shape, filled with zeros.
numpyzeros.py Output
1 import numpy as np [0. 0. 0.]
2 c = np.zeros(3)
3 print(c) [[0. 0. 0.] [0. 0. 0.] [0. 0. 0.]]
4 c1 = np.zeros((3,3)) #have to give as tuple
5 print(c1)
ones(n) function will return NumPy array of given shape, filled with ones.
NumPy Array (Cont.)
eye(n) function will create 2-D NumPy array with ones on the diagonal and zeros elsewhere.
numpyeye.py Output
1 import numpy as np [[1. 0. 0.]
2 b = np.eye(3) [0. 1. 0.]
3 print(b) [0. 0. 1.]]
linspace(start,stop,num) function will return evenly spaced numbers over a specified interval.
numpylinspace.py Output
1 import numpy as np [0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
2 c = np.linspace(0,1,11) 0.9 1. ]
3 print(c)
Note: in arange function we have given start, stop & step, whereas in lispace function we are
giving start,stop & number of elements we want.
Array Shape in NumPy
We can grab the shape of ndarray using its shape property.
numpyarange.py Output
1 import numpy as np (3,3)
2 b = np.zeros((3,3))
3 print(b.shape)
Note: the number of elements and multiplication of rows and cols in new array must be equal.
Example : here we have old one-dimensional array of 10 elements and reshaped shape is (5,2)
so, 5 * 2 = 10, which means it is a valid reshape
NumPy Random
rand(p1,p2….,pn) function will create n-dimensional array with random data using uniform
distrubution, if we do not specify any parameter it will return random float number.
numpyrand.py Output
1 import numpy as np 0.23937253208490505
2 r1 = np.random.rand()
3 print(r1) [[0.58924723 0.09677878]
4 r2 = np.random.rand(3,2) # no tuple [0.97945337 0.76537675]
5 print(r2) [0.73097381 0.51277276]]
randint(low,high,num) function will create one-dimensional array with num random integer data
between low and high.
numpyrandint.py Output
1 import numpy as np [78 78 17 98 19 26 81 67 23 24]
2 r3 = np.random.randint(1,100,10)
3 print(r3)
We can reshape the array in any shape using reshape method, which we learned in previous
slide.
NumPy Random (Cont.)
randn(p1,p2….,pn) function will create n-dimensional array with random data using standard
normal distribution, if we do not specify any parameter it will return random float number.
numpyrandn.py Output
1 import numpy as np -0.15359861758111037
2 r1 = np.random.randn()
3 print(r1) [[ 0.40967905 -0.21974532]
4 r2 = np.random.randn(3,2) # no tuple [-0.90341482 -0.69779498]
5 print(r2) [ 0.99444948 -1.45308348]]
Note: rand function will generate random number using uniform distribution, whereas randn
function will generate random number using standard normal distribution.
We are going to learn the difference using visualization technique (as a data scientist, We have
to use visualization techniques to convince the audience)
Visualizing the difference between rand & randn
We are going to use matplotlib library to visualize the difference.
You need not to worry if you are not getting the syntax of matplotlib, we are going to learn it in detail in Unit-4
matplotdemo.py
1 import numpy as np
2 from matplotlib import pyplot as plt
3 %matplotlib inline
4 samplesize = 100000
5 uniform = np.random.rand(samplesize)
6 normal = np.random.randn(samplesize)
7 plt.hist(uniform,bins=100)
8 plt.title('rand: uniform')
9 plt.show()
10 plt.hist(normal,bins=100)
11 plt.title('randn: normal')
12 plt.show()
Aggregations
min() function will return the minimum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymin.py Output
1 import numpy as np Min way1 = 1
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Min way2 = 1
3 a = np.array(l)
4 print('Min way1 = ',a.min())
5 print('Min way2 = ',np.min(a))
max() function will return the maximum value from the ndarray, there are two ways in which we
can use min function, example of both ways are given below.
numpymax.py Output
1 import numpy as np Max way1 = 11
2 l = [1,5,3,8,2,3,6,7,5,2,9,11,2,5,3,4,8,9,3,1,9,3] Max way2 = 11
3 a = np.array(l)
4 print('Max way1 = ',a.max())
5 print('Max way2 = ',np.max(a))
Aggregations (Cont.)
NumPy support many aggregation functions such as min, max, argmin, argmax, sum, mean, std,
etc…
numpymin.py Output
1 l = [7,5,3,1,8,2,3,6,11,5,2,9,10,2,5,3,7,8,9,3,1,9,3]
2 a = np.array(l)
3 print('Min = ',a.min()) Min = 1
4 print('ArgMin = ',a.argmin()) ArgMin = 3
5 print('Max = ',a.max()) Max = 11
6 print('ArgMax = ',a.argmax()) ArgMax = 8
7 print('Sum = ',a.sum()) Sum = 122
8 print('Mean = ',a.mean()) Mean = 5.304347826086956
9 print('Std = ',a.std()) Std = 3.042235771223635
Using axis argument with aggregate functions
When we apply aggregate functions with multidimensional ndarray, it will apply aggregate
function to all its dimensions (axis).
numpyaxis.py Output
1 import numpy as np sum = 45
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]])
3 print('sum = ',array2d.sum())
If we want to get sum of rows or cols we can use axis argument with the aggregate functions.
numpyaxis.py Output
1 import numpy as np sum (cols) = [12 15 18]
2 array2d = np.array([[1,2,3],[4,5,6],[7,8,9]]) sum (rows) = [6 15 24]
3 print('sum (cols)= ',array2d.sum(axis=0)) #Vertical
4 print('sum (rows)= ',array2d.sum(axis=1)) #Horizontal
Single V/S Double bracket notations
There are two ways in which you can access element of multi-dimensional array, example of
both the method is given below
numpybrackets.py Output
1 arr = double = h
2 np.array([['a','b','c'],['d','e','f'],['g','h','i']]) single = h
3 print('double = ',arr[2][1]) # double bracket notaion
4 print('single = ',arr[2,1]) # single bracket notation
Both method is valid and provides exactly the same answer, but single bracket notation is
recommended as in double bracket notation it will create a temporary sub array of third row
and then fetch the second column from it.
Single bracket notation will be easy to read and write while programming.
Slicing ndarray
Slicing in python means taking elements from one given index to another given index.
Similar to Python List, we can use same syntax array[start:end:step] to slice ndarray.
Default start is 0
Default end is length of the array
Default step is 1
numpyslice1d.py Output
1 import numpy as np ['c' 'd' 'e']
2 arr = ['a' 'b' 'c' 'd' 'e']
np.array(['a','b','c','d','e','f','g','h']) ['f' 'g' 'h']
3 print(arr[2:5]) ['c' 'e' 'g']
4 print(arr[:5]) ['h' 'g' 'f' 'e' 'd' 'c'
5 print(arr[5:]) 'b' 'a']
6 print(arr[2:7:2])
7 print(arr[::-1])
Array Slicing Example
C-0 C-1 C-2 C-3 C-4
Example :
R-0 1 2 3 4 5 a[2][3] =
R-1 6 7 8 9 10 a[2,3] =
a = R-2 11 12 13 14 15
a[2] =
a[0:2] =
R-3 16 17 18 19 20
a[0:2:2] =
R-4 21 22 23 24 25 a[::-1] =
a[1:3,1:3] =
a[3:,:3] =
a[:,::-1] =
Slicing multi-dimensional array
Slicing multi-dimensional array would be same as single dimensional array with the help of
single bracket notation we learn earlier, lets see an example.
numpyslice1d.py Output
1 arr = [['a' 'b']
2 np.array([['a','b','c'],['d','e','f'],['g','h', ['d' 'e']]
'i']]) [['g' 'h' 'i']
3 print(arr[0:2 , 0:2]) #first two rows and cols ['d' 'e' 'f']
4 print(arr[::-1]) #reversed rows ['a' 'b' 'c']]
5 print(arr[: , ::-1]) #reversed cols [['c' 'b' 'a']
6 print(arr[::-1,::-1]) #complete reverse ['f' 'e' 'd']
['i' 'h' 'g']]
[['i' 'h' 'g']
['f' 'e' 'd']
['c' 'b' 'a']]
Warning : Array Slicing is mutable !
When we slice an array and apply some operation on them, it will also make changes in original
array, as it will not create a copy of a array while slicing.
Example,
numpyslice1d.py Output
1 import numpy as np Original Array = [2 2 2 4 5]
2 arr = np.array([1,2,3,4,5]) Sliced Array = [2 2 2]
3 arrsliced = arr[0:3]
4
5 arrsliced[:] = 2 # Broadcasting
6
7 print('Original Array = ', arr)
8 print('Sliced Array = ',arrsliced)
NumPy Arithmetic Operations
numpyop.py Output
1 import numpy as np Addition Scalar = [[3 4 5]
2 arr1 = np.array([[1,2,3],[1,2,3],[1,2,3]]) [3 4 5]
3 arr2 = np.array([[4,5,6],[4,5,6],[4,5,6]]) [3 4 5]]
Addition Matrix = [[5 7 9]
4
[5 7 9]
5 arradd1 = arr1 + 2 # addition of matrix with scalar [5 7 9]]
6 arradd2 = arr1 + arr2 # addition of two matrices Substraction Scalar = [[-1 0 1]
7 print('Addition Scalar = ', arradd1) [-1 0 1]
8 print('Addition Matrix = ', arradd2) [-1 0 1]]
9 Substraction Matrix = [[-3 -3 -3]
10 arrsub1 = arr1 - 2 # substraction of matrix with [-3 -3 -3]
scalar [-3 -3 -3]]
Division Scalar = [[0.5 1. 1.5]
11 arrsub2 = arr1 - arr2 # substraction of two matrices
[0.5 1. 1.5]
12 print('Substraction Scalar = ', arrsub1) [0.5 1. 1.5]]
13 print('Substraction Matrix = ', arrsub2) Division Matrix = [[0.25 0.4 0.5
14 arrdiv1 = arr1 / 2 # substraction of matrix with ]
scalar [0.25 0.4 0.5 ]
15 arrdiv2 = arr1 / arr2 # substraction of two matrices [0.25 0.4 0.5 ]]
16 print('Division Scalar = ', arrdiv1)
17 print('Division Matrix = ', arrdiv2)
NumPy Arithmetic Operations (Cont.)
numpyop.py Output
1 import numpy as np Multiply Scalar = [[2 4 6]
2 arrmul1 = arr1 * 2 # multiply matrix with scalar [2 4 6]
3 arrmul2 = arr1 * arr2 # multiply two matrices [2 4 6]]
Multiply Matrix = [[ 4 10 18]
4 print('Multiply Scalar = ', arrmul1)
[ 4 10 18]
5 #Note : its not metrix multiplication* [ 4 10 18]]
6 print('Multiply Matrix = ', arrmul2) Matrix Multiplication = [[24 30
7 # In order to do matrix multiplication 36]
8 arrmatmul = np.matmul(arr1,arr2) [24 30 36]
9 print('Matrix Multiplication = ',arrmatmul) [24 30 36]]
10 # OR Dot = [[24 30 36]
arrdot = arr1.dot(arr2) [24 30 36]
[24 30 36]]
11 print('Dot = ',arrdot)
Python 3.5+ support = [[24 30 36]
12 # OR [24 30 36]
13 arrpy3dot5plus = arr1 @ arr2 [24 30 36]]
14 print('Python 3.5+ support = ',arrpy3dot5plus)
Sorting Array
The sort() function returns a sorted copy of the input array.
syntax Parameters
import numpy as np arr = array to sort (inplace)
# arr = our ndarray axis = axis to sort (default=0)
np.sort(arr,axis,kind,order) kind = kind of algo to use
# OR arr.sort() (‘quicksort’ <- default,
‘mergesort’, ‘heapsort’)
order = on which field we want
to sort (if multiple fields)
Example :
numpysort.py Output
1 import numpy as np Before Sorting = [‘MBIT'
2 arr = ‘Anand' ‘College' 'of'
np.array([‘MBIT',‘Anand',‘College','of','En 'Engineering']
gineering']) After Sorting = []
3 print("Before Sorting = ", arr)
4 arr.sort() # or np.sort(arr)
5 print("After Sorting = ",arr)
Sort Array Example
numpysort2.py Output
1 import numpy as np [(b'ABC', 300) (b‘MBIT', 200)
2 dt = np.dtype([('name', 'S10'),('age', int)]) (b'XYZ', 100)]
3 arr2 =
np.array([(‘MBIT',200),('ABC',300),('XYZ',100
)],dtype=dt)
4 arr2.sort(order='name')
5 print(arr2)
Conditional Selection
Similar to arithmetic operations when we apply any comparison operator to Numpy Array, then
it will be applied to each element in the array and a new bool Numpy Array will be created with
values True or False.
numpycond1.py Output
1 import numpy as np [25 17 24 15 17 97 42 10 67
2 arr = np.random.randint(1,100,10) 22]
3 print(arr) [False False False False
4 boolArr = arr > 50 False True False False True
5 print(boolArr) False]
numpycond2.py Output
1 import numpy as np All = [31 94 25 70 23 9 11
2 arr = np.random.randint(1,100,10) 77 48 11]
3 print("All = ",arr) Filtered = [94 70 77]
4 boolArr = arr > 50
5 print("Filtered = ", arr[boolArr])
Python for Data Science (PDS) (3150713)
Unit-03.02
Lets Learn
Pandas
Pandas
Pandas is an open source library built on top of NumPy.
It allows for fast data cleaning, preparation and analysis.
It excels in performance and productivity.
It also has built-in visualization features.
It can work with the data from wide variety of sources.
Install :
conda install pandas
OR pip install pandas
Outline
Looping (Pandas)
✓ Series
✓ Data Frames
✓ Accessing text, CSV, Excel files using pandas
✓ Accessing SQL Database
✓ Missing Data
✓ Group By
✓ Merging, Joining & Concatenating
✓ Operations
Series
Series is an one-dimensional* array with axis labels.
It supports both integer and label-based index but index must be of hashable type.
If we do not specify index it will assign integer zero-based index.
syntax Parameters
import pandas as pd data = array like Iterable
s = pd.Series(data,index,dtype,copy=False) index = array like index
dtype = data-type
copy = bool, default is False
pandasSeries.py Output
1 import pandas as pd 0 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) 1 3
3 print(s) 2 5
3 7
4 9
5 11
dtype: int64
Series (Cont.)
We can then access the elements inside Series just like array using square brackets notation.
pdSeriesEle.py Output
1 import pandas as pd S[0] = 1
2 s = pd.Series([1, 3, 5, 7, 9, 11]) Sum = 4
3 print("S[0] = ", s[0])
4 b = s[0] + s[1]
5 print("Sum = ", b)
Output
Deleting Row
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('103',inplace=True) 102 85 47 31 12
2 print(df) 104 66 83 70 50
105 65 88 87 87
Data Frames (Cont.)
Output
Creating new column
dfCreateCol.py
PDS Algo SE INS total
101 0 23 93 46 162
1 df['total'] = df['PDS'] + df['Algo'] + 102 85 47 31 12 175
df['SE'] + df['INS'] 103 35 34 6 89 164
2 print(df) 104 66 83 70 50 269
105 65 88 87 87 327
dfDelCol.py
PDS Algo SE INS
101 0 23 93 46
1 df.drop('total',axis=1,inplace=True) 102 85 47 31 12
2 print(df) 103 35 34 6 89
104 66 83 70 50
105 65 88 87 87
Data Frames (Cont.)
Getting Subset of Data Frame
dfGrabSubSet.py Output
1 print(df.loc[[101,104], [['PDS','INS']]) PDS INS
101 0 46
104 66 50
Then, create a database connection string and create engine using it.
createEngine.py
1 db_connection_str = 'mysql+pymysql://username:password@host/dbname'
2 db_connection = create_engine(db_connection_str)
Read from MySQL Database (Cont.)
After getting the engine, we can fire any sql query using pd.read_sql method.
read_sql is a generic method which can be used to read from any sql (MySQL,MSSQL,
Oracle etc…)
readSQLDemo.py
1 df = pd.read_sql('SELECT * FROM cities', con=db_connection)
2 print(df)
Output
CityID CityName CityDescription CityCode
0 1 Rajkot Rajkot Description here RJT
1 2 Ahemdabad Ahemdabad Description here ADI
2 3 Surat Surat Description here SRT
What Is Web Scraping?
Web scraping is the process of gathering information from the Internet. Even copying and
pasting the lyrics of your favorite song is a form of web scraping! However, the words “web
scraping” usually refer to a process that involves automation. Some websites don’t like it
when automatic scrapers gather their data, while others don’t mind.
Challenges of Web Scraping
The Web has grown organically out of many sources. It combines many different
technologies, styles, and personalities, and it continues to grow to this day. In other
words, the Web is a hot mess! Because of this, you’ll run into some challenges when
scraping the Web:
• Variety: Every website is different. While you’ll encounter general structures that repeat
themselves, each website is unique and will need personal treatment if you want to
extract the relevant information.
• Durability: Websites constantly change. Say you’ve built a shiny new web scraper that
automatically cherry-picks what you want from your resource of interest. The first time
you run your script, it works flawlessly. But when you run the same script only a short
while later, you run into a discouraging and lengthy stack of tracebacks!
An Alternative to Web Scraping: APIs
Some website providers offer application programming interfaces (APIs) that allow you to
access their data in a predefined manner. With APIs, you can avoid parsing HTML. Instead,
you can access the data directly using formats like JSON and XML. HTML is primarily a way
to present content to users visually.
When you use an API, the process is generally more stable than gathering the data through
web scraping. That’s because developers create APIs to be consumed by programs rather
than by human eyes.
The front-end presentation of a site might change often, but such a change in the
website’s design doesn’t affect its API structure. The structure of an API is usually more
permanent, which means it’s a more reliable source of the site’s data.
Parse HTML Code With Beautiful Soup
Beautiful Soup is a Python library for parsing structured data. It allows you to interact
with HTML in a similar way to how you interact with a web page using developer tools. The
library exposes a couple of intuitive functions you can use to explore the HTML you
received. To get started, use your terminal to install Beautiful Soup:
Step 1: Inspect Your Data Source
Before you write any Python code, you need to get to know the website that you want to
scrape. That should be your first step for any web scraping project you want to tackle.
You’ll need to understand the site structure to extract the information that’s relevant for
you. Start by opening the site you want to scrape with your favorite browser.
Step 2: Scrape HTML Content From a Page
Now that you have an idea of what you’re working with, it’s time to start using Python. First,
you’ll want to get the site’s HTML code into your Python script so that you can interact with it.
For this task, you’ll use Python’s requests library.
$ python -m pip install requests
import requests
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
print(page.text)
This code issues an HTTP GET request to the given URL. It retrieves the HTML data that the
server sends back and stores that data in a Python object.
If you print the .text attribute of page, then you’ll notice that it looks just like the HTML that you
inspected earlier with your browser’s developer tools. You successfully fetched the static site
content from the Internet! You now have access to the site’s HTML from within your Python
script.
Static Websites
The website that you’re scraping in this tutorial serves static HTML content. In this
scenario, the server that hosts the site sends back HTML documents that already contain
all the data that you’ll get to see as a user.
<div class="card">
<div class="card-content">
<div class="media">
<div class="media-left">
<img
src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg"
/>
</figure>
</div>
<div class="media-content">
</div>
</div>
<div class="content">
<time datetime="2021-04-08">2021-04-08</time>
</p>
</div>
<footer class="card-footer">
<a
href="https://www.realpython.com"
target="_blank"
The HTML you’ll encounter will sometimes be confusing. Luckily, the HTML of this job board
has descriptive class names on the elements that you’re interested in:
Beautiful Soup is a Python library for parsing structured data. It allows you to interact with
HTML in a similar way to how you interact with a web page using developer tools. The library
exposes a couple of intuitive functions you can use to explore the HTML you received. To get
started, use your terminal to install Beautiful Soup:
$ python -m pip install beautifulsoup4
Then, import the library in your Python script and create a Beautiful Soup object:
import requests
from bs4 import BeautifulSoup
URL = "https://realpython.github.io/fake-jobs/"
page = requests.get(URL)
<div id="ResultsContainer">
<!-- all the job listings -->
</div>
Beautiful Soup allows you to find that specific HTML element by its ID:
results = soup.find(id="ResultsContainer")
For easier viewing, you can prettify any Beautiful Soup object when you print it out. If you call
.prettify() on the results variable that you just assigned above, then you’ll see all the HTML
contained within the <div>:
print(results.prettify())
When you use the element’s ID, you can pick out one element from among the rest of the HTML.
Now you can work with only this specific part of the page’s HTML. It looks like the soup just got a
little thinner! However, it’s still quite dense.
Find Elements by HTML Class Name
You’ve seen that every job posting is wrapped in a <div> element with the class card-content.
Now you can work with your new object called results and select only the job postings in it.
These are, after all, the parts of the HTML that you’re interested in! You can do this in one line
of code:
job_elements = results.find_all("div", class_="card-content")
Here, you call .find_all() on a Beautiful Soup object, which returns an iterable containing all the
HTML for all the job listings displayed on that page.
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element)
print(company_element)
print(location_element)
print()
Each job_element is another BeautifulSoup() object. Therefore, you can use the same methods
on it as you did on its parent element, results.
With this code snippet, you’re getting closer and closer to the data that you’re actually
interested in. Still, there’s a lot going on with all those HTML tags and attributes floating
around: