BeautifulSoup is a class in the bs4 module of python. Basic purpose of building beautifulsoup is to parse HTML or XML documents.
Installing bs4 (in-short beautifulsoup)
It is easy to install beautifulsoup on using pip module. Just run the below command on your command shell.
pip install bs4
Running above command on your terminal, will see your screen something like -
C:\Users\rajesh>pip install bs4 Collecting bs4 Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz Requirement already satisfied: beautifulsoup4 in c:\python\python361\lib\site-packages (from bs4) (4.6.0) Building wheels for collected packages: bs4 Building wheel for bs4 (setup.py) ... done Stored in directory: C:\Users\rajesh\AppData\Local\pip\Cache\wheels\a0\b0\b2\4f80b9456b87abedbc0bf2d52235414c3467d8889be38dd472 Successfully built bs4 Installing collected packages: bs4 Successfully installed bs4-0.0.1
To verify, if BeautifulSoup is successfully installed in your machine or not, just run below command in the same terminal−
C:\Users\rajesh>python Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from bs4 import BeautifulSoup >>>
Successful, great!.
Example 1
Find all the links from an html document Now, assume we have a HTML document and we want to collect all the reference links in the document. So first we will store the document as a string like below −
html_doc='''<a href='wwww.Tutorialspoint.com.com'/a> <a href='wwww.nseindia.com.com'/a> <a href='wwww.codesdope.com'/a> <a href='wwww.google.com'/a> <a href='wwww.facebook.com'/a> <a href='wwww.wikipedia.org'/a> <a href='wwww.twitter.com'/a> <a href='wwww.microsoft.com'/a> <a href='wwww.github.com'/a> <a href='wwww.nytimes.com'/a> <a href='wwww.youtube.com'/a> <a href='wwww.reddit.com'/a> <a href='wwww.python.org'/a> <a href='wwww.stackoverflow.com'/a> <a href='wwww.amazon.com'/a> <a href=‘wwww.linkedin.com'/a> <a href='wwww.finace.google.com'/a>'''
Now we will create a soup object by passing the above variable html_doc in the initializer function of beautifulSoup.
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'html.parser')
Now we have the soup object, we can apply methods of the BeautifulSoup class on it. Now we can find all the attributes of a tag and values in the attributes given in the html_doc.
for tag in soup.find_all('a'): print(tag.get('href'))
From above code we are trying to get all the links in the html_doc string through a loop to get every <a> in the document and get the href attribute.
Below is our complete code to get all the links from the html_doc string.
from bs4 import BeautifulSoup html_doc='''<a href='www.Tutorialspoint.com'/a> <a href='www.nseindia.com.com'/a> <a href='www.codesdope.com'/a> <a href='www.google.com'/a> <a href='www.facebook.com'/a> <a href='www.wikipedia.org'/a> <a href='www.twitter.com'/a> <a href='www.microsoft.com'/a> <a href='www.github.com'/a> <a href='www.nytimes.com'/a> <a href='www.youtube.com'/a> <a href='www.reddit.com'/a> <a href='www.python.org'/a> <a href='www.stackoverflow.com'/a> <a href='www.amazon.com'/a> <a href='www.rediff.com'/a>''' soup = BeautifulSoup(html_doc, 'html.parser') for tag in soup.find_all('a'): print(tag.get('href'))
Result
www.Tutorialspoint.com www.nseindia.com.com www.codesdope.com www.google.com www.facebook.com www.wikipedia.org www.twitter.com www.microsoft.com www.github.com www.nytimes.com www.youtube.com www.reddit.com www.python.org www.stackoverflow.com www.amazon.com www.rediff.com
Example 2
Prints all the links from a website with specific element (for example: python) mentioned in the link.
Below program will print all the URLs from a specific website which contains “python” in there link.
from bs4 import BeautifulSoup from urllib.request import urlopen import re html = urlopen("https://www.python.org") content = html.read() soup = BeautifulSoup(content) for a in soup.findAll('a',href=True): if re.findall('python', a['href']): print("Python URL:", a['href'])
Result
Python URL: https://docs.python.org Python URL: https://pypi.python.org/ Python URL: https://www.facebook.com/pythonlang?fref=ts Python URL: https://brochure.getpython.info/ Python URL: https://docs.python.org/3/license.html Python URL: https://wiki.python.org/moin/BeginnersGuide Python URL: https://devguide.python.org/ Python URL: https://docs.python.org/faq/ Python URL: https://wiki.python.org/moin/Languages Python URL: https://python.org/dev/peps/ Python URL: https://wiki.python.org/moin/PythonBooks Python URL: https://wiki.python.org/moin/ Python URL: https://www.python.org/psf/codeofconduct/ Python URL: https://planetpython.org/ Python URL: /events/python-events Python URL: /events/python-user-group/ Python URL: /events/python-events/past/ Python URL: /events/python-user-group/past/ Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event Python URL: //docs.python.org/3/tutorial/controlflow.html#defining-functions Python URL: //docs.python.org/3/tutorial/introduction.html#lists Python URL: https://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator Python URL: //docs.python.org/3/tutorial/ Python URL: //docs.python.org/3/tutorial/controlflow.html Python URL: /downloads/release/python-373/ Python URL: https://docs.python.org Python URL: //jobs.python.org Python URL: https://blog.python.org Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/Joo0vg55HKo/python-373-is-now-available.html Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/N5tvkDIQ47g/python-3410-is-now-available.html Python URL: https://feedproxy.google.com/~r/PythonInsider/~3/n0mOibtx6_A/python-3.html Python URL: /events/python-events/805/ Python URL: /events/python-events/817/ Python URL: /events/python-user-group/814/ Python URL: /events/python-events/789/ Python URL: /events/python-events/831/ Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/ Python URL: /success-stories/building-an-open-source-and-cross-platform-azure-cli-with-python/ Python URL: https://wiki.python.org/moin/TkInter Python URL: https://www.wxpython.org/ Python URL: https://ipython.org Python URL: #python-network Python URL: https://brochure.getpython.info/ Python URL: https://docs.python.org/3/license.html Python URL: https://wiki.python.org/moin/BeginnersGuide Python URL: https://devguide.python.org/ Python URL: https://docs.python.org/faq/ Python URL: https://wiki.python.org/moin/Languages Python URL: https://python.org/dev/peps/ Python URL: https://wiki.python.org/moin/PythonBooks Python URL: https://wiki.python.org/moin/ Python URL: https://www.python.org/psf/codeofconduct/ Python URL: https://planetpython.org/ Python URL: /events/python-events Python URL: /events/python-user-group/ Python URL: /events/python-events/past/ Python URL: /events/python-user-group/past/ Python URL: https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event Python URL: https://devguide.python.org/ Python URL: https://bugs.python.org/ Python URL: https://mail.python.org/mailman/listinfo/python-dev Python URL: #python-network Python URL: https://github.com/python/pythondotorg/issues Python URL: https://status.python.org/