0% found this document useful (0 votes)

27 views20 pages

DADS404 Unit-01 - V1.2

Uploaded by

kulhariravindra7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views20 pages

DADS404 Unit-01 - V1.2

Uploaded by

kulhariravindra7

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

MASTER OF BUSINESS ADMINISTRATION

SEMESTER 4

DADS404
DATA SCRAPPING

Unit 1: Introduction to Data Scrapping and Wrangling 1

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Unit 1
Introduction to Data Scrapping and
Wrangling
Table of Contents
SL Fig No / Table SAQ /
Topic Page No
No / Graph Activity
1 Introduction - -
3-4
1.1 Learning Objectives - -
2 What is Data Scrapping? - 1 5-6
3 Tools Used for Data Scrapping - 2 7-9
4 Ethical Considerations in Web Scrapping - 3 10 -12
5 Data Wrangling: An Overview - 4 13 - 15
6 Summary - - 16
7 Glossary - - 17
8 Terminal Questions - - 18
9 Answer Keys - - 19
10 Suggested Books and E-References - - 20

Unit 1: Introduction to Data Scrapping and Wrangling 2

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

1. INTRODUCTION

In today's digital world, data has become an invaluable resource for organizations and
individuals alike. Whether you're trying to make informed decisions, gain insights into
trends and patterns, or understand your customers and competitors, data can be the key to
unlocking valuable knowledge. As an aspiring professional or analyst, it's essential to
understand how to gather and manipulate data effectively. This unit will introduce you to
the fundamentals of data scraping and wrangling, equipping you with the skills and
knowledge needed to harness the power of data.

Data scraping is the process of extracting data from websites or other sources, while data
wrangling involves transforming raw data into a structured and usable format for analysis.
By learning these techniques, you'll be able to gather and process data from various sources,
enabling you to derive meaningful insights and make more informed decisions. This unit will
cover key concepts and tools related to data scraping and wrangling, as well as ethical
considerations to ensure responsible data collection practices.

1.1 Learning Objectives

After studying the chapter, you will be able to:

❖ Understand the importance of data scraping and wrangling in the data analysis process

Unit 1: Introduction to Data Scrapping and Wrangling 3

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

❖ Learn about various tools and methods for data scraping

❖ Evaluate the ethical considerations in web scraping
❖ Familiarize yourself with the key steps involved in data wrangling
❖ Examine the capabilities and uses of popular data wrangling tools and libraries in R and
Python

By the end of this unit, you will have a solid foundation in data scraping and wrangling,
preparing you to tackle more advanced techniques and applications in subsequent units. So,
let's dive in and start exploring the exciting world of data scraping and wrangling!

Unit 1: Introduction to Data Scrapping and Wrangling 4

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

2. WHAT IS DATA SCRAPPING?

Data scraping, also known as web scraping, is a technique used to extract information from
websites and convert it into a structured format, such as a spreadsheet, a database, or a JSON
file. This process enables businesses, researchers, and individuals to access vast amounts of
data available on the internet and analyze it for various purposes, such as market research,
sentiment analysis, competitor analysis, or data-driven decision-making.

Data scraping can be performed through two primary methods:

Manual scraping: This method involves manually copying and pasting data from websites
into a spreadsheet or another structured format. This approach is suitable for small-scale
scraping tasks or when dealing with websites that are difficult to scrape using automated
techniques. However, manual scraping can be time-consuming and prone to errors,
especially when handling large amounts of data.

Automated scraping: This method employs software, scripts, or tools to extract data from
websites automatically, without huyman intervention. Automated scraping is generally
faster and more efficient than manual scraping. It can handle large amounts of data and can
be scheduled to run at specific intervals to keep the extracted data up-to-date. However,
automated scraping may require technical skills to create, implement, and maintain the
scripts or tools used in the process.

The data scraping process generally involves the following steps:

1. Identifying the target website(s) and the specific data to be extracted.

2. Inspecting the website's structure, such as its HTML code, to determine how the data
is organized and how it can be accessed.
3. Writing a script or using a tool to navigate the website, locate the desired data, and
extract it.
4. Storing the extracted data in a structured format, such as a CSV file, an Excel
spreadsheet, or a database.
5. Performing data cleaning and wrangling tasks, if necessary, to ensure the quality and
usability of the scraped data.

Unit 1: Introduction to Data Scrapping and Wrangling 5

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

Data scraping has numerous applications across various industries, including finance,
marketing, healthcare, sports, and more. However, it is essential to be aware of the ethical
and legal considerations associated with web scraping and ensure that the scraping activities
respect copyright, data privacy, and other relevant laws and guidelines.

SELF-ASSESSMENT QUESTIONS - 1
1) What is data scraping?
a) The process of cleaning data
b) The process of extracting information from websites
c) The process of creating visualizations from data
d) The process of storing data in a database

2) Which method of data scraping is faster and more efficient?

a) Manual scraping
b) Automated scraping
c) Both are equally efficient
d) Neither method is efficient

Unit 1: Introduction to Data Scrapping and Wrangling 6

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

3. TOOLS FOR DATA SCRAPPING

There is a wide range of tools and technologies available for data scraping, catering to
different levels of expertise and project requirements. Some popular tools and methods for
data scraping include:

Browser
Extension

Custom-built
Programming
Web Crawlers
Libraries
and Scrappers

Tools
Categories

Application
Peripheral Web Scrapping
Interfaces Services
(API)

• Browser extensions: Browser extensions are simple, easy-to-use tools that can be added
to a web browser, such as Google Chrome or Mozilla Firefox, to extract data from
websites. These extensions often provide a graphical user interface for users with limited
programming skills, allowing them to perform basic data scraping tasks without writing
any code. Examples of popular browser extensions for data scraping include Web
Scraper and Data Miner.
• Programming libraries: For users with programming skills, libraries in various
programming languages can be used to create custom scripts for data scraping. These
libraries often provide more flexibility and control over the scraping process compared
to browser extensions, allowing users to handle more complex scraping tasks or deal
with websites that employ anti-scraping techniques. Some popular libraries for data
scraping include Beautiful Soup and Scrapy in Python, and rvest in R.
• Web scraping services: Web scraping services are online platforms that provide data
scraping as a service, often with a graphical user interface that enables users with limited

Unit 1: Introduction to Data Scrapping and Wrangling 7

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

programming skills to create and run scraping projects. These services often provide
additional features, such as scheduling, data storage, or data cleaning and wrangling.
Examples of popular web scraping services include Import.io, ParseHub, and Octoparse.
• Application Programming Interfaces (APIs): Some websites offer APIs that allow users
to access their data in a structured format without scraping the HTML content of the site.
Using APIs is often more reliable and efficient than web scraping, as the data is provided
in a standardized format and the website owner explicitly permits data access. However,
using APIs generally requires programming skills and may be subject to usage limits or
fees. Examples of popular APIs for data access include the Twitter API, the Google Maps
API, and the OpenWeatherMap API.
• Custom-built web crawlers and scrapers: In some cases, it may be necessary to build
custom web crawlers and scrapers from scratch to handle specific requirements or
challenges posed by the target website(s). Custom-built scrapers can be developed using
programming languages such as Python, Java, or C#, and often require a deep
understanding of web technologies, such as HTML, CSS, JavaScript, and HTTP.
When selecting a tool or method for data scraping, it is essential to consider factors such as
the scale and complexity of the scraping project, the technical skills of the user, the target
website's structure and anti-scraping mechanisms, and any legal or ethical considerations
associated with the data being scraped.

Unit 1: Introduction to Data Scrapping and Wrangling 8

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

SELF-ASSESSMENT QUESTIONS - 2
3) Which tool can be used to create custom scripts for data scraping in Python?
a) rvest
b) Beautiful Soup
c) Web Scraper
d) Data Miner
4) What is an API?
a) A browser extension for data scraping
b) A programming library for data scraping
c) An online platform for data scraping
d) An interface for accessing structured data from a website

Unit 1: Introduction to Data Scrapping and Wrangling 9

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

4. ETHICAL CONSIDERATIONS IN WEB SCRAPPING

Web scraping can provide valuable insights and data, but it's crucial to consider the ethical
implications of this practice. Ethical web scraping involves respecting the rights, privacy, and
resources of the target website and its users. Some key ethical considerations when scraping
websites include:

Data Privacy
and Consent

Transparency Copyright and

and Intellectual
Disclosure Property

Ethical
Considerations

Website
Robots.txt Terms of
Service

Server Load
and
performance

• Data privacy and consent: When extracting personal or sensitive information from
websites, it's essential to ensure that you have the necessary permissions and are
complying with data privacy regulations such as the General Data Protection Regulation
(GDPR). Extracting data without consent may lead to legal issues and potential harm to
individuals whose information is collected.
• Copyright and intellectual property: Web content is often protected by copyright and
intellectual property laws. Make sure you respect these rights and only scrape publicly
available data or data for which you have explicit permission to access. Scraping
copyrighted material without permission may result in legal consequences.
• Website terms of service: Many websites have terms of service that outline the allowed
and prohibited uses of their content. Before scraping a website, review its terms of
service to ensure that your intended use of the scraped data is in compliance with these

Unit 1: Introduction to Data Scrapping and Wrangling 10

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

rules. Violating a website's terms of service may lead to legal disputes or being blocked
from accessing the site.
• Server load and performance: Web scraping can generate a significant amount of server
load on the target website, potentially slowing it down or even causing it to crash. Be
mindful of the frequency and volume of your scraping requests to avoid overloading the
server. Consider using techniques like throttling your requests, randomizing request
intervals, or accessing the site during off-peak hours to minimize the impact on the
website's performance.
• Robots.txt: Many websites have a robots.txt file that provides guidelines for web crawlers
and scrapers. Always review and respect the rules outlined in this file. Ignoring the
robots.txt directives may lead to being blocked from accessing the site or other negative
consequences.
• Transparency and disclosure: If you are collecting data for research or other purposes
that might be shared publicly, consider disclosing the fact that the data was obtained
through web scraping. Transparency helps ensure the credibility of your work and
informs users about the methods used to collect the information.
By adhering to these ethical guidelines, you can ensure that your web scraping activities are
conducted responsibly and minimize the potential risks associated with this practice. Always
be aware of the legal and ethical landscape in the relevant jurisdictions and stay informed
about changes in regulations and best practices in web scraping.

Unit 1: Introduction to Data Scrapping and Wrangling 11

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

SELF-ASSESSMENT QUESTIONS - 3
5) Which file provides guidelines for web crawlers and scrapers on a website?
a) guidelines.txt
b) rules.txt
c) robots.txt
d) crawler.txt
6) What is one potential consequence of web scraping on the target website?
a) Increased server load
b) Reduced server load
c) Faster page loading times
d) Improved website design

Unit 1: Introduction to Data Scrapping and Wrangling 12

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

5. DATA WRANGLING: AN OVERVIEW

Data wrangling, also known as data munging or data cleaning, is the process of transforming
raw data into a structured and usable format for analysis, visualization, or further
processing. Data wrangling is a critical step in the data analysis process, as it ensures the
quality, reliability, and accuracy of the data being analyzed. The data wrangling process
typically consists of several steps, including:

Data Import

Data Cleaning Data Export

Data Data
Transformation Enrichment

1. Data import: The first step in the data wrangling process is loading data from various
sources, such as spreadsheets, databases, web APIs, or scraped web content, into a
programming environment for analysis. Depending on the data source, different tools
and libraries may be used to facilitate data import.
2. Data cleaning: Once the data has been imported, it is essential to identify and correct
errors, inconsistencies, and inaccuracies in the data. This step may involve tasks such
as removing duplicate records, filling in missing values, correcting data entry errors,
or converting data types to ensure consistency.
3. Data transformation: After cleaning the data, it may be necessary to convert it into a
suitable format or structure for analysis or visualization. This step can involve
reshaping the data, aggregating it, normalizing it, or performing other

Unit 1: Introduction to Data Scrapping and Wrangling 13

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

transformations that make it more accessible and meaningful for the intended
purpose.
4. Data enrichment: In some cases, it may be beneficial to add new variables or features
to the dataset to enhance its analytical value. Data enrichment can involve calculating
new metrics, creating dummy variables, merging data from different sources, or
deriving new insights from the existing data.
5. Data export: Once the data has been cleaned, transformed, and enriched, it is usually
saved in a structured format, such as a CSV file, an Excel spreadsheet, or a database,
for further analysis, visualization, or reporting. Data export may also involve
converting the data into different formats to make it more accessible or compatible
with various tools and platforms.
Both R and Python offer powerful tools and libraries for data wrangling, which can simplify
and streamline the process:

• In R, the 'dplyr' package is a popular and versatile tool for data wrangling tasks. It
provides a set of functions for performing common data manipulation tasks, such as
filtering, sorting, aggregating, and joining data. The 'tidyr' package is another useful
R library for reshaping and cleaning data.
• In Python, the 'pandas' library is widely used for data wrangling tasks. It offers a
comprehensive set of data manipulation and transformation functions, including
tools for handling missing data, merging and reshaping datasets, and performing
aggregation operations. The 'numpy' library can also be used for various data
manipulation tasks, particularly when dealing with numerical data.
By mastering the techniques and tools involved in data wrangling, you can ensure that the
data you work with is reliable, accurate, and suitable for analysis, enabling you to derive more
meaningful insights and make more informed decisions.

Unit 1: Introduction to Data Scrapping and Wrangling 14

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

SELF-ASSESSMENT QUESTIONS - 4
7) What is the primary goal of data wrangling?
a) To extract data from websites
b) To transform raw data into a structured format for analysis
c) To visualize data in charts and graphs
d) To store data in databases
8) Which Python library is commonly used for data wrangling?
a) NumPy
b) pandas
c) SciPy
d) matplotlib

Unit 1: Introduction to Data Scrapping and Wrangling 15

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

6. SUMMARY
In this unit, we introduced the concepts of data scraping and wrangling, which are essential
skills for anyone working with data in various domains. Data scraping involves extracting
data from websites or other sources, while data wrangling focuses on transforming raw data
into a structured and usable format for analysis, visualization, or further processing.

We explored different tools and methods used for data scraping, including browser
extensions, programming libraries, web scraping services, APIs, and custom-built scrapers.
We also discussed the ethical considerations involved in web scraping, emphasizing the
importance of respecting data privacy, copyright, terms of service, server load, robots.txt, and
transparency.

Throughout the chapter, we delved into the key steps of the data wrangling process, including
data import, cleaning, transformation, enrichment, and export. We examined the capabilities
and uses of popular data wrangling tools and libraries in R and Python, such as 'dplyr' and
'pandas', which are widely used in the data science community.

By understanding the fundamentals of data scraping and wrangling, you now have the
knowledge and skills to collect and manipulate data effectively, allowing you to derive
meaningful insights and make more informed decisions in your professional or personal
endeavours.

Unit 1: Introduction to Data Scrapping and Wrangling 16

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

7. GLOSSARY
• Data: In today's world, data is essentially the lifeblood of decision making and
knowledge. It's all about collecting and manipulating this data in a way that brings value
to organizations and individuals.
• Data Scraping: This is a way of getting hold of data by extracting it from websites or
other sources. It's one of the key methods to gather data, especially in the digital era.
• Data Wrangling: Once you've scraped your data, you need to make it usable. That's
where data wrangling comes in. It involves cleaning up and transforming raw data into a
structured format that you can analyze.
• Web Scraping Services: These are platforms that simplify the data scraping process.
They often come with a user-friendly interface, making it easy to scrape data without
deep programming knowledge.
• APIs: Application Programming Interfaces, or APIs, are another way to access data
without needing to scrape a website's content. It's a method that many websites offer,
providing structured data ready for use.
• Custom-built web crawlers and scrapers: These are personalized tools developed to
meet specific data scraping needs. They are built from scratch and often require a deep
understanding of web technologies.
• Ethical considerations in web scraping: With the power to scrape data comes
responsibility. It's essential to respect the privacy, rights, and resources of the websites
and people whose data is being scraped.
• Data Wrangling: This is all about cleaning and transforming raw data into a form that's
ready for analysis. It's a crucial step in the data analysis process.
• R and Python libraries for Data Wrangling: Both R and Python offer powerful tools for
data wrangling, helping to streamline the process. They offer functions for data
manipulation, handling missing data, merging and reshaping datasets, and more.
• Data Analysis: This is the final destination of your data journey. After gathering data
through scraping, and refining it through wrangling, you're now ready to analyze it. The
insights drawn here can be invaluable for making informed decisions and gaining deeper
understanding of various phenomena.

Unit 1: Introduction to Data Scrapping and Wrangling 17

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

8. TERMINAL QUESTIONS
1. What is the primary goal of data scraping?
2. Describe the difference between manual scraping and automated scraping.
3. What are some common applications of data scraping across different industries?
4. List four popular tools or methods for data scraping and briefly describe their key
features.
5. What are the advantages and limitations of using APIs for data extraction compared to
web scraping?
6. Explain how data privacy regulations like GDPR can impact web scraping activities.
7. Why is it essential to review a website's terms of service before performing web scraping?
8. What are some techniques to minimize the impact of web scraping on the target website's
server load and performance?
9. Explain the purpose of a robots.txt file and its importance in web scraping.
10. Describe the key steps involved in the data wrangling process.
11. What are some common data cleaning tasks that might be performed during the data
wrangling process?
12. Explain the concept of data transformation and provide some examples of data
transformation tasks.
13. What is data enrichment, and why might it be beneficial to add new variables or features
to a dataset?
14. How does the 'dplyr' package in R facilitate data wrangling tasks?
15. Describe the capabilities of the 'pandas' library in Python for data wrangling.
16. Provide examples of how data scraping can be used in market research and competitor
analysis.
17. Discuss the potential legal consequences of scraping copyrighted material without
permission.
18. Explain the role of data wrangling in ensuring the quality, reliability, and accuracy of data
being analyzed.
19. What are some challenges that users may face when importing data from various sources,
and how can these challenges be addressed?
20. Describe the importance of transparency and disclosure when collecting data through
web scraping for research or public sharing purposes.

Unit 1: Introduction to Data Scrapping and Wrangling 18

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

9. ANSWERS

SELF ASSESSMENT QUESTIONS

1. Answer: b) The process of extracting information from websites
2. Answer: b) Automated scraping
3. Answer: b) Beautiful Soup
4. Answer: d) An interface for accessing structured data from a website
5. Answer: c) robots.txt
6. Answer: a) Increased server load
7. Answer: b) To transform raw data into a structured format for analysis
8. Answer: b) pandas

TERMINAL QUESTIONS
1. Refer to Section 1
2. Refer to Section 1
3. Refer to Section 1
4. Refer to Section 2
5. Refer to Section 2
6. Refer to Section 3
7. Refer to Section 3
8. Refer to Section 3
9. Refer to Section 3
10. Refer to Section 4
11. Refer to Section 4
12. Refer to Section 4
13. Refer to Section 4
14. Refer to Section 4
15. Refer to Section 4
16. Refer to Section 1
17. Refer to Section 3
18. Refer to Section 4

Unit 1: Introduction to Data Scrapping and Wrangling 19

DADS404: Data Scrapping Manipal University Jaipur (MUJ)

19. Refer to Section 4

20. Refer to Section 3

s 10. SUGGESTED BOOKS AND E-REFERENCES

BOOKS:
• Mitchell, R. (2018). Web Scraping with Python: A Comprehensive Guide to Data
Collection Solutions. O'Reilly Media, Inc.
• Wickham, H. & Grolemund, G. (2016). R for Data Science: Import, Tidy, Transform,
Visualize, and Model