Data Science UNIT 1 Final
Data Science UNIT 1 Final
• Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data that
• Online systems and payment portals capture more data in the fields of e-
commerce, medicine, finance, and every other aspect of human life. We
have text, audio, video, and image data available in vast quantities.
NEED FOR DATA SCIENCE:
DATA SCIENCE TECHNIQUES?
• Data Scientist
• Data Analyst
• Machine learning expert
• Data engineer
• Data Architect
• Data Administrator
• Business Analyst
• Business Intelligence Manager
DATA NATURE: QUANTITATIVE VS
QUALITATIVE
• Structured data is often referred to as quantitative data. It means
that such data commonly contains precise numbers or textual
elements that can be counted. The analysis methods are clear and
easy-to-apply. Among them there are:
• classification or arranging stored items of data into similar classes
based on common features,
• regression or investigation of the relationships and dependencies
between variables, and
• data clustering or organizing the data points into specific groups
based on various attributes.
FIND THE NAMES OF THE LOGO
FIND THE NAMES OF THE LOGO
STRUCTURED DATA USE CASE
EXAMPLES
• Online booking : Different hotel booking and ticket reservation services leverage the
advantages of the pre-defined data model as all booking data such as dates, prices,
destinations, etc. fit into a standard data structure with rows and columns.
• ATMs : Any ATM is a great example of how relational databases and structured data work.
All the actions a user can do follow a pre-defined model.
• Inventory control systems : There are lots of variants of inventory control systems
companies use, but they all rely on a highly organized environment of relational
databases.
• Banking and accounting : Different companies and banks must process and record
huge amounts of financial transactions. Consequently, they make use of traditional
database management systems to keep structured data in place.
UNSTRUCTURED DATA USE CASE
EXAMPLES
• Sound recognition. Call centers use speech recognition to identify
customers and collect information about their queries and emotions.
• Image recognition. Online retailers take advantage of image recognition
so that customers can shop from their phones by posting a photo of the
desired item.
• Text analytics. Manufacturers make use of advanced text analytics to
examine warranty claims from customers and dealers and elicit specific
items of important information for further clustering and processing.
• Chatbots. Using natural language processing (NLP) for text analysis,
chatbots help different companies boost customer satisfaction from their
services. Depending on the question input, customers are routed to the
corresponding representatives that would provide comprehensive answers
TYPES OF DATA
QUALITATIVE OR CATEGORICAL DATA
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
DATA SCIENCE COMPONENTS:
DATA SCIENCE LIFECYCLE
history of data science
HISTORY OF DATA SCIENCE
• https://data-flair.training/blogs/data-science-in-
banking/
DATA SCIENCE LIFE CYCLE
STEPS IN DATA SCIENCE
• Obtaining the Data: This stage involves using technical knowledge like MySQL to
process and generate the data. It can even be in simpler file formats such as
Microsoft Excel. Some examples like Python and R even directly import the datasets
into a data science program.
• Scrubbing the Data: This stage involves cleaning raw data to retain only the relevant
part of the processed data. The noise is also scrubbed off, and the data is refined,
converted, and consolidated.
• Exploring the Data: This stage consists of examining the generated data. The data
and its properties are inspected since different data types demand specific
treatments. Descriptive statistics are then computed to extract the features and test
the significant variables.
• Modeling the Data: The dataset is refined further, and only the essential components
are kept. Only relevant values are kept and tested to predict accurate results.
• Interpreting the Data: At this stage, the final product is interpreted for the client or
business to analyze if it meets the requirement or answers a business question. The
insights are shared with everyone, and the results of the final stage are visualized.
Traits(characteristics) of big data
BIG DATA
• Big data is a collection of large datasets that cannot be processed
using traditional computing techniques. It is not a single technique
or a tool, rather it has become a complete subject, which involves
various tools, techniques and frameworks.
• What Comes Under Big Data?
• Big data involves the data produced by different devices and
applications. Given below are some of the fields that come under
the umbrella of Big Data.
• Black Box Data − It is a component of helicopter, airplanes, and
jets, etc. It captures voices of the flight crew, recordings of
microphones and earphones, and the performance information of
the aircraft.
BIG DATA
• Social Media Data − Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.
• Stock Exchange Data − The stock exchange data holds information
about the ‘buy’ and ‘sell’ decisions made on a share of different
companies made by the customers.
• Power Grid Data − The power grid data holds information consumed by
a particular node with respect to a base station.
• Transport Data − Transport data includes model, capacity, distance and
availability of a vehicle.
• Search Engine Data − Search engines retrieve lots of data from
different databases.
BIG DATA EXAMPLES
TRAITS(CHARACTERISTICS) OF BIG DATA
• Big Data contains a large amount of data that is not being
processed by traditional data storage or the processing unit.
• The data flow would exceed 150 exabytes per day before
replication.
THE CHARACTERISTICS OF BIG DATA
• Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources. Data will only be collected
from databases and sheets in the past, But these days the data will comes in array
forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.
• The data is categorized as below:
• Structured data: In Structured schema, along with all the required columns. It is in a
tabular form. Structured Data is stored in the relational database management system.
• Semi-structured: In Semi-structured, the schema is not appropriately defined,
e.g., JSON, XML, CSV, TSV, and email. OLTP (Online Transaction Processing)
systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
• Unstructured Data: All the unstructured files, log files, audio files,
and image files are included in the unstructured data. Some organizations have much
data available, but they did not know how to derive the value of data since the data is
raw.
• Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools.
VARIETY
VERACITY
• Web Scraping has multiple applications across various industries. Let’s check out
some of these now!
1. Price Monitoring
• Web Scraping can be used by companies to scrap the product data for their products
and competing products as well to see how it impacts their pricing strategies.
Companies can use this data to fix the optimal pricing for their products so that they
can obtain maximum revenue.
2. Market Research
• Web scraping can be used for market research by companies. High-quality web
scraped data obtained in large volumes can be very helpful for companies in
analyzing consumer trends and understanding which direction the company should
move in the future.
WHAT IS WEB SCRAPING USED FOR?
3. News Monitoring
• Web scraping news sites can provide detailed reports on the current news to a
company. This is even more essential for companies that are frequently in the
news or that depend on daily news for their day-to-day functioning. After all, news
reports can make or break a company in a single day!
4. Sentiment Analysis
• If companies want to understand the general sentiment for their products among
their consumers, then Sentiment Analysis is a must. Companies can use web
scraping to collect data from social media websites such as Facebook and Twitter
as to what the general sentiment about their products is. This will help them in
creating products that people desire and moving ahead of their competition.
5. Email Marketing
• Companies can also use Web scraping for email marketing. They can collect Email
ID’s from various sites using web scraping and then send bulk promotional and
marketing Emails to all the people owning these Email ID’s.
WEB SCARPING
HOW DOES WEB SCRAPPING WORK?
• These are the following steps to perform web scraping. Let's understand the working
of web scraping.
Step -1: Find the URL that you want to scrape
• First, you should understand the requirement of data according to your project. A
webpage or website contains a large amount of information. That's why scrap only
relevant information. In simple words, the developer should be familiar with the data
requirement.
Step - 2: Inspecting the Page
• The data is extracted in raw HTML format, which must be carefully parsed and reduce
the noise from the raw data. In some cases, data can be simple as name and address
or as complex as high dimensional weather and stock market data.
Step - 3: Write the code
• Write a code to extract the information, provide relevant information, and run the
code.
Step - 4: Store the data in the file
TECHNIQUES OF WEB SCRAPING
• Techniques of Web Scraping: There are
two ways of extracting data from websites,
the Manual extraction technique, and the
automated extraction technique.
• Manual Extraction Techniques
• Automated Extraction Techniques:
• HTML Parsing.
• DOM Parsing
• Web Scraping Software:
TECHNIQUES OF WEB SCRAPING
• Techniques of Web Scraping: There are two ways of extracting data from
websites, the Manual extraction technique, and the automated extraction technique.
• Manual Extraction Techniques: Manually copy-pasting the site content comes
under this technique. Though tedious, time taking and repetitive it is an effective way
to scrap data from the sites having good anti-scraping measures like bot detection.
• Automated Extraction Techniques: Web scraping software is used to
automatically extract data from sites based on user requirement.
• HTML Parsing: Parsing means to make something understandable to be analyzing it part by part.
To wit, it means to convert the information in one form to another form that is easy to that is
easier to work on with. HTML parsing means taking in the code and extracting relevant
information from it based on the user requirement. Mainly executed using JavaScript, the target as
the name suggests are HTML pages.
• DOM Parsing: The Document Object Model is the official recommendation of the World Wide Web
Consortium. It defines an interface that enables a user to modify and update the style, structure,
and content of the XML document.
• Web Scraping Software: Nowadays, many web scraping tools are available or are custom build
on users need to extract required desiring information from millions of websites.
DIFFERENT TYPES OF WEB SCRAPERS
• Descriptive Analysis looks at data and analyzes past events for insight as to how
to approach future events. It looks at the past performance and understands the
performance by mining historical data to understand the cause of success or
failure in the past. Almost all management reporting such as sales, marketing,
operations, and finance uses this type of analysis.
• Example: Let’s take an example of DMart, we can look at the product’s history
and find out which products have been sold more or which products have large
demand by looking at the product sold trends and based on their analysis we
can further make the decision of putting a stock of that item in large quantity
for the coming year.
2. DIAGNOSTIC ANALYSIS
• Keep it Succinct: Organize data in a way that makes it easy for different
audiences to skim through it to find the information most relevant to them.
• Make it Visual: Use data visualizations techniques, such as tables and charts, to
communicate the message clearly.
• Include an Executive Summary: This allows someone to analyze your findings
upfront and harness your most important points to influence their decisions.
DATA ANALYSIS TOOLS
1. SAS
• SAS was a programming language developed by the SAS
Institute for performed advanced analytics, multivariate
analyses, business intelligence, data management, and
predictive analytics.
2. Microsoft Excel
• It is an important spreadsheet application that can be useful
for recording expenses, charting data, and performing easy
manipulation and lookup and or generating pivot tables to
provide the desired summarized reports of large datasets
that contain significant data findings.
DATA ANALYSIS TOOLS
3. R
• It is one of the leading programming languages for performing
complex statistical computations and graphics. It is a free and open-
source language that can be run on various UNIX platforms,
Windows, and macOS. It also has a command-line interface that is
easy to use.
4. Python
• It is a powerful high-level programming language that is used for
general-purpose programming. Python supports both structured and
functional programming methods.
DATA ANALYSIS TOOLS
5. Tableau Public
• Tableau Public is free software developed by the public company
“Tableau Software” that allows users to connect to any spreadsheet or
file and create interactive data visualizations.
6. RapidMiner
• RapidMiner is an extremely versatile data science platform developed
by “RapidMiner Inc”. The software emphasizes lightning-fast data
science capabilities and provides an integrated environment for the
preparation of data and application of machine learning, deep
learning, text mining, and predictive analytical techniques
ANALYSIS VS REPORTING
TYPES OF DATA ANALYSIS METHODS
• Keep it Succinct: Organize data in a way that makes it easy for different
audiences to skim through it to find the information most relevant to them.
• Make it Visual: Use data visualizations techniques, such as tables and charts, to
communicate the message clearly.
• Include an Executive Summary: This allows someone to analyze your findings
upfront and harness your most important points to influence their decisions.
DATA ANALYSIS TOOLS
1. SAS
• SAS was a programming language developed by the SAS
Institute for performed advanced analytics, multivariate
analyses, business intelligence, data management, and
predictive analytics.
2. Microsoft Excel
• It is an important spreadsheet application that can be useful
for recording expenses, charting data, and performing easy
manipulation and lookup and or generating pivot tables to
provide the desired summarized reports of large datasets
that contain significant data findings.
DATA ANALYSIS TOOLS
3. R
• It is one of the leading programming languages for performing
complex statistical computations and graphics. It is a free and open-
source language that can be run on various UNIX platforms,
Windows, and macOS. It also has a command-line interface that is
easy to use.
4. Python
• It is a powerful high-level programming language that is used for
general-purpose programming. Python supports both structured and
functional programming methods.
DATA ANALYSIS TOOLS
5. Tableau Public
• Tableau Public is free software developed by the public company
“Tableau Software” that allows users to connect to any spreadsheet or
file and create interactive data visualizations.
6. RapidMiner
• RapidMiner is an extremely versatile data science platform developed
by “RapidMiner Inc”. The software emphasizes lightning-fast data
science capabilities and provides an integrated environment for the
preparation of data and application of machine learning, deep
learning, text mining, and predictive analytical techniques
DATA ANALYSIS TOOLS