0% found this document useful (0 votes)
21 views34 pages

chapter-1 Introduction to Data Analytics

Uploaded by

tinayetakundwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views34 pages

chapter-1 Introduction to Data Analytics

Uploaded by

tinayetakundwa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Chapter-1

Introduction to Data Analytics


Prepared By :- Assistant Professor Manthan Rankaja
Definition of Data Analytics
• Data Analytics involves the use of specialized systems and software to
analyze data and draw insights from it.
• In the era of big data, analytics help organizations make informed
decisions, predict trends, and understand customer behavior.
Applications of Data Analytics
• Various industries where data analytics is applied: Healthcare
(predicting disease outbreaks), Finance (fraud detection), Retail
(customer segmentation), and many more.
• Real-world examples of data analytics: Netflix’s recommendation
system, Credit card fraud detection, etc
Types of Data Analytics
• Descriptive: Analyzes historical data to understand what has
happened.
• Diagnostic: Digs deeper into data to understand the root cause of the
outcome.
• Predictive: Uses statistical models and forecasts techniques to
understand the future.
• Prescriptive: Uses optimization and simulation algorithms to advise
on possible outcomes.
Descriptive Analytics
• Definition: Descriptive Analytics deals with the analysis of historical
data to understand changes that have occurred in a business.
• Use cases: Sales trend analysis, Social media trend analysis.
• Examples: Monthly revenue report, Social media post reach analysis.
Diagnostic Analytics
• Definition: Diagnostic Analytics is a form of advanced analytics that
examines data to answer the question “Why did it happen?”.
• Use cases: Sales decline analysis, Customer churn analysis.
• Examples: Analyzing customer feedback to understand a drop in
product sales, Studying customer behavior data to understand churn.
Predictive Analytics
• Definition: Predictive Analytics uses statistical techniques and
machine learning algorithms to understand the future.
• Use cases: Customer lifetime value prediction, Predictive
maintenance.
• Examples: Using past purchase history to predict a customer’s future
purchase, Predicting machine failure using sensor data.
Prescriptive Analytics
• Definition: Prescriptive Analytics goes beyond predicting future
outcomes by also suggesting actions to benefit from the predictions.
• Use cases: Supply chain optimization, Personalized marketing.
• Examples: Optimizing delivery routes in real-time to save costs,
Personalizing marketing messages based on customer behavior
prediction.
Types of Data
• Structured data: Data that is organized and formatted so it’s easily
readable.
• For example, a database of customer information where data is
organized in rows and columns.
• Unstructured data: Data that doesn’t follow a specified format. For
example, emails, social media posts, etc.
• Semi-structured data: A mix of structured and unstructured data. For
example, a document which contains metadata.
Structured Data
• Definition: Structured data is highly organized and formatted in a way
so it’s easily searchable in relational databases.
• Examples:
Customer databases, Excel spreadsheets, etc.
• Advantages:
Easy to enter, store, query, and analyze.
• Disadvantages:
Requires a lot of time and resources to maintain.
Not suitable for complex, interconnected data.
Unstructured Data
• Definition: Unstructured data is not organized in a pre-defined
manner or does not have a pre-defined data model. It is difficult to
process and analyze.
• Examples: Word documents, PDFs, emails, audio files, etc.
• Advantages: Can capture nuanced information. More flexible as it
does not require a predefined schema.
• Disadvantages: Difficult to analyze and process. Requires more
storage space.
Semi-Structured Data
• Definition: Semi-structured data is a type of data that is both raw and
formatted, falling somewhere in between structured and
unstructured data.
• Examples: XML files, JSON files, etc.
• Advantages: More flexible than structured data, while still being
easier to analyze than unstructured data.
• Disadvantages: Can be more complex to work with and manage
compared to structured data.
•XML: extensible Markup Language
<person>
<name>John Doe</name>
<email>[email protected]</email>
<age>30</age>
</person>
JSON: JavaScript Object Notation

{
"person": {
"name": "John Doe",
"email": "[email protected]",
"age": 30
}
}
Data Sources
• Explanation:
• Data sources are the locations, files, databases, or services where
data comes from.
• Understanding data sources is important as the quality and reliability
of the data can greatly impact the results of data analysis.
Databases
• Explanation: Databases are structured sets of data. They are a
common source of data for analytics.
• Discussion: There are different types of databases,
• such as SQL (relational databases) and
• NoSQL (non-relational databases like MongoDB).
• Examples: Customer information in a SQL database, product
information in a NoSQL database.
Web Data
• Explanation: Web data refers to data that is obtained from the
internet. This can include data scraped from websites, data from
social media platforms, etc.
• Discussion: Different types of web data include text data, user
behaviour data, transactional data, etc.
• Examples: Tweets scraped from Twitter for sentiment analysis,
product reviews scraped from e-commerce websites.
Sensor Data
• Explanation: Sensor data is data that is collected by sensors, which
can be anything from temperature sensors to motion sensors.
• Discussion: Different types of sensor data include time series data,
spatial data, etc.
• This data is often used in IoT (Internet of Things) applications.
• Examples: Temperature data from a weather station, accelerometer
data from a smartphone
Data Collection Types
• Primary data collection involves gathering new data directly from the
source,
• while secondary data collection involves using data that already
exists, such as data from existing databases or data collected by
others.
Data Collection Methods
• Explanation: Data collection methods refer to how we obtain data.
• Common methods include surveys, where we ask people for
information;
• experiments, where we observe outcomes under controlled
conditions;
• observations, where we collect data about real-world behavior.
Data Preprocessing
• Definition: Data preprocessing is the process of cleaning and
transforming raw data into an understandable format.
• It’s a crucial step before data analysis or data modeling.
• Overview:
• Preprocessing involves data cleaning (removing noise and
inconsistencies),
• data transformation (normalizing data),
• data integration (combining data from various sources).
Data Cleaning
• Definition: Data cleaning involves handling missing values, removing
duplicates, and treating outliers.
• It ensures the quality of the data and improves the accuracy of the
insights derived from it.
• Discussion: Techniques include imputation for handling missing
values, deduplication for removing duplicate data, and outlier
detection methods for identifying and handling anomalies in the data.
Data Transformation
• Definition: Data transformation involves changing the format,
structure, or values of data to prepare it for analysis.
• It can involve
• normalization (scaling data to a small, specified range),
• standardization (shifting the distribution of each attribute to have a
mean of zero and a standard deviation of one),
• binning (converting numerical variables into categorical
counterparts).
• Discussion: These techniques help in reducing the complexity of data
and making data compatible for analysis.
Normalization

• Normalization involves scaling data to fit within a small, specified


range, typically between 0 and 1. This is useful when you want to
ensure that all features contribute equally to the analysis. The
formula for min-max normalization is:

• [ 10, 20, 30, 40, 50 ] >[ 0, 0.25, 0.5, 0.75, 1 ]


Standardization
• Standardization transforms data to have a mean of zero and a
standard deviation of one. This is useful when you want to compare
data that have different units or scales. The formula for
standardization is

• [ 10, 20, 30, 40, 50 ] >[ -1.41, -0.71, 0, 0.71, 1.41 ]


Data Integration
• Definition: Data integration involves combining data from different
sources and providing users with a unified view of the data.
• Discussion: This process becomes significant in a variety of situations,
which include both
• commercial (when two similar companies need to merge their
databases)
• scientific (combining research findings from different bioinformatics
repositories, for example) applications.
Data Analytics Tools
• Data analytics tools are software applications used to process and
analyze data. They help data analysts manage and interpret data from
various sources
• We will be discussing the features and use cases of popular data
analytics tools like R, Python, and SAS.
SAS
• Introduction to SAS:
SAS (Statistical Analysis System) is a software suite developed by SAS
Institute for advanced analytics, business intelligence, data
management, and predictive analytics.
• Key features and use cases of SAS in data analytics:
SAS provides a graphical point-and-click user interface for non-technical
users and more advanced options through the SAS language.
It is widely used in the corporate world.
R
• Introduction to R:
• R is a programming language and free software environment for
statistical computing and graphics.
It is widely used among statisticians and data miners for developing
statistical software and data analysis.
• Key features and use cases of R in data analytics:
R provides a wide variety of statistical and graphical techniques and is
highly extensible.
It is used in fields like healthcare, finance, academia, etc.
Python
• Python is a high-level, interpreted programming language. It is known
for its simplicity and readability, making it a popular choice for
beginners and experts in data analytics
• Python has powerful libraries for data manipulation and analysis like
pandas, NumPy, and SciPy.
• It is used in various domains like web development, machine learning,
AI, and more
Data Analytics Technologies
• Data analytics technologies refer to the frameworks and systems used
to process and analyze large datasets. They are designed to handle
big data and are essential for advanced analytics.
• Discussion on various technologies such as Hadoop, Spark, etc.: We
will be discussing the features and use cases of popular data analytics
technologies like Hadoop and Spark.
Hadoop
• Hadoop is an open-source software framework for storing data and
running applications on clusters of commodity hardware.
• It provides massive storage for any kind of data, enormous processing
power, and the ability to handle virtually limitless concurrent tasks or
jobs.
• Key features and use cases of Hadoop in data analytics: Hadoop is
known for its scalability, cost-effectiveness, flexibility, and fault
tolerance.
• It is used in various industries like finance, healthcare, media, etc.
Spark
• Introduction to Spark: Spark is an open-source, distributed computing
system used for big data processing and analytics.
• It provides an interface for programming entire clusters with implicit
data parallelism and fault tolerance.
• Key features and use cases of Spark in data analytics: Spark is known
for its speed, ease of use, and versatility.
• It can be used for various tasks like batch processing, real-time data
streaming, machine learning, etc.

You might also like