Exploratory Data Analysis With Python
Exploratory Data Analysis With Python
UNIT – I
Introduction to Data Science: Introduction to Data Science - Data Science
Stages - Data Science Ecosystem - Tools used in Data Science - Data Science
Workflow - Automated methods for Data Collection - Overview of Data -
Sources of Data - Big Data - Data Categorization.
USES:
Data Science is being used in almost all major industry. Here are some
examples:
• Predicting customer preferences for personalized recommendations.
• Detecting fraud in financial transactions.
• Forecasting sales and market trends.
• Enhancing healthcare with predictive diagnostics and personalized
treatments.
• Identifying risks and opportunities in investments.
STAGES OF DATA SCIENCE (OR) DATA SCIENCE WORKFLOW
The stages of data science are a structured process that helps to derive insights
from data:
• Problem Identification
• Data Collection
• Data Cleaning and Preprocessing
• Exploratory Data Analysis (EDA)
• Modeling
• Evaluation
• Deployment and Communication
Problem Identification:
The first step in the data science project life cycle is to identify the problem that
needs to be solved. This involves understanding the business requirements and
the goals of the project. Once the problem has been identified, the data science
team will plan the project by determining the data sources, the data collection
process, and the analytical methods that will be used.
Data Collection:
The second step in the data science project life cycle is data collection. This
involves collecting the data that will be used in the analysis. The data science
team must ensure that the data is accurate, complete, and relevant to the problem
being solved.
Modeling:
The fifth step in the data science project life cycle is model building. This
involves building a predictive model that can be used to make predictions based
on the data analysis. The data science team will use the insights and patterns from
the data analysis to build a model that can predict future outcomes.
Evaluation:
The sixth step in the data science project life cycle is model evaluation. This
involves evaluating the performance of the predictive model to ensure that it is
accurate and reliable. The data science team will test the model using a validation
dataset to determine its accuracy and performance.
4. Data Visualization
Visualization helps make data insights accessible and actionable.
Business Intelligence Tools:
• Tableau: A popular tool for creating data visualizations, dashboards, and
interactive graphs.
• Power BI: A Microsoft tool widely used for data analysis and visualization.
• Looker: Another powerful tool used for data analysis and business
intelligence.
Python Libraries:
• Matplotlib: A key library for creating basic graphs and charts for data
visualization.
• Seaborn: Built on top of Matplotlib, it helps create more attractive and
informative graphs.
• Plotly: A powerful library for creating interactive visualizations.
Cloud Platforms
Cloud services offer scalability, flexibility, and advanced analytics tools.
1. Amazon Web Services (AWS):
o S3: A cloud storage service used for data storage.
o SageMaker: An AWS service used to create, train, and deploy
machine learning and artificial intelligence models.
2. Google Cloud Platform (GCP):
o BigQuery: A Google Cloud service used for analyzing large
datasets.
3. Microsoft Azure:
o Azure ML: A service on Azure used to manage machine
learning workflows, train models, and deploy them.
These tools are chosen based on the specific requirements of a project, the size
of the data, and the complexity of the tasks involved.
1. Web Scraping
One common method is web scraping, which extracts data from websites using
tools like Beautiful Soup, Scrapy, or Selenium, often applied to tasks such as
monitoring e-commerce prices or gathering reviews. (or) Extracting data from
websites using scripts or tools.
• Tools: Beautiful Soup, Scrapy, Selenium.
• Applications:
o Collecting product prices from e-commerce sites.
o Gathering reviews or comments from social media or forums.
Data Characteristics:
1. Accuracy:
o This refers to how close the data is to the actual, real-world value.
Accurate data ensures reliable and truthful conclusions can be
drawn from it.
o Example: If you're recording temperatures, accurate data means it
reflects the actual temperature measured.
2. Completeness:
o Complete data includes all necessary and relevant information
needed to draw meaningful insights.
o Example: If you're tracking sales, missing data points like a sale
amount or a date would
o make the dataset incomplete.
3. Consistency:
o Consistent data means that the information is the same across
different databases or systems, without contradictions.
o Example: If a customer's name is spelled differently in different
parts of the system, that data is inconsistent.
4. Timeliness:
o Timely data is up-to-date and relevant for the current time period
or decision-making process.
o Example: If you're tracking stock prices, timely data reflects the
most recent market conditions, not outdated figures.
Sources of Data
Data can come from numerous sources, both structured and unstructured, and
they can be categorized based on their origin.
1. Primary Data:
• Definition: This is data that is directly collected for a specific purpose by
the researcher or organization.
• Examples:
o Surveys: Asking people questions to collect information.
o Interviews: One-on-one or group conversations to gather insights.
o Experiments: Controlled tests to gather data on specific variables.
o Observations: Directly watching and recording behaviors or events.
2. Secondary Data:
• Definition: This data has already been collected by someone else and is
repurposed for new research or analysis.
• Examples:
o Government Reports: Data published by government agencies on
various topics.
o Academic Papers: Studies and research conducted by scholars.
o Market Research Reports: Data provided by firms that analyze
consumer behavior, trends, and markets.
3. Internal Data:
• Definition: This is data generated within an organization, often related to
its operations and activities.
• Examples:
o Company Sales Data: Information on the company’s sales
performance.
o Employee Records: Data about the organization’s staff, such as
performance, attendance, and salaries.
o Financial Transactions: Data related to the company’s income,
expenses, and profits.
4. External Data:
• Definition: This is data that comes from outside an organization,
typically from third-party sources.
• Examples:
o Social Media Data: Information gathered from platforms like
Facebook, Twitter, Instagram, etc.
o Public Datasets: Data made available by government bodies,
NGOs, or research organizations.
o Data from External Partners: Data shared by other companies or
entities that the organization collaborates with.
5. Big Data Sources:
• Definition: Big Data refers to large volumes of data generated at high
speeds, often in real-time, that require advanced processing techniques.
• Examples:
o Social Media Platforms: Data generated from user interactions,
posts, likes, comments, etc.
o IoT Devices: Data collected from Internet of Things devices like
smart home devices, sensors, and wearables.
o E-commerce Platforms: Data from online shopping activities,
including customer preferences, purchase history, and browsing
behavior.
Big Data
Big data refers to extremely large datasets that are complex, varied, and grow at
an exponential rate. Traditional data processing tools cannot efficiently manage,
store, or analyze these datasets. The concept of big data is commonly defined by
the 5 V's:
1. Volume: Refers to the massive size of data generated from various
sources, such as social media, IoT devices, and e-commerce platforms.
For example, Facebook generates terabytes of data daily.
2. Velocity: Indicates the speed at which data is generated and processed.
Real-time data, like stock market feeds or live social media updates,
highlights the need for fast processing.
3. Variety: Describes the different types of data, including structured
(databases), semi-structured (JSON, XML), and unstructured (images,
videos, social media posts).
4. Veracity: Reflects the uncertainty and reliability of data. Data must be
cleansed and validated to ensure accuracy for analysis.
5. Value: Highlights the importance of extracting meaningful insights from
raw data for business or societal benefits.
Applications of Big Data
• Healthcare: Analyzing patient data for improved diagnosis and
personalized treatments.
• E-commerce: Optimizing customer experience through personalized
recommendations.
• Finance: Fraud detection and risk analysis in real-time.
• Transportation: Enhancing logistics and traffic management with IoT
data.
Challenges in Big Data
1. Data storage and management require scalable solutions like Hadoop or
cloud platforms.
2. Data security and privacy concerns arise due to the sensitive nature of
data.
3. Integration of varied datasets from multiple sources is complex.