0% found this document useful (0 votes)

59 views33 pages

Introduction Data Science Edited

Uploaded by

thirosul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views33 pages

Introduction Data Science Edited

Uploaded by

thirosul

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

B.

Sc (ECS)-II | DSPy Unit - 1

Introduction to Data Science: Introduction, Evolution of Data Science, Data Science Roles,
Stages in a Data Science Project, Applications of Data Science in various fields, Data Security
Issues.
Data Collection – Strategies used data in a collection, Introduction to Collection of Data,
Primary and Secondary Data, Methods of Collecting Primary Data, Methods of Collecting,
Secondary Data, Statistical Errors, Rounding off Data.
Data Pre-Processing Overview – Data Cleaning, Data Integration and Transformation, Data
Reduction, Data Discretization, Outlier Analysis, Testing, and Training.
Model Design and Development: Model Evaluation using Visualization, Residual Plot,
Distribution Plot, Measures for In-sample Evaluation, Prediction and Decision Making.
Generalization Error- Out-of-Sample Evaluation Metrics, Cross Validation, Overfitting,
Under Fitting and Model Selection, Prediction by using Ridge Regression, Testing Multiple
Parameters by using Grid Search.
******************************************************************************
Introduction to Data Science:
 Data science is a multidisciplinary field that uses statistical and computational methods or
algorithms to extract features and knowledge from data raw, structured, and unstructured data.
 It helps business to takes the right decisions at the right time. It also helps in strategic planning.

 Data science uses a variety of tools and methods, such as machine learning, statistical modeling,
and data visualization, to analyze and make predictions from data.
 “Data Science is a process of extraction, preparation, analysis, visualization, and maintenance of
information”.

 Analyze the massive amount of raw and unstructured data

 To extract features from data that can be used for decision-making.
 Identifying patterns or trends in data,
 Making predictions about future outcomes,
 Identifying opportunities for optimization or improvement.
 Uses tools and techniques to manipulate the data for finding new and meaningful information.
Example:
 Decision-making such as identifying which email is spam and which is not.
 A product recommendation such as which movie to watch?

1 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Predicting the outcome such as who will be the next President of the USA?
 Suppose we want to travel from station A to station B by car.
o We need to take some decisions such as:
1. which route will be the best route to reach faster at the location,
2. Which route there will be no traffic jams, and
3. Which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate answer from these
decisions, so this is called data analysis.
 Benefits of Data Science
 Improves business predictions
 Interpretation of complex data
 Better decision making
 Product innovation
 Improves data security
 Development of user-centric products
 Data:
 Data science is all about experimenting on raw or structured data.
 Its insights help to improve business, launch new products, or try out different experiments.
 Data is stored in various categories, qualities, and characteristics of data, and these categories
are called data types.

Types of data:

Qualitative Quantitative
 Nominal  Discrete
 Ordinal  Continuous

 Qualitative or Categorical Data

 Qualitative or Categorical Data can’t be measured in numbers. Data can be grouped into
categories based on certain characteristics. The data consist of audio, images, symbols, or text.
 Data defined in different categories like gender, Marital Status, Political Party, Eye color
1. Nominal Data
 Nominal Data can't be sorted in order. i.e. don’t have any meaningful order; their values are
distributed into different categories.
 The color of hair can be nominal data, as one color can’t be compared with another color.
Examples of Nominal Data:
 Color of hair (Blonde, red, Brown, Black, etc.)
 Nationality (Indian, German, American)
 Gender (Male, Female, Others)

2. Ordinal Data
 Ordinal data have natural ordering where a numbers are order by their position on the scale.
The ordinal data only shows the order of the sequence.
 These data are used for observation like customer satisfaction, happiness, etc., but we can’t do
any arithmetical tasks on them.
Examples of Ordinal Data:
 Feedback, experience, or satisfaction on a scale of 1 to 10
 Grades in the exam (A, B, C, D, etc.)

2 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Ranking in a competition (First, Second, Third, etc.)
 Education Level (Higher, Secondary, Primary)
 Quantitative Data
 Quantitative data can be expressed in numerical values and it countable with statistical data
analysis.
 Example: Price of a smartphone, discount, Processor of a smartphone, Ram, Internal Storage.
Examples of Quantitative Data:
 Height or weight of a person or object
 Room Temperature or Time
 Scores and Marks (Ex: 59, 80, 60, etc.)
1. Discrete Data
 The discrete means separate. The discrete data contain the values of integers or whole
numbers. These data can’t be broken into decimal or fraction values.
 The discrete data are countable and have finite values; their subdivision is not possible.
Examples of Discrete Data:
 The age of a person has discrete values such as 18, 19, not 20.8
 Total numbers of students present in a class
 Cost of a cell phone
 The total number of players in a team
 Days in a week
2. Continuous Data
 Continuous data are in the form of fractional numbers. Continuous data represents information
that can be divided into smaller values.
 The continuous variable can take any value within a range.
Examples of Continuous Data:
 Height of a person like 5.5ft, 8.2 ft
 Time taken to finish the work
 Wi-Fi Range
 Market share price

Stages in a Data Science Project (Data Science Life Cycle)

The data science life cycle is a methodical way of processing and analyzing data to gain useful
insights and make knowledgeable decisions. The process of data science involves several steps,
including data collection, cleaning, exploration, analysis, and interpretation.

Domain experts and Data Scientists are the key persons in the problem identification.
Domain expert has in-depth knowledge of the subject and exactly what is the problem to be solved.
Data Scientist understands the subject, identification of a problem and provide possible solutions to the problems.
Example: If a business wants to reduce credit loss, then it needs to find out the factors that affect it.

3 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

1. Problem Identification and Business Understanding

 This phase contains understanding the business scenario, problem and setting clear objectives that
need to be achieved for problem.
 In this phase, examine the business trends, analyzing case studies. The team will assess in-house
resources, infrastructure, total time, and technology needs.
 A better understanding of the problem increases the chances of building great model that solve
business problems.
 After identification, the analyst developing primary hypothesis for how to resolve all business
difficulties
 Clearly define the exact problem.
 List out issues that need to be resolved.
 Define why problem to be sovled
 Identify risks associated with the project
 State the potential value of the project to motivate everyone
 Create and distribute a flexible high-level project plan.
2. Data Collection or Extraction –
 After examining the problem, we need collect to required data for solution.
 Data extraction is the process of collecting different types of data like structured, unstructured from
a various sources.
 There are several methods of data collection like web scraping, survey, internal & external sources.
 To ensure that the collected data is high quality and sufficient for solving the problem.
 If gathered data should not be faulty, inaccurate, or insufficient then the result will be incorrect.
 Collected data are processed, refined, combined, and stored in a centralized location.
 Historical data from archives is also useful for better understanding the business. Transactional
data is also important because it is collected daily.
 Quality and Relevance of data
 Load the data into the target location
 Combine various data sets to create new views
 Define primary findings to stakeholders and gives feedback
 Describe the data structure
3. Data Pre-processing or Preparation –
 Once the data is extracted, it then enters the data preparation stage.
 In this stage, raw data is cleaned, merged, detecting and resolving common errors.
 All of this data is extracted, transformed, and processed into a desired format.
 The purpose of data preparation is to remove duplicate, incomplete, or incorrect data and prepare
excellent quality data for better predictions or business intelligence.
 Selecting related data to the problem.
 Refining the existing information
 Removing missing values, unwanted columns and features.
 Detecting outliers and handling them.
 Ensures the quality and reliability of the collected data before
additional analysis.
4 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Importance of Data Processing:
 Error Identification and Resolution: Detecting and resolving common errors
 Data Integration: Integration of the different formats and structures of datasets from multiple
sources.
 Data Transformation: Transforming the data into a standardized format
 This stage takes a lot of time to preprocess the raw data.
 Without preprocessing, Data may contains errors, inconsistencies that can lead to inaccurate
results and faulty decisions.
 To increase the accuracy of the data, data scientists develop new features from data.

4. Exploratory Data Analysis(EDA) –

 In this stage, data is explored and analyzed to achieve valuable insights and patterns.
 Data Analysis can be used to improve business processes, predictions, and decisions, and create
new products and services.
 This involves visualizing the data using graphs and charts, statistical analysis, and identifying
outliers, and anomalies. 
 It is used to understand the characteristics of the data.
 Exploratory Data Analysis helps data scientists in understanding the data and locating any
problems to be fixed. 
 Examine the data by expressing the various statistical functions
 Identify dependent and independent variables
 Identify and Analyze key features of data
 Graphs are used to visualize the data for a better understanding
 Define the spread of data
The most common types of data analysis are-
 Descriptive analysis is used to summarize the data,
 Inferential analysis is used to make predictions based on the data
 Predictive analysis is used to identify patterns and relationships in the data.
 The goal of data analysis is to understand the basic structure of the data and extract meaningful
information.
Example: Customer Satisfaction Analysis-
Analysts analyze customer feedback data to assess satisfaction levels. Charts and graphs can provide
visual representations of the findings.
 A bar graph can show the percentage of positive, neutral, and negative customer reviews,
providing an overall sentiment analysis.
 A line chart can display the average ratings over time, helping identify any fluctuations in
customer satisfaction.
 A scatter plot can depict the relationship between customer satisfaction scores and specific
product features, highlighting areas for improvement.
5. Data modeling
When the appropriate features are identified, data scientists proceed to model building. The models
are trained and modified based on the requirements.
 Model Building:
 A prepared and analyzed data feeding to various machine learning models such as classification,
regression, clustering, decision trees, or neural networks and perform a trial run on
corresponding datasets.
 A data scientist split the entire data set into training and testing data.
 Building the model by using training data set
 Evaluating the model performance by using testing data set
 During the training process, the models learn from the data and establish relationships between
the input features and the target variable.

5 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Models that capture the characteristic, patterns and relationships within the data.
 These models help to find patterns, make predictions, and produce perfect results.
choice of model depends on
 Accuracy of the model
 Amount of data
 Time and space constraints
 Scalability of the model

Firstly, models are tested on dummy data that is similar to the actual data.
 Model evaluation:
 The trained model is tested by unused datasets and evaluated for performance.
 If the desired results are not achieved, we must re-iterate the model until it gets it right.
 Building a model that can accurately predict the target variable using a set of features known as
predictors
6. Model deployment:
 Model deployment is the process of putting a model into production.
 The model makes predictions that are available to users, developers, or systems.
 They can make business decisions based on data, and interact with their application.
 After careful evaluation and modifications, the data model will become ready to provide the
results in real time.
 It is deployed in the desired channel and format.
Applications of Data Science in various fields:
Data science has changed almost every industry. In medicine, to help predict patient side effects. In
sports, athletic performance. Route-optimization models that capture typical rush hours and
weekend intervals.

 E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. use data Science to make a better user
experience with personalized recommendations. we get suggestions similar product to choices
according to our past data and also we get recommendations according to mostly purchased
product, most rated, most searched, etc.
 Product Recommendation
The product recommendation can guide customers to buy similar products. For example, a
bundling the products together and giving discounts. So he bundled shampoo and conditioner
together and gave a discount on them.
6 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Healthcare
Healthcare industries use data science to make implements to detect and cure disease. Data
Science helps in various branches of healthcare such as Medical Image Analysis, Development
of new drugs, Genetics and Genomics, Predictive Modeling for Diagnosis etc, and providing
virtual assistance to patients.
 Transportation
The objective of the transportation or Logistics industry is to ensure the efficient and safe
movement of people or goods from one location to another with the best-optimized route,
optimized delivery time, and price. A self-driving or autonomous car system is used to reduce
the number of Accidents.
 Fraud and Risk Detection
To judge whether a given transaction is fraud or not. it can analyze individual customers’
financial information, loans the person has taken in the past, overall income, and debts. It also
helps to classify and segment the transaction data to find out any patterns that might predict
any kind of fraud.
 Image Recognition
Image Recognition is a process of identify and classify faces, objects, colors, patterns,
shapes, Traffic Sign boards, etc. Unblock your smartphone using a scan of your face or thumb.
The system will detect the face, then classify your face as a human face, and after that, it will
decide if the phone belongs to the actual owner or not.
 Speech recognition
Speech recognition allows you to speak out the message and automatically convert it to text.
Some of the best speech recognition products are Chatbots, Google Voice, Siri, Cortana,
Alexa, Google Assistant, etc.
 Search Engines
Google, Yahoo, Bing, Ask, etc. search engines take the query as input and apply various Data
Science techniques to provide the most relevant results to the user within a fraction of a
second.
 Future Forecasting
Based on various types of data that are collected from various sources weather forecasting and
future forecasting are done.
Advantages/ Disadvantages of Data Science
Advantages Disadvantages
Better Decision-Making Data Privacy Concerns
Improved Efficiency Bias in Data
Enhanced Customer Experience Misinterpretation of Data
Predictive Analytics Data Quality Issues
Innovation and New Discoveries Cost and Time

Evolution or History of Data Science

Since the beginning of the 1960s, the term "data science" continues to be in use.
 1963: John W. Tukey, an American mathematician, initially expressed the data science dream in
1962. Tukey was not alone in his early understanding of what would become known as "data
science," even though he was ahead of his time. Peter Naur, a Danish computer engineer, was
another early pioneer.
 1977: The International Association for Statistical Computing (IASC) was founded, with the
mission of "linking traditional statistical methodology, the knowledge of domain experts and
modern computer technology, to transform data into knowledge and information," putting "pre"
data scientists like Tukey and Naur's theories and predictions into practice.
 The 1980s & 1990s: A Knowledge Discovery in Databases (KDD) workshop and the International
Federation of Classification Societies(IFCS) began to make significant developments in data
7 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
science. Data science began to look at the profit from large data and applied statistics.
 1994: "Database Marketing" was covered in Business. It represented the process of gathering and
analyzing massive amounts of data to understand more about their consumers, competitors, and
advertising strategies. The first wave of interest in establishing separate positions for data
management was sparked by massive volumes of data. Businesses appeared to require a new type
of employee to make data work in their favor.
 The early 2000s: Data science has evolved as a recognized and specialized field. Several data
science academic publications began to circulate, and proponents such as Jeff Wu and William S.
Cleveland continued to help expand and develop data science's usefulness and potential. William S
Cleveland is known to have brought about the modern data science start.
 The 2000s: Technology took huge steps in internet connectivity, communication, and data
collection. William S. Cleveland outlined proposals for training data scientists to address future
needs in 2001. He offered Data Science: An Action Plan for Expanding the Technical Areas of
Statistics as an action plan. It outlined methods to improve data analysts' technical skills and
breadth of knowledge, as well as six areas of study for university departments. His idea extends to
government and industry research as well. Software-as-a-Service (SaaS) was founded in 2001.
 2005: Big data makes its debut in 2005. Google and Facebook build up massive volumes of data.
New data-processing methods were required. Hadoop stepped up to the plate, and Spark and
Cassandra followed suit.
 2014: As data became more important and organizations became more interested in detecting
patterns and making better business decisions, demand for data scientists grew dramatically in
various parts of the world.
 2015: Artificial Intelligence (AI), Machine Learning, and Deep learning all make their debut in the
field of data science in 2015. These technologies have generated developments in personalized
shopping and entertainment to self-driving vehicles, as well as all the insights needed to effectively
bring these real-world AI applications into our daily lives.
 2018: The evolution of data science is the introduction of new regulations in the field.
 2020s: We're seeing more developments in AI and machine learning, as well as an ever-increasing
demand for qualified Big Data specialists.
Data Security Issues
 Data science is extracting valuable insights and solutions from complex and different data.
 Data is the most valuable asset for any system or business.
 Data commonly takes the form of financial data, healthcare data, archives, and business
intelligence datasets.
 It needs to protect the data, models, and outcomes.
 Data security includes policies and principles to protect large datasets, data analysis processes,
and other kinds of information.
 Data required 3 primary protection: inbound data transfers, outbound data transfers, and data at
rest.
 Data security aims to prevent accidental and intentional breaks, leaks, losses, and escape of large
amounts of data.
Main Elements of Data Security
 Confidentiality: Ensures that only authorized users, with appropriate authorizations, have access to
data.
 Integrity: Ensures that all data is accurate, reliable, and changeable.
 Availability: Ensures that data is accessible and available for ongoing business needs in a timely
and secure manner.
Data leakage can happen due to:
1. Cyberattacks in which hackers bypass your security technologies and get into your important
software or your security platform
2. Theft or loss of devices containing protected information

8 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
3. Data theft by employees or other internal users, such as contractors or partners
4. Human errors such as accidentally sending sensitive data to someone unauthorized to see it
Data Security Issues:
 Data and Model privacy
 Data quality and integrity
 Model robustness
 Lack of data visibility
 Misconfiguration and Leaving data open and unprotected
 Unauthorized access to data and Cyberattacks
 Rejection of service attacks
 Hijacking of accounts
 Insecure Interfaces and APIs
 Malicious insiders
 Data loss
 Careless data management

1. Fake Data
Fake data is difficult to detect other security issues in the system and causes lost clients’ data. It can
confuse fraud identification and stop all business processes.
2. Data Cleaning Failure
It can reduce the quality of the database and also create the potential for breaches.
3. Data Masking Issues
The data masking process ensures the separation of confidential information from the actual data.
Someone can reconstruct the database and use confidential data. It is a massive risk to all sensitive
information your organization operates.
4. Lost of Data Access Control
Different users can have different access levels of data. It can be challenging to manage all the access
in a company. Losing data access is always losing data confidentiality.
5. Model poisoning
Model poisoning is an attack on prepared model’s training data and manipulates the outcome of
models. Threat actors can try to inject malicious data into the model, which will cause the model to
misclassify the data and make bad decisions. The model can not work appropriately.
5. Insider threats
Insiders can access your company’s sensitive or confidential information and use their benefit.
To handle data science security issues is to follow the data security basics such as
 Encryption ensures that the data is unreadable by unauthorized parties, even if they access the
storage or the network.
 Authentication verifies the identity of the users or systems that access the data,
 Authorization defines the level of access and permissions they have.
 Auditing tracks and logs the data access and usage activities, which can help detect and prevent
breaches or misuse.
Data science roles:

 Data Scientist:
 Data Scientists are responsible for finding insights and patterns, trends in the data.
 A Data Scientist is responsible for collecting and handling raw data, analyzing, interpreting the
data, implementing various statistical procedures, visualizing the data to generate business
insights from data.
9 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data scientist's roles and responsibilities:
 Identifying the required data sets for the analysis
 Collecting large data sets from various sources
 Perform predictive analysis
 Searching for patterns and trends in data that impact on the business.
 Data visualization tools used to create charts and dashboards.
 Collaborating with IT and business teams.
Data Scientist Skills:
 Linear algebra, calculus and Statistics
 Programming knowledge in Python, R, Scala
 Knowledge of data preprocessing
 AI/ML
 Relational database management systems like SQL
 Natural Language Processing algorithms
 Data visualization skills like matplotlib, seaborn,
 Deep Learning frameworks (e.g., TensorFlow)
 Strong communication and presentation skills

 Data Analyst:
 Data Analysts are responsible for preparing, transforming, managing, processing, and
visualizing the data for business growth.
 He mainly deal with the analysis and visualization of the data.
 He worked on structured, unstructured, and semi-structured to generate reports to identify
patterns, valuable insights and produce data visualizations that is easily read by business users.
Data Analyst Roles and Responsibilities:
 Conduct surveys to collect raw data.
 Extracting data from primary and secondary data sources using automated tools
 Performing data analysis and visualize data in the form of graphs and reports.
 Use statistical methodologies and procedures to make reports
 Analyzing data and predicting trends that impact the organization

Data Analyst Skills

 Programming language skills in R/Python
 Knowledge of Various tools like SAS, Qlikview, Tableau, Excel, etc.
 Relational database systems like SQL
 Experience of data extraction from many sources
 Understanding of quantitative techniques, sampling, and statistical software

 Data Engineer:
 Data engineers are responsible for developing, constructing, managing data pipelines and
models.
 Data engineers also update the existing systems with newer or upgraded versions of the current
technologies to improve the efficiency of the databases.
 He deals with many responsibilities related to data such as storage, reliability, durability,
backup, cleaning, availability, etc.
Data Engineer Roles and Responsibilities
 Programming language skills in R/Python
 Knowledge of Various tools like SAS, Qlikview, Tableau, Excel, etc.
 Relational database systems like SQL
 Experience of data extraction from many sources
 Understanding of quantitative techniques, sampling, and statistical software
 Building and maintaining data pipelines

10 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data Engineer Skills
 Least one programming language, such as Python
 Understanding of data modeling and warehousing
 Big Data tools (Hadoop Stack such as HDFS, M/R, Hive, Pig, etc.)
 ETL (Extract, Transfer, Load) tools, NoSQL, Apache Spark System, and relational DBMS

 Data Architect:
 Data architects are also responsible for design patterns, data modeling, blueprints for data
management service-oriented integration, and business intelligence.
 The data can be easily integrated, centralized, and protected with the best security measures.
 He is organizing and managing data both at the macro level.
 A data architect develops the systems and tools that used by data scientists, analysts, machine
learning engineers, and artificial intelligence experts.
Data Architect Roles and Responsibilities
 Creating and implementing an data strategy of business
 Auditing the performance of data management systems regularly to improve systems
 Explain the complex technical issues to the non-technical staff.
 Ensuring the accessibility and accuracy of data
Data Architect Skills
 Programming languages like Java, Python, R, SQL
 Knowledge of data warehouses, data governance, and big data analytics.
 Data visualization tools.
 Data flow and integration automation

 ML Engineer:
 A Machine Learning Engineer is responsible for adapting machine learning models for
performing classification and regression tasks.
 They develop highly efficient machine learning models to assist data scientists in assessing,
analyzing, and organizing large amounts of data.
 A Machine Learning Engineer has the knowledge of various techniques like classification,
regression,clustering and deep learning algorithms.
Machine Learning Engineer Roles and Responsibilities
 Designing, building, and testing machine learning systems
 Examining and presenting data
 Improving the performance of ML models by changing various parameters.
Machine Learning Engineer Skills
 Strong mathematical and statistical foundation.
 Natural language processing.
 Deep expertise in technologies like Python, Java, SQL, Scala, or C++.
 Query processing data sets, building regression models, and creating and testing hypotheses.
 Machine learning algorithms.

 BI Developer:
BI or Business Intelligence Developer is responsible for maintaining business interfaces which
include data visualization, future prediction, etc., and helps businesses to set their future goals.
 Business Analyst
 Business data analyst is also responsible for improving the existing business processes and
operations, products, services and software data analysis, identifies problems and develops
solutions.
 A business Analyst is acts as a bridge between business and technology within the business.
 They closely work with stakeholders to understand their needs, gather and analyze data, and
develop strategies to optimize business performance.
 Their work ensures businesses operate efficiently, effectively, and profitably.
11 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Business Analysts Roles and Responsibilities
 Analysing, designing, and implementing new systems, processes, or technologies to achieve
desired outcomes.
 Improving existing business procedures.
 Conducting detailed business analysis, outlining problems, opportunities, and solutions
 Budget and Pricing analysis
 Execute quality assurance
 Share significant discoveries and ideas with the product team.
Business Analysts Skills
 Knowledge of business
 Data visualization tools such as Power BI and Tableau
 Expert in MS PowerPoint and MS Excel for documentation purposes
 Excellent critical thinking, problem-solving, and decision-making skills
 Knowledge of statistics and probability

 Statistician
 A statistician has a sound knowledge of statistical models, theories and data organization.
 Statisticians are responsible for extracting valuable insights from data and have a knowledge
of statistical models, theories, techniques, and data organization.
 They gather, organize, analyze, and evaluate data.
Statistician Roles and Responsibilities:
 Collecting, analyzing, and interpreting data
 Assessing results, and predicting trends, relationships using statistical methodologies
 Designing data collection processes
 Consulting on organizational and business strategy basis data
Statistician Skills
 Excellent knowledge of R, Python, SQL, and MATLAB.
 Expertise in statistical theories, machine learning methods, and database management models.
 Proficiency with statistical software, such as SPSS.
 Ability to communicate with other departments to coordinate data collection
 Expertise in company operations and industry knowledge

 Database Administrator
 Database administrator manages the database and responsible for continuously monitoring the
database to guarantee that efficient functioning, data security, user access, and permissions to
databases.
 They are also for data availability by performing frequent backups, retrieving data when
necessary, and testing databases to ensure their reliable operation.
Database Administrator Roles and Responsibilities
 Work on design and development of database
 Maintain and safeguard sensitive business data in collaboration with the IT security team.
 Build database software to store and manage data.
 Data archiving
 Work in collaboration with programmers, project managers, and other team members
 Make the essential data available and accessible using cloud servers.
Database Administrator Skills
 Excellent knowledge of SQL
 Understanding of database backup, recovery, security, and design
 Proficient with at least one database management system, such as IBM DB2, Oracle, Microsoft
SQL Server, or MySQL
 Solid problem-solving and analytical skills

****************
12 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Question:
1) Define Data Science? How data science benefical to us.
2) What is Data Security? What are the major components of data security.Discuss various Data
security issues.
3) Describe Exploratory Data Analysis and its role in data science.
4) Explain different stages of data Science?
5) Describe any five application of Data Science in detail.
6) Who is a Data Scientist?Difference between Data scientist and Business Analyst.

Data collection Strategies

What is Data Collection? | Methods of Collecting Data

What is Data Collection?

Data Collection is the systematic process of gathering, measuring, and recording data for research, analysis,
or decision-making. It involves collecting data from various sources, such as surveys, interviews,
observations, experiments, documents, or existing databases, to obtain relevant and reliable information.

Data Collection is the process of collecting information from relevant sources in order to find a solution to the
given statistical enquiry. Collection of Data is the first and foremost step in a statistical investigation.
Here, statistical enquiry means an investigation made by any agency on a topic in which the investigator
collects the relevant quantitative information. In simple terms, a statistical enquiry is the search of truth by
using statistical methods of collection, compiling, analysis, interpretation, etc. The basic problem for any
statistical enquiry is the collection of facts and figures related to this specific phenomenon that is being
studied. Therefore, the basic purpose of data collection is collecting evidence to reach a sound and clear
solution to a problem.

Data collection is a process of measuring and gathering information on desired variables in a fashion so that
questions related to the data can be found and used in research of various types. Data collection is a common
feature of study in various disciplines, such as marketing, statistics, economics, sciences, etc. The methods of
collecting data may vary according to subjects, but the ultimate aim of the study and honesty in data
collection are of the same importance in all matters of study.

Types of Data Collection

Depending on the nature of data collection, it can be divided into two major types, namely:
 Primary data collection method
 Secondary data collection method.

Important Terms related to Data Collection:

1. Investigator: An investigator is a person who conducts the statistical enquiry.
2. Enumerators: In order to collect information for statistical enquiry, an investigator needs the help of some
people. These people are known as enumerators.
3. Respondents: A respondent is a person from whom the statistical information required for the enquiry is
collected.
13 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
4. Survey: It is a method of collecting information from individuals. The basic purpose of a survey is to
collect data to describe different characteristics such as usefulness, quality, price, kindness, etc. It involves
asking questions about a product or service from a large number of people.

Thus, Data is a tool that helps an investigator in understanding the problem by providing him with the
information required. Data can be classified into two types; viz., Primary Data and Secondary
Data. Primary Data is the data collected by the investigator from primary sources for the first time from
scratch. However, Secondary Data is the data already in existence that has been previously collected by
someone else for other purposes. It does not include any real-time data as the research has already been done
on that information.
Methods of Collecting Data
There are two different methods of collecting data: Primary Data Collection and Secondary Data Collection.

A. Methods of Collecting Primary Data:

Primary Data Collection: Quantitative Data Collection

There are a number of methods of collecting primary data, Some of the common methods are as follows:
1. Direct Personal Investigation: As the name suggests, the method of direct personal investigation involves
collecting data personally from the source of origin. In simple words, the investigator makes direct contact
with the person from whom he/she wants to obtain information. This method can attain success only when the
investigator collecting data is efficient, diligent, tolerant and impartial. For example, direct contact with the
household women to obtain information about their daily routine and schedule.
2. Indirect Oral Investigation: In this method of collecting primary data, the investigator does not make
direct contact with the person from whom he/she needs information, instead they collect the data orally from
some other person who has the necessary required information. For example, collecting data of employees
from their superiors or managers.
3. Information from Local Sources or Correspondents: In this method, for the collection of data, the
investigator appoints correspondents or local persons at various places, which are then furnished by them to
the investigator. With the help of correspondents and local persons, the investigators can cover a wide area.
4. Information through Questionnaires and Schedules: In this method of collecting primary data, the
investigator, while keeping in mind the motive of the study, prepares a questionnaire. The investigator can
collect data through the questionnaire in two ways:
 Mailing Method: This method involves mailing the questionnaires to the informants for the collection of
data. The investigator attaches a letter with the questionnaire in the mail to define the purpose of the study
or research. The investigator also assures the informants that their information would be kept secret, and
then the informants note the answers to the questionnaire and return the completed file.
 Enumerator’s Method: This method involves the preparation of a questionnaire according to the purpose
of the study or research. However, in this case, the enumerator reaches out to the informants himself with
the prepared questionnaire. Enumerators are not the investigators themselves; they are the people who help
the investigator in the collection of data.

Primary data is collected by researchers on their own and for the first time in a study. There are various ways
of collecting primary data, some of which are the following:
 Interview: Interviews are the most used primary data collection method. In interviews a questionnaire
is used to collect data or the researcher may ask questions directly to the interviewee. The idea is to
seek information on concerning topics from the answers of the respondent. Questionnaires u sed can be
sent via email or details can be asked over telephonic interviews.
 Delphi Technique: In this method, the researcher asks for information from the panel of experts. The
researcher may choose in-person research or questionnaires may be sent via email. At the end of the
Delphi technique, all data is collected according to the need of the research.

14 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Projective techniques: Projective techniques are used in research that is private or confidential in a
manner where the researcher thinks that respondents won’t reveal information if direct questions are
asked. There are many types of projective techniques, such as Thematic appreciation tests (TAT),
role-playing, cartoon completion, word association, and sentence completion.
 Focus Group Interview: Here a few people gather to discuss the problem at hand. The number of
participants is usually between six to twelve in such interviews. Every participant expresses his own
insights and a collective unanimous decision is reached.
 Questionnaire Method: Here a questionnaire is used for collecting data from a diverse group
population. A set of questions is used for the concerned research and respondents answer queries
related to the questionnaire directly or indirectly. This method can either be open-ended or closed-
ended.

B. Methods of Collecting Secondary Data or Qualitative Data
Secondary data can be collected through different published and unpublished sources. Some of them are as
follows:
1. Published Sources
 Government Publications: Government publishes different documents which consists of different
varieties of information or data published by the Ministries, Central and State Governments in India as
their routine activity. As the government publishes these Statistics, they are fairly reliable to the
investigator. Examples of Government publications on Statistics are the Annual Survey of Industries,
Statistical Abstract of India, etc.
 Semi-Government Publications: Different Semi-Government bodies also publish data related to health,
education, deaths and births. These kinds of data are also reliable and used by different informants. Some
examples of semi-government bodies are Metropolitan Councils, Municipalities, etc.
 Publications of Trade Associations: Various big trade associations collect and publish data from their
research and statistical divisions of different trading activities and their aspects. For example, data
published by Sugar Mills Association regarding different sugar mills in India.
 Journals and Papers: Different newspapers and magazines provide a variety of statistical data in their
writings, which are used by different investigators for their studies.
 International Publications: Different international organizations like IMF, UNO, ILO, World Bank, etc.,
publish a variety of statistical information which are used as secondary data.
 Publications of Research Institutions: Research institutions and universities also publish their research
activities and their findings, which are used by different investigators as secondary data. For example
National Council of Applied Economics, the Indian Statistical Institute, etc.
2. Unpublished Sources
Another source of collecting secondary data is unpublished sources. The data in unpublished sources is
collected by different government organizations and other organizations. These organizations usually collect
data for their self-use and are not published anywhere. For example, research work done by professors,
professionals, teachers and records maintained by business and private enterprises.

15 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

16 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

Primary Data Collection Methods

Primary data or raw data is a type of information that is obtained directly from the first-hand source through
experiments, surveys or observations. The primary data collection method is further classified into two types.
They are

 Quantitative Data Collection Methods

 Qualitative Data Collection Methods
Let us discuss the different methods performed to collect the data under these two data collection methods.

Quantitative Data Collection Methods

It is based on mathematical calculations using various formats like close-ended questions, correlation and
regression methods, mean, median or mode measures. This method is cheaper than qualitative data collection
methods and it can be applied in a short duration of time.
17 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Qualitative Data Collection Methods
It does not involve any mathematical calculations. This method is closely associated with elements that are
not quantifiable. This qualitative data collection method includes interviews, questionnaires, observations,
case studies, etc. There are several methods to collect this type of data. They are

Observation Method

Observation method is used when the study relates to behavioural science. This method is planned
systematically. It is subject to many controls and checks. The different types of observations are:

 Structured and unstructured observation

 Controlled and uncontrolled observation
 Participant, non-participant and disguised observation
Interview Method

The method of collecting data in terms of verbal responses. It is achieved in two ways, such as

 Personal Interview – In this method, a person known as an interviewer is required to ask questions
face to face to the other person. The personal interview can be structured or unstructured, direct
investigation, focused conversation, etc.
 Telephonic Interview – In this method, an interviewer obtains information by contacting people on the
telephone to ask the questions or views, verbally.
Questionnaire Method

In this method, the set of questions are mailed to the respondent. They should read, reply and subsequently
return the questionnaire. The questions are printed in the definite order on the form. A good survey should
have the following features:

 Short and simple

 Should follow a logical sequence
 Provide adequate space for answers
 Avoid technical terms
 Should have good physical appearance such as colour, quality of the paper to attract the attention of
the respondent
Schedules

This method is similar to the questionnaire method with a slight difference. The enumerations are specially
appointed for the purpose of filling the schedules. It explains the aims and objects of the investigation and
may remove misunderstandings, if any have come up. Enumerators should be trained to perform their job
with hard work and patience.

Secondary Data Collection Methods

Secondary data is data collected by someone other than the actual user. It means that the information is
already available, and someone analyses it. The secondary data includes magazines, newspapers, books,
journals, etc. It may be either published data or unpublished data.

Published data are available in various resources including

 Government publications
18 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 Public records
 Historical and statistical documents
 Business documents
 Technical and trade journals
Unpublished data includes

 Diaries
 Letters
 Unpublished biographies, etc.

Whether you’re collecting data for business or academic research, the first step is to identify the type of
data you need to collect and what method you’ll use to do so. In general, there are two data types —
primary and secondary — and you can gather both with a variety of effective collection methods.

Primary data refers to original, firsthand information, while secondary data refers to information retrieved
from already existing sources. Peter Drow, head of marketing at NCCuttingTools, explains that “original
findings are primary data, whereas secondary data refers to information that has already been reported in
secondary sources, such as books, newspapers, periodicals, magazines, web portals, etc.”

Both primary and secondary data-collection methods have their pros, cons, and particular use cases. Read
on for an explanation of your options and a list of some of the best methods to consider.

Pro Tip

Automate your data collection process for free with Jotform. It’s free!
Primary data-collection methods

As mentioned above, primary data collection involves gathering original and firsthand source information.
Primary data-collection methods help researchers or service providers obtain specific and up -to-date
information about their research subjects. These methods involve reaching out to a targeted group of
people and sourcing data from them through surveys, interviews, observations, experiments, etc.

You can collect primary data using quantitative or qualitative methods. Let’s take a closer look at the two:

Quantitative data-collection methods involve collecting information that you can analyze numerically.
Closed-ended surveys and questionnaires with predefined options are usually the ways researchers collect
quantitative information. They can then analyze the results using mathematical calculations such as means,
modes, and grouped frequencies. An example is a simple poll. It’s easy to quickly determine or express the
number of participants who choose a specific option as a percentage of the whole.

Qualitative data collection involves retrieving nonmathematical data from primary sources. Unlike
quantitative data-collection methods where subjects are limited to predefined options, qualitative data -
collection methods give subjects a chance to freely express their thoughts about the research topic. As a
result, the data researchers collect via these methods is unstructured and often nonquantifiable.

Here’s an important difference between the two: While quantitative methods focus on understanding
“what,” “who,” or “how much,” qualitative methods focus on understanding “why” and “how.” For
example, quantitative research on parents may show trends that are specific to fathers or mothers, but it
may not uncover why those trends exist.

19 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Drow explains that applying quantitative methods is faster and cheaper than applying qualitative methods.
“It is simple to compare results because quantitative approaches are highly standardized. In contrast,
qualitative research techniques rely on words, sounds, feelings, emotions, colors, and other intangible
components.”

Drow emphasizes that the field of your study and the goals and objectives of your research will influence
your decision about whether to use quantitative or qualitative methodologies for data collection.

Below are some examples of primary data-collection methods:

1. Questionnaires and surveys

While researchers often use the terms “survey” and “questionnaire” interchangeably, the two mean slightly
different things.

A questionnaire refers specifically to the set of questions researchers use to collect information from
respondents. It may include closed-ended questions, which means respondents are limited to predefined
answers, or open-ended questions, which allow respondents to give their own answers.

A survey includes the entire process of creating questionnaires, collecting responses, and analyzing the
results.

Jotform’s free survey maker makes it easy to conduct surveys. Using any of Jotform’s customizable
survey templates, you can quickly create a questionnaire and share your survey with respondents using a
shareable link. You can also analyze survey results in easy-to-read spreadsheets, charts, and more.

2. Interviews

An interview is a conversation in which one participant asks questions and the other provides answers.
Interviews work best for small groups and help you understand the opinions and feelings of respondents.

Interviews may be structured or unstructured. Structured interviews are similar to questionnaires and
involve asking predetermined questions with specific multiple-choice answers. Unstructured interviews,
on the other hand, give subjects the freedom to provide their own answers. You can conduct interviews in
person or via recorded video or audio conferencing.

3. Focus groups

A focus group is a small group of people who have an informal discussion about a particular topic,
product, or idea. The researcher selects participants with similar interests, gives them topics to discuss,
and records what they say.

Focus groups can help you better understand the results of a large-group quantitative study. For example, a
survey of 1,000 respondents may help you spot trends and patterns, but a focus group of 10 respondents
will provide additional context for the results of the large-group survey.

4. Observation

Observation involves watching participants or their interactions with specific products or objects. It’s a
great way to collect data from a group when they’re unwilling or unable to participate in interviews —
children are a good example.

20 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
You can conduct observations covertly or overtly. The former involves discreetly observing people’s
behavior without their knowledge. This allows you to see them acting naturally. On the other hand, y ou
have to conduct overt observation openly, and it may cause the subjects to behave unnaturally.

Advantages of primary data-collection methods

1. Accuracy: You collect data firsthand from the target demographic, which leaves less room for
error or misreporting.
2. Recency: Sourcing primary data ensures you have the most up-to-date information about the
research subject.
3. Control: You have full control over the data-collection process and can make adjustments where
necessary to improve the quality of the data you collect.
4. Relevance: You can ask specific questions that are directly relevant to your research.
5. Privacy: You can control access to the research results and maintain the confide
6.
7. ntiality of respondents.

Disadvantages of primary data collection

1. Cost: Collecting primary data can be expensive, especially if you’re working with a large group.
2. Labor: Collecting raw data can be labor intensive. When you’re gathering data from large groups,
you need more skilled hands. And if you’re researching something arcane or unusual, it might be
difficult to find people with the appropriate expertise.
3. Time: Collecting primary data takes time. If you’re conducting surveys, for example, participants
have to fill out questionnaires. This could take anywhere from a few days to several months,
depending on the size of the study group, how you deliver the survey, and how quickly participants
respond. Post-survey activities, such as organizing and cleaning data to make it usable, also add up.

Secondary data-collection methods

Secondary data collection involves retrieving already available data from sources other than the target
audience. When working with secondary data, the researcher doesn’t “collect” data; instead, they consult
secondary data sources.

Secondary data sources are broadly categorized into published and unpublished data. As the names
suggest, published data has been published and released for public or private use, while unpublished data
comprises unreleased private information that researchers or individuals have documented.

When choosing public data sources, Drow strongly recommends considering the date of publication, the
author’s credentials, the source’s dependability, the text’s level of discussion and depth of analysis, and
the impact it has had on the growth of the field of study.

Below are some examples of secondary data sources:

1. Online journals, records, and publications

Data that reputable organizations have collected from research is usually published online. Many of these
sources are freely accessible and serve as reliable data sources. But it’s best to search for the latest editions
of these publications because dated ones may provide invalid data.

21 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
2. Government records and publications

Periodically, government institutions collect data from people. The information can range from population
figures to organizational records and other statistical information such as age distribution. You can usually
find information like this in government libraries and use it for research purposes.

3. Business and industry records

Industries and trade organizations usually release revenue figures and periodic industry trends in quarterly
or biannual publications. These records serve as viable secondary data sources since they’re industry -
specific.

Previous business records, such as companies’ sales and revenue figures, can also be useful for research.
While some of this information is available to the public, you may have to get permission to access other
records.

4. Newspapers

Newspapers often publish data they’ve collected from their own surveys. Due to the volume of resources
you’ll have to sift through, some surveys may be relevant to your niche but difficult to find on paper.
Luckily, most newspapers are also published online, so looking through their online archives for specific
data may be easier.

5. Unpublished sources

These include diaries, letters, reports, records, and figures belonging to private individuals; these sources
aren’t in the public domain. Since authoritative bodies haven’t vetted or published the data, it can often be
unreliable.

Advantages of secondary data-collection methods

Below are some of the benefits of secondary data-collection methods and their advantages over primary
methods.

1. Speed: Secondary data-collection methods are efficient because delayed responses and data
documentation don’t factor into the process. Using secondary data, analysts can go straight into
data analysis.
2. Low cost: Using secondary data is easier on the budget when compared to primary data collection.
Secondary data often allows you to avoid logistics and other survey expenses.
3. Volume: There are thousands of published resources available for data analysis. You can sift
through the data that several individual research efforts have produced to find the components that
are most relevant to your needs.
4. Ease of use: Secondary data, especially data that organizations and the government have
published, is usually clean and organized. This makes it easy to understand and extract.
5. Ease of access: It’s generally easier to source secondary data than primary data. A basic internet
search can return relevant information at little or no cost.

Disadvantages of secondary data collection

1. Lack of control: Using secondary data means you have no control over the survey process.
Already published data may not include the questions you need answers to. This makes it difficult
to find the exact data you need.
22 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
2. Lack of specificity: There may not be many available reports for new industries, and government
publications often have the same problems. Furthermore, if there’s no available data for the niche
your service specializes in, you’ll encounter problems using secondary data.
3. Lack of uniqueness: Using secondary sources may not give you the originality and uniqueness
you need from data. For instance, if your service or product hinges on innovation and uses an out-
of-the-norm approach to problem-solving, you may be disappointed by the generic nature of the
data you collect.
4. Age: Because user preferences change over time, data can evolve. The secondary data you retrieve
can become invalid. When this happens, it becomes difficult to source new data without conducting
a hands-on survey.

What are Statistical Errors?

The errors which are occurred while collecting data are known as Statistical Errors. These are dependent
on the sample size selected for the study. There are two types of Statistical Errors; viz., Sampling
Errors and Non-Sampling Errors.

MeaCalResBiMi

1. Sampling Errors:
The errors which are related to the nature or size of the sample selected for the study are known as
Sampling Errors. If the size of the sample selected is very small or the nature of the sample is non -
representative, then the estimated value may differ from the actual value of a parameter. This kind of error
is sampling error. For example, if the estimated value of a parameter is 10, while the actual value is 30,
then the sampling error will be 10-30=-20.
Sampling Error = Estimated Value – Actual Value

2. Non-Sampling Errors:

The errors related to the collection of data are known as Non-Sampling Errors. The different types of Non-
Sampling Errors are Error of Measurement, Error of Non-response, Error of Misinterpretation, Error of
Calculation or Arithmetical Error, and Error of Sampling Bias.

23 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
i) Error of Measurement:
The reason behind the occurrence of Error of Measurement may be difference in the scale of measurement
and difference in the rounding-off procedure that is adopted by different investigators.
ii) Error of Non-response:
These errors arise when the respondents do not offer the information required for the study.
iii) Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given in the questionnaire.
iv) Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
v) Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target population cannot be
included in the sample choice.
Note: If the field of investigation is larger or the size of the population is larger, then the possibility of the
occurrence of errors related to the collection of data is high. Besides, a non-sampling error is more serious
than a sampling error. It is because one can minimize the sampling error by opting for a larger sample size
which is not possible in the case of non-sampling errors.

Data Pre-processing Overview CTRIND

Data Preprocessing


Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the
quality of the data and to make it more suitable for the specific data mining task.

Some common steps in data preprocessing include:

Data preprocessing is an important step in the data mining process that involves cleaning and transforming
raw data to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing
values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and semantics.
Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the data
to have zero mean and unit variance. Discre tization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information.
Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and
clustering.
24 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results.
The specific steps involved in data preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves
handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing
within a tuple.

2. Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute
mean or the most probable value.

25 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.

2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).

3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)

2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.

4. Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute
“city” can be converted to “country”.

3. Data Reduction:

Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than using
the entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall
trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless compression to
reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that are most
relevant to the task at hand.

26 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size of the
data. The more data is reduced, the less accurate the model will be and the less generalizable it will be.

Discretization

Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words, data discretization is a method of
converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are
two forms of data discretization first is supervised discretization, and the second is unsupervised
discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It means it works on the
top-down splitting strategy and bottom-up merging strategy.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Backward Skip 10sPlay VideoForward Skip 10s

Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Another example is analytics, where we gather the static data of website visitors. For example, all visitors
who visit the site with the IP address of India are shown under country level.

Some Famous techniques of data discretization

Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation,
normal distribution representation, etc.

Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values. For data discretization and the development of idea hierarchy, this technique can also be used.

Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x
numbers into clusters to isolate a computational feature of x.

Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that
27 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
has the least entropy, and then you need to run it with the help of a recursive process. The recursive process
divides it into various discretized disjoint intervals, from top to bottom, using the same splitting criterion.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that
has the least entropy, and then you need to run it with the help of a recursive process. The recursive process
divides it into various discretized disjoint intervals, from top to bottom, using the same splitting criterion.

Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large
intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised
procedure.

What is Outlier Analysis

Whenever we talk about data analysis, the term outliers often come to our mind. As the name suggests,
"outliers" refer to the data points that exist outside of what is to be expected. The major thing about the
outliers is what you do with them. If you are going to analyze any task to analyze data sets, you will always
have some assumptions based on how this data is generated. If you find some data points that are likely to
contain some form of error, then these are definitely outliers, and depending on the context, you want to
overcome those errors. The data mining process involves the analysis and prediction of data that the data
holds. In 1969, Grubbs introduced the first definition of outliers.

Difference between outliers and noise

Any unwanted error occurs in some previously measured variable, or there is any variance in the previously
measured variable called noise. Before finding the outliers present in any data set, it is recommended first to
remove the noise.

Types of Outliers

Outliers are divided into three different types

1. Global or point outliers

2. Collective outliers
3. Contextual or conditional outliers

28 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

Global Outliers

Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers. When
data points deviate from all the rest of the data points in a given data set, it is known as the global outlier. In
most cases, all the outlier detection procedures are targeted to determine the global outliers. The green data
point is the global outlier.

Collective Outliers

In a given set of data, when a group of data points deviates from the rest of the data set is called collective
outliers. Here, the particular set of data objects may not be outliers, but when you consider the data objects as
a whole, they may behave as outliers. To identify the types of different outliers, you need to go through
background information about the relationship between the behavior of outliers shown by different data
objects. For example, in an Intrusion Detection System, the DOS package from one system to another is taken
as normal behavior. Therefore, if this happens with the various computer simultaneously, it is considered
abnormal behavior, and as a whole, they are called collective outliers. The green data points as a whole
represent the collective outlier.

29 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

Contextual Outliers

As the name suggests, "Contextual" means this outlier introduced within a context. For example, in the
speech recognition technique, the single background noise. Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data object deviates from the other data points because of any
specific condition in a given data set. As we know, there are two types of attributes of objects of data:
contextual attributes and behavioral attributes. Contextual outlier analysis enables the users to examine
outliers in different contexts and conditions, which can be useful in various applications. For example, A
temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave like
a normal data point in the context of a summer season. In the given diagram, a green dot representing the low-
temperature value in June is a contextual outlier since the same value in December is not an outlier.

Outliers Analysis

Outliers are discarded at many places when data mining is applied. But it is still used in many applications
like fraud detection, medical, etc. It is usually because the events that occur rarely can store much more
significant information than the events that occur more regularly.

Other applications where outlier detection plays a vital role are given below.

30 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Any unusual response that occurs due to medical treatment can be analyzed through outlier analysis in data
mining.

o Fraud detection in the telecom industry

o In market analysis, outlier analysis enables marketers to identify the customer's behaviors.
o In the Medical analysis field.
o Fraud detection in banking and finance such as credit cards, insurance sector, etc.

The process in which the behavior of the outliers is identified in a dataset is called outlier analysis. It is also
known as "outlier mining", the process is defined as a significant task of data mining.

Train and Test datasets

Machine Learning is one of the booming technologies across the world that enables computers/machines to
turn a huge amount of data into predictions. However, these predictions highly depend on the quality of the
data, and if we are not using the right data for our model, then it will not generate the expected result. In
machine learning projects, we generally divide the original dataset into training data and test data. We train
our model over a subset of the original dataset, i.e., the training dataset, and then evaluate whether it can
generalize well to the new or unseen dataset or test set. Therefore, train and test datasets are the two key
concepts of machine learning, where the training dataset is used to fit the model, and the test dataset is
used to evaluate the model.

In this topic, we are going to discuss train and test datasets along with the difference between both of them.
So, let's start with the introduction of the training dataset and test dataset in Machine Learning.

What is Training Dataset?

The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how to
make predictions for the given task.

For example, for training a sentiment analysis model, the training data could be as below:

Input Output (Labels) remeber the

example in
The New UI is Great Positive

Update is really Slow Negative

The training data varies depending on whether we are using Supervised Learning or Unsupervised Learning
Algorithms.

What is Test Dataset?

Once we train the model with the training dataset, it's time to test the model with the test dataset. This dataset
evaluates the performance of the model and ensures that the model can generalize well with the new or
unseen dataset. The test dataset is another subset of original data, which is independent of the training
dataset.

31 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Need of Splitting dataset into Train and Test set

Splitting the dataset into train and test sets is one of the important parts of data pre-processing, as by doing so,
we can improve the performance of our model and hence give better predictability.

We can understand it as if we train our model with a training set and then test it with a completely different
test dataset, and then our model will not be able to understand the correlations between the features.

imp

How do training and testing data work in Machine Learning?

Machine Learning algorithms enable the machines to make predictions and solve problems on the basis of
past observations or experiences. These experiences or observations an algorithm can take from the training
data, which is fed to it. Further, one of the great things about ML algorithms is that they can learn and
improve over time on their own, as they are trained with the relevant training data.

Once the model is trained enough with the relevant training data, it is tested with the test data. We can
understand the whole process of training and testing in three steps, which are as follows:

1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in Supervised Learning), and the
model transforms the training data into text vectors or a number of data features.
3. Test: In the last step, we test the model by feeding it with the test data/unseen dataset. This step
ensures that the model is trained efficiently and can generalize well.

The above process is explained using a flowchart given below:

32 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1

33 | S S Ghule

Introduction to Data Science __ 23CSH-283
100% (1)
Introduction to Data Science __ 23CSH-283
48 pages
1#Ppt Innovation and Design Thinking by DR Firozkhan
No ratings yet
1#Ppt Innovation and Design Thinking by DR Firozkhan
70 pages
Data Science Overview Basic to Advance Guide
No ratings yet
Data Science Overview Basic to Advance Guide
27 pages
Internship Report 2023-24 Data Science
100% (2)
Internship Report 2023-24 Data Science
23 pages
Data Science
100% (2)
Data Science
33 pages
Data similarity and dissimilarity
No ratings yet
Data similarity and dissimilarity
73 pages
BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process
No ratings yet
BDA -Statistical Inference, Exploratory Data Analysis, and the Analytics Process
74 pages
Unit-1
No ratings yet
Unit-1
44 pages
Data Science S3mca
No ratings yet
Data Science S3mca
55 pages
Unit I and unit ii dev (1)
No ratings yet
Unit I and unit ii dev (1)
36 pages
Ads TopperSh
No ratings yet
Ads TopperSh
50 pages
DSE 3 Unit 1
100% (1)
DSE 3 Unit 1
10 pages
Activity 3. Mind Map. Data Science Methodology
No ratings yet
Activity 3. Mind Map. Data Science Methodology
4 pages
EBook - Data Science 4
No ratings yet
EBook - Data Science 4
14 pages
CAE1 - 2 - Set1 Key
No ratings yet
CAE1 - 2 - Set1 Key
3 pages
ADS TT1 QB Solutions
No ratings yet
ADS TT1 QB Solutions
14 pages
DATA SCIENCE
No ratings yet
DATA SCIENCE
11 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
DS
No ratings yet
DS
94 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Hector Lavoe El Cantante The Originals Paperback
100% (1)
Hector Lavoe El Cantante The Originals Paperback
4 pages
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
No ratings yet
IDS - UNIT-2 - Notes part1_Introduction to Data Science and Prob concept[1]
66 pages
Data Science (Quick Guide) for College Exams
No ratings yet
Data Science (Quick Guide) for College Exams
34 pages
Fundamentals of Data Science
No ratings yet
Fundamentals of Data Science
84 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
MLM FDS
No ratings yet
MLM FDS
19 pages
Fundamental of Data Science
No ratings yet
Fundamental of Data Science
20 pages
Reflective Essay of Principles of Data Science
No ratings yet
Reflective Essay of Principles of Data Science
16 pages
Module1 Data Science
No ratings yet
Module1 Data Science
15 pages
DS PPT 1
No ratings yet
DS PPT 1
30 pages
datas_unit1
No ratings yet
datas_unit1
20 pages
Unit-1 Data Science
No ratings yet
Unit-1 Data Science
74 pages
Data Science
No ratings yet
Data Science
11 pages
Introduction of Data Science.docx
No ratings yet
Introduction of Data Science.docx
28 pages
Unit-1 - Introduction to Data Science
No ratings yet
Unit-1 - Introduction to Data Science
17 pages
Data Science-Lec 1
No ratings yet
Data Science-Lec 1
17 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
CUITM217-DATA-SCIENCE Data
No ratings yet
CUITM217-DATA-SCIENCE Data
48 pages
DTS 201 LECTURE NOTE
No ratings yet
DTS 201 LECTURE NOTE
24 pages
Data Science
No ratings yet
Data Science
65 pages
ETCh2
No ratings yet
ETCh2
36 pages
Statictics Computerscience Information Science
No ratings yet
Statictics Computerscience Information Science
3 pages
Dsdm-Unit1 241031 194317
No ratings yet
Dsdm-Unit1 241031 194317
38 pages
Data Science PDF
No ratings yet
Data Science PDF
11 pages
Unit 2 - DS - 1st year
No ratings yet
Unit 2 - DS - 1st year
7 pages
DTS Modul Data Science Methodology
100% (1)
DTS Modul Data Science Methodology
56 pages
HTTTTC- FINAL EXAM
No ratings yet
HTTTTC- FINAL EXAM
4 pages
File
No ratings yet
File
27 pages
Data-Science
No ratings yet
Data-Science
14 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
FDS notes
No ratings yet
FDS notes
5 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Data Science Ppt1 Update
No ratings yet
Data Science Ppt1 Update
67 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
II CSE_A&B (96)DS-int 1 QP ANS-set1 - Copy
No ratings yet
II CSE_A&B (96)DS-int 1 QP ANS-set1 - Copy
7 pages
Data Science Process Stages Lecture 2
No ratings yet
Data Science Process Stages Lecture 2
4 pages
Data Science
No ratings yet
Data Science
5 pages
The Complete Windows 11 User Manual - 4th Edition, 2023
No ratings yet
The Complete Windows 11 User Manual - 4th Edition, 2023
150 pages
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
No ratings yet
Lecture 1 What Is Data Science Prerequisites, Lifecycle and Applications Simplilearn
5 pages
2015.366745.Maraathyanchyaa-Raajya
No ratings yet
2015.366745.Maraathyanchyaa-Raajya
222 pages
District Census Handbook: Maharashtra
No ratings yet
District Census Handbook: Maharashtra
202 pages
2015.366596.Dona-Tapen
No ratings yet
2015.366596.Dona-Tapen
161 pages
Cs504 Final Term Solved Mcqs by Junaid
No ratings yet
Cs504 Final Term Solved Mcqs by Junaid
54 pages
indrayani
No ratings yet
indrayani
99 pages
VoLTE E2e Optimization
No ratings yet
VoLTE E2e Optimization
56 pages
NSE3 FortiAI Complete Downloadable
No ratings yet
NSE3 FortiAI Complete Downloadable
37 pages
Lohara
100% (1)
Lohara
1 page
Unit 3 Supervised Learning Technique
No ratings yet
Unit 3 Supervised Learning Technique
46 pages
Physical Layer
No ratings yet
Physical Layer
37 pages
E-Tutorial - PAO Registration and Login
No ratings yet
E-Tutorial - PAO Registration and Login
15 pages
Power Bi
100% (1)
Power Bi
198 pages
SNJB - 2019 - Cognizant
No ratings yet
SNJB - 2019 - Cognizant
20 pages
Network Layer
No ratings yet
Network Layer
21 pages
M.E. Electronics and Communication Engineering (Industry Integrated) Branch
No ratings yet
M.E. Electronics and Communication Engineering (Industry Integrated) Branch
49 pages
COA Module 3 PART 2
No ratings yet
COA Module 3 PART 2
41 pages
TESTS(15-18:04)
No ratings yet
TESTS(15-18:04)
23 pages
Visual C++ Programming Final
No ratings yet
Visual C++ Programming Final
9 pages
LO2 - Gather Data Through Formal and Informal Process
No ratings yet
LO2 - Gather Data Through Formal and Informal Process
8 pages
DW 2022 Supplementary Exam Annswer Booklet
No ratings yet
DW 2022 Supplementary Exam Annswer Booklet
8 pages
Do You Need a New Product-Development Strategy
No ratings yet
Do You Need a New Product-Development Strategy
11 pages
Apache Hadoop 3.4.1 – HDFS Architecture
No ratings yet
Apache Hadoop 3.4.1 – HDFS Architecture
7 pages
Data Link Layer
No ratings yet
Data Link Layer
26 pages
Chapter - 6 - Excel_Data_Analysts_Training
No ratings yet
Chapter - 6 - Excel_Data_Analysts_Training
11 pages
Aruba Instant On 1930 24G Class4 PoE 4SFPSFP+ 370W Switch-PSN1012839570UKEN
No ratings yet
Aruba Instant On 1930 24G Class4 PoE 4SFPSFP+ 370W Switch-PSN1012839570UKEN
4 pages
Q11]PYTHON programming
No ratings yet
Q11]PYTHON programming
7 pages
HARMAN DTS Intelligent Healthcare Platform Brochure
No ratings yet
HARMAN DTS Intelligent Healthcare Platform Brochure
4 pages
Data Communication and NW Part - 1
No ratings yet
Data Communication and NW Part - 1
6 pages
9 Grade
No ratings yet
9 Grade
6 pages
Computational Linguistics
No ratings yet
Computational Linguistics
2 pages
Master of Computer Application 3 Yr. Major XX Principles of Programming Languages (2050250)
No ratings yet
Master of Computer Application 3 Yr. Major XX Principles of Programming Languages (2050250)
2 pages
Tulsidas Jadhav
No ratings yet
Tulsidas Jadhav
3 pages
1731067261
No ratings yet
1731067261
4 pages
Linux - Servers
No ratings yet
Linux - Servers
5 pages
S._K._Patil
No ratings yet
S._K._Patil
2 pages
Rahul Pandita
No ratings yet
Rahul Pandita
2 pages
99485-Notice 53845 2023 10 31
No ratings yet
99485-Notice 53845 2023 10 31
4 pages
1731079331
No ratings yet
1731079331
5 pages
A Systematic Algorithm For Denoising Audio Signal Using Savitzky - Golay Method
No ratings yet
A Systematic Algorithm For Denoising Audio Signal Using Savitzky - Golay Method
4 pages
Excel 2013 Basic Quick Reference PDF
No ratings yet
Excel 2013 Basic Quick Reference PDF
3 pages
Tally ERP 9 Notes Part-1
No ratings yet
Tally ERP 9 Notes Part-1
3 pages
Biometric (Ankita Domadia, Nidhi Jha)
No ratings yet
Biometric (Ankita Domadia, Nidhi Jha)
7 pages
Introduction To SFG Custom Customer Delivery Protocol: IBM Sterling File Gateway Advanced Topics March 2020
No ratings yet
Introduction To SFG Custom Customer Delivery Protocol: IBM Sterling File Gateway Advanced Topics March 2020
19 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet

Introduction Data Science Edited

Uploaded by

Introduction Data Science Edited

Uploaded by

B.

Sc (ECS)-II | DSPy Unit - 1

 Analyze the massive amount of raw and unstructured data

 Qualitative or Categorical Data

Stages in a Data Science Project (Data Science Life Cycle)

1. Problem Identification and Business Understanding

4. Exploratory Data Analysis(EDA) –

Evolution or History of Data Science

Data Analyst Skills

Data collection Strategies

What is Data Collection?

Types of Data Collection

Important Terms related to Data Collection:

A. Methods of Collecting Primary Data:

Primary Data Collection: Quantitative Data Collection

Primary Data Collection Methods

 Quantitative Data Collection Methods

Quantitative Data Collection Methods

 Structured and unstructured observation

 Short and simple

Secondary Data Collection Methods

Published data are available in various resources including

Below are some examples of primary data-collection methods:

1. Questionnaires and surveys

Advantages of primary data-collection methods

Disadvantages of primary data collection

Secondary data-collection methods

Below are some examples of secondary data sources:

1. Online journals, records, and publications

3. Business and industry records

Advantages of secondary data-collection methods

Disadvantages of secondary data collection

What are Statistical Errors?

Data Pre-processing Overview CTRIND

Some common steps in data preprocessing include:

Steps Involved in Data Preprocessing:

 (a). Missing Data:

2. Fill the Missing values:

4. Concept Hierarchy Generation:

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

Backward Skip 10sPlay VideoForward Skip 10s

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old

Some Famous techniques of data discretization

Data discretization using decision tree analysis

Data discretization using correlation analysis

What is Outlier Analysis

Difference between outliers and noise

Outliers are divided into three different types

1. Global or point outliers

o Fraud detection in the telecom industry

Train and Test datasets

What is Training Dataset?

Input Output (Labels) remeber the

Update is really Slow Negative

What is Test Dataset?

How do training and testing data work in Machine Learning?

The above process is explained using a flowchart given below:

You might also like