Introduction Data Science Edited
Introduction Data Science Edited
Data science uses a variety of tools and methods, such as machine learning, statistical modeling,
and data visualization, to analyze and make predictions from data.
“Data Science is a process of extraction, preparation, analysis, visualization, and maintenance of
information”.
1 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Predicting the outcome such as who will be the next President of the USA?
Suppose we want to travel from station A to station B by car.
o We need to take some decisions such as:
1. which route will be the best route to reach faster at the location,
2. Which route there will be no traffic jams, and
3. Which will be cost-effective.
All these decision factors will act as input data, and we will get an appropriate answer from these
decisions, so this is called data analysis.
Benefits of Data Science
Improves business predictions
Interpretation of complex data
Better decision making
Product innovation
Improves data security
Development of user-centric products
Data:
Data science is all about experimenting on raw or structured data.
Its insights help to improve business, launch new products, or try out different experiments.
Data is stored in various categories, qualities, and characteristics of data, and these categories
are called data types.
Types of data:
Qualitative Quantitative
Nominal Discrete
Ordinal Continuous
2. Ordinal Data
Ordinal data have natural ordering where a numbers are order by their position on the scale.
The ordinal data only shows the order of the sequence.
These data are used for observation like customer satisfaction, happiness, etc., but we can’t do
any arithmetical tasks on them.
Examples of Ordinal Data:
Feedback, experience, or satisfaction on a scale of 1 to 10
Grades in the exam (A, B, C, D, etc.)
2 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Ranking in a competition (First, Second, Third, etc.)
Education Level (Higher, Secondary, Primary)
Quantitative Data
Quantitative data can be expressed in numerical values and it countable with statistical data
analysis.
Example: Price of a smartphone, discount, Processor of a smartphone, Ram, Internal Storage.
Examples of Quantitative Data:
Height or weight of a person or object
Room Temperature or Time
Scores and Marks (Ex: 59, 80, 60, etc.)
1. Discrete Data
The discrete means separate. The discrete data contain the values of integers or whole
numbers. These data can’t be broken into decimal or fraction values.
The discrete data are countable and have finite values; their subdivision is not possible.
Examples of Discrete Data:
The age of a person has discrete values such as 18, 19, not 20.8
Total numbers of students present in a class
Cost of a cell phone
The total number of players in a team
Days in a week
2. Continuous Data
Continuous data are in the form of fractional numbers. Continuous data represents information
that can be divided into smaller values.
The continuous variable can take any value within a range.
Examples of Continuous Data:
Height of a person like 5.5ft, 8.2 ft
Time taken to finish the work
Wi-Fi Range
Market share price
Domain experts and Data Scientists are the key persons in the problem identification.
Domain expert has in-depth knowledge of the subject and exactly what is the problem to be solved.
Data Scientist understands the subject, identification of a problem and provide possible solutions to the problems.
Example: If a business wants to reduce credit loss, then it needs to find out the factors that affect it.
3 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
5 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Models that capture the characteristic, patterns and relationships within the data.
These models help to find patterns, make predictions, and produce perfect results.
choice of model depends on
Accuracy of the model
Amount of data
Time and space constraints
Scalability of the model
Firstly, models are tested on dummy data that is similar to the actual data.
Model evaluation:
The trained model is tested by unused datasets and evaluated for performance.
If the desired results are not achieved, we must re-iterate the model until it gets it right.
Building a model that can accurately predict the target variable using a set of features known as
predictors
6. Model deployment:
Model deployment is the process of putting a model into production.
The model makes predictions that are available to users, developers, or systems.
They can make business decisions based on data, and interact with their application.
After careful evaluation and modifications, the data model will become ready to provide the
results in real time.
It is deployed in the desired channel and format.
Applications of Data Science in various fields:
Data science has changed almost every industry. In medicine, to help predict patient side effects. In
sports, athletic performance. Route-optimization models that capture typical rush hours and
weekend intervals.
E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. use data Science to make a better user
experience with personalized recommendations. we get suggestions similar product to choices
according to our past data and also we get recommendations according to mostly purchased
product, most rated, most searched, etc.
Product Recommendation
The product recommendation can guide customers to buy similar products. For example, a
bundling the products together and giving discounts. So he bundled shampoo and conditioner
together and gave a discount on them.
6 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Healthcare
Healthcare industries use data science to make implements to detect and cure disease. Data
Science helps in various branches of healthcare such as Medical Image Analysis, Development
of new drugs, Genetics and Genomics, Predictive Modeling for Diagnosis etc, and providing
virtual assistance to patients.
Transportation
The objective of the transportation or Logistics industry is to ensure the efficient and safe
movement of people or goods from one location to another with the best-optimized route,
optimized delivery time, and price. A self-driving or autonomous car system is used to reduce
the number of Accidents.
Fraud and Risk Detection
To judge whether a given transaction is fraud or not. it can analyze individual customers’
financial information, loans the person has taken in the past, overall income, and debts. It also
helps to classify and segment the transaction data to find out any patterns that might predict
any kind of fraud.
Image Recognition
Image Recognition is a process of identify and classify faces, objects, colors, patterns,
shapes, Traffic Sign boards, etc. Unblock your smartphone using a scan of your face or thumb.
The system will detect the face, then classify your face as a human face, and after that, it will
decide if the phone belongs to the actual owner or not.
Speech recognition
Speech recognition allows you to speak out the message and automatically convert it to text.
Some of the best speech recognition products are Chatbots, Google Voice, Siri, Cortana,
Alexa, Google Assistant, etc.
Search Engines
Google, Yahoo, Bing, Ask, etc. search engines take the query as input and apply various Data
Science techniques to provide the most relevant results to the user within a fraction of a
second.
Future Forecasting
Based on various types of data that are collected from various sources weather forecasting and
future forecasting are done.
Advantages/ Disadvantages of Data Science
Advantages Disadvantages
Better Decision-Making Data Privacy Concerns
Improved Efficiency Bias in Data
Enhanced Customer Experience Misinterpretation of Data
Predictive Analytics Data Quality Issues
Innovation and New Discoveries Cost and Time
8 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
3. Data theft by employees or other internal users, such as contractors or partners
4. Human errors such as accidentally sending sensitive data to someone unauthorized to see it
Data Security Issues:
Data and Model privacy
Data quality and integrity
Model robustness
Lack of data visibility
Misconfiguration and Leaving data open and unprotected
Unauthorized access to data and Cyberattacks
Rejection of service attacks
Hijacking of accounts
Insecure Interfaces and APIs
Malicious insiders
Data loss
Careless data management
1. Fake Data
Fake data is difficult to detect other security issues in the system and causes lost clients’ data. It can
confuse fraud identification and stop all business processes.
2. Data Cleaning Failure
It can reduce the quality of the database and also create the potential for breaches.
3. Data Masking Issues
The data masking process ensures the separation of confidential information from the actual data.
Someone can reconstruct the database and use confidential data. It is a massive risk to all sensitive
information your organization operates.
4. Lost of Data Access Control
Different users can have different access levels of data. It can be challenging to manage all the access
in a company. Losing data access is always losing data confidentiality.
5. Model poisoning
Model poisoning is an attack on prepared model’s training data and manipulates the outcome of
models. Threat actors can try to inject malicious data into the model, which will cause the model to
misclassify the data and make bad decisions. The model can not work appropriately.
5. Insider threats
Insiders can access your company’s sensitive or confidential information and use their benefit.
To handle data science security issues is to follow the data security basics such as
Encryption ensures that the data is unreadable by unauthorized parties, even if they access the
storage or the network.
Authentication verifies the identity of the users or systems that access the data,
Authorization defines the level of access and permissions they have.
Auditing tracks and logs the data access and usage activities, which can help detect and prevent
breaches or misuse.
Data science roles:
Data Scientist:
Data Scientists are responsible for finding insights and patterns, trends in the data.
A Data Scientist is responsible for collecting and handling raw data, analyzing, interpreting the
data, implementing various statistical procedures, visualizing the data to generate business
insights from data.
9 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data scientist's roles and responsibilities:
Identifying the required data sets for the analysis
Collecting large data sets from various sources
Perform predictive analysis
Searching for patterns and trends in data that impact on the business.
Data visualization tools used to create charts and dashboards.
Collaborating with IT and business teams.
Data Scientist Skills:
Linear algebra, calculus and Statistics
Programming knowledge in Python, R, Scala
Knowledge of data preprocessing
AI/ML
Relational database management systems like SQL
Natural Language Processing algorithms
Data visualization skills like matplotlib, seaborn,
Deep Learning frameworks (e.g., TensorFlow)
Strong communication and presentation skills
Data Analyst:
Data Analysts are responsible for preparing, transforming, managing, processing, and
visualizing the data for business growth.
He mainly deal with the analysis and visualization of the data.
He worked on structured, unstructured, and semi-structured to generate reports to identify
patterns, valuable insights and produce data visualizations that is easily read by business users.
Data Analyst Roles and Responsibilities:
Conduct surveys to collect raw data.
Extracting data from primary and secondary data sources using automated tools
Performing data analysis and visualize data in the form of graphs and reports.
Use statistical methodologies and procedures to make reports
Analyzing data and predicting trends that impact the organization
Data Engineer:
Data engineers are responsible for developing, constructing, managing data pipelines and
models.
Data engineers also update the existing systems with newer or upgraded versions of the current
technologies to improve the efficiency of the databases.
He deals with many responsibilities related to data such as storage, reliability, durability,
backup, cleaning, availability, etc.
Data Engineer Roles and Responsibilities
Programming language skills in R/Python
Knowledge of Various tools like SAS, Qlikview, Tableau, Excel, etc.
Relational database systems like SQL
Experience of data extraction from many sources
Understanding of quantitative techniques, sampling, and statistical software
Building and maintaining data pipelines
10 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data Engineer Skills
Least one programming language, such as Python
Understanding of data modeling and warehousing
Big Data tools (Hadoop Stack such as HDFS, M/R, Hive, Pig, etc.)
ETL (Extract, Transfer, Load) tools, NoSQL, Apache Spark System, and relational DBMS
Data Architect:
Data architects are also responsible for design patterns, data modeling, blueprints for data
management service-oriented integration, and business intelligence.
The data can be easily integrated, centralized, and protected with the best security measures.
He is organizing and managing data both at the macro level.
A data architect develops the systems and tools that used by data scientists, analysts, machine
learning engineers, and artificial intelligence experts.
Data Architect Roles and Responsibilities
Creating and implementing an data strategy of business
Auditing the performance of data management systems regularly to improve systems
Explain the complex technical issues to the non-technical staff.
Ensuring the accessibility and accuracy of data
Data Architect Skills
Programming languages like Java, Python, R, SQL
Knowledge of data warehouses, data governance, and big data analytics.
Data visualization tools.
Data flow and integration automation
ML Engineer:
A Machine Learning Engineer is responsible for adapting machine learning models for
performing classification and regression tasks.
They develop highly efficient machine learning models to assist data scientists in assessing,
analyzing, and organizing large amounts of data.
A Machine Learning Engineer has the knowledge of various techniques like classification,
regression,clustering and deep learning algorithms.
Machine Learning Engineer Roles and Responsibilities
Designing, building, and testing machine learning systems
Examining and presenting data
Improving the performance of ML models by changing various parameters.
Machine Learning Engineer Skills
Strong mathematical and statistical foundation.
Natural language processing.
Deep expertise in technologies like Python, Java, SQL, Scala, or C++.
Query processing data sets, building regression models, and creating and testing hypotheses.
Machine learning algorithms.
BI Developer:
BI or Business Intelligence Developer is responsible for maintaining business interfaces which
include data visualization, future prediction, etc., and helps businesses to set their future goals.
Business Analyst
Business data analyst is also responsible for improving the existing business processes and
operations, products, services and software data analysis, identifies problems and develops
solutions.
A business Analyst is acts as a bridge between business and technology within the business.
They closely work with stakeholders to understand their needs, gather and analyze data, and
develop strategies to optimize business performance.
Their work ensures businesses operate efficiently, effectively, and profitably.
11 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Business Analysts Roles and Responsibilities
Analysing, designing, and implementing new systems, processes, or technologies to achieve
desired outcomes.
Improving existing business procedures.
Conducting detailed business analysis, outlining problems, opportunities, and solutions
Budget and Pricing analysis
Execute quality assurance
Share significant discoveries and ideas with the product team.
Business Analysts Skills
Knowledge of business
Data visualization tools such as Power BI and Tableau
Expert in MS PowerPoint and MS Excel for documentation purposes
Excellent critical thinking, problem-solving, and decision-making skills
Knowledge of statistics and probability
Statistician
A statistician has a sound knowledge of statistical models, theories and data organization.
Statisticians are responsible for extracting valuable insights from data and have a knowledge
of statistical models, theories, techniques, and data organization.
They gather, organize, analyze, and evaluate data.
Statistician Roles and Responsibilities:
Collecting, analyzing, and interpreting data
Assessing results, and predicting trends, relationships using statistical methodologies
Designing data collection processes
Consulting on organizational and business strategy basis data
Statistician Skills
Excellent knowledge of R, Python, SQL, and MATLAB.
Expertise in statistical theories, machine learning methods, and database management models.
Proficiency with statistical software, such as SPSS.
Ability to communicate with other departments to coordinate data collection
Expertise in company operations and industry knowledge
Database Administrator
Database administrator manages the database and responsible for continuously monitoring the
database to guarantee that efficient functioning, data security, user access, and permissions to
databases.
They are also for data availability by performing frequent backups, retrieving data when
necessary, and testing databases to ensure their reliable operation.
Database Administrator Roles and Responsibilities
Work on design and development of database
Maintain and safeguard sensitive business data in collaboration with the IT security team.
Build database software to store and manage data.
Data archiving
Work in collaboration with programmers, project managers, and other team members
Make the essential data available and accessible using cloud servers.
Database Administrator Skills
Excellent knowledge of SQL
Understanding of database backup, recovery, security, and design
Proficient with at least one database management system, such as IBM DB2, Oracle, Microsoft
SQL Server, or MySQL
Solid problem-solving and analytical skills
****************
12 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Question:
1) Define Data Science? How data science benefical to us.
2) What is Data Security? What are the major components of data security.Discuss various Data
security issues.
3) Describe Exploratory Data Analysis and its role in data science.
4) Explain different stages of data Science?
5) Describe any five application of Data Science in detail.
6) Who is a Data Scientist?Difference between Data scientist and Business Analyst.
Data Collection is the process of collecting information from relevant sources in order to find a solution to the
given statistical enquiry. Collection of Data is the first and foremost step in a statistical investigation.
Here, statistical enquiry means an investigation made by any agency on a topic in which the investigator
collects the relevant quantitative information. In simple terms, a statistical enquiry is the search of truth by
using statistical methods of collection, compiling, analysis, interpretation, etc. The basic problem for any
statistical enquiry is the collection of facts and figures related to this specific phenomenon that is being
studied. Therefore, the basic purpose of data collection is collecting evidence to reach a sound and clear
solution to a problem.
Data collection is a process of measuring and gathering information on desired variables in a fashion so that
questions related to the data can be found and used in research of various types. Data collection is a common
feature of study in various disciplines, such as marketing, statistics, economics, sciences, etc. The methods of
collecting data may vary according to subjects, but the ultimate aim of the study and honesty in data
collection are of the same importance in all matters of study.
Thus, Data is a tool that helps an investigator in understanding the problem by providing him with the
information required. Data can be classified into two types; viz., Primary Data and Secondary
Data. Primary Data is the data collected by the investigator from primary sources for the first time from
scratch. However, Secondary Data is the data already in existence that has been previously collected by
someone else for other purposes. It does not include any real-time data as the research has already been done
on that information.
Methods of Collecting Data
There are two different methods of collecting data: Primary Data Collection and Secondary Data Collection.
There are a number of methods of collecting primary data, Some of the common methods are as follows:
1. Direct Personal Investigation: As the name suggests, the method of direct personal investigation involves
collecting data personally from the source of origin. In simple words, the investigator makes direct contact
with the person from whom he/she wants to obtain information. This method can attain success only when the
investigator collecting data is efficient, diligent, tolerant and impartial. For example, direct contact with the
household women to obtain information about their daily routine and schedule.
2. Indirect Oral Investigation: In this method of collecting primary data, the investigator does not make
direct contact with the person from whom he/she needs information, instead they collect the data orally from
some other person who has the necessary required information. For example, collecting data of employees
from their superiors or managers.
3. Information from Local Sources or Correspondents: In this method, for the collection of data, the
investigator appoints correspondents or local persons at various places, which are then furnished by them to
the investigator. With the help of correspondents and local persons, the investigators can cover a wide area.
4. Information through Questionnaires and Schedules: In this method of collecting primary data, the
investigator, while keeping in mind the motive of the study, prepares a questionnaire. The investigator can
collect data through the questionnaire in two ways:
Mailing Method: This method involves mailing the questionnaires to the informants for the collection of
data. The investigator attaches a letter with the questionnaire in the mail to define the purpose of the study
or research. The investigator also assures the informants that their information would be kept secret, and
then the informants note the answers to the questionnaire and return the completed file.
Enumerator’s Method: This method involves the preparation of a questionnaire according to the purpose
of the study or research. However, in this case, the enumerator reaches out to the informants himself with
the prepared questionnaire. Enumerators are not the investigators themselves; they are the people who help
the investigator in the collection of data.
Primary data is collected by researchers on their own and for the first time in a study. There are various ways
of collecting primary data, some of which are the following:
Interview: Interviews are the most used primary data collection method. In interviews a questionnaire
is used to collect data or the researcher may ask questions directly to the interviewee. The idea is to
seek information on concerning topics from the answers of the respondent. Questionnaires u sed can be
sent via email or details can be asked over telephonic interviews.
Delphi Technique: In this method, the researcher asks for information from the panel of experts. The
researcher may choose in-person research or questionnaires may be sent via email. At the end of the
Delphi technique, all data is collected according to the need of the research.
14 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Projective techniques: Projective techniques are used in research that is private or confidential in a
manner where the researcher thinks that respondents won’t reveal information if direct questions are
asked. There are many types of projective techniques, such as Thematic appreciation tests (TAT),
role-playing, cartoon completion, word association, and sentence completion.
Focus Group Interview: Here a few people gather to discuss the problem at hand. The number of
participants is usually between six to twelve in such interviews. Every participant expresses his own
insights and a collective unanimous decision is reached.
Questionnaire Method: Here a questionnaire is used for collecting data from a diverse group
population. A set of questions is used for the concerned research and respondents answer queries
related to the questionnaire directly or indirectly. This method can either be open-ended or closed-
ended.
B. Methods of Collecting Secondary Data or Qualitative Data
Secondary data can be collected through different published and unpublished sources. Some of them are as
follows:
1. Published Sources
Government Publications: Government publishes different documents which consists of different
varieties of information or data published by the Ministries, Central and State Governments in India as
their routine activity. As the government publishes these Statistics, they are fairly reliable to the
investigator. Examples of Government publications on Statistics are the Annual Survey of Industries,
Statistical Abstract of India, etc.
Semi-Government Publications: Different Semi-Government bodies also publish data related to health,
education, deaths and births. These kinds of data are also reliable and used by different informants. Some
examples of semi-government bodies are Metropolitan Councils, Municipalities, etc.
Publications of Trade Associations: Various big trade associations collect and publish data from their
research and statistical divisions of different trading activities and their aspects. For example, data
published by Sugar Mills Association regarding different sugar mills in India.
Journals and Papers: Different newspapers and magazines provide a variety of statistical data in their
writings, which are used by different investigators for their studies.
International Publications: Different international organizations like IMF, UNO, ILO, World Bank, etc.,
publish a variety of statistical information which are used as secondary data.
Publications of Research Institutions: Research institutions and universities also publish their research
activities and their findings, which are used by different investigators as secondary data. For example
National Council of Applied Economics, the Indian Statistical Institute, etc.
2. Unpublished Sources
Another source of collecting secondary data is unpublished sources. The data in unpublished sources is
collected by different government organizations and other organizations. These organizations usually collect
data for their self-use and are not published anywhere. For example, research work done by professors,
professionals, teachers and records maintained by business and private enterprises.
15 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
16 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Observation Method
Observation method is used when the study relates to behavioural science. This method is planned
systematically. It is subject to many controls and checks. The different types of observations are:
The method of collecting data in terms of verbal responses. It is achieved in two ways, such as
Personal Interview – In this method, a person known as an interviewer is required to ask questions
face to face to the other person. The personal interview can be structured or unstructured, direct
investigation, focused conversation, etc.
Telephonic Interview – In this method, an interviewer obtains information by contacting people on the
telephone to ask the questions or views, verbally.
Questionnaire Method
In this method, the set of questions are mailed to the respondent. They should read, reply and subsequently
return the questionnaire. The questions are printed in the definite order on the form. A good survey should
have the following features:
This method is similar to the questionnaire method with a slight difference. The enumerations are specially
appointed for the purpose of filling the schedules. It explains the aims and objects of the investigation and
may remove misunderstandings, if any have come up. Enumerators should be trained to perform their job
with hard work and patience.
Government publications
18 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Public records
Historical and statistical documents
Business documents
Technical and trade journals
Unpublished data includes
Diaries
Letters
Unpublished biographies, etc.
Whether you’re collecting data for business or academic research, the first step is to identify the type of
data you need to collect and what method you’ll use to do so. In general, there are two data types —
primary and secondary — and you can gather both with a variety of effective collection methods.
Primary data refers to original, firsthand information, while secondary data refers to information retrieved
from already existing sources. Peter Drow, head of marketing at NCCuttingTools, explains that “original
findings are primary data, whereas secondary data refers to information that has already been reported in
secondary sources, such as books, newspapers, periodicals, magazines, web portals, etc.”
Both primary and secondary data-collection methods have their pros, cons, and particular use cases. Read
on for an explanation of your options and a list of some of the best methods to consider.
Pro Tip
Automate your data collection process for free with Jotform. It’s free!
Primary data-collection methods
As mentioned above, primary data collection involves gathering original and firsthand source information.
Primary data-collection methods help researchers or service providers obtain specific and up -to-date
information about their research subjects. These methods involve reaching out to a targeted group of
people and sourcing data from them through surveys, interviews, observations, experiments, etc.
You can collect primary data using quantitative or qualitative methods. Let’s take a closer look at the two:
Quantitative data-collection methods involve collecting information that you can analyze numerically.
Closed-ended surveys and questionnaires with predefined options are usually the ways researchers collect
quantitative information. They can then analyze the results using mathematical calculations such as means,
modes, and grouped frequencies. An example is a simple poll. It’s easy to quickly determine or express the
number of participants who choose a specific option as a percentage of the whole.
Qualitative data collection involves retrieving nonmathematical data from primary sources. Unlike
quantitative data-collection methods where subjects are limited to predefined options, qualitative data -
collection methods give subjects a chance to freely express their thoughts about the research topic. As a
result, the data researchers collect via these methods is unstructured and often nonquantifiable.
Here’s an important difference between the two: While quantitative methods focus on understanding
“what,” “who,” or “how much,” qualitative methods focus on understanding “why” and “how.” For
example, quantitative research on parents may show trends that are specific to fathers or mothers, but it
may not uncover why those trends exist.
19 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Drow explains that applying quantitative methods is faster and cheaper than applying qualitative methods.
“It is simple to compare results because quantitative approaches are highly standardized. In contrast,
qualitative research techniques rely on words, sounds, feelings, emotions, colors, and other intangible
components.”
Drow emphasizes that the field of your study and the goals and objectives of your research will influence
your decision about whether to use quantitative or qualitative methodologies for data collection.
While researchers often use the terms “survey” and “questionnaire” interchangeably, the two mean slightly
different things.
A questionnaire refers specifically to the set of questions researchers use to collect information from
respondents. It may include closed-ended questions, which means respondents are limited to predefined
answers, or open-ended questions, which allow respondents to give their own answers.
A survey includes the entire process of creating questionnaires, collecting responses, and analyzing the
results.
Jotform’s free survey maker makes it easy to conduct surveys. Using any of Jotform’s customizable
survey templates, you can quickly create a questionnaire and share your survey with respondents using a
shareable link. You can also analyze survey results in easy-to-read spreadsheets, charts, and more.
2. Interviews
An interview is a conversation in which one participant asks questions and the other provides answers.
Interviews work best for small groups and help you understand the opinions and feelings of respondents.
Interviews may be structured or unstructured. Structured interviews are similar to questionnaires and
involve asking predetermined questions with specific multiple-choice answers. Unstructured interviews,
on the other hand, give subjects the freedom to provide their own answers. You can conduct interviews in
person or via recorded video or audio conferencing.
3. Focus groups
A focus group is a small group of people who have an informal discussion about a particular topic,
product, or idea. The researcher selects participants with similar interests, gives them topics to discuss,
and records what they say.
Focus groups can help you better understand the results of a large-group quantitative study. For example, a
survey of 1,000 respondents may help you spot trends and patterns, but a focus group of 10 respondents
will provide additional context for the results of the large-group survey.
4. Observation
Observation involves watching participants or their interactions with specific products or objects. It’s a
great way to collect data from a group when they’re unwilling or unable to participate in interviews —
children are a good example.
20 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
You can conduct observations covertly or overtly. The former involves discreetly observing people’s
behavior without their knowledge. This allows you to see them acting naturally. On the other hand, y ou
have to conduct overt observation openly, and it may cause the subjects to behave unnaturally.
1. Accuracy: You collect data firsthand from the target demographic, which leaves less room for
error or misreporting.
2. Recency: Sourcing primary data ensures you have the most up-to-date information about the
research subject.
3. Control: You have full control over the data-collection process and can make adjustments where
necessary to improve the quality of the data you collect.
4. Relevance: You can ask specific questions that are directly relevant to your research.
5. Privacy: You can control access to the research results and maintain the confide
6.
7. ntiality of respondents.
1. Cost: Collecting primary data can be expensive, especially if you’re working with a large group.
2. Labor: Collecting raw data can be labor intensive. When you’re gathering data from large groups,
you need more skilled hands. And if you’re researching something arcane or unusual, it might be
difficult to find people with the appropriate expertise.
3. Time: Collecting primary data takes time. If you’re conducting surveys, for example, participants
have to fill out questionnaires. This could take anywhere from a few days to several months,
depending on the size of the study group, how you deliver the survey, and how quickly participants
respond. Post-survey activities, such as organizing and cleaning data to make it usable, also add up.
Secondary data collection involves retrieving already available data from sources other than the target
audience. When working with secondary data, the researcher doesn’t “collect” data; instead, they consult
secondary data sources.
Secondary data sources are broadly categorized into published and unpublished data. As the names
suggest, published data has been published and released for public or private use, while unpublished data
comprises unreleased private information that researchers or individuals have documented.
When choosing public data sources, Drow strongly recommends considering the date of publication, the
author’s credentials, the source’s dependability, the text’s level of discussion and depth of analysis, and
the impact it has had on the growth of the field of study.
Data that reputable organizations have collected from research is usually published online. Many of these
sources are freely accessible and serve as reliable data sources. But it’s best to search for the latest editions
of these publications because dated ones may provide invalid data.
21 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
2. Government records and publications
Periodically, government institutions collect data from people. The information can range from population
figures to organizational records and other statistical information such as age distribution. You can usually
find information like this in government libraries and use it for research purposes.
Industries and trade organizations usually release revenue figures and periodic industry trends in quarterly
or biannual publications. These records serve as viable secondary data sources since they’re industry -
specific.
Previous business records, such as companies’ sales and revenue figures, can also be useful for research.
While some of this information is available to the public, you may have to get permission to access other
records.
4. Newspapers
Newspapers often publish data they’ve collected from their own surveys. Due to the volume of resources
you’ll have to sift through, some surveys may be relevant to your niche but difficult to find on paper.
Luckily, most newspapers are also published online, so looking through their online archives for specific
data may be easier.
5. Unpublished sources
These include diaries, letters, reports, records, and figures belonging to private individuals; these sources
aren’t in the public domain. Since authoritative bodies haven’t vetted or published the data, it can often be
unreliable.
Below are some of the benefits of secondary data-collection methods and their advantages over primary
methods.
1. Speed: Secondary data-collection methods are efficient because delayed responses and data
documentation don’t factor into the process. Using secondary data, analysts can go straight into
data analysis.
2. Low cost: Using secondary data is easier on the budget when compared to primary data collection.
Secondary data often allows you to avoid logistics and other survey expenses.
3. Volume: There are thousands of published resources available for data analysis. You can sift
through the data that several individual research efforts have produced to find the components that
are most relevant to your needs.
4. Ease of use: Secondary data, especially data that organizations and the government have
published, is usually clean and organized. This makes it easy to understand and extract.
5. Ease of access: It’s generally easier to source secondary data than primary data. A basic internet
search can return relevant information at little or no cost.
1. Lack of control: Using secondary data means you have no control over the survey process.
Already published data may not include the questions you need answers to. This makes it difficult
to find the exact data you need.
22 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
2. Lack of specificity: There may not be many available reports for new industries, and government
publications often have the same problems. Furthermore, if there’s no available data for the niche
your service specializes in, you’ll encounter problems using secondary data.
3. Lack of uniqueness: Using secondary sources may not give you the originality and uniqueness
you need from data. For instance, if your service or product hinges on innovation and uses an out-
of-the-norm approach to problem-solving, you may be disappointed by the generic nature of the
data you collect.
4. Age: Because user preferences change over time, data can evolve. The secondary data you retrieve
can become invalid. When this happens, it becomes difficult to source new data without conducting
a hands-on survey.
The errors which are occurred while collecting data are known as Statistical Errors. These are dependent
on the sample size selected for the study. There are two types of Statistical Errors; viz., Sampling
Errors and Non-Sampling Errors.
MeaCalResBiMi
1. Sampling Errors:
The errors which are related to the nature or size of the sample selected for the study are known as
Sampling Errors. If the size of the sample selected is very small or the nature of the sample is non -
representative, then the estimated value may differ from the actual value of a parameter. This kind of error
is sampling error. For example, if the estimated value of a parameter is 10, while the actual value is 30,
then the sampling error will be 10-30=-20.
Sampling Error = Estimated Value – Actual Value
2. Non-Sampling Errors:
The errors related to the collection of data are known as Non-Sampling Errors. The different types of Non-
Sampling Errors are Error of Measurement, Error of Non-response, Error of Misinterpretation, Error of
Calculation or Arithmetical Error, and Error of Sampling Bias.
23 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
i) Error of Measurement:
The reason behind the occurrence of Error of Measurement may be difference in the scale of measurement
and difference in the rounding-off procedure that is adopted by different investigators.
ii) Error of Non-response:
These errors arise when the respondents do not offer the information required for the study.
iii) Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given in the questionnaire.
iv) Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
v) Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target population cannot be
included in the sample choice.
Note: If the field of investigation is larger or the size of the population is larger, then the possibility of the
occurrence of errors related to the collection of data is high. Besides, a non-sampling error is more serious
than a sampling error. It is because one can minimize the sampling error by opting for a larger sample size
which is not possible in the case of non-sampling errors.
Data preprocessing is an important step in the data mining process. It refers to the cleaning, transforming, and
integrating of data in order to make it ready for analysis. The goal of data preprocessing is to improve the
quality of the data and to make it more suitable for the specific data mining task.
Data preprocessing is an important step in the data mining process that involves cleaning and transforming
raw data to make it suitable for analysis. Some common steps in data preprocessing include:
Data Cleaning: This involves identifying and correcting errors or inconsistencies in the data, such as missing
values, outliers, and duplicates. Various techniques can be used for data cleaning, such as imputation,
removal, and transformation.
Data Integration: This involves combining data from multiple sources to create a unified dataset. Data
integration can be challenging as it requires handling data with different formats, structures, and semantics.
Techniques such as record linkage and data fusion can be used for data integration.
Data Transformation: This involves converting the data into a suitable format for analysis. Common
techniques used in data transformation include normalization, standardization, and discretization.
Normalization is used to scale the data to a common range, while standardization is used to transform the data
to have zero mean and unit variance. Discre tization is used to convert continuous data into discrete
categories.
Data Reduction: This involves reducing the size of the dataset while preserving the important information.
Data reduction can be achieved through techniques such as feature selection and feature extraction. Feature
selection involves selecting a subset of relevant features from the dataset, while feature extraction involves
transforming the data into a lower-dimensional space while preserving the important information.
Data Discretization: This involves dividing continuous data into discrete categories or intervals.
Discretization is often used in data mining and machine learning algorithms that require categorical data.
Discretization can be achieved through techniques such as equal width binning, equal frequency binning, and
clustering.
24 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Data Normalization: This involves scaling the data to a common range, such as between 0 and 1 or -1 and 1.
Normalization is often used to handle data with different units and scales. Common normalization techniques
include min-max normalization, z-score normalization, and decimal scaling.
Data preprocessing plays a crucial role in ensuring the quality of data and the accuracy of the analysis results.
The specific steps involved in data preprocessing may vary depending on the nature of the data and the
analysis goals.
By performing these steps, the data mining process becomes more efficient and the results become more
accurate.
Preprocessing in Data Mining:
Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient
format.
25 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty
data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of
equal size and then various methods are performed to complete the task. Each segmented is handled
separately. One can replace all data in a segment by its mean or boundary values can be used to
complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This
involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual levels.
3. Data Reduction:
Data reduction is a technique used in data mining to reduce the size of a dataset while still preserving the
most important information. This can be beneficial in situations where the dataset is too large to be
processed efficiently, or where the dataset contains a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work with, rather than using
the entire dataset. This can be useful for reducing the size of a dataset while still preserving the overall
trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the number of features in the dataset,
either by removing features that are not relevant or by combining multiple features into a single feature.
3. Data Compression: This technique involves using techniques such as lossy or lossless compression to
reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous data into discrete data by
partitioning the range of possible values into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of features from the dataset that are most
relevant to the task at hand.
26 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
6. It’s important to note that data reduction can have a trade-off between the accuracy and the size of the
data. The more data is reduced, the less accurate the model will be and the less generalizable it will be.
Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones so that the
evaluation and management of data become easy. In other words, data discretization is a method of
converting attributes values of continuous data into a finite set of intervals with minimum data loss. There are
two forms of data discretization first is supervised discretization, and the second is unsupervised
discretization. Supervised discretization refers to a method in which the class data is used. Unsupervised
discretization refers to a method depending upon the way which operation proceeds. It means it works on the
top-down splitting strategy and bottom-up merging strategy.
Another example is analytics, where we gather the static data of website visitors. For example, all visitors
who visit the site with the IP address of India are shown under country level.
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation,
normal distribution representation, etc.
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous values into
smaller values. For data discretization and the development of idea hierarchy, this technique can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x
numbers into clusters to isolate a computational feature of x.
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that
27 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
has the least entropy, and then you need to run it with the help of a recursive process. The recursive process
divides it into various discretized disjoint intervals, from top to bottom, using the same splitting criterion.
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done
through a supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that
has the least entropy, and then you need to run it with the help of a recursive process. The recursive process
divides it into various discretized disjoint intervals, from top to bottom, using the same splitting criterion.
Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large
intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised
procedure.
Whenever we talk about data analysis, the term outliers often come to our mind. As the name suggests,
"outliers" refer to the data points that exist outside of what is to be expected. The major thing about the
outliers is what you do with them. If you are going to analyze any task to analyze data sets, you will always
have some assumptions based on how this data is generated. If you find some data points that are likely to
contain some form of error, then these are definitely outliers, and depending on the context, you want to
overcome those errors. The data mining process involves the analysis and prediction of data that the data
holds. In 1969, Grubbs introduced the first definition of outliers.
Any unwanted error occurs in some previously measured variable, or there is any variance in the previously
measured variable called noise. Before finding the outliers present in any data set, it is recommended first to
remove the noise.
Types of Outliers
28 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Global Outliers
Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers. When
data points deviate from all the rest of the data points in a given data set, it is known as the global outlier. In
most cases, all the outlier detection procedures are targeted to determine the global outliers. The green data
point is the global outlier.
Collective Outliers
In a given set of data, when a group of data points deviates from the rest of the data set is called collective
outliers. Here, the particular set of data objects may not be outliers, but when you consider the data objects as
a whole, they may behave as outliers. To identify the types of different outliers, you need to go through
background information about the relationship between the behavior of outliers shown by different data
objects. For example, in an Intrusion Detection System, the DOS package from one system to another is taken
as normal behavior. Therefore, if this happens with the various computer simultaneously, it is considered
abnormal behavior, and as a whole, they are called collective outliers. The green data points as a whole
represent the collective outlier.
29 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Contextual Outliers
As the name suggests, "Contextual" means this outlier introduced within a context. For example, in the
speech recognition technique, the single background noise. Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data object deviates from the other data points because of any
specific condition in a given data set. As we know, there are two types of attributes of objects of data:
contextual attributes and behavioral attributes. Contextual outlier analysis enables the users to examine
outliers in different contexts and conditions, which can be useful in various applications. For example, A
temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave like
a normal data point in the context of a summer season. In the given diagram, a green dot representing the low-
temperature value in June is a contextual outlier since the same value in December is not an outlier.
Outliers Analysis
Outliers are discarded at many places when data mining is applied. But it is still used in many applications
like fraud detection, medical, etc. It is usually because the events that occur rarely can store much more
significant information than the events that occur more regularly.
Other applications where outlier detection plays a vital role are given below.
30 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Any unusual response that occurs due to medical treatment can be analyzed through outlier analysis in data
mining.
The process in which the behavior of the outliers is identified in a dataset is called outlier analysis. It is also
known as "outlier mining", the process is defined as a significant task of data mining.
Machine Learning is one of the booming technologies across the world that enables computers/machines to
turn a huge amount of data into predictions. However, these predictions highly depend on the quality of the
data, and if we are not using the right data for our model, then it will not generate the expected result. In
machine learning projects, we generally divide the original dataset into training data and test data. We train
our model over a subset of the original dataset, i.e., the training dataset, and then evaluate whether it can
generalize well to the new or unseen dataset or test set. Therefore, train and test datasets are the two key
concepts of machine learning, where the training dataset is used to fit the model, and the test dataset is
used to evaluate the model.
In this topic, we are going to discuss train and test datasets along with the difference between both of them.
So, let's start with the introduction of the training dataset and test dataset in Machine Learning.
The training data is the biggest (in -size) subset of the original dataset, which is used to train or fit the
machine learning model. Firstly, the training data is fed to the ML algorithms, which lets them learn how to
make predictions for the given task.
For example, for training a sentiment analysis model, the training data could be as below:
The training data varies depending on whether we are using Supervised Learning or Unsupervised Learning
Algorithms.
Once we train the model with the training dataset, it's time to test the model with the test dataset. This dataset
evaluates the performance of the model and ensures that the model can generalize well with the new or
unseen dataset. The test dataset is another subset of original data, which is independent of the training
dataset.
31 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
Need of Splitting dataset into Train and Test set
Splitting the dataset into train and test sets is one of the important parts of data pre-processing, as by doing so,
we can improve the performance of our model and hence give better predictability.
We can understand it as if we train our model with a training set and then test it with a completely different
test dataset, and then our model will not be able to understand the correlations between the features.
imp
Machine Learning algorithms enable the machines to make predictions and solve problems on the basis of
past observations or experiences. These experiences or observations an algorithm can take from the training
data, which is fed to it. Further, one of the great things about ML algorithms is that they can learn and
improve over time on their own, as they are trained with the relevant training data.
Once the model is trained enough with the relevant training data, it is tested with the test data. We can
understand the whole process of training and testing in three steps, which are as follows:
1. Feed: Firstly, we need to train the model by feeding it with training input data.
2. Define: Now, training data is tagged with the corresponding outputs (in Supervised Learning), and the
model transforms the training data into text vectors or a number of data features.
3. Test: In the last step, we test the model by feeding it with the test data/unseen dataset. This step
ensures that the model is trained efficiently and can generalize well.
32 | S S Ghule
B.Sc (ECS)-II | DSPy Unit - 1
33 | S S Ghule