Data Science
Data Science
• Nominal data
• Ordinal data
• Discrete data
• Continuous data
Data Collection
1. The process of gathering and analyzing accurate data from various sources to find answers to research
problems, trends and probabilities, etc., to evaluate possible outcomes is Known as Data Collection.
2. Data collection is the process of collecting and evaluating information or data from multiple sources to find
answers to research problems, answer questions, evaluate outcomes, and forecast trends and probabilities.
3. Accurate data collection is necessary to make informed business decisions, ensure quality assurance, and
keep research integrity.
4. During data collection, the researchers must identify the data types, the sources of data, and what methods
are being used.
Before an analyst begins collecting data, they must answer three questions first:
What methods and procedures will be used to collect, store, and process the information?
Additionally, we can break up data into qualitative and quantitative types. Qualitative data covers descriptions such as
color, size, quality, and appearance. Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll numbers,
percentages, etc.
Data Collection Methods
1. Primary Data Collection:
Primary data collection involves the collection of original data directly from the source or
through direct interaction with the respondents. This method allows researchers to
obtain firsthand information specifically tailored to their research objectives. There are
various techniques for primary data collection, including:
b. Interviews: Interviews involve direct interaction between the researcher and the
respondent. They can be conducted in person, over the phone, or through video
conferencing. Interviews can be structured (with predefined questions), semi-structured
(allowing flexibility), or unstructured (more conversational).
Observations: Researchers observe and record behaviors, actions, or events in their natural
setting. This method is useful for gathering data on human behavior, interactions, or
phenomena without direct intervention.
e. Focus Groups: Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting. This method helps in understanding opinions,
perceptions, and experiences shared by the participants.
Secondary Data Collection:
Secondary data collection involves using existing data collected by someone else for a purpose different from the
original intent. Researchers analyze and interpret this data to extract relevant information. Secondary data can be
obtained from various sources, including:
a. Published Sources: Researchers refer to books, academic journals, magazines, newspapers, government reports,
and other published materials that contain relevant data.
b. Online Databases: Numerous online databases provide access to a wide range of secondary data, such as research
articles, statistical information, economic data, and social surveys.
Government and Institutional Records: Government agencies, research institutions, and organizations often maintain
databases or records that can be used for research purposes.
d. Publicly Available Data: Data shared by individuals, organizations, or communities on public platforms, websites, or
social media can be accessed and utilized for research.
e. Past Research Studies: Previous research studies and their findings can serve as valuable secondary data sources.
Researchers can review and analyze the data to gain insights or build upon existing knowledge.
What are Common Challenges in Data Collection?
There are some prevalent challenges faced while collecting data, let us explore a few of them to understand
them better and avoid them.
Inconsistent Data
When working with various data sources, it's conceivable that the same information will have discrepancies
between sources. The differences could be in formats, units, or occasionally spellings. The introduction of
inconsistent data might also occur during firm mergers or relocations. Inconsistencies in data have a
tendency to accumulate and reduce the value of data if they are not continually resolved. Organizations that
have heavily focused on data consistency do so because they only want reliable data to support their
analytics.
Data Downtime
Data is the driving force behind the decisions and operations of data-driven businesses. However, there may be brief
periods when their data is unreliable or not prepared. Customer complaints and subpar analytical outcomes are only
two ways that this data unavailability can have a significant impact on businesses. A data engineer spends about 80%
of their time updating, maintaining, and guaranteeing the integrity of the data pipeline.
Schema modifications and migration problems are just two examples of the causes of data downtime.
Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes. For data streaming at a fast
speed, the issue becomes more overwhelming. Spelling mistakes can go unnoticed, formatting difficulties can occur, and
column heads might be deceptive. This unclear data might cause a number of problems for reporting and analytics.
Duplicate Data
Streaming data, local databases, and cloud data lakes are just a few of the sources of data that modern enterprises must
contend with. They might also have application and system silos. These sources are likely to duplicate and overlap each
other quite a bit. For instance, duplicate contact information has a substantial impact on customer experience. If certain
prospects are ignored while others are engaged repeatedly, marketing campaigns suffer.
Inaccurate Data
For highly regulated businesses like healthcare, data accuracy is crucial. Given the current experience, it is more
important than ever to increase the data quality for COVID-19 and later pandemics. Inaccurate information does not
provide you with a true picture of the situation and cannot be used to plan the best course of action. Personalized
customer experiences and marketing strategies underperform if your customer data is inaccurate.
Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder sometimes being lost in data silos or
discarded in data graveyards. For instance, the customer service team might not receive client data from sales, missing
an opportunity to build more precise and comprehensive customer profiles. Missing out on possibilities to develop
novel products, enhance services, and streamline procedures is caused by hidden data.
What are the Key Steps in the Data Collection Process?
Gather Information
Once our plan is complete, we can put our data collection plan into action and begin gathering data.
df = pd.read_csv('diabetes.csv')
print(df.head())
df.isnull()
df.describe()
Handling categorical features
We can take care of categorical features by converting them to integers. There are 2 common ways to do so.
Label Encoding
One Hot Encoding
In Label Encoder, we can convert the Categorical values into numerical labels.
In OneHotEncoder we make a new column for each unique categorical value, and the value is 1 for that column, if in an
actual data frame that value is there, else it is 0.
We use pandas built-in function get_dummies to convert categorical values in a dataframe to a one-hot vector.
Convert the data frame to NumPy
Now that we’ve converted all the data to integers, it’s time to prepare the data for machine learning models.
X = df.values
y = df['predictor_column'].values
Split the dataset into train and test data
Now that we’re ready with X and y, let's split the data set: we’ll allocate 70 percent for training and 30 percent for tests
using scikit model_selection.
Feature Scaling
This is the final step of data preprocessing. Feature scaling puts all our data in the same range and on the same scale.
To work on the data, you can either load the CSV in Excel or in Pandas.
df = pd.read_csv('train.csv')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
there are 891 total rows but Age shows only 714 (which means we’re missing some data), Embarked is missing two rows
and Cabin is missing a lot as well. Object data types are non-numeric so we have to find a way to encode them to
numerical values.
drop some of the columns which won't contribute much to our machine learning model. We’ll start with Name,
Ticket and Cabin.
>> df = df.dropna()
>>> df.info()
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
PassengerId 712 non-null int64
Survived 712 non-null int64
Pclass 712 non-null int64
Sex 712 non-null object
Age 712 non-null float64
SibSp 712 non-null int64
Parch 712 non-null int64
Fare 712 non-null float64
Embarked 712 non-null object
THE PROBLEM WITH DROPPING ROWS
After dropping rows with missing values, we find the data set is reduced to 712 rows from 891, which means we are
wasting data. Machine learning models need data to train and perform well. So, let’s preserve the data and make
use of it as much as we can.
print(df.head())
df.isnull()
df.describe()
df
y=df.iloc[:,3:].values
y
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
x[:,0]=LE.fit_transform(x[:,0])
x
y=LE.fit_transform(y)
y
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
transform =ColumnTransformer([("hello",OneHotEncoder(),[0])],remainder="passthrough")
x
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
y_test
x_test
x_train
y_train
from sklearn.preprocessing import StandardScaler
SC=StandardScaler()
x_train[:,3:5]=SC.fit_transform(x_train[:,3:5])
x_train
Data Cleaning
Python Implementation for Database Cleaning
import pandas as pd
import numpy as np
1. isnull()
2. notnull()
3. dropna()
4. fillna()
5.replace()
1. isnull() method which stores True for ever NaN value and False for a Not null value.
2. True for every NON-NULL value and False for a null value.
3. Dropping Rows with at least 1 null value. A data frame is read and all rows with any Null values are
dropped.
Handling Missing Values:
df
5. Replacing a Single Value
df.replace(to_replace="Boston", value="Warrior")
df.replace(to_replace=["Boston", "Texas"],value="Warrior")