IDS - Sem Ans Unit 1
IDS - Sem Ans Unit 1
process?
A) Retrieving data:
Retrieving data is an essential step in the data
science process as it provides the raw material needed to analyze and derive
insights. There are various ways to retrieve data, and the methods used
depend on the type of data and where it is stored.
Here are some common methods for retrieving data in data science:
➢ File import: Data can be retrieved from files in various formats, such as
CSV, Excel, JSON, or XML. This is a common method used to retrieve data that
is stored locally.
➢ Web scraping: Web scraping involves using scripts to extract data from
websites. This is a useful method for retrieving data that is not readily
available in a structured format.
➢ Big Data platforms: When dealing with large amounts of data, big data
platforms such as Hadoop, Spark, or NoSQL databases can be used to retrieve
data efficiently
2 b) Explain the role of big data ecosystems in Data Science. How do they
support large-scale data processing and analysis?
3 a) What are the common challenges faced during data cleansing, and how can
they be addressed?
A) Data Cleansing:
This step involves identifying and correcting or removing any errors,
inconsistencies, or missing values in the data. Some common techniques used
for data cleansing include removing duplicates, filling in missing values,
correcting spelling errors, and dealing with outliers.
> remove errors in data cleaning:
1. Identify Errors: Look for issues such as missing values, duplicates, and
incorrect data types.
2. Correct Errors:
2. Resolve Inconsistencies:
3 b) List and briefly explain the facets of data that are crucial
in data science. How does each facet impact data analysis?
A) Facets of Data
• Very large amount of data will generate in big data and data science.
These data is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for
application to retrieve and process data easily. Database management
system is used for storing structured data.
• The term structured data refers to data that is identifiable because it
is organized in a structure. The most common form of structured data
or records is a database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within content.
Structured data is understood by computers and is also efficiently
organized for human readers.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row
and columns are not used for unstructured data. Therefore it is difficult
to retrieve required information. Unstructured data has no identifiable
structure.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for
unstructured data. 5. Since there is no structural binding for
unstructured data, it is unpredictable in nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize
characters, words and sentences, then apply meaning and
understanding to that information. This helps machines to understand
language as humans do.
• Natural language processing is the driving force behind machine
intelligence in many modern real-world applications. The natural
language processing community has had success in entity recognition,
topic recognition, summarization, text completion and sentiment
analysis.
Streaming Data
Streaming data is data that is generated continuously by thousands of
data sources, which typically send in the data records simultaneously
and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files
generated by customers using your mobile or web applications,
ecommerce purchases, in-game player activity, information from social
networks, financial trading floors or geospatial services and telemetry
from connected devices or instrumentation in data centers .
Model paper 2
Unit 1
2. Improved Decision-Making:
3. Increased Engagement:
2 b) Given a business problem, outline the process of defining goals and creating
a project charter in a Data Science project. Why is this step critical?
Here are some key considerations for defining research goals and
creating a project charter in data science:
Identify the problem or question you want to answer: What is the business
problem or research question that you are trying to solve? It's important to
clearly define the problem or question at the outset of the project, so that
everyone involved is on the same page and working towards the same goal.
❖ Define the scope of the project: Once you have identified the problem or
question, you need to define the scope of the project. This includes specifying
the data sources you will be using, the variables you will be analyzing, and the
timeframe for the project.
❖ Determine the project objectives: What do you hope to achieve with the
project? What are your key performance indicators (KPIs)? This will help you
measure the success of the project and determine whether you have achieved
your goals.
❖ Identify the stakeholders: Who are the key stakeholders in the project? This
could include business leaders, data analysts, data scientists, and other team
members. It's important to identify all the stakeholders upfront so that everyone
is aware of their role in the project and can work together effectively.
3 b) Compare and contrast exploratory data analysis (EDA) with model building in
the Data Science process.
The main goal of EDA is to understand the data, rather than to test a
particular hypothesis. The process typically involves visualizing the data using
graphs, charts, and tables, as well as calculating summary statistics such as
mean, median, and standard deviation.