0% found this document useful (0 votes)
12 views10 pages

IDS - Sem Ans Unit 1

The document outlines the data retrieval process in data science, highlighting methods such as file import, web scraping, APIs, databases, and big data platforms. It also discusses the role of big data ecosystems in supporting large-scale data processing and analysis, emphasizing components like storage, processing, querying, visualization, and machine learning. Additionally, it addresses challenges in data cleansing and the importance of effective presentation of findings in data science projects.

Uploaded by

lakshmimahaa999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views10 pages

IDS - Sem Ans Unit 1

The document outlines the data retrieval process in data science, highlighting methods such as file import, web scraping, APIs, databases, and big data platforms. It also discusses the role of big data ecosystems in supporting large-scale data processing and analysis, emphasizing components like storage, processing, querying, visualization, and machine learning. Additionally, it addresses challenges in data cleansing and the importance of effective presentation of findings in data science projects.

Uploaded by

lakshmimahaa999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

2 a) Explain the process of retrieving data in Data Science

process?

A) Retrieving data:
Retrieving data is an essential step in the data
science process as it provides the raw material needed to analyze and derive
insights. There are various ways to retrieve data, and the methods used
depend on the type of data and where it is stored.

Here are some common methods for retrieving data in data science:

➢ File import: Data can be retrieved from files in various formats, such as
CSV, Excel, JSON, or XML. This is a common method used to retrieve data that
is stored locally.

➢ Web scraping: Web scraping involves using scripts to extract data from
websites. This is a useful method for retrieving data that is not readily
available in a structured format.

➢ APIs: Many applications and services provide APIs (Application


Programming Interfaces) that allow data to be retrieved programmatically.
APIs can be used to retrieve data from social media platforms, weather
services, financial data providers, and many other sources.

➢ Databases: Data is often stored in databases, and SQL (Structured Query


Language) can be used to retrieve data from databases. Non-relational
databases such as MongoDB or Cassandra are also popular for storing and
retrieving data.

➢ Big Data platforms: When dealing with large amounts of data, big data
platforms such as Hadoop, Spark, or NoSQL databases can be used to retrieve
data efficiently

2 b) Explain the role of big data ecosystems in Data Science. How do they
support large-scale data processing and analysis?

A) The big data ecosystem and data science:


❖ The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
❖ The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
❖ Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to store and
manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark, Apache
Flink, and Apache Storm are used to process and analyze large volumes of data
in parallel across distributed computing clusters.
3. Querying: Big data querying systems such as Apache Hive, Apache Pig, and
Apache Drill are used to extract and transform data stored in big data storage
systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and dashboards
that enable data scientists and business analysts to explore and understand
data.
5. Machine learning: Big data machine learning platforms such as Apache
Mahout, TensorFlow, and Microsoft Azure Machine Learning are used to build and
deploy machine learning models at scale.

3 a) What are the common challenges faced during data cleansing, and how can
they be addressed?

A) Data Cleansing:
This step involves identifying and correcting or removing any errors,
inconsistencies, or missing values in the data. Some common techniques used
for data cleansing include removing duplicates, filling in missing values,
correcting spelling errors, and dealing with outliers.
> remove errors in data cleaning:
1. Identify Errors: Look for issues such as missing values, duplicates, and
incorrect data types.

2. Correct Errors:

o Fix Missing Values: Impute or remove missing data.

o Remove Duplicates: Eliminate duplicate records.

o Correct Data Types: Convert data to appropriate formats.

 Inconsistencies in Data Cleaning:


1. Identify Inconsistencies: Look for data that does not conform to
expected formats or rules, such as mismatched date formats or
inconsistent categorical values.

2. Resolve Inconsistencies:

o Standardize Formats: Convert data to a consistent format (e.g.,


dates in YYYY-MM-DD).

o Normalize Values: Ensure uniformity in categorical data (e.g.,


“Male” and “male” should be standardized).
 Handling Missing Values in Data Cleaning:

Identify Missing Values: Detect missing entries using


methods like null checks or visual inspection.

Handle Missing Values:

Imputation: Replace missing values with statistical


measures such as mean, median, or mode.

Deletion: Remove rows or columns with excessive missing


values if they are not essential

READ ONLY DATA TRANSFORMATION

3 b) List and briefly explain the facets of data that are crucial
in data science. How does each facet impact data analysis?
A) Facets of Data
• Very large amount of data will generate in big data and data science.
These data is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images

Structured Data
• Structured data is arranged in rows and column format. It helps for
application to retrieve and process data easily. Database management
system is used for storing structured data.
• The term structured data refers to data that is identifiable because it
is organized in a structure. The most common form of structured data
or records is a database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within content.
Structured data is understood by computers and is also efficiently
organized for human readers.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row
and columns are not used for unstructured data. Therefore it is difficult
to retrieve required information. Unstructured data has no identifiable
structure.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for
unstructured data. 5. Since there is no structural binding for
unstructured data, it is unpredictable in nature.

Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize
characters, words and sentences, then apply meaning and
understanding to that information. This helps machines to understand
language as humans do.
• Natural language processing is the driving force behind machine
intelligence in many modern real-world applications. The natural
language processing community has had success in entity recognition,
topic recognition, summarization, text completion and sentiment
analysis.

Machine - Generated Data


• Machine-generated data is an information that is created without
human interaction as a result of a computer process or application
activity. This means that data entered manually by an end-user is not
recognized to be machine generated
. • Machine data contains a definitive record of all activity and behavior
of our customers, users, transactions, applications, servers, networks,
factory machinery and so on.
• It's configuration data, data from APIs and message queues, change
events, the output of diagnostic commands and call detail records,
sensor data from remote equipment and more.

Graph-based or Network Data


•Graphs are data structures to describe relationships and interactions
between entities in complex systems. In general, a graph contains a
collection of entities called nodes and another collection of interactions
between a pair of nodes called edges.
• Nodes represent entities, which can be of any object type that is
relevant to our problem domain. By connecting nodes with edges, we
will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or
documents. Data is stored just like we might sketch ideas on a
whiteboard. Our data is stored without restricting it to a predefined
model, allowing a very flexible way of thinking about and using it.

Audio, Image and Video


• Audio, image and video are data types that pose specific challenges
to a data scientist. Tasks that are trivial for humans, such as
recognizing objects in pictures, turn out to be challenging for
computers.
•The terms audio and video commonly refers to the time-based media
storage format for sound/music and moving pictures information. Audio
and video digital recording, also referred as audio and video codecs,
can be uncompressed, lossless compressed or lossy compressed
depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most
important sources of information and knowledge; the integration,
transformation and indexing of multimedia data bring significant
challenges in data management and analysis. Many challenges have to
be addressed including big data, multidisciplinary nature of Data
Science and heterogeneity.

Streaming Data
Streaming data is data that is generated continuously by thousands of
data sources, which typically send in the data records simultaneously
and in small sizes (order of Kilobytes).
• Streaming data includes a wide variety of data such as log files
generated by customers using your mobile or web applications,
ecommerce purchases, in-game player activity, information from social
networks, financial trading floors or geospatial services and telemetry
from connected devices or instrumentation in data centers .

Model paper 2

Unit 1

2 a) What are the benefits of presenting findings effectively in a Data Science


project?

A) Presenting Findings and Building Applications


• The team delivers final reports, briefings, code and technical documents

. • In addition, team may run a pilot project to implement the models in a


production environment. The last stage of the data science process is where
user soft skills will be most useful.

• Presenting your results to the stakeholders and industrializing your analysis


process for repetitive reuse and integration with other tools.

Benefits of Presenting Findings Effectively in a Data Science Project:


1. Enhanced Understanding:

o Clarity: Well-presented findings help stakeholders easily


understand complex data insights and trends.

o Impact: Clear visualizations and summaries make it easier to grasp


key takeaways, leading to better-informed decisions.

2. Improved Decision-Making:

o Actionable Insights: Effective presentation highlights actionable


insights, allowing stakeholders to make data-driven decisions
promptly.
o Persuasion: Convincing presentations can drive stakeholder buy-in
and support for recommended actions or strategies.

3. Increased Engagement:

o Interest: Engaging visuals and narratives capture attention,


making it more likely that stakeholders will engage with the
findings.

o Communication: Clear communication facilitates better


discussions and collaborative problem-solving among team
members.

4. Credibility and Professionalism:

o Trust: Professional and well-organized presentations enhance the


credibility of the findings and the data science team.

o Reputation: Effective presentation reflects well on the data science


team, showcasing their ability to translate data into valuable
insights.

2 b) Given a business problem, outline the process of defining goals and creating
a project charter in a Data Science project. Why is this step critical?

A) Defining research goals and creating a project character:


Defining research goals and creating a project charter are important initial steps
in any data science project, as they set the stage for the entire project and help
ensure that it stays focused and on track.

Here are some key considerations for defining research goals and
creating a project charter in data science:

Identify the problem or question you want to answer: What is the business
problem or research question that you are trying to solve? It's important to
clearly define the problem or question at the outset of the project, so that
everyone involved is on the same page and working towards the same goal.

❖ Define the scope of the project: Once you have identified the problem or
question, you need to define the scope of the project. This includes specifying
the data sources you will be using, the variables you will be analyzing, and the
timeframe for the project.

❖ Determine the project objectives: What do you hope to achieve with the
project? What are your key performance indicators (KPIs)? This will help you
measure the success of the project and determine whether you have achieved
your goals.

❖ Identify the stakeholders: Who are the key stakeholders in the project? This
could include business leaders, data analysts, data scientists, and other team
members. It's important to identify all the stakeholders upfront so that everyone
is aware of their role in the project and can work together effectively.

❖ Create a project charter: The project charter is a document that summarizes


the key information about the project, including the problem or question, the
scope of the project, the objectives, the stakeholders, and any constraints or
risks. It's a critical document that helps ensure everyone involved in the project
is on the same page and understands what is expected of them.

3 a) Describe the key steps involved in the Data Science


process and explain the importance of each step.
A) The data science process:
The data science process typically involves the following steps:
1. Define the problem: The first step in the data science process is to define the
problem that you want to solve. This involves identifying the business or
research question that you want to answer and determining what data you need
to collect.
2. Collect and clean the data: Once you have identified the data that you need,
you will need to collect and clean the data to ensure that it is accurate and
complete. This involves checking for errors, missing values, and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the data,
the next step is to explore and visualize the data. This involves creating
summary statistics, visualizations, and other descriptive analyses to better
understand the data.
4. Prepare the data: Once you have explored the data, you will need to prepare
the data for analysis. This involves transforming and manipulating the data,
creating new variables, and selecting relevant features.
5. Build the model: With the data prepared, the next step is to build a model that
can answer the business or research question that you identified in step one.
This involves selecting an appropriate algorithm, training the model, and
evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to evaluate
its performance to ensure that it is accurate and effective. This involves using
metrics such as accuracy, precision, recall, and F1 score to assess the model's
performance.
7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating the
model into an application or workflow and ensuring that it can handle real-world
data and user inputs.

3 b) Compare and contrast exploratory data analysis (EDA) with model building in
the Data Science process.

A) Exploratory data analysis:

Exploratory data analysis (EDA) is the process of analyzing and summarizing


data sets in order to gain insights and identify patterns.

The main goal of EDA is to understand the data, rather than to test a
particular hypothesis. The process typically involves visualizing the data using
graphs, charts, and tables, as well as calculating summary statistics such as
mean, median, and standard deviation.

Some common techniques used in EDA include:


❖ Descriptive statistics: This involves calculating summary statistics such as
mean, median, mode, standard deviation, and range.
❖ Data visualization: This involves creating graphs, charts, and other visual
representations of the data, such as histograms, scatter plots, and box plots.
❖ Data transformation: This involves transforming the data to make it easier to
analyze, such as normalizing or standardizing the data, or log transforming
skewed data.
❖ Outlier detection: This involves identifying and analyzing data points that are
significantly different from the other data points.
❖ Correlation analysis: This involves examining the relationship between
different variables in the data set, such as calculating correlation coefficients or
creating correlation matrices.

Overall, EDA is an important step in any data analysis project, as it helps to


identify any patterns, outliers, or other trends in the data that may be relevant to
the analysis. It also helps to ensure that the data is clean, complete, and ready
for further analysis.

You might also like