0% found this document useful (0 votes)
12 views6 pages

Data Science

The document discusses the Cross-Industry Standard Process for Data Mining (CRISP-DM), which is a structured framework for executing data mining projects through six iterative phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It also covers various data processing tools such as Apache Spark and Cassandra, compares Schema-on-Write and Schema-on-Read methodologies, and explains concepts like data swamps, feature engineering, and clustering techniques. Additionally, it highlights the importance of data quality assessment and introduces the Time-Person-Object-Location-Event (T-P-O-L-E) design principle in Data Vault modeling.

Uploaded by

omkargawas8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

Data Science

The document discusses the Cross-Industry Standard Process for Data Mining (CRISP-DM), which is a structured framework for executing data mining projects through six iterative phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment. It also covers various data processing tools such as Apache Spark and Cassandra, compares Schema-on-Write and Schema-on-Read methodologies, and explains concepts like data swamps, feature engineering, and clustering techniques. Additionally, it highlights the importance of data quality assessment and introduces the Time-Person-Object-Location-Event (T-P-O-L-E) design principle in Data Vault modeling.

Uploaded by

omkargawas8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Q.

1 Discuss the Cross-Industry Standard Process for Data Mining


(CRISP-DM).
The Cross-Industry Standard Process for Data Mining (CRISP-DM) is a widely
used framework for data mining and analytics projects. It provides a structured
approach to planning, executing, and delivering data mining projects. CRISP-DM
is a flexible and iterative framework, allowing for feedback loops and
revisitation of previous phases as needed. It helps ensure a structured approach
to data mining projects, increasing the likelihood of success and minimizing the
risk of project failure. The CRISP-DM framework consists of six phases:
Business Understanding: Define project objectives, identify business needs,
and determine the problem scope.
Data Understanding: Collect, explore, and analyze data to understand its
quality, relevance, and relationships.
Data Preparation: Clean, transform, and prepare data for modeling.
Modeling: Apply data mining techniques to develop models that address the
business objectives.
Evaluation: Assess the performance and quality of the developed models.
Deployment: Implement the models in the production environment and
monitor their performance.

Explain any five data processing tools in data science technology.


The next step involves processing tools to transform your data lakes into data
vaults and then into data warehouses. These tools are the workhorses of the
data science and engineering ecosystem.
Spark:- Apache Spark is an open source cluster computing framework. Spark
offers an interface for programming distributed clusters with implicit data
parallelism and fault-tolerance.
Spark Core:- Spark Core is the foundation of the overall development. It
provides distributed task dispatching, scheduling, and basic I/O functionalities.
Spark SQL:- Spark SQL is a component on top of the Spark Core that presents
a data abstraction called Data Frames.
Spark Streaming:- Spark Streaming leverages Spark Core’s fast scheduling
capability to perform streaming analytics. Spark Streaming has built-in support
to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.
Cassandra:- Apache Cassandra is a large-scale distributed database
supporting multi–data center replication for availability, durability, and
performance.

Compare Schema-on-Write and Schema-on-Read storage


methodologies.
Schema-on-Write (SoW) in data science refers to the traditional approach of
defining the schema (structure) of the data before writing it to a database or
storage system. This approach is commonly used in relational databases, where
the schema is defined beforehand to ensure data consistency and quality. In
SoW, the data is processed and transformed to fit the predefined schema,
which includes:
Data cleaning and preprocessing
Data transformation and formatting
Data quality checks
Data insertion into the database

Schema-on-Read (SoR) in data science is an approach where the schema


(structure) of the data is defined at query time, rather than before writing the
data to a database or storage system. This approach is commonly used in big
data, NoSQL databases, and data lakes, where data is often unstructured, semi-
structured, or constantly changing.In SoR, the data is stored in its raw form, and
the schema is defined dynamically when the data is queried. This approach
offers:
Flexibility and adaptability to changing data structures
Ability to handle unstructured and semi-structured data
Support for real-time data processing and analytics
Reduced data transformation and processing costs

Q.3 What is a data swamp? Explain the steps to avoid a data swamp.
A data swamp is a situation where a data lake, a repository for storing raw,
unprocessed data, becomes unmanageable, disorganized, and difficult to
navigate, making it hard to extract valuable insights. This occurs when data is
ingested without a clear purpose, governance, or quality control. To avoid a
data swamp:
Define a clear purpose: Establish a specific use case or business objective for
the data lake.
Implement data governance: Set policies, standards, and roles for data
management and access.
Ensure data quality: Validate, cleanse, and transform data upon ingestion.
Organize and catalog data: Use metadata management and data cataloging
tools to label and categorize data.
Use data schema and standards: Apply consistent data structures and
formats. Monitor and audit data: Regularly check data quality, usage, and
access.
Provide training and support: Educate users on data lake usage and best
practices.
Continuously refine and improve: Regularly review and optimize data lake
operations.

Q.2 Write a short note on the six super steps for processing the data.
1. Retrieve: This super step contains all the processing chains for retrieving
data from the raw data lake via a more structured format.
2. Assess: This superstep contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This superstep contains all the processing chains for building the
data vault.
4. Transform: This superstep contains all the processing chains for building the
data warehouse.
5. Organize: This superstep contains all the processing chains for building the
data marts.
Explain the operational management layer.
The operational management layer is the core store for the data science
ecosystem’s complete processing capability. The layer stores every processing
schedule and workflow for the all-inclusive ecosystem.
1. Processing-Stream Definition and Management:- The processing-
stream definitions are the building block of the data science ecosystem.
2. Parameters:- The parameters for the processing are stored in this section,
to ensure a single location for all the system parameters.
3. Scheduling :- The scheduling plan is stored in this section, to enable central
control and visibility of the complete scheduling plan for the system.
4. Monitoring:- The central monitoring process is in this section to ensure that
there is a single view of the complete system.
5. Communication:- All communication from the system is handled in this one
section, to ensure that the system can communicate any activities that are
happening.
6. Alerting:- The alerting section uses communications to inform the correct
person, at the correct time, about the correct status of the complete system.
The utility layer is used to store repeatable practical methods of data science.
Utilities are the common and verified workhorses of the data science
ecosystem. The utility layer is a central storehouse for keeping all one’s
solutions utilities in one place. Having a central store for all utilities ensures
that you do not use out-of-date or duplicate algorithms in your solutions. The
most important benefit is that you can use stable algorithms across your
solutions.
Explain the fundamental steps of the data science process.
It consists of several structures, as follows:
• Data schemas and data formats: Functional data schemas and data
formats deploy onto the data lake’s raw data, to perform the required schema-
on-query via the functional layer.
• Data models: These form the basis for future processing to enhance the
processing capabilities of the data lake, by storing already processed data
sources for future use by other processes against the data lake.
• Processing algorithms: The functional processing is performed via a series
of well-designed algorithms across the processing chain.
• Provisioning of infrastructure: The functional infrastructure provision
enables the framework to add processing capability to the ecosystem, using
technology such as Apache Mesos, which enables the dynamic previsioning of
processing work cells.
Explain Assess super step. What are the different ways to handle
errors in the Assess super step.
Assessing data quality is crucial in data science to ensure reliable insights and
decisions. Here's a simplified approach to handle invalid or erroneous data
values:
1. Accept the Error: If the error is minor (e.g., "West Street" instead of "West
St."), you can accept it and move on. However, this may affect certain data
science techniques.
2. Reject the Error: If the data is severely damaged, it's best to delete it to
maintain data integrity. This should be a last resort.
3. Correct the Error: This is the preferred option. Methodically correct errors,
such as spelling mistakes in names, addresses, and locations.
4. Create a Default Value: If no value is entered, a default value can be
assigned. This is commonly used in business systems.
Q.4 Write a short note on Isolation Forest.
Outlier detection in high-dimensional data can be efficiently done using random
forests. The ensemble. Isolation Forest tool works by:
1. Randomly selecting a feature and a split value within its range.
2. Repeating this process recursively, creating a tree-like structure.
3. Measuring the path length from the root to the terminating node for each
sample.
4. Averaging this path length over multiple trees.
Normal data points have longer path lengths, while anomalies have shorter
ones. By combining the results from multiple trees, the tool identifies samples
with consistently shorter path lengths as outliers.
What is feature engineering? Describe any feature extraction
technique.
Feature engineering is a crucial step in preparing data for analysis. Feature
engineering is like uncovering hidden treasures in your data lake. You use
techniques to extract valuable features (characteristics) from your data, making
it easier to analyze and gain insights.
To do this:
1. Identify important data characteristics (features) in your data lake.
2. Use techniques like extraction, transformation, and creation to prepare these
features.
3. Document each step in the data transformation matrix and data lineage, so
you can track how your data is transformed and ensure transparency.
What is cross validation? Explain Leave-One-Out cross validation.
Cross-validation is a way to test how well a statistical model will perform on
new, unseen data. It helps ensure that the model is not too tailored to the
training data and can generalize well to real-world applications.
There are three main methods:
1. Validation Set Approach: Split data into two parts - training and
validation. Train the model on one part and test it on the other. This method can
be variable and may not perform well with smaller datasets.
2. Leave-One-Out Cross-Validation (LOOCV): Use one observation as the
validation set and the rest as the training set. Repeat this process for each
observation. This method is more accurate but can be computationally
expensive.
3. k-Fold Cross-Validation: Divide data into k groups (folds). Use one fold as
the validation set and the rest as the training set. Repeat this process k times.
This method is a good balance between accuracy and computational efficiency.
Explain the Time-Person-Object-Location-Event (T-P-O-L-E) design
principle, Data Vault.
The Time-Person-Object-Location-Event (T-P-O-L-E) design principle is a core
concept in Data Vault, a data modeling methodology. It represents the five
essential elements for capturing and storing data in a scalable and flexible way.
Here's a breakdown of each element:
Time (T): Represents the temporal aspect of data, including dates, times, and
timestamps.
Person (P): Refers to individuals, organizations, or entities involved in a
transaction or event.
Object (O): Represents the things or items being manipulated, such as
products, assets, or documents.
Location (L): Captures the physical or logical places where events occur or
data is stored.
Event (E): Describes the actions, transactions, or changes that occur, such as
purchases, updates, or movements.
By using the T-P-O-L-E framework, Data Vault ensures that data is structured in
a way that:
Supports granular tracking of changes and events
Enables flexible querying and analysis
Facilitates data integration and consolidation
Allows for scalable and adaptable data storage
Explain in brief Linear Regression.
Linear regression is a statistical technique that helps us understand the
relationship between a dependent variable (outcome) and an independent
variable (input). It tries to fit a straight line through the data points to predict
the outcome. Linear regression, is a very simple approach for supervised
learning.
 In particular, linear regression is a useful tool for predicting a quantitative
response.
 Linear regression models are, by definition, linear. It works on the
underlying assumption that the relationship between variables is linear.
 Given a collection of n points, linear regression seeks to find the line
which best approximates or fits the points.
 The goal is to find the line that best approximates the relationship
between the dependent outcome variable (Y) and one or more
independent input variables(X).
Write a short note on Hierarchical Clustering.
Hierarchical clustering is a way to group similar data points into clusters in a
step-by-step process. It's like building a tower of clusters, where similar clusters
are merged at each level, until all data points are in one cluster. Here's a
simplified explanation of the steps:
1. Start with each data point as its own cluster (like individual bricks).
2. Find the two closest clusters (bricks) and merge them into one (like stacking
two bricks together).
3. Repeat step 2, finding the next closest clusters and merging them, until all
data points are in one cluster (like a complete tower).
4. The resulting diagram (dendrogram) shows the hierarchy of clusters and how
they're related.
This process helps identify distinct groups (clusters) in complex data, like the
example of budget-conscious shoppers vs. brand-loyal shoppers. By visualizing
the hierarchy, you can understand how the clusters are related and what drives
their behaviour.
What is clustering? Explain the different clustering techniques
Clustering is a way to group similar data points into clusters without prior
knowledge of the groups. It's like sorting a pile of puzzle pieces into categories
without knowing the final picture. Think of it as:
Unsupervised learning: No prior training or guidance is given to the
algorithm.
Discovery: The algorithm finds hidden patterns and groups in the data.
Labeling: The algorithm assigns labels to the groups, creating clusters.
In customer data, clustering helps segment customers based on demographics
or purchasing behavior, creating distinct groups. Since there's no "right"
answer, the algorithm finds the best fit solution based on the data. Clustering is
also called unsupervised classification, as it categorizes data without prior
knowledge of the categories. It's a powerful tool for exploring and
understanding complex data

You might also like