Unit 1 Data Science
Unit 1 Data Science
1.1 Need, benefits and uses of data science and big data
Data science and big data play crucial roles in various industries, providing valuable
insights and enabling informed decision-making. Here are some of the key aspects of
their need, benefits, and uses:
2. **Complexity of Data:**
- Data comes in various formats, including structured, semi-structured, and
unstructured data. Extracting meaningful information from such diverse sources
requires advanced analytical techniques.
3. **Competitive Advantage:**
- Organizations that harness the power of data science and big data gain a
competitive edge by making more informed decisions and identifying opportunities
for innovation.
4. **Customer Expectations:**
- Businesses are under increasing pressure to understand customer behaviour,
preferences, and needs. Data science enables organizations to analyse customer data
to enhance the customer experience.
5. **Risk Management: **
- Big data analytics helps in identifying potential risks and predicting future trends,
enabling organizations to proactively mitigate risks and optimize strategies.
1. **Informed Decision-Making:**
- Data science provides insights that help organizations make data-driven
decisions, reducing reliance on intuition and improving accuracy.
2. **Improved Efficiency:**
- Big data technologies allow organizations to process and analyse large datasets
quickly, leading to more efficient operations and resource utilization.
3. **Personalization:**
- Businesses can use data science to analyse customer behaviour and preferences,
enabling personalized marketing, product recommendations, and services.
5. **Cost Reduction:**
- Predictive analytics and optimization techniques can help organizations identify
cost-saving opportunities and streamline operations.
1. **Healthcare:**
- Predictive analytics can be used for disease diagnosis and treatment planning. Big
data helps in managing and analysing large volumes of patient data.
2. **Finance:**
- Fraud detection, risk management, and algorithmic trading are common
applications of data science in the financial sector.
3. **E-commerce:**
- Recommendation engines, personalized marketing, and inventory management
benefit from data science in the e-commerce industry.
4. **Manufacturing:**
- Predictive maintenance, supply chain optimization, and quality control are areas
where big data analytics is applied in manufacturing.
In summary, the integration of data science and big data technologies is essential for
organizations to stay competitive, improve decision-making, and unlock new
opportunities across various industries.
Data Science process
1.2 Overview of the data science process
Following a structured approach to data science helps you to maximize your chances of
success in a data science project at the lowest cost.
It also makes it possible to take up a project as a team member focusing on what they do best.
3. Data Processing Tools:** Tools like Apache Hive, Apache Pig, and Apache Flink
facilitate data processing and analysis.
4. NoSQL Databases (Not only SQL):- Solutions like MongoDB, Cassandra, and
Couchbase are designed to handle unstructured and semi-structured data.
• Def: A NoSQL database provides mechanism for storage and retrival of data that is
modelled in means other than tabular relations used in relational databases.
• It is scalable (Scalable is the ability to expand or contract the capacity of system
resources in order to support the changing usage of your application)
• Fast
• Types :
1. Column Databases = Data is stored in columns, which allows algorithms to perform
much faster queries. Newer technologies use cell wise storage. Table like structures
are still important.
2. Document stores = Document stores no longer use tables, but store every observation
in a document. This allows for a much more flexible data scheme.
3. Streaming data = Data is collected, transformed and aggregated not in batches but in
real time. Although we have categorized it here as a database to help you in tool
selection , it’s more a particular type of problem that drove creation of technologies
such as storm
4. Key-value stores = Data isn’t stored in a table, rather you assign a key for every value,
such as org.marketing.sales.2015:2000. This scales well but places almost all the
implementation on the developer.
5. SQL on Hadoop = Batch queries on Hadoop are in a SQL-like language that uses the
map-reduce framework in the background.
6. New SQL = This class combines the scalability of NoSQL databases with the
advantages of relational databases. They all have a SQL interface and a relational data
model.
7. Graph Database = Not every problem is best stored in a table. Particular problems are
more naturally translated into graph theory and stored in graph databases. A classic
example of this is social network.
6) Scheduling tools:-
• Scheduling tools help you automate repetitive tasks and trigger jobs based on events
such as adding a new file to a folder.
• Specially developed for a big data.
• You can use them, for instance, to start a MapReduce task whenever a new dataset is
available in a directory.
7) Benchmarking tools:-
• This class of tools was developed to optimize your big data installation by providing
standardized profiling suites.
• Benchmark ( Standard or point of reference)
• A profiling suits is taken from a representative set of big data jobs.
• Using an optimized infrastructure can make a big cost difference.
8) System Deployment:-
• Deployment implies moving a product from a temporary or development state to a
permanent or desired state.
• Setting up a big data infrastructure isn’t an easy task and assisting engineers in
deploying new applications into the big data cluster is where system deployment tools
shine.
• They largely automate the installation and configuration of big data components.
• This isn’t a core task of a data scientist.
9) Service programming:-
• Suppose that you’ve made a world class soccer prediction application on Hadoop, and
you want to allow others to use the prediction made by your application.
• However, you have no idea of the architecture or technology of everyone keen on
using your predictions.
• Service tools excel here by exposing big data applications to other applications as a
service. Data scientist sometimes need to expose their models through services.
• The best known example is the REST service, REST stands for representational state
transfer. It’s often used to feed websites with data.
10) Security:-
• Do you want everybody to have access to all of your data? You probably need to have
fine grained control over the access to data but don’t want to manage this on an
application by application basis.
• Big data security tools allow you to have central and fine grained control over access
to the data. Big data security has become a topic in its own right, and data scientist are
usually only confronted with it as data consumers.
1. Volume:
- *Description:* The sheer volume of data generated on a daily basis is one of the
primary challenges in the big data world. Managing, storing, and processing massive
amounts of data can be a daunting task.
- *Solution:* Distributed storage and processing systems like Hadoop and Spark, along
with scalable cloud storage solutions, help address volume challenges.
2. Velocity:
- *Description:* Data is generated at an unprecedented speed, requiring real-time or
near-real-time processing to extract meaningful insights. Traditional databases and
processing systems may struggle with high-velocity data streams.
- *Solution:* Stream processing frameworks like Apache Kafka and technologies that
support real-time analytics are essential to handle high-velocity data.
3. Variety:
- *Description:* Big data comes in various formats, including structured, semi-
structured, and unstructured data. Managing diverse data types and sources can be
complex.
- *Solution:* Data lakes and flexible storage solutions, such as NoSQL databases, are
used to store and process diverse data types.
4. Veracity:
- *Description:* Data quality and reliability are crucial. Big data often includes noisy,
incomplete, or inconsistent data, which can impact the accuracy of analytical results.
- *Solution:* Data cleansing and preprocessing techniques, along with quality
assurance measures, help improve data accuracy and reliability.
5. Value:
- *Description:* Extracting meaningful insights and value from large datasets can be
challenging. Identifying relevant patterns and trends requires advanced analytics and
machine learning techniques.
- *Solution:* Employing data analytics, machine learning, and artificial intelligence (AI)
tools to analyze and derive actionable insights from big data.
7. Scalability:
- *Description:* As data volumes grow, systems need to scale seamlessly to handle
increased workloads. Scalability is crucial to ensure performance and responsiveness.
- *Solution:* Distributed computing frameworks, cloud services, and scalable storage
solutions support the scalability requirements of big data systems.
8. Cost Management:
- *Description:* Managing the costs associated with storing, processing, and analyzing
large volumes of data can be challenging. Cloud services and infrastructure costs need to
be optimized.
- *Solution:* Implementing cost-effective storage solutions, optimizing data processing
workflows, and leveraging cloud cost management tools are essential for cost control.
9. Complexity:
- *Description:* Big data ecosystems can be complex with various tools, technologies,
and components. Integrating and managing these components can be challenging.
- *Solution:* Adopting comprehensive data governance practices, using integrated
platforms, and employing skilled professionals can help manage the complexity of big
data environments.
1. Descriptive Statistics:
- *Role:* Descriptive statistics help summarize and describe essential features of a
dataset, such as mean, median, mode, variance, and standard deviation. These measures
provide an initial understanding of the data's central tendency, spread, and distribution.
2. Inferential Statistics:
- *Role:* Inferential statistics enable data scientists to make predictions or inferences
about a population based on a sample of data. Techniques like hypothesis testing and
confidence intervals help draw conclusions from data and assess the reliability of
predictions.
3. Probability:
- *Role:* Probability theory is foundational to statistics and plays a crucial role in
modeling uncertainty. Probability distributions, such as the normal distribution, are
used to model and understand the likelihood of different outcomes, which is essential
for making informed decisions.
4. Linear Algebra:
- *Role:* Linear algebra is integral to machine learning algorithms and data
manipulation. Concepts like matrices and vectors are used to represent and transform
data, especially in the context of algorithms like linear regression, principal component
analysis, and deep learning.
5. Calculus:
- *Role:* Calculus is essential for understanding the rates of change and gradients in
mathematical models. Optimization algorithms, which are widely used in machine
learning for model training, rely on calculus principles, such as derivatives.
6. Statistical Modeling:
- *Role:* Statistical models form the basis for understanding relationships within data.
Regression analysis, time series analysis, and other statistical modeling techniques help
identify patterns and relationships, making predictions and guiding decision-making.
9. A/B Testing:
- *Role:* A/B testing is a statistical technique used to compare two or more versions of
a product or process. It relies on statistical methods to determine if observed
differences are statistically significant and not due to chance.
In summary, mathematics and statistics are the backbone of data science, providing the
necessary tools and techniques for data exploration, analysis, and modelling. A strong
foundation in these subjects empowers data scientists to formulate hypotheses, build
models, validate results, and make informed decisions based on data-driven insights.