0% found this document useful (0 votes)
13 views41 pages

Extra - Data Science Unit II

The document discusses the challenges of handling large datasets, including memory limitations, long processing times, and resource bottlenecks. It offers strategies for overcoming these issues, such as optimizing memory usage, reducing processing time, and selecting appropriate algorithms and data structures. Additionally, it highlights the importance of using specialized libraries and tools for efficient data manipulation and analysis.

Uploaded by

deepakonangi09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views41 pages

Extra - Data Science Unit II

The document discusses the challenges of handling large datasets, including memory limitations, long processing times, and resource bottlenecks. It offers strategies for overcoming these issues, such as optimizing memory usage, reducing processing time, and selecting appropriate algorithms and data structures. Additionally, it highlights the importance of using specialized libraries and tools for efficient data manipulation and analysis.

Uploaded by

deepakonangi09
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Unit 2 – Data Wrangling, Data Cleaning and Preparation

S1 - Data Handling: Problem faced when


handling large data-

1
Problems faced - Introduction
A large volume of data poses new challenges, such as overloaded
memory and algorithms that never stops running. It forces you to adapt
and expand your repertoire of techniques. But even when you can
perform your analysis, you should take care of issues. such as I/O
(input/output) and CPU hunger, as these might lead to performance
problems.

Improving your code and using effective data structures can help reduce
these issues. Moreover, exploring parallel processing or distributed
computing might enhance performance when working with extensive
datasets.
2
Handling Large Data - Problems

Courtesy: https://www.anuupdates.org
3
Three Problems w.r.to Large Database
1. Not Enough Memory:
When a dataset surpasses the available RAM, the computer might not
be able to handle all the data at once, causing errors .
2. Processes that Never End:
Large datasets can lead to extremely long processing times, making
it seem like the processes never terminate.
3. Bottlenecks:
Processing large datasets can strain the computer% resources.
Certain components, like the CPU, might become overloaded while
others remain idle. This is referred to as a bottleneck.

4
1. Not Enough Memory (RAM):
• Random Access Memory (RAM) acts as the computer's short-term
memory. When you work with a dataset, a portion of it is loaded
into RAM for faster processing.
• If the dataset surpasses the available RAM, the computer might
resort to using slower storage devices like hard disk drives (HDDs)
to swap data in and out of memory as needed. This process,
known as paging, significantly slows down operations because
HDDs have much slower read/write speeds compared to RAM.
• In severe cases, exceeding RAM capacity can lead to program
crashes or errors if the computer cannot allocate enough memory
to handle the data.

5
2. Processing that never end
• Large datasets naturally take longer to process because the computer
needs to perform operations on each data point.
• This can include calculations, filtering, sorting, or any other
manipulation required for the given task.
• The processing time can become impractical for very large datasets,
making it seem like the computer is stuck in an infinite loop. This can
be frustrating and impede the workflow.

6
3. Bottlenecks (Resource Overload)
• When processing large datasets, the computer's central processing
unit (CPU) is typically the most stressed component. The CPU is
responsible for executing all the instructions required for data
manipulation.
• If the CPU becomes overloaded, it can create a bottleneck, where
other components like the graphics processing unit (GPU) or storage
might be underutilized while waiting for the CPU to complete its tasks.
This imbalance in resource usage hinders the overall processing speed.

These limitations can significantly impact the efficiency and feasibility of
working with large datasets on a single computer. In extreme cases, it might
become impossible to handle the data altogether due to memory constraints
or excessively long processing times.

7
Tips to overcome Problems
Working with massive datasets on a single computer can be challenging ,
following strategies can be employed to overcome the bottlenecks

1. Optimizing Memory Usage


2. Reducing Processing Time
3. Addressing Bottlenecks

8
1. Optimizing Memory Usage

• Data Partitioning: Divide your large dataset into smaller, manageable


chunks. Work on each chunk independently, reducing the overall
memory footprint at any given time. Libraries like Pandas in Python
offer functionalities for efficient data partitioning.
• Data Sampling: Instead of processing the entire dataset, consider
selecting a representative subset (sample) that captures the essential
characteristics of the whole data. This can be helpful for initial analysis
or testing purposes without overloading the system.
• Data Type Optimization: Analyze your data and convert variables to
appropriate data types that require less memory. For instance, storing
integers as 16-bit values instead of 32-bit can significantly reduce
memory usage.

9
2. Reducing Processing Time
1. Parallelization: Utilize multi-core processors available in most
modern computers. Break down large tasks into smaller subtasks
and distribute them across multiple cores for simultaneous
execution, speeding up the overall process. Libraries like Dask in
Python or NumPy can facilitate parallel processing.
2. Code Optimization: Review and optimize your code to improve its
efficiency. Look for redundant operations or areas where algorithms
can be streamlined. Even small code improvements can lead to
significant performance gains when dealing with large datasets.
3. Utilize Specialized Libraries: Take advantage of libraries and
frameworks designed for handling big data. These tools often employ
efficient data structures and algorithms optimized for large-scale
processing, significantly improving performance compared to
generic programming languages.
10
3. Addressing Bottlenecks
1. Upgrade Hardware: If feasible, consider upgrading your
computer's hardware, particularly RAM and CPU. Adding more
RAM directly increases the available memory for data
processing, while a more powerful CPU can handle large
datasets with greater efficiency.
2. Cloud Computing: For extremely large datasets that exceed the
capabilities of a single computer, consider utilizing cloud
computing platforms like Google Cloud Platform or Amazon
Web Services. These platforms offer virtual machines with
significantly larger memory and processing power, allowing you
to tackle tasks that wouldn't be possible on your local machine.

11
Unit 2 – Data Wrangling, Data Cleaning and Preparation
S2 - General Techniques for Handling
Large Volume of Data

12
Handling Large Data
The main obstacles encountered while dealing with enormous data
include perpetual algorithms, memory overflow faults, and
performance deficiencies.

The solutions can be divided into three categories:


o using the correct algorithms,
o choosing the right data structure
o using the right tools.

13
Block Diagram – Handling Large Data

Courtesy: https://www.anuupdates.org

14
Selecting the appropriate algorithm

1. Opting for the appropriate algorithm can resolve a greater


number of issues than simply upgrading technology.
2. An algorithm optimized for processing extensive data can
provide predictions without requiring the complete dataset to be
loaded into memory.
3. The method should ideally allow parallelized computations.
4. Heer three types of algorithms are explored: online algorithms,
block algorithms, and Map Reduce algorithms.

15
A. Online Algorithms

Definition: These algorithms make decisions based on a limited and


sequential stream of data, without knowledge of future inputs.
Applications: They are commonly used in scenarios where data arrives
continuously, and decisions need to be made in real-time. Examples
include:
o Online scheduling algorithms for resource allocation in computer
systems
oSpam filtering algorithms that classify incoming emails as spam or not
spam as they arrive
oOnline game playing algorithms that make decisions based on the
current state of the game

16
B. Block Algorithms:

Definition: These algorithms operate on fixed-size chunks of data,


also known as blocks. Each block is processed independently,
allowing for a degree of parallelization and improved efficiency
when dealing with large datasets.
1. Sorting algorithms like the merge or quicksort that divide the
data into sub-arrays for sorting
2. Image processing tasks where image data can be divided into
smaller blocks for individual filtering or manipulation
3. Scientific computing problems where large datasets are
processed in chunks to utilize parallel computing resource
17
c) Map Reduce Algorithms
Definition: This is a programming framework specifically designed for
processing large datasets in a distributed manner across multiple
computers. It involves two key phases:
• Map: This phase takes individual data elements as input and
processes them independently, generating intermediate key-value
pairs.
• Reduce: This phase aggregates the intermediate key-value pairs from
the "Map" phase based on the key, performing a specific operation on
the values for each unique key.
• Applications: MapReduce is widely used in big data analytics tasks,
where massive datasets need to be processed and analyzed. Examples
include:
18
c) Map Reduce Algorithms Contd…
• Log analysis: analyzing large log files from web servers to identify
trends and patterns
• Sentiment analysis: analyzing large amounts of text data to
understand the overall sentiment
• Scientific data processing: analyzing large datasets from scientific
experiments

19
2. Choosing the right data structure
• Algorithms can make or break your program, but the way you store
your data is of equal importance. Data structures have different
storage requirements, but also influence the performance of
CRUD (create, read, update, and delete) and other operations on
the data set.
• There are many data structures to choose from, here we will
discuss about sparse data, tree data and hash data. These three
terms represent different approaches to storing and organizing
data, each with its own strengths and weaknesses.

20
Data Structure - Types

Courtesy: https://www.anuupdates.org

21
1. Sparse Data
Definition:
Sparse data refers to datasets where most of the values are empty or
zero. This often occurs when dealing with high-dimensional data where
most data points have values for only a few features out of many.
Examples:
• Customer purchase history: Most customers might not buy every
available product, resulting in many zeros in the purchase matrix.
• Text documents: Most words don't appear in every document, leading
to sparse word-document matrices.

22
1. Sparse Data Contd…
Challenges:
• Storing and processing sparse data using conventional methods
can be inefficient due to wasted space for empty values.
• Specialized techniques like sparse matrices or compressed
representations are needed to optimize storage and processing.
Applications:
• Recommender systems: Analyzing sparse user-item interactions
to recommend relevant products or content.
• Natural language processing: Analyzing sparse word-document
relationships for tasks like topic modeling or text classification.

23
2. Tree Data
Definition: Tree data structures represent data in a hierarchical
manner, resembling an upside-down tree. Each node in the tree can
have child nodes, forming parent-child relationships.
Examples:
File systems: Files and folders are organized in hierarchical
structures using tree data structures.
Biological taxonomies: Classification of species into kingdoms,
phylum, class, etc., can be represented as a tree.

24
2. Tree Data Contd…
Advantages
• Efficient for representing hierarchical relationships and
performing search operations based on specific criteria.
• Can be traversed in various ways (preorder, inorder, postorder) to
access data in different orders.
Disadvantages
• May not be suitable for all types of data, particularly non-
hierarchical relationships.
• Inserting and deleting nodes can be expensive operations in
certain tree structures.

25
3. Hash Data
Definition:
Hash data uses hash functions to map data elements (keys) to
unique fixed-size values (hashes). These hashes are used for quick
retrieval and identification of data within a larger dataset.
Examples:
• Hash tables: Used in dictionaries and associative arrays to quickly
access data based on key-value pairs.
• Usemame and password storage: Passwords are typically stored
as hashed values for security reasons.

26
3. Hash Data Contd…
Advantages:
• Extremely fast for data lookup operations using the hash key.
• Efficient for storing and retrieving data when quick access by a
unique identifier is necessary.
Disadvantages:
• Hash collisions can occur when different keys map to the same
hash value, requiring additional techniques to resolve conflicts.
• Not suitable for maintaining order or performing comparisons
between data elements.

27
Selecting the Right Tools

Courtesy: https://www.anuupdates.org

28
1. NumPy – Python Libraries for Big Data
Purpose:
The foundation for scientific computing in Python, offering a
powerful multidimensional array object (ndarray) for efficient
numerical operations.
Strengths:
o Fast and efficient array operations (vectorized computations).
o Linear algebra capabilities (matrix operations, eigenvalue
decomposition, etc.).
o Integration with other libraries like Pandas and SciPy.

29
2. Pandas
Purpose:
A high-performance, easy-to-use data analysis and manipulation
library built on top of NumPy.
Strengths:
o DataFrames (tabular data structures) for flexible and efficient
data handling.
o Time series functionality (date/time data manipulation).
o Grouping and aggregation operations.
o Data cleaning and wrangling capabilities.

30
3. Dask
Purpose:
A parallel processing framework built on NumPy and Pandas,
allowing you to scale computations across multiple cores or
machines.
Strengths:
• Scalable parallel execution of NumPy and Pandas operations on
large datasets.
• Fault tolerance and efficient handling of data distribution.
• Ability to use existing NumPy and Pandas code with minor
modifications.

31
4. SciPy
Purpose:
A collection of algorithms and functions for scientific computing
and technical computing, built on top of NumPy and often relying
on NumPy arrays.
Strengths:
• Wide range of scientific functions (optimization, integration,
interpolation, etc.).
• Statistical analysis and modeling tools.
• Signal and image processing capabilities.

32
5. Scikit-learn:
Purpose:
A comprehensive and user-friendly machine learning library offering
a variety of algorithms and tools for classification, regression,
clustering, dimensionality reduction, and more.
Strengths:
• Extensive collection of well-tested machine learning algorithms.
• Easy-to-use API for building and evaluating models.
• Scalability and efficiency for working with large datasets.

33
Unit 2 – Data Wrangling, Data Cleaning and Preparation
S3 - General Programming Tips for
Dealing with Large Dataset

34
Programming Tips – Block Diagram

Courtesy: https://www.anuupdates.org

35
Programming Tips

36
1. Avoid Duplicating Existing Efforts
(Don’t reinvent the wheel)

i. Avoid repetition
is likely superior to "avoid repeating yourself." Act in a way that adds
significance and worth. Revisiting an issue that has previously been resolved is
inefficient. As a data scientist, there are two fundamental principles that can
enhance your productivity while working with enormous datasets:
ii. Harness the potential of databases.
Most data scientists first choose to create their analytical base tables
within a database when dealing with huge data sets. This strategy is
effective for preparing straightforward features. Determine if user-defined
functions and procedures may be utilized while using advanced modeling
in this preparation. The last example in this chapter demonstrates how to
include a database into your workflow.

37
1. Avoid Duplicating Existing Efforts Contd…
iii. Utilize optimized libraries.
Developing libraries such as Mahout, Weka, and other machine learning
algorithms demands effort and expertise. The products are highly optimized
and utilize best practices and cutting-edge technologies. Focus your
attention on accomplishing tasks rather than duplicating or reiterating the
labor of others, unless it is for the purpose of comprehending processes.

38
2. Get the most out of your hardware
Over-utilization of resources can slow down programs and cause them to
fail. Shifting workload from overtaxed to underutilized resources can be
achieved using techniques.

1. Feeding CPU compressed data: Shift more work from hard disk to
CPU to avoid CPU starvation.
2. Utilizing GPU: Switch to GPU for parallelize computations due to its
higher throughput.
3. Using CUDA Packages: Use CUDA packages like PyCUDA for
parallelization.
4. Using Multiple Threads: Parallelize computations on CPU using
normal Python threads
39
3. Reduce the Computing Need
Utilize a profiler to identify and remediate slow code parts.

1. Use compiled code, especially when loops are involved, and functions
from packages optimized for numerical computations.
2. If a package is not available, compile the code yourself.
3. Use computational libraries like LAPACK, BLAST, Intel MKL, and ATLAS for
high performance.
4. Avoid pulling data into memory when working with data that doesn't fit in
memory.
5. Use generators to avoid intermediate data storage by returning data per
observation instead of in batches.
6. Use as little data as possible if no large-scale algorithm is available.
7. Use math skills to simplify calculations

40
References
• The problems we face when handling large data ~ Acharya Nagarjuna
University Syllabus, Important Questions, Materials
https://www.anuupdates.org/2024/03/the-problems-we-face-when-
handling-large-data.html

41

You might also like