0% found this document useful (0 votes)

13 views41 pages

Extra - Data Science Unit II

The document discusses the challenges of handling large datasets, including memory limitations, long processing times, and resource bottlenecks. It offers strategies for overcoming these issues, such as optimizing memory usage, reducing processing time, and selecting appropriate algorithms and data structures. Additionally, it highlights the importance of using specialized libraries and tools for efficient data manipulation and analysis.

Uploaded by

deepakonangi09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views41 pages

Extra - Data Science Unit II

Uploaded by

deepakonangi09

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Unit 2 – Data Wrangling, Data Cleaning and Preparation

S1 - Data Handling: Problem faced when

handling large data-

1
Problems faced - Introduction
A large volume of data poses new challenges, such as overloaded
memory and algorithms that never stops running. It forces you to adapt
and expand your repertoire of techniques. But even when you can
perform your analysis, you should take care of issues. such as I/O
(input/output) and CPU hunger, as these might lead to performance
problems.

Improving your code and using effective data structures can help reduce
these issues. Moreover, exploring parallel processing or distributed
computing might enhance performance when working with extensive
datasets.
2
Handling Large Data - Problems

Courtesy: https://www.anuupdates.org
3
Three Problems w.r.to Large Database
1. Not Enough Memory:
When a dataset surpasses the available RAM, the computer might not
be able to handle all the data at once, causing errors .
2. Processes that Never End:
Large datasets can lead to extremely long processing times, making
it seem like the processes never terminate.
3. Bottlenecks:
Processing large datasets can strain the computer% resources.
Certain components, like the CPU, might become overloaded while
others remain idle. This is referred to as a bottleneck.

4
1. Not Enough Memory (RAM):
• Random Access Memory (RAM) acts as the computer's short-term
memory. When you work with a dataset, a portion of it is loaded
into RAM for faster processing.
• If the dataset surpasses the available RAM, the computer might
resort to using slower storage devices like hard disk drives (HDDs)
to swap data in and out of memory as needed. This process,
known as paging, significantly slows down operations because
HDDs have much slower read/write speeds compared to RAM.
• In severe cases, exceeding RAM capacity can lead to program
crashes or errors if the computer cannot allocate enough memory
to handle the data.

5
2. Processing that never end
• Large datasets naturally take longer to process because the computer
needs to perform operations on each data point.
• This can include calculations, filtering, sorting, or any other
manipulation required for the given task.
• The processing time can become impractical for very large datasets,
making it seem like the computer is stuck in an infinite loop. This can
be frustrating and impede the workflow.

6
3. Bottlenecks (Resource Overload)
• When processing large datasets, the computer's central processing
unit (CPU) is typically the most stressed component. The CPU is
responsible for executing all the instructions required for data
manipulation.
• If the CPU becomes overloaded, it can create a bottleneck, where
other components like the graphics processing unit (GPU) or storage
might be underutilized while waiting for the CPU to complete its tasks.
This imbalance in resource usage hinders the overall processing speed.
•
These limitations can significantly impact the efficiency and feasibility of
working with large datasets on a single computer. In extreme cases, it might
become impossible to handle the data altogether due to memory constraints
or excessively long processing times.

7
Tips to overcome Problems
Working with massive datasets on a single computer can be challenging ,
following strategies can be employed to overcome the bottlenecks

1. Optimizing Memory Usage

2. Reducing Processing Time
3. Addressing Bottlenecks

8
1. Optimizing Memory Usage

• Data Partitioning: Divide your large dataset into smaller, manageable

chunks. Work on each chunk independently, reducing the overall
memory footprint at any given time. Libraries like Pandas in Python
offer functionalities for efficient data partitioning.
• Data Sampling: Instead of processing the entire dataset, consider
selecting a representative subset (sample) that captures the essential
characteristics of the whole data. This can be helpful for initial analysis
or testing purposes without overloading the system.
• Data Type Optimization: Analyze your data and convert variables to
appropriate data types that require less memory. For instance, storing
integers as 16-bit values instead of 32-bit can significantly reduce
memory usage.

9
2. Reducing Processing Time
1. Parallelization: Utilize multi-core processors available in most
modern computers. Break down large tasks into smaller subtasks
and distribute them across multiple cores for simultaneous
execution, speeding up the overall process. Libraries like Dask in
Python or NumPy can facilitate parallel processing.
2. Code Optimization: Review and optimize your code to improve its
efficiency. Look for redundant operations or areas where algorithms
can be streamlined. Even small code improvements can lead to
significant performance gains when dealing with large datasets.
3. Utilize Specialized Libraries: Take advantage of libraries and
frameworks designed for handling big data. These tools often employ
efficient data structures and algorithms optimized for large-scale
processing, significantly improving performance compared to
generic programming languages.
10
3. Addressing Bottlenecks
1. Upgrade Hardware: If feasible, consider upgrading your
computer's hardware, particularly RAM and CPU. Adding more
RAM directly increases the available memory for data
processing, while a more powerful CPU can handle large
datasets with greater efficiency.
2. Cloud Computing: For extremely large datasets that exceed the
capabilities of a single computer, consider utilizing cloud
computing platforms like Google Cloud Platform or Amazon
Web Services. These platforms offer virtual machines with
significantly larger memory and processing power, allowing you
to tackle tasks that wouldn't be possible on your local machine.

11
Unit 2 – Data Wrangling, Data Cleaning and Preparation
S2 - General Techniques for Handling
Large Volume of Data

12
Handling Large Data
The main obstacles encountered while dealing with enormous data
include perpetual algorithms, memory overflow faults, and
performance deficiencies.

The solutions can be divided into three categories:

o using the correct algorithms,
o choosing the right data structure
o using the right tools.

13
Block Diagram – Handling Large Data

Courtesy: https://www.anuupdates.org

14
Selecting the appropriate algorithm

1. Opting for the appropriate algorithm can resolve a greater

number of issues than simply upgrading technology.
2. An algorithm optimized for processing extensive data can
provide predictions without requiring the complete dataset to be
loaded into memory.
3. The method should ideally allow parallelized computations.
4. Heer three types of algorithms are explored: online algorithms,
block algorithms, and Map Reduce algorithms.

15
A. Online Algorithms

Definition: These algorithms make decisions based on a limited and

sequential stream of data, without knowledge of future inputs.
Applications: They are commonly used in scenarios where data arrives
continuously, and decisions need to be made in real-time. Examples
include:
o Online scheduling algorithms for resource allocation in computer
systems
oSpam filtering algorithms that classify incoming emails as spam or not
spam as they arrive
oOnline game playing algorithms that make decisions based on the
current state of the game

16
B. Block Algorithms:

Definition: These algorithms operate on fixed-size chunks of data,

also known as blocks. Each block is processed independently,
allowing for a degree of parallelization and improved efficiency
when dealing with large datasets.
1. Sorting algorithms like the merge or quicksort that divide the
data into sub-arrays for sorting
2. Image processing tasks where image data can be divided into
smaller blocks for individual filtering or manipulation
3. Scientific computing problems where large datasets are
processed in chunks to utilize parallel computing resource
17
c) Map Reduce Algorithms
Definition: This is a programming framework specifically designed for
processing large datasets in a distributed manner across multiple
computers. It involves two key phases:
• Map: This phase takes individual data elements as input and
processes them independently, generating intermediate key-value
pairs.
• Reduce: This phase aggregates the intermediate key-value pairs from
the "Map" phase based on the key, performing a specific operation on
the values for each unique key.
• Applications: MapReduce is widely used in big data analytics tasks,
where massive datasets need to be processed and analyzed. Examples
include:
18
c) Map Reduce Algorithms Contd…
• Log analysis: analyzing large log files from web servers to identify
trends and patterns
• Sentiment analysis: analyzing large amounts of text data to
understand the overall sentiment
• Scientific data processing: analyzing large datasets from scientific
experiments

19
2. Choosing the right data structure
• Algorithms can make or break your program, but the way you store
your data is of equal importance. Data structures have different
storage requirements, but also influence the performance of
CRUD (create, read, update, and delete) and other operations on
the data set.
• There are many data structures to choose from, here we will
discuss about sparse data, tree data and hash data. These three
terms represent different approaches to storing and organizing
data, each with its own strengths and weaknesses.

20
Data Structure - Types

Courtesy: https://www.anuupdates.org

21
1. Sparse Data
Definition:
Sparse data refers to datasets where most of the values are empty or
zero. This often occurs when dealing with high-dimensional data where
most data points have values for only a few features out of many.
Examples:
• Customer purchase history: Most customers might not buy every
available product, resulting in many zeros in the purchase matrix.
• Text documents: Most words don't appear in every document, leading
to sparse word-document matrices.

22
1. Sparse Data Contd…
Challenges:
• Storing and processing sparse data using conventional methods
can be inefficient due to wasted space for empty values.
• Specialized techniques like sparse matrices or compressed
representations are needed to optimize storage and processing.
Applications:
• Recommender systems: Analyzing sparse user-item interactions
to recommend relevant products or content.
• Natural language processing: Analyzing sparse word-document
relationships for tasks like topic modeling or text classification.

23
2. Tree Data
Definition: Tree data structures represent data in a hierarchical
manner, resembling an upside-down tree. Each node in the tree can
have child nodes, forming parent-child relationships.
Examples:
File systems: Files and folders are organized in hierarchical
structures using tree data structures.
Biological taxonomies: Classification of species into kingdoms,
phylum, class, etc., can be represented as a tree.

24
2. Tree Data Contd…
Advantages
• Efficient for representing hierarchical relationships and
performing search operations based on specific criteria.
• Can be traversed in various ways (preorder, inorder, postorder) to
access data in different orders.
Disadvantages
• May not be suitable for all types of data, particularly non-
hierarchical relationships.
• Inserting and deleting nodes can be expensive operations in
certain tree structures.

25
3. Hash Data
Definition:
Hash data uses hash functions to map data elements (keys) to
unique fixed-size values (hashes). These hashes are used for quick
retrieval and identification of data within a larger dataset.
Examples:
• Hash tables: Used in dictionaries and associative arrays to quickly
access data based on key-value pairs.
• Usemame and password storage: Passwords are typically stored
as hashed values for security reasons.

26
3. Hash Data Contd…
Advantages:
• Extremely fast for data lookup operations using the hash key.
• Efficient for storing and retrieving data when quick access by a
unique identifier is necessary.
Disadvantages:
• Hash collisions can occur when different keys map to the same
hash value, requiring additional techniques to resolve conflicts.
• Not suitable for maintaining order or performing comparisons
between data elements.

27
Selecting the Right Tools

Courtesy: https://www.anuupdates.org

28
1. NumPy – Python Libraries for Big Data
Purpose:
The foundation for scientific computing in Python, offering a
powerful multidimensional array object (ndarray) for efficient
numerical operations.
Strengths:
o Fast and efficient array operations (vectorized computations).
o Linear algebra capabilities (matrix operations, eigenvalue
decomposition, etc.).
o Integration with other libraries like Pandas and SciPy.

29
2. Pandas
Purpose:
A high-performance, easy-to-use data analysis and manipulation
library built on top of NumPy.
Strengths:
o DataFrames (tabular data structures) for flexible and efficient
data handling.
o Time series functionality (date/time data manipulation).
o Grouping and aggregation operations.
o Data cleaning and wrangling capabilities.

30
3. Dask
Purpose:
A parallel processing framework built on NumPy and Pandas,
allowing you to scale computations across multiple cores or
machines.
Strengths:
• Scalable parallel execution of NumPy and Pandas operations on
large datasets.
• Fault tolerance and efficient handling of data distribution.
• Ability to use existing NumPy and Pandas code with minor
modifications.

31
4. SciPy
Purpose:
A collection of algorithms and functions for scientific computing
and technical computing, built on top of NumPy and often relying
on NumPy arrays.
Strengths:
• Wide range of scientific functions (optimization, integration,
interpolation, etc.).
• Statistical analysis and modeling tools.
• Signal and image processing capabilities.

32
5. Scikit-learn:
Purpose:
A comprehensive and user-friendly machine learning library offering
a variety of algorithms and tools for classification, regression,
clustering, dimensionality reduction, and more.
Strengths:
• Extensive collection of well-tested machine learning algorithms.
• Easy-to-use API for building and evaluating models.
• Scalability and efficiency for working with large datasets.

33
Unit 2 – Data Wrangling, Data Cleaning and Preparation
S3 - General Programming Tips for
Dealing with Large Dataset

34
Programming Tips – Block Diagram

Courtesy: https://www.anuupdates.org

35
Programming Tips

36
1. Avoid Duplicating Existing Efforts
(Don’t reinvent the wheel)

i. Avoid repetition
is likely superior to "avoid repeating yourself." Act in a way that adds
significance and worth. Revisiting an issue that has previously been resolved is
inefficient. As a data scientist, there are two fundamental principles that can
enhance your productivity while working with enormous datasets:
ii. Harness the potential of databases.
Most data scientists first choose to create their analytical base tables
within a database when dealing with huge data sets. This strategy is
effective for preparing straightforward features. Determine if user-defined
functions and procedures may be utilized while using advanced modeling
in this preparation. The last example in this chapter demonstrates how to
include a database into your workflow.

37
1. Avoid Duplicating Existing Efforts Contd…
iii. Utilize optimized libraries.
Developing libraries such as Mahout, Weka, and other machine learning
algorithms demands effort and expertise. The products are highly optimized
and utilize best practices and cutting-edge technologies. Focus your
attention on accomplishing tasks rather than duplicating or reiterating the
labor of others, unless it is for the purpose of comprehending processes.

38
2. Get the most out of your hardware
Over-utilization of resources can slow down programs and cause them to
fail. Shifting workload from overtaxed to underutilized resources can be
achieved using techniques.

1. Feeding CPU compressed data: Shift more work from hard disk to
CPU to avoid CPU starvation.
2. Utilizing GPU: Switch to GPU for parallelize computations due to its
higher throughput.
3. Using CUDA Packages: Use CUDA packages like PyCUDA for
parallelization.
4. Using Multiple Threads: Parallelize computations on CPU using
normal Python threads
39
3. Reduce the Computing Need
Utilize a profiler to identify and remediate slow code parts.

1. Use compiled code, especially when loops are involved, and functions
from packages optimized for numerical computations.
2. If a package is not available, compile the code yourself.
3. Use computational libraries like LAPACK, BLAST, Intel MKL, and ATLAS for
high performance.
4. Avoid pulling data into memory when working with data that doesn't fit in
memory.
5. Use generators to avoid intermediate data storage by returning data per
observation instead of in batches.
6. Use as little data as possible if no large-scale algorithm is available.
7. Use math skills to simplify calculations

40
References
• The problems we face when handling large data ~ Acharya Nagarjuna
University Syllabus, Important Questions, Materials
https://www.anuupdates.org/2024/03/the-problems-we-face-when-
handling-large-data.html

Apznzazhdljcco08e5denxdpmwyo3o0bbbl Avbpxuleoshb0su5nxvmc0kmm Nedtetebi8yzcpkitljoqgvxy2bm9 h7lf4pttnwfomnaaiuzkwez3ngcw8tojl 2 Mqyh57ajl0gsdcgvi7 Zyq2peekpbhxfc8bwvklrk40yokucqdffpuuvalsrcadb80ozuvpiug5 Vwbpc65kyeem2on3rtvppqicbjz71pp0ho0m
No ratings yet
Apznzazhdljcco08e5denxdpmwyo3o0bbbl Avbpxuleoshb0su5nxvmc0kmm Nedtetebi8yzcpkitljoqgvxy2bm9 h7lf4pttnwfomnaaiuzkwez3ngcw8tojl 2 Mqyh57ajl0gsdcgvi7 Zyq2peekpbhxfc8bwvklrk40yokucqdffpuuvalsrcadb80ozuvpiug5 Vwbpc65kyeem2on3rtvppqicbjz71pp0ho0m
25 pages
4. Handling Large Datasets in RAM
No ratings yet
4. Handling Large Datasets in RAM
5 pages
DAA.ppt
No ratings yet
DAA.ppt
24 pages
Data Intensive Computing
No ratings yet
Data Intensive Computing
18 pages
Jifs223295 2
No ratings yet
Jifs223295 2
25 pages
Cloud Computing Unit4
No ratings yet
Cloud Computing Unit4
55 pages
Data Science
No ratings yet
Data Science
31 pages
Extensible Operator Models
No ratings yet
Extensible Operator Models
20 pages
Lab 03
No ratings yet
Lab 03
13 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Apriori algorithm
No ratings yet
Apriori algorithm
26 pages
DS Unit-2 PDF
No ratings yet
DS Unit-2 PDF
54 pages
Unit- 1
No ratings yet
Unit- 1
28 pages
Unit 1
No ratings yet
Unit 1
14 pages
1M AND 10 M
No ratings yet
1M AND 10 M
23 pages
Data Modeling
No ratings yet
Data Modeling
12 pages
Big Data
No ratings yet
Big Data
10 pages
BCE Report
No ratings yet
BCE Report
14 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
IP Project Deepika
No ratings yet
IP Project Deepika
26 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Unit II - Data Science
No ratings yet
Unit II - Data Science
113 pages
ucPDF (14)
No ratings yet
ucPDF (14)
10 pages
hadoop-big-data-unit-2
No ratings yet
hadoop-big-data-unit-2
23 pages
DS R Unit-4
No ratings yet
DS R Unit-4
5 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Articol Disteibuted Data Processing
No ratings yet
Articol Disteibuted Data Processing
9 pages
Super 25 Unit 1 and Unit 2 (1)
No ratings yet
Super 25 Unit 1 and Unit 2 (1)
15 pages
DS1
No ratings yet
DS1
10 pages
BIG DATA Notes
No ratings yet
BIG DATA Notes
11 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
Datascience-unit3
No ratings yet
Datascience-unit3
19 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Resume Building Tips by Prafful
No ratings yet
Resume Building Tips by Prafful
7 pages
DS1
No ratings yet
DS1
20 pages
Big Data
No ratings yet
Big Data
1 page
Unit 4 - DS - 1st year
No ratings yet
Unit 4 - DS - 1st year
6 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
DS - Module 3 1
No ratings yet
DS - Module 3 1
6 pages
Big Data
No ratings yet
Big Data
25 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Big Data 3
No ratings yet
Big Data 3
16 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Presentation - MIL-STD-31000 - A - Overview - Roy Whittenburg
No ratings yet
Presentation - MIL-STD-31000 - A - Overview - Roy Whittenburg
19 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Big Data
No ratings yet
Big Data
25 pages
The Genius of Language
No ratings yet
The Genius of Language
143 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Measurement of Joint Motion A Guide to Goniometry 4th edition pdf download
No ratings yet
Measurement of Joint Motion A Guide to Goniometry 4th edition pdf download
35 pages
MS Word Set-7 Computer MCQ Part-18
No ratings yet
MS Word Set-7 Computer MCQ Part-18
16 pages
Guía Rápida Pathloss
No ratings yet
Guía Rápida Pathloss
87 pages
Soal Debate Xii
No ratings yet
Soal Debate Xii
3 pages
Marketing Plan Presentation PDF
No ratings yet
Marketing Plan Presentation PDF
45 pages
NSS Mock Paper (Compulsory) Paper 1-sol
No ratings yet
NSS Mock Paper (Compulsory) Paper 1-sol
10 pages
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Oral Cavity
No ratings yet
Oral Cavity
59 pages
DSF - Unit V Notes
No ratings yet
DSF - Unit V Notes
7 pages
Linear Algebra II Worksheet I
No ratings yet
Linear Algebra II Worksheet I
2 pages
Computer Application In Business ( Concise Notes )
From Everand
Computer Application In Business ( Concise Notes )
NotesKaro
No ratings yet
Class-IX Holiday Homework 2024-25
No ratings yet
Class-IX Holiday Homework 2024-25
8 pages
Descriptive and Narrative Essay
100% (2)
Descriptive and Narrative Essay
5 pages
Lab-5 Use Case Diagram
No ratings yet
Lab-5 Use Case Diagram
4 pages
Readers Theater The True Story of The Three Little Pigs
50% (2)
Readers Theater The True Story of The Three Little Pigs
2 pages
Logical Operations: 1's Complement
No ratings yet
Logical Operations: 1's Complement
11 pages
IO Interfacing
No ratings yet
IO Interfacing
10 pages
Lesson Test Unit 1, Lesson 1: Listening: Women
No ratings yet
Lesson Test Unit 1, Lesson 1: Listening: Women
3 pages
Co202 Dbms Lab Assignment - 04: Strong Entity
No ratings yet
Co202 Dbms Lab Assignment - 04: Strong Entity
5 pages
Ccna Security
100% (1)
Ccna Security
81 pages
w4 Real
No ratings yet
w4 Real
9 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
NN Tool Example
No ratings yet
NN Tool Example
3 pages
1982 No172 Were You There Harmony
No ratings yet
1982 No172 Were You There Harmony
1 page
Introduction To The Canada Charter
No ratings yet
Introduction To The Canada Charter
2 pages
13.10.2019 - Mock Test 8 Answer Key
No ratings yet
13.10.2019 - Mock Test 8 Answer Key
1 page
Talend Scenario Based Interview Questions
No ratings yet
Talend Scenario Based Interview Questions
21 pages
Kode Hack
No ratings yet
Kode Hack
1 page
Young Mania Rating Scale (Ymrs) : Guide For Scoring Items
No ratings yet
Young Mania Rating Scale (Ymrs) : Guide For Scoring Items
2 pages
Clarity, Cohesion, and Coherence
No ratings yet
Clarity, Cohesion, and Coherence
2 pages
Vocal Anatomy Key Terms and Concepts: Posture
No ratings yet
Vocal Anatomy Key Terms and Concepts: Posture
2 pages
Online Hostel Management System
No ratings yet
Online Hostel Management System
47 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Extra - Data Science Unit II

Uploaded by

Extra - Data Science Unit II

Uploaded by

Unit 2 – Data Wrangling, Data Cleaning and Preparation

S1 - Data Handling: Problem faced when

1. Optimizing Memory Usage

• Data Partitioning: Divide your large dataset into smaller, manageable

The solutions can be divided into three categories:

1. Opting for the appropriate algorithm can resolve a greater

Definition: These algorithms make decisions based on a limited and

Definition: These algorithms operate on fixed-size chunks of data,

You might also like