0% found this document useful (0 votes)
7 views

Python Written Assignment

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Python Written Assignment

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Analyzing the Impact of Python Libraries on Data Science

Master's thesis

A report submitted for the course named Programming with Python(DLMDSPWP01)

Submitted by
Subodhini Balu Bhosale

42310022

Under the supervision of


Dr. Cosmina Croitoru

Master's in Computer Science

IU - International Institute of Applied Sciences, Berlin

May, 2024

I
Abstract

This assignment, titled "Analyzing the Impact of Python Libraries on Data Science," delves into the
pivotal role of Python libraries in shaping contemporary data science practices. Beginning with an
overview of Python's ascendancy as a premier language in the field, it underscores the significance of
libraries in augmenting Python's capabilities for data manipulation, analysis, and visualization. The
main body of the assignment elucidates four key Python libraries essential for data science: NumPy,
Pandas, Matplotlib, and scikit-learn. Each section provides a comprehensive examination of the li-
brary's functionalities, elucidating their advantages and applications in real-world scenarios. For in-
stance, the discussion on NumPy delves into its fundamental role in numerical computing, explicating
NumPy arrays, their superiority over conventional Python lists, and showcasing NumPy functions for
array manipulation, mathematical operations, and linear algebra. Similarly, the analysis of Pandas un-
derscores its indispensable role in data manipulation and analysis, introducing Pandas Series and
DataFrame structures, and demonstrating Pandas functions for data cleaning, transformation, filtering,
and aggregation. Moreover, the exploration of Matplotlib highlights the crucial aspect of data visualiza-
tion in data science, introducing Matplotlib's capabilities for crafting various types of plots and charts,
and showcasing its functionalities for visualizing data distributions, trends, and relationships. Through-
out the assignment, reasoned arguments are bolstered by theoretical underpinnings, practical illustra-
tions, and references to scholarly sources, ensuring a thorough and insightful analysis. The conclusion
succinctly summarizes the assignment's key insights, emphasizing the transformative impact of
Python libraries in empowering data scientists to derive actionable insights from intricate datasets and
steer data-driven decision-making across diverse domains.

II
Table of Contents
1.Introduction……………………………………………………………………………….............................1

1.1. Overview of Python for Data Science………………………………………………………………………………1

1.2. Importance of Libraries in Extending Python’s Capabilities……………………………………………………..2

1.3. Brief introduction to key libraries……………………………………………………………………………………2

1.3.1. Numpy(Numerical Python)…………………………………………………………………………………..2

1.3.2. Pandas…………………………………………………………………………………………………………2

1.3.3. Matplotlib………………………………………………………………………………………………………2

2. Numpy…………………………………………………………………………………………………………3

2.1. Overview of Numpy and its Role in Numerical Computing………………………………………………………3

2.2. Explanation of Numpy Arrays and their Advantages over Traditional Python Lists…………………………..3

2.3. Demonstrating NumPy Functions for Array Manipulation, Mathematical Operations, and Linear Algebra...4

2.4. Broadcasting in NumPy……….…………………………………………………………………………………….5

2.5. Numpy for Random Number Generation………………………………………………………………………….6

2.6. Numpy for Data Cleaning and Preprocessing…………………………………………………………………….6

3. Pandas…………………………………………………………………………………………………………8

3.1. Introduction to Pandas for Data Manipulation and Analysis…………………………………………………….8

3.2. Overview of Pandas Series and DataFrame Data Structures…………………………………………………..8

3.3. Utilizing Pandas Functions for Data Cleaning, Transformation, Filtering, and Aggregation………………….9

3.4. Time Series Analysis with Pandas………………………………………………………………………………...10

3.5. Practical Examples and Use Cases……………………………………………………………………………….10

4. Matpotlib……………………………………………………………………………………………………..12

4.1. Importance of Data Visualization in Data Science………………………………………………………………12

4.2. Introduction to Matplotlib for Creating Various Types of Plots and Charts…………………………………...13

4.3. Demonstrating Matplotlib's Functionalities for Visualizing Data Distributions, Trends, and Relationships.13

4.4. Advanced Customization Techniques……………………………………………………………………………15

4.5. Creating Animated Plots…………………………………………………………………………………………...16

5. Conclusion…………………………………………………………………………………………………..18
III
5.1 Key points…………………………………………………………………………………………………………...18

5.1.1 Numpy…………………………………………………………………………………………………………18

5.1.2 Pandas………………………………………………………………………………………………………..18

5.1.3 Matplotlib……………………………………………………………………………………………………...18

5.2 Summary of Key Arguments………………………………………………………………………………………18

5.3 Outcomes and Perspectives……………………………………………………………………………………....18

5.4 Future Considerations……………………………………………………………………………………………...18

6. Bibliography……………………………………………………………………………………………………………20

7. Appendix A: Source code for Practical Assignment………...………………………………………21

8. Appendix B: Git Commands……………………………………………………………………………..30

IV
List of Figures
Including Numpy___________________________________________________________________________________- 3 -
Creating Arrays____________________________________________________________________________________- 4 -
Reshaping Array____________________________________________________________________________________- 4 -
Numpy terminology_________________________________________________________________________________- 5 -
Broadcasting in Numpy______________________________________________________________________________- 5 -
Random Number Generation__________________________________________________________________________- 6 -
Data Cleaning______________________________________________________________________________________- 6 -
Including Pandas___________________________________________________________________________________- 7 -
Creating pandas Series_______________________________________________________________________________- 8 -
Creating Dataframes________________________________________________________________________________- 8 -
Apply() function in Pandas____________________________________________________________________________- 9 -
Function groupby()__________________________________________________________________________________- 9 -
Time series analysis_________________________________________________________________________________- 9 -
Practical examples_________________________________________________________________________________- 10 -
Difference between Numpy and Pandas________________________________________________________________- 10 -
Histogram_______________________________________________________________________________________- 11 -
Box Plot_________________________________________________________________________________________- 12 -
Scatter Plot_______________________________________________________________________________________- 13 -
Pie chart_________________________________________________________________________________________- 13 -
Advanced Customization Techniques__________________________________________________________________- 14 -
Result of Advanced Techniques_______________________________________________________________________- 14 -
Result of Advanced Techniques_______________________________________________________________________- 15 -
Animated plots____________________________________________________________________________________- 15 -
Result of Animated plots____________________________________________________________________________- 15 -

V
1.Introduction:

The landscape of data science is currently experiencing a significant shift, largely driven by the wide-
spread adoption of Python and its extensive range of libraries tailored for data analysis and manipula-
tion. In this era of unprecedented data abundance, organizations across various sectors are increas-
ingly relying on Python libraries to derive actionable insights from complex datasets. As the volume,
velocity, and variety of data continue to grow, the need for robust analytical tools has never been more
pressing.

This assignment aims to systematically explore the profound impact of Python libraries on data sci-
ence, elucidating their significance and implications within the context of contemporary data-driven en-
deavors. Recent studies and scholarly discourse underscore the pivotal role of Python libraries in
shaping the data science landscape. From small-scale startups to multinational corporations, Python
has emerged as the language of choice for data professionals seeking to extract value from their data
assets.

As organizations grapple with the challenges posed by burgeoning datasets and evolving analytical
techniques, the relevance of Python libraries becomes increasingly pronounced. By examining the
open questions surrounding the efficacy, limitations, and future directions of Python libraries in data
science, we aim to shed light on their transformative potential and pave the way for informed decision-
making.

The aim of this assignment is to analyze and evaluate the multifaceted impact of Python libraries on
data science practices. By delineating the boundaries of our inquiry and defining key terms, we pro -
vide readers with a comprehensive understanding of the parameters within which our analysis oper-
ates. Our objective is not only to elucidate the capabilities and limitations of Python libraries but also to
explore their broader implications for data science methodologies and workflows. Through rigorous in-
quiry and critical analysis, we endeavor to contribute to the ongoing discourse surrounding Python's
role in shaping the future of data-driven decision-making.

1.1. Overview of Python for Data Science:

Python's popularity in data science can be attributed to its user-friendly design and broad applicability.
Its clean and readable syntax lowers the barrier to entry, enabling individuals from diverse back-
grounds to quickly grasp the fundamentals of programming. Moreover, Python's dynamic typing and
high-level abstractions facilitate rapid prototyping and experimentation, fostering a culture of innova-
tion within the data science community.

Beyond its syntactic elegance, Python's versatility extends to its ecosystem of libraries, which serve as
the lifeblood of data science workflows. These libraries augment Python's core functionality, providing
1
specialized tools for tasks ranging from data manipulation to machine learning. As a result, Python has
become the language of choice for data professionals seeking to extract actionable insights from com-
plex datasets.

1.2. Importance of Libraries in Extending Python's Capabilities:

While Python's core language features are robust, its true power lies in its extensive collection of third-
party libraries. These libraries, developed and maintained by a vibrant community of contributors, ex-
tend Python's capabilities in myriad ways, empowering data scientists to tackle real-world challenges
with confidence.

In the context of data science, libraries play a pivotal role in accelerating workflows and facilitating re -
producible research. By abstracting complex operations into simple function calls, libraries such as
NumPy, Pandas, and Matplotlib enable data scientists to focus on high-level analysis rather than low-
level implementation details. This abstraction layer promotes code readability and maintainability, facil-
itating collaboration and knowledge sharing within interdisciplinary teams.

1.3. Brief Introduction to Key Libraries:


1.3.1. NumPy (Numerical Python): NumPy is the cornerstone of numerical computing in Python,
providing support for multidimensional arrays and a wide range of mathematical functions. Its
efficient array manipulation capabilities make it indispensable for tasks such as matrix opera-
tions, linear algebra, and statistical analysis.
1.3.2. Pandas: Pandas is a powerful data manipulation library that introduces two key data struc-
tures, Series and DataFrame, for tabular data analysis. Its rich set of functions for data clean -
ing, transformation, and aggregation streamline the data preprocessing pipeline, making it a
staple in the toolkit of every data scientist.
1.3.3. Matplotlib: Matplotlib is a versatile plotting library that enables the creation of a wide variety of
charts, plots, and graphs. From simple line plots to complex heatmaps, Matplotlib offers a com-
prehensive suite of visualization tools for conveying insights from data. Its seamless integration
with NumPy and Pandas makes it a preferred choice for visualizing data in Python.

In summary, Python libraries serve as force multipliers, empowering data scientists to tackle complex
analytical challenges with efficiency and ease. In the subsequent sections, we will delve deeper into
the functionalities, applications, and impact of these libraries in data science workflows, unraveling the
intricacies of Python's role in shaping the future of data-driven decision-making.

2
2. NumPy:
2.1 Overview of NumPy and its Role in Numerical Computing:

NumPy, short for Numerical Python, stands as a cornerstone of numerical computing in the Python
ecosystem. Developed to address the shortcomings of traditional Python lists in handling numerical
data, NumPy provides a powerful framework for performing array-based computations efficiently. Its
array-oriented computing capabilities make it indispensable for a wide range of scientific and engi-
neering applications, including data analysis, machine learning, and simulations.

At the heart of NumPy lies its array object, ndarray, which enables efficient storage and manipulation
of homogeneous data. Unlike Python lists, NumPy arrays are homogeneous and contiguous blocks of
memory, allowing for vectorized operations and efficient memory management. This design choice not
only enhances computational performance but also facilitates interoperability with other libraries writ-
ten in low-level languages such as C and Fortran.

2.2 Explanation of NumPy Arrays and their Advantages over Traditional Python Lists:

NumPy arrays offer several advantages over traditional Python lists, making them the preferred data
structure for numerical computing tasks. Firstly, NumPy arrays are homogeneous, meaning that all el-
ements within an array must be of the same data type. This enforced homogeneity enables NumPy to
leverage optimized, low-level routines for array manipulation and arithmetic operations, resulting in
significant performance gains.

Moreover, NumPy arrays are stored in contiguous blocks of memory, allowing for efficient memory ac-
cess and vectorized operations. This contiguous memory layout enables NumPy to perform array
computations in a highly parallelized manner, leveraging the computational power of modern CPUs
and GPUs.

Including Numpy

NumPy is imported as 'np' in this notebook for brevity, using the standard Python convention 'import
numpy as np'.

Another key advantage of NumPy arrays is their support for multidimensional data. While Python lists
are limited to one-dimensional arrays, NumPy arrays can have any number of dimensions, making
them suitable for representing complex data structures such as matrices, tensors, and images. This
multidimensional capability facilitates the manipulation of multidimensional datasets, enabling data sci-

3
entists to work with data in its native form without the need for cumbersome reshaping or transposing
operations.

2.3. Demonstrating NumPy Functions for Array Manipulation, Mathematical Operations, and
Linear Algebra:

NumPy provides a rich set of functions and methods for array manipulation, mathematical operations,
and linear algebra. These functions enable data scientists to perform a wide range of tasks, from basic
array manipulation to advanced numerical computations.

Creating Arrays

The above example shows Array Manipulation:

a. Creating one - dimensional arrays


b. Horizontal Stacking – appending the elements in same row.

NumPy provides functions for creating arrays of various shapes and sizes, initializing arrays with pre-
defined values, and reshaping arrays to suit specific requirements. Additionally, NumPy offers a
plethora of mathematical functions for performing element-wise operations such as addition, subtrac-
tion, multiplication, and division.

The example beside shows:

a. Creating array
b. Reshaping the existing ar-
ray and labeling with new
name.
c. Printing the new array.

Reshaping Array

4
Furthermore, NumPy boasts a comprehensive suite of linear algebra functions for performing common
operations such as matrix multiplication, matrix inversion, and eigenvalue decomposition. These func-
tions enable data scientists to tackle complex mathematical problems with ease, making NumPy a ver-
satile tool for numerical computing tasks.

Numpy terminology

2.4 Broadcasting in NumPy:

Broadcasting is a powerful mechanism that allows NumPy to perform arithmetic operations on arrays
of different shapes. It automatically expands the smaller array to match the shape of the larger array,
enabling element-wise operations without the need for explicit looping.

Example:

Broadcasting in Numpy

5
2.5 NumPy for Random Number Generation:

NumPy includes a robust suite of functions for generating random numbers, which are essential for
simulations, statistical modeling, and machine learning tasks.

Example:

Random Number Generation

2.6 NumPy for Data Cleaning and Preprocessing:

NumPy is widely used in data preprocessing and cleaning tasks, such as handling missing values,
normalizing data, and transforming data types.

Example:

Data Cleaning

In summary, NumPy serves as a powerful toolkit for numerical computing in Python, offering efficient
array-based data structures and a wide range of functions for array manipulation, mathematical opera-
tions, and linear algebra. Its capabilities are essential for many scientific and engineering applications,

6
providing the foundation for data analysis, machine learning, and more. With a strong understanding
of NumPy, we are now ready to explore the pandas library, which builds on NumPy to provide even
more powerful data manipulation and analysis tools.

3. Pandas:

7
3.1 Introduction to Pandas for Data Manipulation and Analysis:

In the dynamic realm of data science, Pandas emerges as an indispensable tool, serving as the
bedrock for data manipulation and analysis in Python. Its widespread adoption and robust functionality
make it the cornerstone of countless data-driven projects, facilitating the exploration, transformation,
and analysis of diverse datasets. As we embark on this journey to explore Pandas comprehensively,
we delve into its multifaceted capabilities, illuminating its pivotal role in empowering data scientists,
analysts, and researchers to extract actionable insights from complex data.

Including Pandas

Pandas is imported as 'pd' using the standard Python convention 'import pandas as pd'.

3.2 Overview of Pandas Series and DataFrame Data Structures:

At the nucleus of Pandas lie two foundational data structures: Series and DataFrame. The Pandas Se-
ries represents a one-dimensional array-like object, equipped with labels or indices for efficient data
access. It encapsulates a single column of data, enabling users to manipulate and analyze data with
granularity and precision. On the other hand, the Pandas DataFrame extends this functionality to a
two-dimensional tabular structure, akin to a spreadsheet or database table. With rows and columns la-
beled for easy identification, DataFrames offer a structured framework for organizing, exploring, and
visualizing data of varying dimensions and complexities.

Creating pandas Series

The versatility of Pandas Series and DataFrame data structures is underscored by their ability to ac -
commodate heterogeneous data types seamlessly. Whether dealing with numerical measurements,
8
categorical variables, or textual descriptions, Pandas provides a unified interface for handling diverse
data formats. This inherent flexibility empowers users to perform a myriad of operations, from data ag-
gregation and summarization to advanced statistical analysis and machine learning modeling.

Creating Dataframes

3.3 Utilizing Pandas Functions for Data Cleaning, Transformation, Filtering, and Aggregation:

Pandas empowers users with a vast array of functions and methods for data cleaning, transformation,
filtering, and aggregation, facilitating the construction of robust data pipelines. These functions serve
as building blocks for preprocessing raw data, ensuring its quality, integrity, and consistency before
analysis.

For instance, Pandas offers a suite of functions for handling missing data, including isnull(), dropna(),
and fillna(), enabling users to address data incompleteness effectively. Moreover, Pandas facilitates
data transformation through functions like map(), apply(), and groupby(), allowing users to apply cus-
tom functions to data elements, group data by specific criteria, and compute aggregate statistics with
ease.

Apply() function in Pandas


9
Function groupby()

Furthermore, Pandas provides robust support for data filtering and selection, offering intuitive indexing
and slicing mechanisms. Whether extracting subsets of data based on conditional criteria or selecting
specific columns for analysis, Pandas' expressive syntax streamlines the process of data extraction
and exploration.

3.4 Time Series Analysis with Pandas:

Pandas excels in handling time series data, providing tools for datetime manipulation, resampling, and
rolling windows.

Examples:

Time series analysis

3.5 Practical Examples and Use Cases:

Providing practical examples and use cases can help illustrate how pandas is used in real-world sce-
narios, such as data cleaning, financial analysis, and machine learning preprocessing.

Examples:

10
Practical examples

In summary, pandas serves as a comprehensive and versatile library for data manipulation and analy-
sis, providing robust data structures and a wealth of functions for cleaning, transforming, and visualiz-
ing data. Its integration with other libraries and powerful features make it an indispensable tool in the
data science workflow. Next, we will explore Matplotlib, delving into how it further enhances our ability
to analyze and interpret data.

Difference between Numpy and Pandas

11
4. Matpotlib:
4.1 Importance of Data Visualization in Data Science:

Data visualization stands as an indispensable component of the data science toolkit, serving as a
bridge between raw data and actionable insights. In today's data-driven world, the ability to effectively
communicate complex patterns, trends, and relationships through visual representations is crucial for
driving informed decision-making and achieving organizational objectives. By harnessing the power of
data visualization, data scientists can uncover hidden patterns, identify outliers, and communicate
their findings to stakeholders in a clear and intuitive manner.

Data visualization plays a multifaceted role across various stages of the data analysis lifecycle. During
exploratory data analysis (EDA), visualizations serve as a lens through which analysts can gain initial
insights into the underlying structure of the data. From identifying data distributions and correlations to
detecting anomalies and outliers, visualizations provide a comprehensive overview of the dataset,
guiding subsequent analysis and hypothesis generation.

Moreover, in the model development phase, data visualization enables data scientists to evaluate
model performance, assess the validity of assumptions, and identify areas for improvement. By visual-
izing model predictions against actual outcomes, analysts can gain insights into the model's predictive
capabilities and identify instances where the model may be underperforming or overfitting the data.

Furthermore, in the presentation of findings and insights, data visualization plays a crucial role in con-
veying complex analytical results to non-technical stakeholders. Through visually compelling charts,
graphs, and dashboards, data scientists can distill key insights from the data and communicate them
in a digestible format, empowering stakeholders to make informed decisions and take appropriate ac-
tions.

Histograms are graphical representations of the distribution of data, where


data values are grouped into intervals called bins and plotted as bars. They
provide insights into the frequency or density of data across different
ranges.

This is the visual representation of histogram.

Histogram

4.2 Introduction to Matplotlib for Creating Various Types of Plots and Charts:
12
Matplotlib, a cornerstone of the Python data visualization ecosystem, offers a comprehensive suite of
tools for creating a wide range of plots and charts. With its intuitive interface and powerful customiza-
tion options, Matplotlib empowers users to generate static, animated, and interactive visualizations tai-
lored to their specific analytical needs.

At its core, Matplotlib provides a high-level interface for creating basic plots, such as line plots, scatter
plots, bar charts, histograms, and pie charts. These fundamental plot types serve as building blocks
for more complex visualizations, allowing users to explore relationships, distributions, and trends
within their datasets.

In addition to basic plot types, Matplotlib offers support for advanced visualizations, including 3D plots,
geographic maps, and statistical plots. Through its integration with other Python libraries, such as
NumPy and Pandas, Matplotlib enables seamless data integration and visualization, facilitating the ex-
ploration of multidimensional datasets and complex relationships.

4.3 Demonstrating Matplotlib's Functionalities for Visualizing Data Distributions, Trends, and
Relationships:

Matplotlib's versatility shines through its ability to visualize data distributions, trends, and relationships
across diverse domains. Whether analyzing financial data, social networks, or scientific measure-
ments, Matplotlib provides a rich set of functionalities for exploring and interpreting complex datasets.

For instance, Matplotlib's plt.plot() function enables users to create line plots, ideal for visualizing
trends and patterns over time or across different variables. By plotting data points connected by lines,
users can identify temporal trends, cyclical patterns, and long-term relationships within the data.

A box plot, or box-and-whisker plot, is a graphical


representation of the distribution of a dataset, dis-
playing key summary statistics such as median,
quartiles, and outliers. It provides a visual summary
of the data's central tendency, variability, and
skewness.

Box Plot

Similarly, the plt.scatter() function facilitates the


creation of scatter plots, allowing users to explore relationships between variables and detect patterns
13
or clusters within the data. Scatter plots are particularly useful for identifying correlations, outliers, and
nonlinear relationships, providing valuable insights into the underlying structure of the data.

Moreover, Matplotlib offers support for creating histograms (plt.hist()), bar charts (plt.bar()), and box
plots (plt.boxplot()), among other types of plots, enabling users to analyze data distributions, compare
categorical variables, and identify potential outliers or anomalies. These visualization techniques are
instrumental in uncovering hidden patterns, assessing
data quality, and deriving actionable insights from the
data.

Scatter plots display the relationship between two vari-


ables, with each data point representing a pair of values.
They visually illustrate patterns, correlations, or trends in
Scatter Plot the data, aiding in understanding the strength and direc-
tion of relationships between variables.

Furthermore, Matplotlib's extensive customization options allow users to fine-tune the appearance and
aesthetics of their plots, including colors, markers, line styles, axis labels, titles, and annotations. By
customizing the visual elements of their plots, users can create visually appealing and informative vi-
sualizations that effectively convey key insights and findings to diverse audiences.

Pie charts visually represent proportions of a whole, useful for illustrating simple distributions. How-
ever, they can be misleading when comparing values or categories and are not suitable for complex
datasets, leading to misinterpretations.

Pie chart

4.4 Advanced Customization Techniques

Discussing advanced customization techniques can help users create more polished and publication-
quality plots.

Examples:

14
Advanced Customization Techniques

Result:

Result of Advanced Techniques

Result of Advanced Techniques

4.5 Creating Animated Plots:


15
Animating plots can be useful for illustrating changes over time or steps in an algorithm.

Example:

Animated plots

Result:

Result of Animated plots

In summary, Matplotlib stands as a cornerstone of data visualization in Python, offering a rich array of
plotting functions and customization options to visualize data distributions, trends, and relationships ef-
fectively. Whether through simple plots or complex visualizations, Matplotlib empowers data scientists
to communicate their insights clearly and effectively. By leveraging its extensive capabilities, data sci-
entists can transform raw data into meaningful visual representations that drive informed decision-
making and actionable outcomes.

16
5. Conclusion:

Throughout this assignment, we have delved into the significant impact of Python libraries, notably
NumPy, Pandas, and Matplotlib, on data science. These libraries are essential for data manipulation,
analysis, and visualization, enabling data scientists to extract actionable insights from complex
datasets and make informed decisions.

5.1. Key Points:


5.1.1 NumPy:
 Central to numerical computing with its array-based data structures and extensive mathemati-
cal functions.
 Efficiently handles large datasets, from basic arithmetic to advanced linear algebra computa-
tions.
17
5.1.2. Pandas:
 Revolutionizes data manipulation and analysis with its intuitive Series and DataFrame struc-
tures.
 Facilitates data cleaning, transformation, aggregation, and exploratory data analysis across
various domains.
5.1.3. Matplotlib:
 Provides robust data visualization capabilities with a wide range of plotting functions.
 Enables effective communication of data distributions, trends, and relationships through vis-
ually appealing plots.

In drawing these arguments to a close, it is evident that Python libraries play a pivotal role in advanc -
ing data science. They offer robust frameworks for managing and analyzing data, thereby driving inno-
vation and informed decision-making.

5.2. Summary of Key Arguments:


 NumPy's efficiency and computational power make it indispensable for numerical analysis.
 Pandas' data manipulation capabilities simplify complex data processes.
 Matplotlib's visualization tools enhance the clarity and impact of data presentations.
5.3. Outcomes and Perspectives:
 Python libraries have solidified their position as foundational tools in data science.
 They provide the necessary tools for transforming raw data into valuable insights.
 The continuous evolution of these libraries promises further advancements in data analysis
techniques.
5.4. Future Considerations:
 As data continues to grow in volume and complexity, the development of more scalable and ef-
ficient data manipulation, analysis, and visualization tools will be crucial.
 Further research could explore the integration of these libraries with emerging technologies
such as machine learning and artificial intelligence to enhance data-driven decision-making
processes.

In conclusion, by harnessing the power of NumPy, Pandas, Matplotlib, and other Python libraries, data
scientists can effectively navigate the challenges of modern data analysis, fostering innovation and im-
pact across diverse fields.

18
Bibliography
1. Oliphant, T. E. (2006). A Guide to NumPy. USA: Trelgol Publishing.
2. McKinney, W. (2010). Data Structures for Statistical Computing in Python. Proceedings of the
9th Python in Science Conference.
3. Hunter, J. D. (2007). Matplotlib: A 2D Graphics Environment. Computing in Science & Engi-
neering, 9(3), 90-95.
4. Wes McKinney (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython. O'Reilly Media, Inc.
5. Van der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy Array: A Structure for
Efficient Numerical Computation. Computing in Science & Engineering, 13(2), 22-30.
6. Fabian Pedregosa, Gaël V., Alexandre G., Vincent M.l, Bertrand T., Olivier Grisel, Math-
ieu B., Peter P., Ron W., Vincent D., Jake Vanderplas, Alexandre Passos, David Courna-
19
peau, Matthieu Brucher, Matthieu Perrot, & Édouard D. (2011). Scikit-learn: Machine
Learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
7. Pauli V., Ralf G., Travis E. O., Matt H., Tyler R., David C., Evgeni B., Pearu P., Warren W.,
Jonathan B., Stéfan J. van der Walt, Matthew B., Joshua W., K. Jarrod M., Nikolay M.,
Andrew R. J. Nelson, Eric J., Robert K., Eric L., C J Carey, İlhan P., Yu Feng, Eric W.
Moore, Jake V., Denis L., Josef P., Robert C., Ian H., E. A. Quintero, Charles R Harris,
Anne M. Archibald, Antônio H Ribeiro, Fabian P., Paul van M., & SciPy 1.0 Contributors
(2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Meth-
ods, 17(3), 261-272.
8. Lars B., Gilles L., Mathieu B., Fabian P., Andreas M., Olivier G., Vlad N., Peter P., Alexan -
dre G., Jaques G., Robert L., Jake V., Arnaud J., Brian H., & Gaël V. (2013). API Design for
Machine Learning Software: Experiences from the Scikit-learn Project. arXiv preprint
arXiv:1309.0238.
9. VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with
Data. O'Reilly Media, Inc.
10. Waskom, M. L. (2021). Seaborn: Statistical Data Visualization. Journal of Open Source Soft-
ware, 6(60), 3021.
11. McKinney, W. (2013). Pandas: A Foundational Python Library for Data Analysis and Statistics.
PyHPC: Python in High Performance Computing, 1-9.
12. Feng, J., & Lipton, Z. C. (2018). Deep Learning for Finance: Deep Portfolios. Applied Sto-
chastic Models in Business and Industry, 34(1), 120-129.

Appendix A: Source Code for the Practical Assignment


import unittest

import numpy as np

import pandas as pd

from sqlalchemy import create_engine, MetaData, Table, Column, Float

from sqlalchemy.orm import sessionmaker

from bokeh.plotting import figure, output_file, save, show

from bokeh.layouts import column

20
def load_data_from_csv(file_path):

"""

Load data from a CSV file into a pandas DataFrame.

Args:

file_path (str): Path to the CSV file.

Returns:

pandas.DataFrame: DataFrame containing the loaded data.

"""

print(f"Loading data from {file_path}")

df = pd.read_csv(file_path)

print(f"Loaded {len(df)} rows with columns: {list(df.columns)}")

return df

def create_table_from_df(engine, table_name, df):

"""

Create a database table based on the structure of a DataFrame.

Args:

engine: SQLAlchemy engine object.

table_name (str): Name of the table to be created.

df (pandas.DataFrame): DataFrame whose structure defines the table schema.

"""

metadata = MetaData()

# Create table definition based on DataFrame columns

columns = [Column('x', Float)]

for col in df.columns[1:]:


21
columns.append(Column(col, Float))

table = Table(table_name, metadata, *columns)

metadata.create_all(engine)

print(f"Table '{table_name}' created with columns: {df.columns}")

def insert_data(session, engine, table_name, df):

"""

Insert data from a DataFrame into a database table.

Args:

session: SQLAlchemy session object.

engine: SQLAlchemy engine object.

table_name (str): Name of the table to insert data into.

df (pandas.DataFrame): DataFrame containing the data to be inserted.

"""

metadata = MetaData()

metadata.reflect(bind=engine)

table = metadata.tables[table_name]

for index, row in df.iterrows():

ins = table.insert().values(row.to_dict())

session.execute(ins)

session.commit()

print(f"Inserted {len(df)} rows into table '{table_name}'")

def load_table_to_df(engine, table_name):

"""
22
Load data from a database table into a pandas DataFrame.

Args:

engine: SQLAlchemy engine object.

table_name (str): Name of the table to load data from.

Returns:

pandas.DataFrame: DataFrame containing the loaded data.

"""

query = f"SELECT * FROM {table_name}"

df = pd.read_sql(query, engine)

return df

def find_best_ideal_functions(training_df, ideal_df):

"""

Find the best ideal functions for each training function.

Args:

training_df (pandas.DataFrame): DataFrame containing training data.

ideal_df (pandas.DataFrame): DataFrame containing ideal functions.

Returns:

dict: A dictionary mapping each training function to its best corresponding ideal
function.

"""

best_ideal_funcs = {}

for train_col in training_df.columns[1:]: # Skip 'x' column

min_ssr = float('inf')

best_func = None

for ideal_col in ideal_df.columns[1:]: # Skip 'x' column

23
ssr = np.sum((training_df[train_col] - ideal_df[ideal_col]) ** 2)

if ssr < min_ssr:

min_ssr = ssr

best_func = ideal_col

best_ideal_funcs[train_col] = best_func

return best_ideal_funcs

def approximate_test_data(test_df, ideal_df, best_ideal_funcs):

"""

Approximate test data using the best ideal functions identified.

Args:

test_df (pandas.DataFrame): DataFrame containing test data.

ideal_df (pandas.DataFrame): DataFrame containing ideal functions.

best_ideal_funcs (dict): Dictionary mapping each training function to its best ideal
function.

Returns:

dict: A dictionary containing the residuals for each test data point.

"""

residuals = {}

for test_idx, test_row in test_df.iterrows():

x_val = test_row['x']

y_test = test_row['y']

closest_ideal = None

min_residual = float('inf')

for ideal_col in best_ideal_funcs.values():

y_ideal = ideal_df.loc[ideal_df['x'] == x_val, ideal_col].values[0]

residual = np.abs(y_test - y_ideal)

24
if residual < min_residual:

min_residual = residual

closest_ideal = ideal_col

residuals[test_idx] = (x_val, y_test, closest_ideal, min_residual)

return residuals

class TestFunctions(unittest.TestCase):

def test_load_data_from_csv(self):

"""Test loading data from CSV file."""

test_df = load_data_from_csv('test.csv')

self.assertEqual(len(test_df), 10)

def test_find_best_ideal_functions(self):

"""Test finding best ideal functions."""

training_df = pd.DataFrame({'x': range(10), 'y1': np.random.rand(10), 'y2': np.ran-


dom.rand(10)})

ideal_df = pd.DataFrame({'x': range(10), 'y1': np.random.rand(10), 'y2': np.ran -


dom.rand(10)})

best_ideal_funcs = find_best_ideal_functions(training_df, ideal_df)

self.assertEqual(len(best_ideal_funcs), len(training_df.columns) - 1)

def visualize_training_data(training_df):

"""

Visualize training data using Bokeh.

Args:

training_df (pandas.DataFrame): DataFrame containing training data.

"""

25
output_file("training_data.html")

p = figure(title="Training Data", x_axis_label='x', y_axis_label='y', width=800,


height=400)

for col in training_df.columns[1:]:

p.line(training_df['x'], training_df[col], legend_label=col)

p.legend.click_policy="hide"

save(p)

def visualize_test_data(test_df, residuals, ideal_df):

"""

Visualize test data and residuals using Bokeh.

Args:

test_df (pandas.DataFrame): DataFrame containing test data.

residuals (dict): Dictionary containing residuals for each test data point.

ideal_df (pandas.DataFrame): DataFrame containing ideal functions.

"""

output_file("test_data.html")

p = figure(title="Test Data", x_axis_label='x', y_axis_label='y', width=800,


height=400)

p.circle(test_df['x'], test_df['y'], legend_label='Test Data', color='blue')

for idx, res in residuals.items():

p.line([res[0], res[0]], [res[1], ideal_df.loc[ideal_df['x'] == res[0], res[2]].values[0]],

legend_label=f'Test Data {idx}', color='red')

p.legend.click_policy = "hide"

save(p)

class DataVisualizer:

@staticmethod

26
def plot_data(training_df, ideal_df, test_df, residuals, best_ideal_funcs):

"""

Plot training data, ideal functions, test data, and residuals using Bokeh.

Args:

training_df (pandas.DataFrame): DataFrame containing training data.

ideal_df (pandas.DataFrame): DataFrame containing ideal functions.

test_df (pandas.DataFrame): DataFrame containing test data.

residuals (dict): Dictionary containing residuals for each test data point.

best_ideal_funcs (dict): Dictionary mapping each training function to its best ideal
function.

"""

p = figure(title="Training Data vs Ideal Functions")

for col in training_df.columns[1:]:

p.line(training_df['x'], training_df[col], legend_label=f"Training {col}",


line_width=2)

for col in best_ideal_funcs.values():

p.line(ideal_df['x'], ideal_df[col], legend_label=f"Ideal {col}", line_width=2,


line_dash="dashed")

p2 = figure(title="Test Data and Residuals", x_range=p.x_range,


y_range=p.y_range)

p2.scatter(test_df['x'], test_df['y'], legend_label="Test Data", color="red")

for idx, (x, y_test, closest_ideal, residual) in residuals.items():

p2.line([x, x], [y_test, ideal_df.loc[ideal_df['x'] == x, closest_ideal].values[0]],


line_width=1, color="black")

show(column(p, p2))

def main():

27
"""

Main function to orchestrate the data loading, analysis, and visualization process.

"""

# Database setup

engine = create_engine(r'sqlite:///C:/Users/ECS/Desktop/assignment/database.db')

Session = sessionmaker(bind=engine)

session = Session()

# Load data from tables

training_df = load_table_to_df(engine, 'training_data')

ideal_df = load_table_to_df(engine, 'ideal_functions')

test_df = load_table_to_df(engine, 'test_data')

# Find the best ideal functions for training data

best_ideal_funcs = find_best_ideal_functions(training_df, ideal_df)

print("Best ideal functions identified for each training function:")

print(best_ideal_funcs)

# Approximate the test data using the best ideal functions

residuals = approximate_test_data(test_df, ideal_df, best_ideal_funcs)

print("Residuals for test data approximation:")

for idx, res in residuals.items():

print(f"Test Data Index {idx}: x = {res[0]}, y_test = {res[1]}, closest_ideal =


{res[2]}, residual = {res[3]}")

# Visualize training data

visualize_training_data(training_df)

28
# Visualize test data

DataVisualizer.plot_data(training_df, ideal_df, test_df, residuals, best_ideal_funcs)

if __name__ == '__main__':

main()

Github link to this project: https://github.com/Subu31/practical_project

Appendix B: Git Commands


1. Clone the Repository

2. Create and Switch to a New Branch

3. Add and Commit Changes

29
4. Push Changes to Remote Branch

5. Create a Pull Request

After pushing the changes, go to the repository on GitHub and create a pull request from the develop
branch to the main branch. Provide a title and description, then submit it for review.

6. Merge the Pull Request

Once the pull request is reviewed and approved, it can be merged into the main branch via the GitHub
web interface.

30

You might also like