Python+for+Effect +Master+Data+Visualization+and+Analysis Copy of Book
Python+for+Effect +Master+Data+Visualization+and+Analysis Copy of Book
Master Data
Visualization and
Analysis Copy
Learn Data Pipelines, Machine
Learning, Advanced Statistical Analysis
and Visualization with Jupyter
Notebook
Tomasz Trebacz
Copyright © 2024 by Tomasz Trebacz
No portion of this book may be reproduced in any form without written permission from the
publisher or author, except as permitted by U.S. copyright law.
st edition 2024
Contents
Introduction 1
Conclusion 144
Appendix 147
Re erences 199
Introduction
Setting Up
Your Python
Environment
A s you stand on the cusp of delving into the vast and intricate
world of data science, imagine for a moment the power of
a well-oiled machine. Just as a Pnely tuned engine propels a vehicle
forward with precision and speed, a properly conPgured bython envi-
ronment serves as the driving force Sehind your data analysis endeav-
ors. ketting up this environment might seem lize a mundane tasz, yet it
is the foundational cornerstone upon which eTective data analysis and
visualixation are Suilt. Bhe tools, conPgurations, and optimixations
you choose today will shape your aSility to navigate the compleIities
of data with agility and conPdence. Oy meticulously conPguring your
bython environment, you ensure that your analytical processes run
smoothly, free from the friction of technical hiccups or compatiSility
issues. Mn this chapter, we will unravel the intricacies of estaSlishing a
roSust bython setup, eIploring the merits of Anaconda, an indispens-
; BZRAkE BC6OAjE
aSle platform for data science, renowned for its capaSility to manage
paczages and dependencies seamlessly. Bhis 'ourney is not merely
aSout installation— itNs aSout crafting a tailored, eqcient worzspace
that empowers you to worz at your Sest.
a series of cells that can contain code, teIt, or raw data. Bhe menu
Sar, perched at the top, oTers a plethora of features and shortcuts,
facilitating everything from saving worz to inserting cells and eI-
porting noteSoozs in various formats. Oelow it, the toolSar provides
Huicz access to essential functions lize running code and adding new
cells. Dithin each noteSooz, cells can Se classiPed as code cells, where
bython commands are eIecuted— Rarzdown cells, which support rich
teIt formatting for notes and documentation— and Caw cells, which
store unformatted teIt that is not intended for eIecution. Bhis trifecta
of cell types empowers users to Slend computation with conteItual
information, creating documents that are as informative as they are
functional.
6levating productivity within Jupyter YoteSoozs involves master-
ing a suite of commands and shortcuts that streamline your worzFow.
jreating and managing noteSoozs is straightforward— the :ile menu
allows you to open new noteSoozs with a single clicz, while eIisting
ones can Se easily accessed and organixed. 6Iecuting code within a
cell reHuires nothing more than pressing khift _ 6nter, a simple yet
powerful command that triggers immediate feedSacz in the output
cell. Yavigation is further enhanced Sy zeySoard shortcuts, such as
6sc _ A to insert a cell aSove or 6sc _ O to insert one Selow, which
minimixe reliance on the mouse and eIpedite the process of Suilding
and modifying noteSoozs. Bhese eqciencies, though seemingly mi-
nor, accumulate to transform the way you interact with data, enaSling
a Fuid and uninterrupted analytical process.
As data scientists often collaSorate and iterate on pro'ects, integrat-
ing Jupyter YoteSoozs with version control systems lize >it Secomes
invaluaSle. Oy saving noteSoozs in >it repositories, you preserve a
comprehensive history of changes, allowing you to revert to previous
versions or tracz the evolution of your analyses. Bhis integration is
23 BZRAkE BC6OAjE
Python
Fundamentals
for Data Analysis
In the realm of data analysis, the adept use of data structures is para-
mount, providing a sca'old upon which complex algorithms and
analyses are constructed. Python’s core data structures—lists, tuples,
and dictionaries—each serve distinct purposes, o'ering unique ca-
:2 TSZORE TBC1O6E
syntax allows for implicit tuple creation without them. Occessing tu-
ple elements mirrors list indexing, o'ering a consistent interface across
data structures. The ability to use tuples as dictionary keys leverages
their immutability, providing a method for indexing complex, mul-
ti-dimensional data in ways that enhance both readability and perfor-
mance.
7ictionaries elevate data management through their implementa-
tion of associative arrays, where each value is linked to a unique key,
facilitating e0cient data retrieval, insertion, and deletion. _eys, which
must be immutable, can range from strings to numbers and tuples,
while values remain unconstrained in type. This #exibility makes dic-
tionaries indispensable in scenarios requiring structured data stor-
age, such as when mapping identiAers to attributes. Zanipulating
dictionaries involves operations like adding, updating, or removing
key-value pairs, each action requiring minimal computational over-
head. Iterating over dictionaries can be performed through their keys,
values, or both simultaneously via methods like .keys(), .values(), and
.items(), respectively. This versatility extends to nested dictionaries,
which allow for the hierarchical organization of data, enabling the
representation of complex entities with multiple layers of attributes.
Cngaging with these data structures requires not only an under-
standing of their individual properties but also an appreciation for
their collective potential in constructing sophisticated data models.
Fists, with their dynamic capabilities, allow for the aggregation and
transformation of data in ways that are both intuitive and powerful.
Tuples provide a stable framework for grouping data, ensuring con-
sistency across operations that rely on Axed datasets. 7ictionaries o'er
an unparalleled level of #exibility in storing and retrieving data, facil-
itating the rapid access and manipulation of information in a manner
that is both e0cient and scalable. Together, these structures form the
"( TSZORE TBC1O6E
function, you create a reusable piece of code that can be called upon
as needed, reducing redundancy and enhancing clarity. …unctions are
deAned using the def keyword, followed by a unique name and a
parameter list enclosed in parentheses. Parameters act as placeholders
for the input values that the function will process, and once invoked,
the function executes its code block, returning a value or performing
an action. This modularity is further enhanced by the ability to specify
default parameter values, which allows functions to operate #exi-
bly under varying conditions. Fambda functions, succinct single-ex-
pression functions, o'er a more concise syntax, making them ideal
for simple operations or as arguments within higher-order functions,
thereby streamlining code and improving readability.
The concept of scope and variable lifetime is pivotal in understand-
ing how functions and code blocks interact with variables. Rcope dic-
tates the visibility and accessibility of variables within di'erent parts
of a program. Focal variables are conAned to the function or block
in which they are declared, vanishing once the execution leaves that
context. In contrast, global variables persist throughout the programNs
execution, accessible from any location within the script. The global
keyword provides a mechanism to modify global variables from within
a local scope, allowing functions to alter variables deAned outside
their immediate context. /nderstanding these distinctions is vital, as
they a'ect how data is stored and manipulated, in#uencing program
behavior and design.
Becursion, a sophisticated control #ow technique, involves a func-
tion calling itself, thus breaking down complex problems into sim-
pler, more manageable sub-problems. Cach recursive call progresses
towards a base case, a condition that terminates the recursion, pre-
venting inAnite loops. This technique is particularly useful in tasks like
factorial calculation, where a function calls itself with a decremented
PDTUSH …SB C……C6TW ZORTCB 7OTO YIR/OFIEO8 "”
argument until reaching the base case of zero or one, at which point it
returns a deAnitive value. Becursive solutions, while elegant, demand
careful implementation to ensure that each call brings the function
closer to the base case, thereby avoiding excessive resource consump-
tion and potential stack over#ow errors.
To further explore these concepts, open your Rpyder I7C and mind
to an exercise that involves writing a Python program to read a Ale
and process its contents. 1egin by wrapping the Ale operations within
a try-except block to handle potential …ileHot…oundCrror or ISCr-
ror exceptions. /se the Anally block to ensure that the Ale is closed
properly, thus preventing resource leaks. Introduce strategically placed
print statements to trace the data #ow and employ the logging module
to record both successful operations and exceptions, this will require
adding import logging at the top of the program. This exercise
will reinforce your understanding of exception handling and debug-
ging techniques, providing practical experience that is invaluable in
real-world programming scenarios.
To conAgure logging, start by importing the logging module into
your code. Ret up the logging conAguration to log messages to a Ale
named Ale)operations.log with the log level set to 7C1/ to cap-
ture all messages. The log format should include the timestamp, log
level, and message for clarity. Hext, set up the read)Ale function with
a try block that attempts to open the Ale in read mode, processes
its contents, counts lines and words, and logs successful operations
"j TSZORE TBC1O6E
Exploring Popular
Python Libraries
iHes date formats, ensuring that the temporal dimension of your data
is accurately represented and ready for analysis. Mnce your dates are
standardiHed, the resample() function allows you to aggregate or dis-
aggregate your data over speciYed time intervals, facilitating the analy-
sis of trends and patterns over time. jhether you are eqamining daily
sales data to identify seasonal trends or analyHing hourly temperature
readings to monitor climate changes, these tools empower you to
unravel the temporal dynamics inherent in your datasets.
Performance optimiHation in Pandas is crucial for anyone man-
aging large datasets, as computational speed can signiYcantly impact
productivity. Ohe eval)z and Tuery)z functions allow for faster eqe-
cution by tapping into Pandas3 internal evaluation engine, minimiH-
ing overhead from Python3s standard evaluation mechanisms. Ohese
functions eqcel in compleq Yltering and arithmetic operations, letting
you streamline worW0ows. Aoreover, vectoriHed operations1apply-
ing a function to entire arrays rather than iterating element by ele-
ment1taWe advantage of Pandas3 underlying 2 and VumPy imple-
mentations for substantial performance beneYts. Cy integrating these
optimiHation strategies, you can devote more time to eqtracting in-
sights rather than wrestling with ineGciencies. For more eqamples and
hands-on eqercises on these advanced transformations, see my Python
for Bkect Aasterclass on …demy, where we eqplore real-world data
manipulation scenarios in greater detail.
Aatplotlib serves as a versatile canvas for the artist within every data
scientist, okering a suite of plotting capabilities that transform raw
data into compelling visual stories. St its core, Aatplotlib provides
fundamental plotting functions that are indispensable for data visu-
aliHation, beginning with the creation of line plots using the plot()
function. Ohis tool allows you to depict trends over continuous data,
capturing shifts and patterns that are often invisible in raw Ygures. Ohe
elegance of a line plot lies in its simplicity, yet the power it wields in
revealing the narrative of data is profound. 2ustomiHing these plots
to enhance their aesthetic appeal and clarity involves the xudicious
addition of labels and titles, elements that transform a simple graph
into a communicative piece, guiding the reader's eye and emphasiHing
Wey insights. Cy ensuring that every aqis is labeled with precision and
every plot bears a descriptive title, you not only improve readability
but also ensure that your audience can readily grasp the signiYcance
of the data being presented.
Bnhancing the clarity and impact of plots reTuires a delicate balance
between form and function, a balance achieved through thoughtful
customiHation. Snnotations and teqt are tools at your disposal, al-
lowing you to highlight speciYc data points or trends directly within
the plot. Cy strategically placing annotations, you can draw attention
to anomalies or outliers, or simply provide conteqt that enriches the
viewer3s understanding. Sdxusting aqis limits and scales is another
techniTue to ensure that your plot communicates ekectively, particu-
larly when dealing with data that spans several orders of magnitude or
; OMASZR OEBCS2R
Software
Engineering Best
Practices
validation, ensuring that the code remains functional and stable across
iterations.
Setting up a CI pipeline is an ezercise in precision and foresight,
re’uiring the careful conxguration of tools that automate the myr-
iad tasks associated with code integration and testing. "it:ub Ac-
tions, a robust CI8CD platform, o'ers seamless integration with
"it:ub repositories, allowing developers to automate work3ows di-
rectly within their ezisting version control systems. By conxguring
work3ows in FAMq xles, developers can specify the events that trigger
actions, such as code pushes or pull re’uests, and dexne the subse-
’uent tasks that should be ezecuted, like running tests or deploying
applications. This declarative approach not only simplixes the setup
process but also ensures that work3ows are transparent and easily
modixable, fostering an environment where continuous improve-
ment is both achievable and encouraged.
Travis CI, another popular CI tool, o'ers a complementary ap-
proach to automated testing, providing a platform that supports a
wide array of programming languages and environments. Setting up
Travis CI involves creating a .travis.yml xle within the repository,
where developers can dexne the build conxguration, including the
language, environment, and script to ezecute. This conxguration xle
serves as a blueprint for the CI process, detailing the steps re’uired
to build, test, and deploy the application. By leveraging Travis CI—s
robust testing infrastructure, developers can ensure that their code is
thoroughly vetted before being merged, minimiYing the risk of defects
and enhancing the overall ’uality of the project.
The integration of testing suites into CI work3ows is a critical as-
pect of maintaining code ’uality and project momentum, as it allows
for the continuous validation of code changes against predexned cri-
teria. By incorporating unit tests into the CI pipeline, developers can
4K TOMASZ TREBACZ
Handling Big
Data with Python
visuayiXe the exelution of tasHs and identifb stages that mab .e lon-
tri.uting to deyabsk Chis reay-time feed.alH is invayua.ye for diagnos-
ing performanle issues, as it highyights areas where resourle ayyolation
mab .e su.optimay or where data shu0ing is exlessivek Uompyement-
ing this, projying tooys sulh as Epalhe DparHGs own instrumentation
lan .e empyobed to detelt resourle .ottyenelHs, providing granuyar
detaiys on memorb usage, UO… yoad, and disH I :k Kb yeveraging these
tooys, bou gain a deeper understanding of the appyilationGs perfor-
manle lharalteristils, ena.ying bou to impyement targeted optimiXa-
tions that address spelijl ineAlienliesk Chis proaltive approalh not
onyb improves the overayy eAlienlb of .ig data operations .ut ayso
ensures that the infrastrulture lan slaye eFeltiveyb to meet future
demandsk
Kb yeveraging a dataset with daiyb sayes data for prodults over a bear,
Obthon lan .e used to allurateyb forelast retaiy sayes trendsk Uonsider
the visuayiXations in the jgure a.ove, the jrst visuayiXation, a yine pyot,
showlases totay daiyb sayes throughout the bear, unlovering patterns
yiHe peaHs and dips that are lruliay for understanding lonsumer .e-
havior and eFeltiveyb managing inventorbk Che selond visuayiXation,
a .ar pyot, ranHs prodults .b their totay annuay sayes, highyighting
.est-seyyers and ena.ying retaiyers to folus on stolHing the most pop-
uyar items, thus optimiXing inventorb yeveys and minimiXing wastek
9 C:YED1 CTSKEU1
Data Cleaning
and
Preprocessing
like forward and backward 'll, where missing values are replaced with
the preceding or following valid observation, respectively, are partic-
ularly useful in time series data, where continuity is critical.
6eyond missing data, invalid data occurrences present another layer
of complexity, reYuiring vigilant detection and recti'cation to ensure
the datasetzs accuracy and reliability. …etecting outliers and anomalies,
which often manifest as extreme values or deviations from expected
patterns, is essential for identifying data points that do not conform
to the expected distribution. Such anomalies may result from data
entry errors, measurement inaccuracies, or genuine variance, each de-
manding a tailored response. Statistical methods, such as K-score or in-
terYuartile range analysis, oHer robust means of identifying these out-
liers, enabling their isolation for further investigation. qalidating data
types and formats is eYually important, ensuring that each variable
is stored in an appropriate format that accurately reWects its nature
and intended use. This validation process often involves cross-check-
ing data types against expected formats, rectifying discrepancies, and
transforming data as necessary to maintain consistency and coherence.
In the realm of data science, where datasets burgeon with both siKe and
complexity, the manual cleaning of data is not only time-consuming
but also fraught with potential for error. Zutomation in data cleaning
stands as a beacon of e ciency, transforming what was once a labo-
rious task into an orchestrated process that not only saves time but
ensures consistency across analyses. Eeproducibility is paramount in
any scienti'c endeavor, and automated cleaning scripts provide an im-
mutable record of each transformation applied to a dataset, allowing
others to replicate results with 'delity. This is especially advantageous
when dealing with large datasets, where the sheer volume of data can
obscure manual oversight and exacerbate human error. 6y automating
repetitive tasks, you liberate valuable cognitive resources, allowing for
a greater focus on the interpretive and strategic aspects of data analysis
rather than the mundane mechanics of data preparation.
Oriting Vython scripts to automate these tasks is an exercise in
e ciency and foresight, where you craft functions that encapsulate
common cleaning operations, rendering them reusable and adapt-
able to diverse datasets. Consider a function that standardiKes date
formats, another that removes duplicates, and yet another that en-
codes categorical variables each of these functions can be 'ne-tuned
to accommodate the nuances of diHerent datasets while maintaining
a consistent methodological approach. Lurthermore, the scheduling
of scripts for regular data updates ensures that datasets remain cur-
rent and reWective of the latest information, a necessity in dynamic
environments where data is constantly evolving. 6y integrating these
scripts into a larger data pipeline, you facilitate a continuous Wow of
clean, validated data ready for analysis at any moment.
VUTNMD LME BLLBCT: AZSTBE …ZTZ qIS3Z4IRZ5 1
Real-World Data
Applications
The Mourney of data analysis begins with the crucial task of eStracting
and cleaning sales data, an endeavor that sets the stage for meaningful
qL TZRxEB TC74xVB
insights. Imagine sifting through a sea of VEP and 7Scel Oles, each a
repository of customer interactions, purchase histories, and revenue
streams. The Orst step is sourcing this data, ensuring it is both com-
prehensive and relevant, which often involves integrating multiple
datasets from disparate sources. Znce gathered, the raw data re'uires
meticulous cleaning to resolve inconsistencies, such as duplicate en-
tries or missing values, which can skew analysis and lead to erroneous
conclusions. Neveraging HythonFs powerful libraries like Handas and
’umHy, you can systematically apply functions such as drop_du-
plicates() and fillna() to cleanse the dataset, ensuring it is accurate,
consistent, and ready for analysis.
Znce armed with clean sales data, the neSt phase involves segment-
ing and summarizing this information to distill actionable insights.
Dere, HythonFs capabilities shine, oAering tools to group data by var-
ious dimensions, such as regions or product categories. 4y employ-
ing techni'ues like grouping and aggregation, you can calculate key
metrics such as total sales and average order value, metrics that serve as
benchmarks for performance evaluation. Yor instance, using Handas:
groupby() function, you can easily segment data to uncover trends
across diAerent geographic areas or product lines, providing a nuanced
understanding of which sectors drive revenue. This segmentation not
only highlights areas of strength but also identiOes underperforming
segments, guiding strategic decisions aimed at optimizing resources
and maSimizing proOtability.
Uetecting sales trends and patterns is akin to uncovering the stories
that data tells over time. 4y visualizing sales data monthly or 'uar-
terly, you gain temporal insights that static reports cannot provide.
7mploying libraries like Ratplotlib and Eeaborn, you can craft visuals
that reveal trends and seasonal patterns, oAering clarity and foresight
into the ebbs and …ows of market demand. These visualizations act
H5TDZ’ YZC 7YY7VTW RxET7C UxTx PIE?xNIBx6 q1
In an era where social media pervades every facet of our lives, data
generated from these platforms represents a rich tapestry of public
sentiment, trends, and interactions that can be meticulously analyzed
H5TDZ’ YZC 7YY7VTW RxET7C UxTx PIE?xNIBx6 qq
can ensure that your dashboards re…ect the most current informa-
tion, providing immediate insights into evolving trends and sentiment
shifts. This real-time capability is crucial in Oelds where rapid response
to public opinion or market trends is necessary, such as in crisis man-
agement or digital marketing. xs you design these dashboards, con-
sider the end-user eSperience, ensuring that the interface is intuitive,
the visuals are clear, and the insights are readily actionable.
Advanced Data
Visualization
Techniques
the dataT phe slatter y,ot zisua,imes the re,ationshiy between Lo,u.e
and Ztolx Jrilef with bubb,e simes reyresenting the Aarxet 8ayT :oz;
er vunltiona,it- is integrated to yrozide additiona, insightsf sulh as
disy,a-ing the datef zo,u.ef and stolx yrile when the user hozers
ozer ealh yointT phis a,,ows users to eWy,ore lorre,ations d-na.ila,,-f
seeing how lhanges in zo,u.e .ight lorre,ate with —ultuations in
stolx yrilesT
phe ,ine lhart yrozides a ti.e;series ziew ov the Ztolx Jrile ozer
ti.ef showlasing its te.yora, yatternT phe lhart inl,udes an inter;
altize moo. vunltion through a range s,iderf enab,ing users to volus
on syelicl yeriods vor deeyer ana,-sisT phis vunltiona,it- is yartil;
u,ar,- usevu, vor cnanlia, dataf where identiv-ing yatterns or ano.;
a,ies within yartilu,ar ti.e vra.es is lritila, vor delision;.axingT C-
adYusting the s,iderf users lan moo. in and eWy,ore diHerent ti.e in;
terza,sf obserzing how stolx yriles ezo,ze and yotentia,,- identiv-ing
yeaxsf diysf or trendsT
S xe- veature ov the dashboard is the range s,ider vor date c,teringT
phe s,ider lontro,s the data range disy,a-ed in both the slatter y,ot
and the ,ine lhartT Ss users adYust the s,iderf both zisua,imations uydate
d-na.ila,,- to re—elt the se,elted date rangeT phis interaltize e,e.ent
enhanles the dashboardjs uti,it-f a,,owing users to eWy,ore diHerent
ti.e vra.es and how the cnanlia, .etrils lhange ozer these yeriods
without the need vor .anua, re,oading or data adYust.entsT
In su..ar-f the slriyt de.onstrates how to bui,d an interaltize
dashboard that lo.bines slatter y,otsf ti.e;series ana,-sisf and d-;
na.il c,tering using J,ot,- UashT It eHeltize,- i,,ustrates how users
lan engage with the data to unlozer insights about cnanlia, behaziorf
sulh as lorre,ations between stolx yrile and trading zo,u.e or yat;
terns in stolx yrile lhanges ozer ti.eT phis interaltizit- is lrulia, in
JFp:M’ DME BDDB8pV ASZpBE USpS LIZ…S7IRS9 60
ti.ef with the lontinuous ,ine e,egant,- guiding the ziewer3s e-e alross
the te.yora, y,aneT C- inloryorating .u,tiy,e ,inesf -ou lan lon;
dult lo.yaratize ana,-sesf YuWtayosing diHerent data series to rezea,
lorre,ations or dizergenlesT Srea lhartsf on the other handf vali,itate
the zisua,imation ov lu.u,atize dataf with the shaded areas under the
lurzes yroziding a zisua, indilation ov zo,u.e or intensit-T phis telh;
nique is yartilu,ar,- usevu, when lonze-ing the allu.u,ation ov data
yoints ozer ti.ef sulh as tota, sa,es or resourle lonsu.ytionT phe
lhoile between ,ine and area lharts hinges on the nature ov the data
and the stor- -ou wish to lonze-f ealh oHering a unique yersyeltize
on te.yora, trendsT
phe ayy,ilation ov .ozing azerages and s.oothing telhniques en;
hanles the l,arit- ov ti.e series zisua,imationsf .itigating the noise that
ovten obslures under,-ing trendsT Eo,,ing azeragesf a si.y,e -et yow;
ervu, telhniquef inzo,ze azeraging data yoints ozer a syeliced win;
dowf thereb- s.oothing out short;ter. —ultuations and high,ighting
,onger;ter. trendsT phis .ethod is yartilu,ar,- eHeltize vor datasets
y,agued b- zo,ati,it-f oHering a l,earer ziew ov the ozerarlhing yat;
ternsT BWyonentia, s.oothingf a .ore soyhistilated ayyroalhf assigns
eWyonentia,,- delreasing weights to yast obserzationsf a,,owing the
zisua,imation to adayt d-na.ila,,- to lhanges in the dataT phis telh;
nique is inza,uab,e vor data that eWhibit rayid shivts or trendsf enab,ing
a .ore resyonsize and insightvu, yortra-a, ov te.yora, d-na.ilsT
phrough the strategil use ov ti.e series zisua,imation telhniquesf
-ou gain the abi,it- to transvor. lo.y,eW te.yora, datasets into in;
tuitize and invor.atize zisua, narratizesT Ohether ,ezeraging Aat;
y,ot,ib3s layabi,ities vor detai,ed statil y,ots or e.braling J,ot,-3s in;
teraltizit-f the too,s at -our disyosa, e.yower -ou to lravt zisua,s
that resonate with -our audienlef lonze-ing te.yora, insights with
l,arit- and yrelisionT Ss -ou eWy,ore these telhniquesf lonsider how
0 pMASZR pEBCS8R
Introduction to
Machine
Learning
training and test setsk xhis division is vital, as it allows .ou to train .our
model on one subset while using the other to evaluate its predictive
accurac.k MciAit-Dearn oyers the convenient train3test3split function
to perform this tasA, enabling .ou to test .our model on unseen datak
xhis process helps prevent overTtting and enhances the model(s abil-
it. to generaliLe to new data, improving its overall performance and
reliabilit.k
²ith .our data adePuatel. prepared, implementing a linear re-
gression model becomes a structured endeavor in MciAit-Dearnk xhe
DinearBegression class serves as the primar. tool for this purposek It
facilitates the creation of a model that Tts a linear ePuation to the
observed data, thereb. modeling the relationship between dependent
and independent variablesk xhe process begins b. instantiating the
class and using the Tt method to train the model with .our dataset(s
features and labelsk Znce trained, the model(s eWcac. can be visualiLed
through the plotting of the regression line, which represents the best
Tt through the data points, and the residuals, which are the discrep-
ancies between observed and predicted valuesk xhese visualiLations
are crucial, oyering a window into the model(s accurac. and helping
.ou identif. an. patterns in the residuals that could indicate potential
issues, such as heteroscedasticit. or non-linearit.k
Os .ou advance in regression anal.sis, .ou(ll encounter situations
where linear regression(s assumptions are violated, or when addition-
al model compleSit. is necessar.k xechniPues liAe Bidge regression
address these limitations b. adding a penalt. term to the loss func-
tion, which reduces the impact of multicollinearit.0a common issue
when predictors are highl. correlatedk xhis form of regulariLation
helps prevent overTtting, particularl. in models with a large number
of featuresk Olternativel., Hol.nomial regression allows the model to
capture non-linear relationships b. introducing pol.nomial terms of
;_4 xZROME xBC9O8E
Statistical Analysis
and Techniques
ing rax data into actionawle information, the script highlights hox
these essential techniques in data analysis provide a solid foundation
for making informed decisionsD Khether used for wudget planning
or forecasting, these methods ensure that estimates are aligned xith
historical patterns and the true wehavior of the data, oTering a reliawle
wasis for strategic planningD
zhe :ython script found in Rppendi. FQDFDF e.plores and visu;
ali—es a synthetic dataset using descriptive statistics to uncover key
insightsD It wegins wy generating a dataset containing F,QQQ entries
for three columns… Cevenue, 1.penses, and :robt, each folloxing a
normal distriwution xith specibed means and standard deviations to
simulate realistic bnancial dataD zhis dataset is stored in a :andas
-ata5rame called dfD zhe script then uses the descriwe34 method to
calculate summary statistics such as the mean, standard deviation,
minimum, ma.imum, and quartiles for each columnD It also calculates
the mean, median, and mode specibcally for the Cevenue column
to demonstrate measures of central tendency, and it computes the
standard deviation and variance to illustrate the variawility of the dataD
zo visuali—e these statistics, the script employs Eeaworn and Yat;
plotliwD It creates a wo. plot for the three variawles, xhich highlights
the spread, central tendency, and any potential outliers wy shoxing
the interquartile rangeD Rdditionally, a histogram xith a kernel density
estimate 3j-14 overlay is generated for the Cevenue column, provid;
ing insights into the shape of the revenue distriwution and indicating
xhether it is normal or skexedD 5inally, the script presents owserva;
tions wased on these descriptive statistics and visuali—ations, noting
hox closely the mean aligns xith the median 3suggesting a symmetric
distriwution4 and interpreting the implications of the standard de;
viationD It e.plains hox the visuali—ations aid in understanding data
patterns and the spread of values, shoxing hox such insights could
:AzUZL 5ZC 1551Vz… YREz1C -RzR 2IE%R'IBRW FF’
dealing xith data that involve ordered categories rather than contin;
uous valuesD
Identifying causal relationships requires more than statistical mea;
suresM it necessitates a thoughtful approach to e.perimental design
and analysisD -esigning e.periments, such as randomi—ed controlled
trials, provides a gold standard for estawlishing causality, as they allox
for the manipulation of independent variawles xhile controlling for
confounding factorsD Uoxever, in many belds, such e.periments are
impractical or unethical, necessitating alternative methodsD Zwserva;
tional studies, though limited wy potential wiases, can yield insights
xhen carefully designed and analy—edD Latural e.periments, xhich
e.ploit e.ternal factors as instruments, can also oTer compelling evi;
dence of causality wy mimicking the conditions of a randomi—ed trial
in a natural settingD
Vonfounding variawles pose a signibcant challenge in causal in;
ference, as they can owscure or distort the true relationship wetxeen
variawlesD Identifying and controlling for these confounders is para;
mount to isolating causal eTectsD Etatistical controls, such as multi;
variate regression, allox for the inclusion of potential confounders in
the analysis, helping to ad+ust for their inSuence and clarify the direct
relationship wetxeen the variawles of interestD zhis approach can we
enhanced wy techniques like propensity score matching, xhich pairs
owservations xith similar values of confounding variawles, therewy
walancing the distriwution of confounders across groups and appro.;
imating the conditions of a randomi—ed e.perimentD 0y meticulously
accounting for confounding variawles, xe can approach a more accu;
rate understanding of causation, distinguishing genuine causal eTects
from mere statistical associationsD
In the pursuit of accurate data interpretation, the distinction we;
txeen correlation and causation is not +ust academicM it is pivotal
FF zZYREB zC10RVB
for ensuring that analyses lead to valid insights and sound decisionsD
%nderstanding the limitations and potential pitfalls inherent in these
concepts is essential for anyone engaged in the analysis of data, regard;
less of beld or applicationD
Integrating
Python with
Other Tools
onous exercise prone to error and ennui. Python can alleviate this
burden through sophisticated scripting that populates spreadsheets
with data pulled from various sources, whether databases, text zles,
or APIs. By employing libraries such as openpyxl, you can program-
matically read, write, and manipulate Excel zles, thus transforming
data entry from a manual chore into a streamlined process. Mhis not
only reduces the potential for error but also frees up time for more
analytical pursuits, enabling you to focus on deriving insights rather
than inputting numbers.
Rtreamlining znancial reporting workTows is another area where
Python's capabilities shine brightly. Yinancial professionals often
grapple with the complexity of consolidating data from disparate
sources into coherent reports, a task fraught with the potential for
discrepancies and inconsistencies. Python's pandas library, renowned
for its robust data manipulation capabilities, can serve as an inter-
mediary, extracting data from Excel sheets, transforming it as needed,
and then reintegrating it into Excel for presentation. Mhis automation
facilitates the rapid generation of reports that are both accurate and
up-to-date, ensuring that znancial insights are always grounded in
the most current data. By automating these workTows, you not only
enhance the reliability of your reports but also increase their frequency
and timeliness, providing stakeholders with the information they need
to make informed decisions.
2reating dynamic Excel reports with Python elevates the static
spreadsheet into a living document that reTects the latest data and in-
sights. HtiliNing Python scripts, you can automate the creation of pivot
tables, an essential tool for summariNing and analyNing large datasets.
Mhis automation allows for the rapid reconzguration of data views,
enabling you to explore di:erent dimensions and uncover hidden
trends with ease. Additionally, Python can generate charts and graphs
PDMVSL YS1 EYYE2M… ZARME1 3AMA 4IRHA(ICA_ FU)
In the digital age, the vast expanse of the internet is teeming with
data, a rich tapestry of information Just waiting to be explored and
extracted. Mhis is where web scraping emerges as a powerful method,
enabling you to collect and analyNe data from websites with unparal-
leled precision. It is a transformative tool for those who wish to gather
competitive pricing data from e-commerce platforms, o:ering insights
into market dynamics and informing strategic pricing decisions. Rimi-
larly, monitoring social media for brand mentions provides a real-time
window into public sentiment, allowing businesses to stay attuned to
their audience's perceptions and reactions. 7eb scraping, therefore,
becomes an invaluable asset in your analytical arsenal, o:ering a means
to access and utiliNe data that would otherwise remain elusive.
Mo harness the power of web scraping, one must zrst become ac-
quainted with the basics, particularly through the use of Beautiful-
FU8 MSZARC M1EBA2C
Roup. Mhis Python library excels in parsing VMZ( and 9Z( docu-
ments, transforming them into navigable parse trees. 7ith Beautiful-
Roup, you can e:ortlessly traverse the complex structures of web pages,
extracting pertinent information with ease. It allows you to locate ele-
ments by their tags, attributes, or even text content, a Texibility that is
crucial when dealing with diverse and unpredictable web page layouts.
7hether your goal is to extract text, images, or hyperlinks, Beauti-
fulRoup provides the tools necessary to dissect and collect data with
precision and eOciency. Mhis meticulous parsing is the foundation
upon which more advanced scraping techniques can be built, enabling
you to transform raw web content into structured, actionable data.
7hile BeautifulRoup is adept at handling static content, the mod-
ern web is replete with dynamic pages that require a more interactive
approach. Mhis is where Relenium steps into the spotlight, a library de-
signed to automate web browser interactions. Relenium allows you to
simulate human actions, such as clicking buttons or zlling out forms,
and is indispensable when dealing with GavaRcript-rendered content
that cannot be accessed through traditional scraping methods. By au-
tomating these interactions, Relenium enables you to access data that
would otherwise remain hidden behind user actions, expanding the
scope of your web scraping capabilities. 7hether you are navigating
through multi-page forms or extracting data from dynamically loaded
elements, Relenium empowers you to interact with web pages as if
you were a human user, bridging the gap between static and dynamic
content.
Det, as you delve into the world of web scraping, it is imperative to
remain cogniNant of the ethical considerations and legal implications
associated with this practice. 7ebsites often have terms of service that
explicitly prohibit or restrict automated data extraction, and it is your
responsibility to respect these boundaries. Ignoring such guidelines
PDMVSL YS1 EYYE2M… ZARME1 3AMA 4IRHA(ICA_ FU0
measures. Mhis setup ensures that the report adheres to scientizc stan-
dards and is visually organiNed for clarity and impact.
Yollowing the template setup, a Python script automates the
process of zlling in the template with the generated data. Mhe script
reads the (aMe9 template and replaces placeholders with actual con-
tent, such as the zle paths for the visualiNations and the textual sum-
mary statistics. Hsing subprocess and pdTatex, the script compiles the
zlled (aMe9 zle into a P3Y report. Mhis compilation step ensures that
the znal output is a professional, ready-to-distribute document that
integrates both visual and textual data seamlessly.
Mhe znal stage involves automating the distribution of the gener-
ated P3Y report via email. Hsing Python's smtplib library, the script
connects to an RZMP server and sends the report as an email attach-
ment to a predezned list of recipients. Mhe email content is formatted
to include a brief message explaining the attached report. Mhe RZMP
conzguration and recipient details are customiNable to zt specizc
organiNational needs, ensuring Texibility and security when sending
out the report. By leveraging automation for this entire process, the
script ensures eOciency, consistency, and accuracy, signizcantly re-
ducing manual e:ort and making it easy to generate and distribute
professional reports regularly.
Mhis end-to-end solution e:ectively integrates data analysis, visual-
iNation, document generation, and email distribution into a cohesive
workTow. It is particularly benezcial for scientizc and data-driven
environments where regular reporting is necessary, as it streamlines the
creation of detailed reports while maintaining a high level of quality
and professionalism. Mhe approach ensures that insights are accurately
captured, presented, and communicated, making it a powerful tool
for businesses, researchers, and organiNations seeking to enhance their
data analysis and reporting capabilities.
PDMVSL YS1 EYYE2M… ZARME1 3AMA 4IRHA(ICA_ F)F
ing, and time series analysis, all of which can be applied to datasets
curated in Python. Mhis allows for a more nuanced and detailed ap-
proach to analysis, unlocking insights that might be less accessible us-
ing Python alone. Additionally, for visualiNing complex data patterns,
16s ggplotU library o:ers unparalleled customiNation and precision,
transforming raw data into compelling, easily interpretable visual nar-
ratives. By incorporating these visualiNations into Python workTows,
you ensure that your insights are not only accurate but also visually
impactful and informative.
In the rapidly evolving zeld of data science, the ability to integrate
and leverage specialiNed tools is crucial for staying ahead. 2ombining
Python and 1 exemplizes the power of utiliNing multiple, comple-
mentary tools to achieve greater analytical depth and scope. Embrac-
ing this integration allows you to tackle complex analytical challenges
with conzdence, knowing you have the best resources at your dis-
posal. Mhis chapter explores the synergy between these two languages,
providing the essential knowledge needed to unlock their combined
potential for advanced data analysis.
Chapter Twelve
Community and
Continued
Learning
progress and adTust your plan as needed, ensuring that your com-
munity engagement remains a dynamic and rewarding aspect of your
professional development.
Ys you immerse yourself in these vibrant communities, remem-
ber that the relationships you build and the xnowledge you gain are
invaluable assets in your Tourney as a Python developer. Engage with
enthusiasm, contribute with integrity, and embrace the collaborative
spirit that deBnes these spaces. Ohrough active participation, you will
not only advance your sxills but also enrich the Python community
as a whole, becoming an integral part of its ongoing evolution and
success.
Ohe platforms where you can Bnd open-source proTects are nu-
merous, yet some stand out for their accessibility and breadth of op-
tions. 4itNub is perhaps the most prominent, a vast repository of
proTects across countless domains, where you can browse repositories
that align with your interests and sxill levels. It provides an interface
that not only facilitates code sharing but also encourages commu-
nity interaction through issues, pull reJuests, and code reviews. Cor
those seexing a more curated ejperience, proTect aggregators lixe Fpen
Vource Criday highlight proTects that are particularly welcoming to
new contributors, often tagging issues as 0good Brst issue0 to help
novices Bnd manageable ways to begin contributing. Ohese platforms
serve as gateways to the open-source world, okering you the chance
to engage with proTects that resonate with your passions and ejpertise
while providing the scakolding needed to begin contributing ekec-
tively.
Fnce you identify a proTect you wish to contribute to, the process
of maxing contributions involves several xey steps. Corxing a repos-
itory creates a personal copy on your 4itNub account, which you
can then clone to your local machine for development worx. Ohis
step ensures that you have a stable environment to ejperiment with
changes without akecting the original proTect. Yfter implementing
your changes, the nejt step is to submit a pull reJuest, a formal pro-
posal that outlines your modiBcations, accompanied by well-docu-
mented code and an ejplanation of the changes. Ohis is where you
showcase not only your technical sxills but also your ability to com-
municate ekectively with the proTect maintainers and the broader
community. Engaging in code reviews and discussions that follow a
pull reJuest submission is a crucial part of the process, as it allows
you to receive feedbacx, iterate on your contributions, and reBne your
code based on community input. Ohis iterative cycle of review and
PHONFR CF: ECCEMOS DYVOE: UYOY LIVZY…I1Y3 5W(
your sxills remain sharp and relevant, allowing you to stay competitive
in an ever-evolving tech landscape that favors the adaptable and the
informed.
Ohe conduit for tracxing these developments is found in a variety of
resources, each okering uniJue insights into the language6s traTectory.
Python Enhancement Proposals PEPs serve as the o cial channel
for proposing and discussing new features and changes to Python6s
core. Ohese documents, accessible to all, provide a transparent view
into the decision-maxing processes that shape the language, oker-
ing you a glimpse into its future directions. In tandem, the o cial
Python blog and release notes are indispensable for staying updated
on the latest releases, bug Bjes, and improvements. Ohese resources
not only inform you of changes but also provide contejt and rationale,
helping you understand the implications for your worx. /y regularly
consulting these documents, you ensure that you are well-prepared to
integrate new features into your worxAow, optimiGing your code and
processes.
Oo continuously ejpand your toolxit with the latest libraries, it is
crucial to stay curious and open to ejploration. Ohe Python Pacxage
Indej PyPI is a treasure trove of libraries, ranging from essential
utilities to niche tools, each okering potential enhancements to your
proTects. /y browsing PyPI, you can discover new pacxages that ad-
dress speciBc needs or introduce novel functionalities, allowing you
to reBne your processes and tacxle complej challenges with greater
e ciency. Ydditionally, following inAuential Python developers and
blogs xeeps you informed about emerging trends and best practices.
Ohese thought leaders often share insights into innovative libraries
and tools, providing you with practical ejamples of their applications
and beneBts. Engaging with this content not only broadens your
PHONFR CF: ECCEMOS DYVOE: UYOY LIVZY…I1Y3 5)5
Conclusion
Uxample :sageH
1nter an integer/ 2
1nter a 0oating’:oint nWSber/ Cp[
1nter a string/ Yello]
xWt:Wt/
The sWS o( 2 and Cp[ is Xp[!, and yoW said/ =Yello]=
This scri:t taFes the Wser=s in:Wts, :er(orSs an addition o:eration
on the nWSeric valWes, and then coSbines these with the string in:Wt
in a descri:tive Sessagep
82[ TxHAfB TZ14ADB
})
# Demonstrating operations on thelist, tuple, and dictionary
# 1. Aggregating and transforming datausing Lists
print("Books published after1950:")
for book in books:
if book[2] > 1950:
print(f"- {book[0]} by {book[1]} ({book[2]})")
# 2. Ensuring consistency using Tuples
# Since Tuples are immutable, attempting to modify book1 directly
will raise an error:
# book1[0] = "NewTitle" # Uncommenting this line will cause a Type
Error
# 3. Storing and retrieving information efficiently using Dictionar-
ies
print("\nLibrary Catalog byGenre:")
for genre, genre_books inlibrary_catalog.items():
print(f"\nGenre: {genre}")
for book_info in genre_books:
print(f" - {book_info['title']}by {book_info['author']}
({book_info['year']})")
# Additional transformation: Count the number of books in each
genre
genre_counts = {genre: len(genre_books)for genre, genre_books in li
brary_catalog.items()}
print("\nNumber of books pergenre:")
for genre, count in genre_counts.items():
print(f"{genre}: {count} book(s)")
Output UxampleH
82~ TxHAfB TZ14ADB
Output Uxample
R( saS:le;tePtptPt is :resent/
Zeading ule contentsppp
zile content read sWccess(Wllyp
)ine coWnt/ ~, Lord coWnt/ [[
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 822
zile closedp
R( the ule is Sissing/
1rror/ The ule was not (oWndp
zile closedp
Log 3ileH
_levoperations.log
The log ule ca:tWres detailed in(orSation aboWt the o:erations :er’
(orSed and any errors encoWntered/
[![~’8!’8Q 8[/C~/2',QXk ’ RUzx ’ x:ened ule/ saS:le;tePtptPt
[![~’8!’8Q 8[/C~/2',Qk! ’ RUzx ’ zile content read sWccess(Wllyp
[![~’8!’8Q 8[/C~/2',Qk! ’ RUzx ’ 9rocessed ule/ ~ lines, [[ words
[![~’8!’8Q 8[/C~/2',Qk8 ’ RUzx ’ zile closed/ saS:le;tePtptPt
xr in the case o( a Sissing ule/
[![~’8!’8Q 8[/C~/2',QXk ’ 1ZZxZ ’ zileUotzoWnd1rror/ $1rrno
[< Uo sWch ule or directory/ =saS:le;tePtptPt=
[![~’8!’8Q 8[/C~/2',Qk8 ’ RUzx ’ zile closed/ saS:le;tePtptPt
This setW: eKectively deSonstrates the Wse o( 9ythonJs ule han’
dling with strWctWred error handling, logging, and resoWrce Sanage’
Sentp
the :riSary tiSe indePp UePt, trans(orS the dataset Wsing the SeltO3
(Wnction to convert it (roS a wide (orSat Owhere each variable has
its own colWSn3 to a long (orSatp This resWlts in a single colWSn
indicating the variable ty:e OTeS:eratWre, YWSidity, Lindf:eed3
and another colWSn (or their valWes, SaFing the data sWitable (or
:lotting and analysis in tidy (orSp Then, a::ly SWlti’indePing to the
-atazraSe, organi"ing it hierarchically with -ate and …ariable as the
two levels o( the indePp This strWctWre (acilitates groW:ed o:erations
and resaS:ling based on these levelsp
ZesaS:le the dataset on a weeFly basis Wsing the resaS:leO=L=3
(Wnction, and calcWlate the Sean valWes (or each weeF to observe
trends over diKerent tiSe intervalsp >roW: the data by …ariable be(ore
resaS:ling to ensWre that each variableJs data is :rocessed inde:en’
dentlyp zor e.cient calcWlations, Wse the evalO3 (Wnction to coS:Wte a
YeatRndeP that coSbines TeS:eratWre, YWSidity, and Lindf:eedp
The evalO3 (Wnction allows direct re(erence to colWSn naSes, o:tiSi"’
ing :er(orSance and SaFing the code cleanerp Additionally, :er(orS
another calcWlation, the DoS(ortRndeP, Wsing vectori"ed o:erations
(or SaPiSWS e.ciencyp This indeP evalWates coS(ort based on teS’
:eratWre, hWSidity, and wind s:eedp
Python Rode
import pandas as pd
import numpy as np
# Generate a time-dependent dataset
date_range = pd.date_range(start='2023-01-01',periods=100
, freq='D')
data = {
'Date': date_range,
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 82Q
Sample Output
xriginal -atazraSe/
-ate TeS:eratWre YWSidity Lindf:eed
0 2023-01-01 21.560963 70.037812
14.609826
1 2023-01-02 18.472673 46.264667
13.238215
2 2023-01-03 28.709601 75.413642
5.312936
3 2023-01-04 29.778739 64.707294
13.198032
4 2023-01-05 22.755032 54.867598
6.025848
Helted -atazraSe/
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 82k
Python Rode
plt.xlabel('Categories')
plt.ylabel('Values')
plt.show()
# 3. Horizontal Bar Plot - Alternative view for the same data
plt.figure(figsize=(8, 6))
plt.barh(categories, values, color='salmon')
plt.title('Horizontal Bar Plot: Comparison of Categories')
plt.xlabel('Values')
plt.ylabel('Categories')
plt.show()
# 4. Histogram - Distribution of continuous data
plt.figure(figsize=(8, 6))
plt.hist(data, bins=30, color='purple', edgecolor='black', alpha=0.7)
plt.title('Histogram: Distribution of Data')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# 5. Scatter Plot - Bivariate analysis
plt.figure(figsize=(8, 6))
plt.scatter(x, y, color='green', alpha=0.6, edgecolor='black')
plt.title('Scatter Plot: Relationship between X and Y')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.show()
Yadoo: and f:arF re#Wire Iava to rWn, so SaFe sWre Iava is installedp
-ownload and install the latest version o( the Iava -evelo:Sent qit
OI-q3 (roS the or Wse the x:enI-q/
xn )inWPmHac, yoW can install it Wsing :acFage Sanagers liFe/
sWdo a:t W:date
sWdo a:t install o:enVdF’88’VdF
xn Lindows, download the installer and (ollow the instrWctionsp
…eri(y the installation by rWnning/
Do:y code
8'~ TxHAfB TZ14ADB
Vava ’version
2. Install Tadoop
-ownload Yadoo: (roS the p Dhoose the binary download and eP’
tract it to a directory o( yoWr choicep
Ron_gurationH
fet the Yadoo: environSent variables in yoWr pbashrc O)in’
WPmHac3 or environSent variables OLindows3/
eP:ort YA-xx9;YxH1Nm:athmtomhadoo:
eP:ort 9ATYN{9ATY/{YA-xx9;YxH1mbin
eP:ort 9ATYN{9ATY/{YA-xx9;YxH1msbin
1dit the core’sitepPSl ule O(oWnd in
YA-xx9;YxH1metcmhadoo:m3 to set the de(aWlt ule systeS/
}conugWration7
}:ro:erty7
}naSe7(spde(aWltzf}mnaSe7
}valWe7hd(s/mmlocalhost/k!!!}mvalWe7
}m:ro:erty7
}mconugWration7
1dit hd(s’sitepPSl O(oWnd in the saSe directory3 to conugWre
Y-zf/
}conugWration7
}:ro:erty7
}naSe7d(spre:lication}mnaSe7
}valWe78}mvalWe7
}m:ro:erty7
}mconugWration7
F. Install PySpark
To rWn a 9yf:arF scri:t, create a p:y ule and inclWde yoWr 9yf:arF
codep EoW can ePecWte the scri:t Wsing/
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8'Q
s:arF’sWbSit Sy;:ys:arF;scri:tp:y
Droubleshooting Dips
Ronclusion
8p Simulated 4ataset/
This chart Wses a line :lot to dis:lay the daily total sales
over the yearp Rt hel:s identi(y trends sWch as :eaFs or
di:s in sales, which are crWcial (or Wnderstanding :Wr’
chasing :atterns and :lanning inventoryp
5eMuired Libraries
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8'k
Python Script
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model importLinearRegression
from pandas.plotting importregister_matplotlib_converters
# Set up Seaborn theme for visualization
sns.set_theme(style="whitegrid")
# Simulate a retail dataset
np.random.seed(0)
dates = pd.date_range(start='2023-01-01',periods=365, freq='D')
product_ids = [f'P{str(i).zfill(3)}' fori in range(1, 11)] # 10 differ-
entproducts
# Generate sales data for each product and each day
data = []
for date in dates:
for product_id in product_ids:
sales = np.random.poisson(lam=20) # Simulating daily sales us-
ing Poisson distribution
data.append([date, product_id, sales])
# Create a DataFrame
df = pd.DataFrame(data, columns=['Date','ProductID', 'Sales'])
# Aggregate sales data
daily_sales = df.groupby('Date')['Sales'].sum().reset_index()
# Chart 1: Purchase Patterns - TotalSales Over Time
8Q! TxHAfB TZ14ADB
plt.figure(figsize=(12, 6))
sns.lineplot(x='Date', y='Sales',data=daily_sales)
plt.title('Daily Sales Trends Over theYear', fontsize=16)
plt.xlabel('Date', fontsize=14)
plt.ylabel('Total Sales', fontsize=14)
plt.show()
# Chart 2: Most Purchased Products
product_sales = df.groupby('ProductID')['Sales'].sum().reset_inde
x().sort_values(by='Sales',ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x='Sales', y='ProductID',data=product_sales,
palette='Blues_d')
plt.title('Most Purchased Products',fontsize=16)
plt.xlabel('Total Sales', fontsize=14)
plt.ylabel('Product ID', fontsize=14)
plt.show()
# Predictive Analytics - ForecastingFuture Sales Using Linear Re-
gression
# Prepare the data for modeling
daily_sales['DayOfYear'] =daily_sales['Date'].dt.dayofyear #Ex-
tract day of the year for feature
X = daily_sales[['DayOfYear']]
y = daily_sales['Sales']
# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)
# Predict future sales for the next 30days
future_days = pd.DataFrame({'DayOfYear':np.arange(366
, 396)}) # Days 366 to 395for the next month
future_sales =model.predict(future_days)
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8Q8
Python Script
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set Seaborn theme for visualizations
sns.set_theme(style="whitegrid")
# Generate a dataset with intentional missing values
np.random.seed(0) # Seed for reproducibility
# Create a DataFrame with 100 rows and5 columns
data = {
'ProductID': [f'P{str(i).zfill(3)}'for i in range(1, 101)],
'Price':np.random.choice([np.nan, 10, 15, 20, 25], 100, p=[0.1,
0.3, 0.3, 0.2, 0.1]),
'Quantity':np.random.choice([np.nan, 1, 5, 10], 100, p=[0.2, 0.5,
0.2, 0.1]),
'Discount':np.random.choice([np.nan, 0, 5, 10], 100, p=[0.3, 0.4,
0.2, 0.1]),
'Revenue':np.random.normal(1000, 250, 100)
}
df = pd.DataFrame(data)
# Display the first few rows of the dataset
print("Initial DataFrame withMissing Values:")
print(df.head(), "\n")
# Detect missing values using isnull()and notnull()
missing_values_count = df.isnull().sum()
print("Missing Values Count PerColumn:")
print(missing_values_count, "\n")
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8QC
The scri:t above tWrns raw data into a visWal and nWSerical Sa:
o( Sissingness, enabling in(orSed decisions on how to address and
Sanage ga:s in the datasetp This a::roach is essential (or ensWring data
#Wality be(ore (Wrther analysis or Sodelingp
Sample Output
Discount 33.0
Revenue 0.0
dtype: float64
Total Sissing valWes in the dataset/ '8
UWSber o( zilled …alWes 9er DolWSn/
ProductID 100
Price 88
Quantity 84
Discount 67
Revenue 100
dtype: int64
The heatSa: above :rovides a gra:hical re:resentation o( where
Sissing data ePists in the dataset, highlighting ga:s across colWSnsp Rt
allows Ws to #WicFly identi(y :atterns, sWch as colWSns that (re#Wently
have Sissing valWes Oepgp, =9rice= and =-iscoWnt=3, or i( Sissingness is
Sore :revalent in s:eciuc :arts o( the datasetp fWch insights are crWcial
(or Wnderstanding biases or issWes in data collection, enabling targeted
strategies liFe iS:Wting Sissing valWes or investigating the reasons
behind these ga:sp
Python Script
8Q' TxHAfB TZ14ADB
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set up Seaborn theme for visualizations
sns.set_theme(style="whitegrid")
# Generate a dataset for the sales analysis case study
np.random.seed(42)
# Simulate data for 1 year (365 days) for 5 products
date_range = pd.date_range(start='2023-01-01', peri-
ods=365,freq='D')
products = ['Product A', 'Product B', 'Product C', 'ProductD', 'Prod-
uct E']
seasonal_effects = np.sin(np.linspace(0, 2 * np.pi, 365)) # Simulate
seasonal trends
# Create a sales dataset
data = []
for product in products:
base_sales =np.random.randint(50, 100) # Base sales for each
product
sales =base_sales + (seasonal_effects * base_sales * np.random.un
iform(0.1, 0.3)) +np.random.normal(0, 10, 365)
for i, date in enumerate(date_range):
data.append([date, product, max(int(sales[i]), 0)]) # Ensure
sales are non-negative
# Create a DataFrame
df = pd.DataFrame(data, columns=['Date','Product', 'Sales'])
# Display the first few rows of the dataset
print("Sales Data Sample:")
print(df.head(), "\n")
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8QQ
PrereMuisites
4e(ore :roceeding, SaFe sWre yoW have the (ollowing libraries in’
stalled/
:i: install re#Wests :andas Sat:lotlib statsSodels
import requests
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from datetime import datetime
# Set up API parameters (Example usesOpenWeatherMap API)
API_KEY = 'your_api_key_here' # Replace with your API key
CITY = 'San Francisco'
BASE_URL = 'http://api.openweathermap.org/data/2.5/onecall/ti
memachine'
LAT = '37.7749' # Latitude for San Francisco
LON = '-122.4194' # Longitude for San Francisco
8X! TxHAfB TZ14ADB
df.set_index('Date', inplace=True)
# Display the first few rows of the data
print("Weather Data Sample:")
print(df.head(), "\n")
# Plot the time series data
plt.figure(figsize=(10, 6))
plt.plot(df.index, df['Temperature'],marker='o')
plt.title('Temperature Over the Past30 Days')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.show()
# Decompose the time series to identify trends, seasonal components,
and residuals
decomposition =sm.tsa.seasonal_decompose(df['Temperature']
, model='additive', period=7)
# Plot the decomposed components
plt.figure(figsize=(12, 8))
plt.subplot(411)
plt.plot(decomposition.observed,label='Observed')
plt.legend(loc='upper left')
plt.subplot(412)
plt.plot(decomposition.trend, label='Trend')
plt.legend(loc='upper left')
plt.subplot(413)
plt.plot(decomposition.seasonal,label='Seasonal')
plt.legend(loc='upper left')
plt.subplot(414)
plt.plot(decomposition.resid, label='Residual')
plt.legend(loc='upper left')
plt.tight_layout()
8X[ TxHAfB TZ14ADB
plt.show()
# Analysis
print("\nAnalysis:")
print("The time series analysisreveals different components:")
print("- The 'Trend' componentshows the long-term direction of tem-
perature changes.")
print("- The 'Seasonal' componentidentifies repeating patterns over a
weekly cycle.")
print("- The 'Residual' componentshows random fluctuations that
are not explained by the trend or seasonality.")
Python Script
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model importLinearRegression, Ridge, Lasso
from sklearn.preprocessing importPolynomialFeatures
from sklearn.model_selection importtrain_test_split
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8XC
plt.figure(figsize=(12, 8))
plt.scatter(X_test, y_test, color='blue',label='Test Data')
# Plotting predictions from each model
plt.plot(X_test, y_pred_linear, color='green',label='Linear Regres-
sion')
plt.plot(X_test, y_pred_ridge, color='red',label='Ridge Regression')
plt.scatter(X_test, y_pred_poly,color='orange', label='Polynomial
Regression (Degree 3)', alpha=0.6)
plt.plot(X_test, y_pred_lasso, color='purple',label='Lasso Regression')
plt.title('Comparison of RegressionTechniques')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
# Model performance metrics
print("Linear Regression R^2Score:", r2_score(y_test, y_pred_lin-
ear))
print("Ridge Regression R^2Score:", r2_score(y_test, y_pred_ridge))
print("Polynomial Regression R^2Score:", r2_score(y_test,
y_pred_poly))
print("Lasso Regression R^2Score:", r2_score(y_test, y_pred_lasso))
PrereMuisitesH
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8X2
To rWn the code, SaFe sWre yoW have the re#Wired libraries Odash,
:andas, and :lotly3 installed/
:i: install dash :andas :lotly
A(ter rWnning the scri:t, o:en yoWr browser at htt:/mm8[Qp!p!p8
/X!2!m to interact with the dashboardp This interactive setW: allows
Wsers to eP:lore unancial data visWally and Wncover insights throWgh
dynaSic ultering and "ooS (Wnctionalityp
Python Script
import dash
from dash import dcc, html
from dash.dependencies import Input,Output
import pandas as pd
import plotly.express as px
# Initialize the Dash app
app = dash.Dash(__name__)
# Sample financial dataset (creatingsynthetic data for demonstra-
tion purposes)
# In a real-world scenario, you could use an API like Yahoo Finance
or read from a CSV file.
dates = pd.date_range(start='2022-01-01',periods=100)
data = {
'Date': dates,
'Stock Price': 100 + (pd.Series(range(100)) * 0.5) + (pd.Series(ran
ge(100)).apply(lambdax: 5 * (x % 5 == 0))),
'Volume': (pd.Series(range(100)) * 1000) + (pd.Series(range(100)
).apply(lambdax: 5000 * (x % 10 == 0))),
'Market Cap': (pd.Series(range(100)) * 2000) + (pd.Series(range(
100)).apply(lambdax: 10000 * (x % 3 == 0))),
8X' TxHAfB TZ14ADB
}
df = pd.DataFrame(data)
# App layout
app.layout = html.Div([
html.H1("Financial Dashboard", style={'text-align': 'center'}),
# Scatter plot for correlations (e.g., between Volume and Stock
Price)
dcc.Graph(id='scatter-plot'),
# Line chart for temporal patterns
dcc.Graph(id='line-chart'),
# Slider for filtering date range
html.Div([
dcc.RangeSlider(
id='date-slider',
min=0,
max=len(df) - 1,
value=[0, len(df) - 1],
marks={i: str(date.date()) for i,date in enumerate(df['Date'])
if i % 10 == 0},
step=1
)
], style={'margin': '40px'})
])
# Callback for updating the scatterplot based on date range
@app.callback(
Output('scatter-plot', 'figure'),
[Input('date-slider', 'value')]
)
def update_scatter(date_range):
filtered_df = df.iloc[date_range[0]:date_range[1]]
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8XQ
fig = px.scatter(
filtered_df,
x='Volume',
y='Stock Price',
size='Market Cap',
hover_data={'Date': filtered_df['Date'], 'Volume': fil-
tered_df['Volume'],'Stock Price': filtered_df['Stock Price']},
title="Volume vs. Stock Price Correlation"
)
return fig
# Callback for updating the line chartbased on date range
@app.callback(
Output('line-chart', 'figure'),
[Input('date-slider', 'value')]
)
def update_line_chart(date_range):
filtered_df = df.iloc[date_range[0]:date_range[1]]
fig = px.line(
filtered_df,
x='Date',
y='Stock Price',
title="Stock Price Over Time"
)
fig.update_xaxes(rangeslider_visible=True)
return fig
# Run the app
if __name__ == '__main__':
app.run_server(debug=True)
8XX TxHAfB TZ14ADB
Python Script
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set the theme for seabornvisualizations
sns.set_theme(style="whitegrid")
# Generate a synthetic dataset for demonstration purposes
np.random.seed(42)
# Creating a dataset with 3 columns representing different types of
financial data
data = {
'Revenue': np.random.normal(50000, 15000, 1000), # Normally
distributed revenue
'Expenses': np.random.normal(30000, 8000, 1000), # Normally
distributed expenses
'Profit': np.random.normal(20000, 5000, 1000) # Normally
distributed profit
}
# Create a DataFrame
df = pd.DataFrame(data)
# Display the first few rows of the dataset
print("Dataset Sample:")
print(df.head(), "\n")
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8Xk
Python Script
import pandas as pd
import openpyxl
import numpy as np
import matplotlib.pyplot as plt
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8k8
consolidated_df['Month'] = pd.to_datetime(consolidated_df['D
ate']).dt.to_period('M')
monthly_sales = consolidated_df.groupby('Month')['Sales'].sum()
# Plotting the monthly sales
plt.figure(figsize=(10, 6))
monthly_sales.plot(kind='bar', color='skyblue')
plt.title('Total Monthly Sales')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Function to write the report back to an Excel file
def write_report_to_excel(consolidated_df,pivot_table):
with pd.ExcelWriter('consolidated_sales_report.xlsx', en-
gine='openpyxl')as writer:
consolidated_df.to_excel(writer, sheet_name='Consolidated
Data', index=False)
pivot_table.to_excel(writer, sheet_name='Pivot Table')
writer.save()
# Assuming each sales_data_X.xlsx file contains columns: ['Date',
'Region', 'Product', 'Sales']
# For demonstration, I will create the dummy Excel files with random
data
def create_dummy_excel_files(file_paths):
for file_path in file_paths:
# Generating random sales data
data = {
'Date': pd.date_range(start='2023-01-01',periods=30),
'Region': ['North', 'South', 'East','West'] * 7 + ['North', 'South'],
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8kC
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset (assuming it's in CSV format)
df = pd.read_csv('scientific_data.csv')
# Generate visualizations
# 1. Line plot for temperature over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Temperature', data=df)
plt.title('Temperature Over Time')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.savefig('temperature_plot.png')
plt.close()
# 2. Line plot for humidity over time
plt.figure(figsize=(10, 6))
sns.lineplot(x='Date', y='Humidity', data=df)
plt.title('Humidity Over Time')
plt.xlabel('Date')
plt.ylabel('Humidity (%)')
plt.savefig('humidity_plot.png')
plt.close()
9ETYxU zxZ 1zz1DT/ HAfT1Z -ATA …RfGA)RBA5 8k2
# 3. Summary statistics
summary_stats = df.describe()
summary_stats.to_csv('summary_stats.csv')
This scri:t reads the sWSSary statistics and re:laces the :lacehold’
er in the )aTe_ teS:latep Rt then coS:iles the ulled )aTe_ docWSent
into a 9-z Oscientiuc;re:ortp:d(3p
smtp_password = 'your_password'
# List of recipients
recipients = ['[email protected]','[email protected]']
# Create email message
msg = MIMEMultipart()
msg['From'] = smtp_user
msg['Subject'] = 'Automated ScientificReport'
body = 'Please find attached the scientific report generated automat-
ically.'
msg.attach(MIMEText(body, 'plain'))
# Attach the PDF report
with open('scientific_report.pdf', 'rb')as file:
report_attachment = MIMEApplication(file.read(), _sub-
type='pdf')
report_attachment.add_header('Content-Disposition', 'attachme
nt',filename='scientific_report.pdf')
msg.attach(report_attachment)
# Connect to SMTP server and send the email
with smtplib.SMTP(smtp_server,smtp_port) as server:
server.starttls()
server.login(smtp_user, smtp_password)
for recipient in recipients:
msg['To'] = recipient
server.sendmail(smtp_user, recipient, msg.as_string())
This scri:t conugWres the fHT9 server and sends the generated
9-z re:ort to the reci:ients listedp HaFe sWre to secWrely handle
fHT9 credentials and cWstoSi"e the eSail conugWration according
to yoWr setW:p
Refferences
References
-comprehensive-guide-to-seamless-integration-and-optimi5
ation-for-developers-b6121ac7zef7
https://medium.com/@dossieranalysis/python-for-data-cle
aning-best-practices-and-e8cient-techni ues-3072ed393Ya
f