0% found this document useful (0 votes)
16 views20 pages

Manoj 5th Sem Project Report

ML PORJECT

Uploaded by

manojsamanta1406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views20 pages

Manoj 5th Sem Project Report

ML PORJECT

Uploaded by

manojsamanta1406
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

A SUMMER INTERNSHIP REPORT

On

Data Science and Machine Learning using PYTHON


(Financial Analysis using PYTHON)

Submitted for the partial fulfillment


of

B. Tech.

in

ELECTRONICS AND COMMUNICATION ENGINEERING

SUBMITTED BY:

Manoj Samanta
(2200270310107)

Vth Sem - IIIrd Year


Section- ECE-2

SUBMITTED TO:

Dr. Dushyant Chauhan


(ASSO. PROF.)
ECE DEPT.

Ajay Kumar Garg Engineering College, Ghaziabad


27th Km Milestone, Delhi-Meerut Expressway, P.O. Adhyatmik Nagar, Ghaziabad-201009
Dr. A. P. J. Abdul Kalam Technical University, Lucknow
December 2024

1
Acknowledgement

I want to express my sincere gratitude and thanks to Prof. (Dr.) Neelesh


Kumar Gupta (HoD, ECE Department), Ajay Kumar Garg
Engineering College, Ghaziabad for granting me permission for my
industrial training in the field of “Data Science and Machine Learning
using PYTHON”.

I express my sincere thanks to Asso. Prof. (Dr.) Dushyant Chauhan for


his cooperative attitude and consistent guidance, due to which I was able to
complete my training successfully.

Finally, I pay my thankful regard and gratitude to the team members and
technicians of “Training Resource Company/ Organization” and Ajay
Kumar Garg Engineering College, Ghaziabad for their valuable help,
support and guidance.

Manoj Samanta
2200270310107
Vth Sem - III Year
Section- ECE-2

2
TABLE OF CONTENTS

Chapter 1. Introduction to PYTHON 1-2


1.1 Introduction 1
1.2 Features of PYTHON 1

Chapter 2. Data Science using PYTHON 3-5


2.1 Introduction 3
2.2 Data Science libraries in PYTHON 4

Chapter 3. Machine Learning using PYTHON 6-9

3.1 Introduction 6

3.2 Machine Learning Libraries in PYTHON 9

Chapter 4. Project Description 10-16


4.1 Introduction to Financial Analysis Project 10
4.2 Code Overview 11

References 17

3
CHAPTER- 1
INTRODUCTION TO PYTHON

1.1 Introduction
Python is a high-level, general-purpose programming language. Its design philosophy
emphasizes code readability with the use of significant indentation.

Python is dynamically type-checked and garbage-collected. It supports multiple programming


paradigms, including structured (particularly procedural), object-oriented and functional
programming. It is often described as a "batteries included" language due to its
comprehensive standard library.

Guido van Rossum began working on Python in the late 1980s as a successor to
the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was
released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-
compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of
Python 2.

Python consistently ranks as one of the most popular programming languages, and has gained
widespread use in the machine learning community.

1.2 Features of PYTHON

Here's a brief overview of Python's key features:

1. Easy to Learn and Read

Python has a clean, straightforward syntax that resembles natural language, making it excellent for
beginners and experienced programmers alike. Its code is highly readable and emphasizes clarity.

2. Interpreted Language

Python is an interpreted language, which means the code is executed line by line. This allows for
easier debugging and eliminates the need for compilation, making the development process faster.

3. Dynamically Typed

Unlike languages that require explicit type declarations, Python allows variables to change types
dynamically. You can assign an integer to a variable and later assign a string without any additional
declarations.

4. Versatile and Multi-Paradigm

1
Python supports multiple programming paradigms, including:

- Object-Oriented Programming

- Functional Programming

- Procedural Programming

- Aspect-Oriented Programming

5. Extensive Standard Library

Python comes with a comprehensive standard library that provides modules and packages for many
common programming tasks, reducing the need to write code from scratch.

6. Cross-Platform Compatibility

Python code can run on multiple platforms (Windows, macOS, Linux) with minimal or no
modifications, making it highly portable.

7. Strong Community and Ecosystem

Python has a large, active community and an extensive ecosystem of third-party libraries and
frameworks for web development, data science, machine learning, scientific computing, and more.

8. High-Level Language

Python abstracts many complex programming details, allowing developers to focus on solving
problems rather than managing low-level implementation details.

9. Automatic Memory Management

Python features automatic memory management through garbage collection, which helps
developers avoid manual memory allocation and de-allocation.

10. Extensible and Embeddable

You can write Python modules in other languages like C or C++, and Python can be embedded into
applications as a scripting interface.

These features make Python a powerful, flexible, and popular programming language across various
domains, from web development to artificial intelligence.

2
CHAPTER- 2
INTRODUCTION TO DATA SCIENCE

2.1 Introduction
1. Descriptive Statistics Descriptive statistics help summarize and describe the main
characteristics of a dataset. Techniques include calculating mean, median, mode, standard
deviation, and creating visualizations that reveal patterns in the data. These methods provide
initial insights into the dataset's basic properties and distribution.

2. Data Pre-processing: This crucial technique involves preparing raw data for analysis by:
 Handling missing values
 Removing duplicates
 Normalizing or scaling data
 Encoding categorical variables
 Dealing with outliers These steps ensure data quality and prepare it for more advanced analysis
techniques.

3. Regression Analysis: Regression techniques predict continuous numerical outcomes by


establishing relationships between variables. Major types include:

 Linear Regression
 Logistic Regression
 Polynomial Regression
 Multiple Regression These methods are fundamental for understanding how different variables
interact and predicting numerical outcomes.

4. Classification Techniques: Classification algorithms predict categorical outcomes by


categorizing data into predefined classes. Key methods include:

 Decision Trees
 Random Forest
 Support Vector Machines (SVM)
 K-Nearest Neighbors
 Naive Bayes These techniques are essential for tasks like spam detection, image recognition,
and customer segmentation.

5. Clustering: Clustering techniques group similar data points together without predefined labels.
Popular methods include:

 K-Means Clustering
 Hierarchical Clustering
 DBSCAN
 Gaussian Mixture Models These techniques help identify natural groupings within complex
datasets.

6. Dimensionality Reduction: These techniques reduce the number of random variables under
consideration by obtaining a set of principal variables. Important methods include:

3
 Principal Component Analysis (PCA)
 t-Distributed Stochastic Neighbor Embedding (t-SNE)
 Linear Discriminant Analysis These help simplify data visualization and reduce computational
complexity.

7. Time Series Analysis: Specialized techniques for analyzing data points collected over time,
including:

 ARIMA modeling
 Exponential smoothing
 Seasonal decomposition
 Forecasting methods These are crucial for analyzing trends in financial, economic, and scientific
data.

8. Machine Learning Techniques: Advanced algorithms that enable systems to learn and
improve from experience:

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning
 Deep Learning Neural Networks These techniques power complex predictive models and
artificial intelligence applications.

9. Ensemble Methods: Techniques that combine multiple machine learning models to improve
predictive performance:

 Bagging
 Boosting
 Stacking These methods often provide more robust and accurate predictions than individual
models.

10. Feature Engineering: The process of creating new features or transforming existing
ones to improve model performance:

 Feature selection
 Feature extraction
 Creating interaction terms
 Polynomial features

Each of these techniques plays a critical role in extracting insights, making predictions, and
understanding complex datasets across various domains like business, science, healthcare, and
technology.

The choice of technique depends on the specific problem, data characteristics, and desired outcomes.
Successful data science often involves combining multiple techniques and iterative refinement of
analytical approaches.

2.2 Data Science libraries in PYTHON


1. NumPy: NumPy is the fundamental package for scientific computing in Python. It provides:
 N-dimensional array object (ndarray)
 Tools for mathematical and numerical operations

4
 Linear algebra functions
 Fourier transforms
 Random number capabilities
 Efficient storage and manipulation of large, multi-dimensional arrays NumPy serves as the
foundation for most scientific computing and data science libraries in Python.

2. Pandas: Pandas is the premier data manipulation library in Python. Key features include:

 DataFrame and Series data structures


 Powerful data loading capabilities (CSV, Excel, databases)
 Data cleaning and preprocessing tools
 Advanced indexing and selection
 Time series functionality
 Data aggregation and grouping
 Handling of missing data Pandas makes data analysis and manipulation extremely efficient and
intuitive.

3. Seaborn: Seaborn is a statistical data visualization library built on top of Matplotlib. Its
highlights are:

 Statistical graphics and data visualization


 Beautiful default plotting styles
 Easy creation of complex statistical plots
 Integration with Pandas DataFrames
 Advanced plot types like violin plots, box plots, and regression plots
 Simplifies the process of creating informative and attractive statistical graphics

4. Matplotlib: Matplotlib is the most widely used plotting library in Python. It provides:

 Comprehensive 2D plotting capabilities


 Multiple plot types (line, scatter, bar, histogram)
 Highly customizable visualization options
 Publication-quality figure generation
 Object-oriented and MATLAB-like plotting interfaces
 Support for various output formats Matplotlib forms the foundation for most Python
visualization libraries.

5. Plotly: Plotly is an interactive visualization library that offers:

 Web-based, interactive plots


 Wide range of chart types
 Easy creation of dashboards
 Support for 3D visualizations
 Ability to create both static and dynamic graphics
 Excellent for creating interactive and shareable visualizations

5
CHAPTER- 3
Machine Learning using PYTHON
3.1 Introduction
1. Supervised Learning Algorithms
a) Regression Algorithms

 Linear Regression

 Predicts continuous numerical outcomes


 Establishes linear relationship between variables
 Simple and interpretable
 Best for linear relationships

 Logistic Regression

 Used for binary classification problems


 Predicts probability of an outcome
 Works well with linearly separable classes
 Provides probabilistic interpretation

 Polynomial Regression

 Handles non-linear relationships


 Introduces polynomial terms to linear regression
 Captures more complex data patterns

b) Classification Algorithms

 Decision Trees

 Creates tree-like model of decisions


 Splits data based on feature conditions
 Highly interpretable
 Can handle both numerical and categorical data

 Random Forest

 Ensemble of multiple decision trees


 Reduces overfitting
 Handles complex datasets
 Provides feature importance

 Support Vector Machines (SVM)

 Creates optimal decision boundary


 Works well in high-dimensional spaces
 Effective for both linear and non-linear classification
 Handles complex classification tasks

 K-Nearest Neighbors (KNN)


6
 Classification based on nearest data points
 Non-parametric algorithm
 Simple and intuitive
 Requires careful feature scaling

2. Unsupervised Learning Algorithms

a) Clustering Algorithms

 K-Means Clustering
 Divides data into K clusters
 Minimizes within-cluster variance
 Simple and efficient
 Works best with spherical clusters
 Hierarchical Clustering
 Creates nested cluster hierarchy
 Can be agglomerative or divisive
 Doesn't require predefined cluster number
 Visualizable through dendrogram
 DBSCAN
 Density-based spatial clustering
 Handles arbitrarily shaped clusters
 Robust to noise and outliers
 Doesn't require pre-specifying cluster number

b) Dimensionality Reduction

 Principal Component Analysis (PCA)


 Reduces feature dimensions
 Preserves maximum variance
 Eliminates multicollinearity
 Improves model performance
 t-SNE
 Non-linear dimensionality reduction
 Excellent for visualization
 Preserves local data structures
 Works well with high-dimensional data

3. Advanced Machine Learning Algorithms

a) Ensemble Methods

 Gradient Boosting
 Builds sequential weak learners
 Corrects previous models' errors
 High predictive performance
 Examples: XGBoost, LightGBM
 Bagging
 Reduces variance through bootstrap aggregation
 Creates multiple model versions
 Improves stability and accuracy

b) Advanced Techniques
7
 Neural Networks
 Mimics human brain structure
 Learns complex non-linear relationships
 Handles high-dimensional data
 Basis for deep learning
 Random Forest
 Combines multiple decision trees
 Reduces overfitting
 Handles complex datasets
 Provides feature importance

4. Probabilistic Algorithms

 Naive Bayes

 Probabilistic classification
 Based on Bayes' theorem
 Works well with high-dimensional data
 Fast and simple

5. Reinforcement Learning Algorithms

 Q-Learning

 learns optimal action-selection strategy


 Used in decision-making scenarios
 Balances exploration and exploitation
 Common in game theory and robotics

6. Deep Learning Algorithms

 Convolutional Neural Networks (CNN)

 Specialized for image processing


 Automatic feature extraction
 Used in computer vision
 Handles spatial hierarchies

 Recurrent Neural Networks (RNN)

 Processes sequential data


 Maintains internal memory
 Used in time series, natural language processing
 Handles variable-length inputs

Key Considerations for Algorithm Selection:

 Dataset size and characteristics


 Problem type (classification, regression)
 Computational resources
 Interpretability requirements
 Desired model complexity

8
Each algorithm has strengths and weaknesses, and the best choice depends on specific use cases, data
characteristics, and desired outcomes.

3.2 Machine Learning libraries in PYTHON


1. Scikit-Learn Scikit-Learn is the primary machine learning library in Python. Features include:

 Comprehensive machine learning algorithms


 Classification and regression techniques
 Clustering methods
 Model selection and evaluation tools
 Preprocessing and feature engineering
 Cross-validation techniques
 Dimensionality reduction Scikit-Learn provides a consistent interface for machine learning
tasks.

2. TensorFlow TensorFlow is an open-source machine learning platform developed by Google. It


offers:

 Deep learning and neural network development


 Flexible numerical computation
 GPU and TPU acceleration
 Production-ready machine learning models
 Keras high-level API integration
 Support for both research and production environments
 Extensive ecosystem for machine learning and artificial intelligence

3. PySpark PySpark is the Python API for Apache Spark, a distributed computing framework.
Key capabilities include:

 Large-scale data processing


 Distributed computing
 Machine learning at scale (MLlib)
 Streaming data processing
 SQL and DataFrame operations
 Integration with big data technologies
 Parallel computing across clusters PySpark is essential for big data processing and analytics on
large distributed systems.

9
CHAPTER- 4
Project Description
4.1 Introduction to Project Name
This comprehensive financial analysis project leverages Python's powerful data science libraries
to evaluate investment portfolio performance. Using Pandas, the project imports and preprocesses
historical stock price data from multiple sources, cleaning and transforming financial datasets
with advanced data manipulation techniques.

NumPy enables complex numerical computations, calculating key financial metrics like returns,
volatility, and risk-adjusted performance. The project implements sophisticated statistical
calculations to assess portfolio efficiency and compare different investment strategies.

Matplotlib and Seaborn are utilized to create interactive and informative visualizations. These
libraries generate comprehensive charts including:

 Time series price movements

 Portfolio allocation pie charts

 Comparative performance line graphs

The analysis provides investors with deep insights into portfolio composition, historical
performance, and potential future trends. Advanced statistical techniques and machine learning
algorithms can be integrated to enhance predictive capabilities, offering data-driven investment
recommendations.

Ultimately, the project demonstrates how Python's data science ecosystem can transform raw
financial data into actionable investment intelligence.

4.2 Code Overview

Importing modules

from pandas_datareader import data, wb


import pandas as pd
import numpy as np
import datetime
get_ipython().run_line_magic('matplotlib', 'inline')

We need to get data using pandas datareader. We will get stock information for the following
banks:

* Bank of America
* CitiGroup
* Goldman Sachs
* JPMorgan Chase
10
* Morgan Stanley
* Wells Fargo

Setting Start and Time date

start = datetime.datetime(2006, 1, 1)
end = datetime.datetime(2016, 1, 1)

Gathering data about banks using Google Finance and converting into dataframes

BAC = data.DataReader("BAC", 'google', start, end)

C = data.DataReader("C", 'google', start, end)

GS = data.DataReader("GS", 'google', start, end)

JPM = data.DataReader("JPM", 'google', start, end)

MS = data.DataReader("MS", 'google', start, end)

WFC = data.DataReader("WFC", 'google', start, end)

Creating a dataframe by combining the above dataframes

df = data.DataReader(['BAC', 'C', 'GS', 'JPM', 'MS', 'WFC'],'google', start, end)

Creating a list of Bank Tickers

tickers = ['BAC', 'C', 'GS', 'JPM', 'MS', 'WFC']

Creating a dataframe using df dataframe and tickers as Column names

bank_stocks = pd.concat([BAC, C, GS, JPM, MS, WFC],axis=1,keys=tickers)


bank_stocks.columns.names = ['Bank Ticker','Stock Info']
bank_stocks.head()

11
Now we can perform some Exploratory Data Analysis on this dataframe

Checking the highest close price for each of the bank stocks

bank_stocks.xs(key='Close',axis=1,level='Stock Info').max()

Creating a dataframe representing the returns of each bank stock

returns = pd.DataFrame()
for tick in tickers:
returns[tick+' Return'] = bank_stocks[tick]['Close'].pct_change()
returns.head()

Creating a pairplot of the returns dataframe using seaborn

import seaborn as sns


sns.pairplot(returns[1:])

12
Finding the worst single day drop in stock values of the banks

returns.idxmin()

Finding the best single day gain in stock values of the banks

13
returns.idxmax()

Calculating the standard deviation of the stock values

returns.std()

Creating a distribution plot using seaborn of the 2015 returns for Morgan Stanley

sns.displot(returns.ix['2015-01-01':'2015-12-31']['MS Return'],color='green',bins=100)

14
Creating a distribution plot using seaborn of the 2008 returns for CitiGroup

sns.displot(returns.ix['2008-01-01':'2008-12-31']['C Return'],color='red',bins=100)

Creating a line plot showing Close price for each bank for the entire index of time using for
loop.

for tick in tickers:


bank_stocks[tick]['Close'].plot(figsize=(12,4),label=tick)
plt.legend()

Creating a line plot showing Close price for each bank for the entire index of time using cross
section(.xs) method

bank_stocks.xs(key='Close',axis=1,level='Stock Info').plot()

15
We can implement all types of financial analysis on our data in a similar fashion and thus, derive
the necessary information from it .

16
REFERENCES
[1] Python 3.13.1 Documentation – docs.python.org/3/

[2] Numpy Documentation – numpy.org/doc/

[3] Matplotlib 3.9.3 documentation – matplotlib.org/stable/

[4] Seaborn Tutorial – seaborn.pydata.org

17

You might also like