0% found this document useful (0 votes)
22 views20 pages

AI Assisted Financial Platform Part02

Uploaded by

tgsush86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

AI Assisted Financial Platform Part02

Uploaded by

tgsush86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

APPENDICES

APPENDIX 1

SYSTEM ENVIRONMENT

VIRTUAL-ENV & PYTHON


To set up a machine learning process for data analytics using Python, creating a virtual
environment is essential. A virtual environment allows you to isolate your project’s
dependencies, ensuring that libraries and packages do not conflict with those of other
projects or the system Python installation.

ENV FOR MACHINE LEARNING SETUP:

1.Curated Environments:
o Curated environments are preconfigured setups provided by platforms like Azure
Machine Learning. They come with a collection of essential Python packages
and settings optimized for various machine learning frameworks, allowing for
quick deployment and ease of use.

2.User-Managed Environments:
o In user-managed environments, you have full control over the setup. You are
responsible for installing all necessary packages and dependencies required for
your machine learning projects. This setup can be customized to fit specific
project needs and can include Bring Your Own Container (BYOC) options.

3.System-Managed Environments
o System-managed environments utilize tools like Conda to automatically manage
the Python environment. This approach simplifies dependency management by
creating a new Conda environment based on specified requirements, ensuring
that all necessary libraries are installed and configured correctly.

4.Local Development Environments:


o Local development environments can be set up on personal computers or remote
virtual machines (VMs). This involves creating Python virtual environments
(using tools like venv or Anaconda) to isolate project dependencies, making it
easier to manage different projects without conflicts.

5.Cloud-Based Environments:
o Cloud-based environments, such as those provided by Azure or AWS, offer
scalable resources for machine learning tasks. These environments allow you to
leverage powerful compute instances and storage solutions while benefiting
from integrated tools for data management and model deployment.

PYTHON-ENV's AVAILABLE IN THE MARKET:

1. Virtual Environment -> Virtual-Env


➢ A virtual environment is an isolated workspace that allows you to manage
dependencies for different Python projects separately. By creating a virtual
environment, you can install packages without affecting the global Python installation
or other projects. This is particularly useful when working on multiple projects that
require different versions of libraries. You can create a virtual environment using
the venv module or virtualenv, and activate it to begin using it.

2. Pipenv
➢ Pipenv is a dependency management tool that combines the functionalities
of pip and virtualenv. It automatically creates and manages a virtual environment for
your projects, as well as adds/removes packages from your Pipfile as needed. Pipenv
simplifies the process of managing dependencies and ensures that all package versions
are compatible with one another, making it easier to maintain project environments.

3. Jupyter Notebook
➢ Jupyter Notebook is an interactive computing environment that allows you to create
and share documents containing live code, equations, visualizations, and narrative text.
It supports various programming languages, including Python, and is widely used for
data analysis, visualization, and machine learning tasks. Jupyter Notebooks can be run
in a virtual environment to ensure that all required libraries are available for your
analysis without conflicts with other projects.

VISUAL STUDIO CODE in PRODUCTION ENVIRONMENTS

1.Integrated Development Environment (IDE) Capabilities:


VS Code offers a streamlined interface with support for debugging, task running,
and version control, making it suitable for developing complex machine learning
applications.

2.Support for Multiple Environments:


Users can easily create and manage virtual environments (using venv or conda)
directly within VS Code. This allows for package isolation, ensuring that
dependencies do not conflict across different projects

3.Extensions and Customization:


The extensive marketplace of extensions enables users to tailor the editor to their
specific needs, including support for various programming languages, tools for
data visualization, and integration with cloud services like Azure Machine
Learning

4.Interactive Notebooks:
VS Code supports Jupyter Notebooks, allowing data scientists to create and run
interactive documents that combine code execution with rich text elements like
equations and visualizations. This is particularly useful for exploratory data
analysis and model development.

5.Version Control Integration:


Built-in Git support allows seamless version control management, enabling
teams to collaborate effectively on machine learning projects while tracking
changes to code and configurations.

6.Remote Development:
VS Code provides features for remote development, allowing users to connect
to remote servers or containers. This is beneficial for running heavy
computations or accessing large datasets without relying on local resources.

7.Environment Management:
The Python extension in VS Code facilitates easy selection and switching
between different Python interpreters and environments, streamlining the
workflow when working on multiple projects

8.Deployment Support:
VS Code can be integrated with deployment tools and services, enabling
smooth transitions from development to production environments. This
includes support for Docker containers and cloud deployments.

LARGE LANGUAGE MODELS:

LLaMA (Large Language Model Meta AI):


LLaMA, developed by Meta AI, is a family of state-of-the-art large language
models (LLMs) designed to facilitate advanced natural language processing
tasks. The LLaMA models range from 7 billion to 65 billion parameters,
allowing for efficient performance across various applications while minimizing
the computational resources required compared to other large models. The
LLaMA architecture is based on the transformer model, which processes input
text by predicting the next word in a sequence, enabling it to generate coherent
and contextually relevant text.The training of LLaMA involved over 1.4 trillion
tokens sourced from diverse datasets, including Common Crawl, Wikipedia, and
books, ensuring a broad understanding of language. This extensive training
allows LLaMA to excel in tasks such as text generation, question answering, and
even code generation. Notably, LLaMA 2 introduced enhancements in
architecture and training methodologies, improving its multilingual capabilities
and efficiency.
One of the distinguishing features of LLaMA is its commitment to open science;
the models are made available for research and commercial use without
significant barriers. This democratization of access enables researchers and
developers to explore innovative applications and fine-tune the models for
specific tasks. Additionally, LLaMA includes provisions for responsible AI use,
addressing potential biases and ethical concerns associated with large language
models.
The latest versions of LLaMA also incorporate vision capabilities, allowing the
models to process and understand visual data alongside textual input. This
multimodal functionality expands the potential applications of LLaMA in fields
such as healthcare, customer service, and creative content generation.
Overall, LLaMA represents a significant advancement in the field of AI language
models, providing powerful tools for developers and researchers while promoting
responsible usage and accessibility in artificial intelligence.

Key Features of LLaMA:


• Parameter Range: Models range from 7B to 65B parameters.
• Transformer Architecture: Utilizes a proven architecture for effective natural
language processing.
• Diverse Training Data: Trained on over 1.4 trillion tokens from various
sources.
• Multimodal Capabilities: Some versions can process both text and visual data.
• Open Access: Available for research and commercial use with a focus on
responsible AI practices.

Applications:
• Text Generation: Producing coherent narratives or responses based on prompts.
• Question Answering: Providing accurate answers to user queries.
• Code Generation: Assisting developers in writing code snippets or solving
programming problems.
• Visual Reasoning: Understanding and interpreting images alongside text inputs.
Explanation of TypeScript, API-Based CRUD Operations, Clerk-Based
Authentication, and Neon-Based PostgreSQL:

TypeScript
TypeScript is a superset of JavaScript that introduces static typing to the
language. This feature helps developers catch errors at compile time rather than at
runtime, making it easier to maintain large codebases. TypeScript compiles down to
plain JavaScript, allowing it to run in any environment that supports JavaScript. It
supports advanced features such as interfaces, enums, and generics, which enhance code
organization and reusability. In the context of building APIs, TypeScript provides strong
typing for request and response objects, improving the reliability of data handling and
reducing runtime errors.

API-Based CRUD Operations:


CRUD (Create, Read, Update, Delete) operations are fundamental to any application
that interacts with a database. In a typical API-based setup using TypeScript and
Node.js, you would implement these operations using the Express framework. The
process generally involves:

1. Creating an API: Set up an Express server and define routes for each CRUD
operation.
2. Connecting to a Database: Use an ORM like Mongoose (for MongoDB) or
Sequelize (for SQL databases) to manage data interactions.
3. Implementing Controllers: Write controller functions that handle incoming
requests, perform the necessary database operations, and send responses back
to the client.
4. Testing Endpoints: Use tools like Postman to test the API endpoints for
functionality and correctness.

Clerk-Based Authentication:
Clerk is an authentication service designed for modern web applications. It simplifies
user management by providing features such as user sign-up, sign-in, and session
management out of the box. Integrating Clerk into your application involves:
1. Setting Up Clerk: Create an account on Clerk's platform and configure your
application settings.
2. Integrating with Frontend: Use Clerk's SDK to add authentication
components directly into your frontend application.
3. Backend Integration: Protect your API routes by verifying user sessions using
Clerk's middleware in your Node.js application.
4. Managing Users: Utilize Clerk's dashboard to manage user accounts, roles,
and permissions effectively.
This setup enhances security by offloading authentication concerns to a dedicated
service while providing a seamless user experience.

Neon-Based PostgreSQL:
Neon is a serverless Postgres database designed for modern applications that require
scalability and flexibility. It offers features such as:
1. Serverless Architecture: Automatically scales resources based on demand,
allowing applications to handle varying workloads without manual
intervention.
2. Instant Snapshots: Provides point-in-time recovery options through instant
snapshots of your database state.
3. Seamless Integration: Works well with existing PostgreSQL clients and
libraries, making migration straightforward for developers familiar with
Postgres.
4. Cost Efficiency: Users only pay for the resources they consume, making it a
cost-effective solution for startups and growing applications.
Using Neon as your database solution allows developers to focus on building features
without worrying about infrastructure management.

Conclusion
Combining TypeScript with API-based CRUD operations, Clerk-based authentication,
and Neon-based PostgreSQL creates a robust framework for developing modern web
applications. This stack leverages the strengths of each component TypeScript's type
safety, Clerk's user management capabilities, and Neon's scalable database architecture
to deliver high-performance applications that are easy to maintain and secure.
APPENDIX 2

SOURCE CODE

# import all the required libraries and dependencies for dataframe

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import warnings

warnings.filterwarnings("ignore")

from datetime import datetime, timedelta

# import all libraries and dependencies for data visualization

pd.options.display.float_format='{:.4f}'.format

plt.rcParams['figure.figsize'] = [8,8]

pd.set_option('display.max_columns', 500)

pd.set_option('display.max_colwidth', -1)

sns.set(style='darkgrid')

import matplotlib.ticker as plticker

%matplotlib inline

# import all the required libraries and dependencies for machine learning

from sklearn.model_selection import train_test_split


from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import KFold

from sklearn.metrics import roc_auc_score

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.ensemble import AdaBoostClassifier

from sklearn import metrics

import statsmodels.api as sm

import pickle

import gc

from sklearn import svm

from xgboost import XGBClassifier

import xgboost as xgb

from sklearn.metrics import accuracy_score

from sklearn.metrics import roc_curve,roc_auc_score, precision_recall_curve,


average_precision_score

from sklearn.metrics import classification_report

from sklearn.metrics import confusion_matrix

path = '../input/credit-card-fraud/creditcard.csv'

In [3]:

linkcode

# Reading the Credit Card file on which analysis needs to be done


df_card = pd.read_csv(path)

df_card.head()

# Calculating the Missing Value% in the DF

df_null = df_card.isnull().mean()*100

df_null.sort_values(ascending=False).head()

Step 3 : Data Visualization

plt.figure(figsize=(13,7))

plt.subplot(121)

plt.title('Fraudulent BarPlot', fontweight='bold',fontsize=14)

ax = df_card['Class'].value_counts().plot(kind='bar')

total = float(len(df_card))

for p in ax.patches:

height = p.get_height()

ax.text(p.get_x()+p.get_width()/2.,

height + 3,

'{:1.5f}'.format(height/total),

ha="center")

classes=df_card['Class'].value_counts()

normal_share=classes[0]/df_card['Class'].count()*100

fraud_share=classes[1]/df_card['Class'].count()*100

print(normal_share)

print(fraud_share)
#Scatter plot between Time and Amount

fig = plt.figure(figsize = (8,8))

plt.scatter(df_nonfraud.Amount,
df_nonfraud.Time.values/(60*60),alpha=0.5,label='Non Fraud')

plt.scatter(df_fraud.Amount, df_fraud.Time.values/(60*60),alpha=1,label='Fraud')

plt.xlabel('Amount')

plt.ylabel('Time')

plt.title('Scatter plot between Amount and Time ')

plt.show()

# Plot of high value transactions($200-$2000)

bins = np.linspace(200, 2000, 100)

plt.hist(df_nonfraud.Amount, bins, alpha=1, density=True, label='Non-Fraud')

plt.hist(df_fraud.Amount, bins, alpha=1, density=True, label='Fraud')

plt.legend(loc='upper right')

plt.title("Amount by percentage of transactions (transactions \$200-$2000)")

plt.xlabel("Transaction amount (USD)")

plt.ylabel("Percentage of transactions (%)")

plt.show()

from sklearn.model_selection import StratifiedShuffleSplit

In [21]:

linkcode

# Splitting the data into Train and Test set


kfold = 4

sss = StratifiedShuffleSplit(n_splits=kfold, test_size=0.3, random_state=9487)

for train_index, test_index in sss.split(X, y):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = X.iloc, X.iloc

y_train, y_test = y[train_index], y[test_index]

from sklearn import preprocessing

from sklearn.preprocessing import PowerTransformer

pt = preprocessing.PowerTransformer(copy=False)

PWTR_X = pt.fit_transform(X)

kfold = 4

sss = StratifiedShuffleSplit(n_splits=kfold, test_size=0.3, random_state=9487)

for train_index, test_index in sss.split(PWTR_X, y):

print("TRAIN:", train_index, "TEST:", test_index)

X_train, X_test = PWTR_X[train_index], PWTR_X[test_index]

y_train, y_test = y[train_index], y[test_index]

from sklearn.linear_model import LogisticRegression

# Fit a logistic regression model to train data

model_lr = LogisticRegression()

model_lr.fit(X_train, y_train)

print("Logistic Regression with PCA Best AUC : ", model_lrh.best_score_)

print("Logistic Regression with PCA Best hyperparameters: ",


model_lrh.best_params_)

# Predicting on test data


model_lrh_tuned.fit(X_train,y_train)

y_predicted = model_lrh_tuned.predict(X_test)

In [39]:

linkcode

#Evaluation Metrices

print('Classification report:\n', classification_report(y_test, y_predicted))

print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))

print("Logistic Regression Accuracy: ",accuracy_score(y_test,y_predicted))

print('ROC AUC : ', roc_auc_score(y_test, y_predicted))

#Initializing Random forest and creating model

from sklearn.ensemble import RandomForestClassifier

model_rfc = RandomForestClassifier(n_jobs=-1,

random_state=2018,

criterion='gini',

n_estimators=100,

verbose=False)

In [42]:

# Fitting the model on Train data and Predicting on Test data

model_rfc.fit(X_train,y_train)

y_predicted = model_rfc.predict(X_test)

In [43]:

linkcode
# Evaluation Metrics

print('Classification report:\n', classification_report(y_test, y_predicted))

print('Confusion matrix:\n',confusion_matrix(y_true = y_test, y_pred = y_predicted))

print("Logistic Regression Accuracy: ",accuracy_score(y_test,y_predicted))

print('ROC AUC : ', roc_auc_score(y_test, y_predicted))


APPENDIX 3

SCREENSHOTS

Fig A3.1 : LLM setup page.

Fig A3.2 : Data summarisation studio.


Fig A3.3 : Machine Learning Environment : Anaconda.

Fig A3.4 : CLERK for user authentication.


Fig A3.5 : CANVA as presentation partner.

Fig A3.6 : NEON as PostgresQL provider.


REFERENCES:

[1]. Galton, F
"Regression towards mediocrity in hereditary stature."
Journal of the Anthropological Institute of Great Britain and Ireland, 1886, pp.
246–263.

[2]. Friedman, J. H., Bentley, J. L., & Finkel, R. A.


"An algorithm for finding best matches in logarithmic expected time."
ACM Transactions on Mathematical Software (TOMS), 1977, 3(3), pp. 209–
226.

[3]. Kohavi, R., et al.


"A study of cross-validation and bootstrap for accuracy estimation and model
selection."
IJCAI, 1995, 14, pp. 1137–1145.

[4]. Markatou, M., Tian, H., Biswas, S., & Hripcsak, G. M.


"Analysis of variance of cross-validation estimators of the generalization
error."
Journal of Machine Learning Research, 2005, 6, pp. 1127–1168.

[5]. Efron, B., & Tibshirani, R. J.


"Improvements on cross-validation: the 632+ bootstrap method."
Journal of the American Statistical Association, 1997, 92(438), pp. 548–560.

[6]. Varma, S., & Simon, R.


"Bias in error estimation when using cross-validation for model selection."
BMC Bioinformatics, 2006, 7(1), pp. 91.

[7]. Rokach, L., & Maimon, O.


"Top-down induction of decision trees classifiers-a survey."
IEEE Transactions on Systems, Man and Cybernetics, 2005, Part C:
Applications and Reviews, 35(4), pp. 476-487.

[8]. Barros, R.C., Basgalupp, M.P., Carvalho, A.C.P.L.F., & Freitas, A.A.
"A Survey of Evolutionary Algorithms for Decision-Tree Induction."
IEEE Transactions on Systems, Man and Cybernetics: Applications and
Reviews*, 2012, 42(3), pp. 291-312.

[9]. James, G., Witten, D., Hastie, T., & Tibshirani, R.


"An Introduction to Statistical Learning."
Springer, 2013.

[10]. Steinwart, I., & Christmann, A.


"Support Vector Machines."
Springer-Verlag, New York: MIT Press; 2008.

[11]. Chang, C.C., & Lin, C.J.


"LIBSVM: A library for support vector machines."
ACM Transactions on Intelligent Systems and Technology, 2011; 2(3).

[12]. Davis, J.V., Kulis, B., Jain, P., Sra, S., & Dhillon, I.S.
"Information-theoretic metric learning."
International Conference in Machine Learning (ICML); 2007; pp. 209-216.

[13]. Kulis, B.
"Metric learning: A survey."
Foundations and Trends in Machine Learning, 2013; 5(4):287-364.

[14]. Zhang Z., Xu Y., Yang J., Li X., & Zhang D.


"A survey of sparse representation: algorithms and applications."
IEEE Digital Object Identifier access, 2015; pp.490-530.

[15]. Kaelbling L.P., Littman M.L., & Moore A.W.


"Reinforcement learning: A survey."
Journal of Artificial Intelligence Research, 1996;4:237-285.

[16]. Sutton R.S., & Barto A.G.


"Reinforcement learning: An introduction."
Cambridge: MIT Press;1998.

[17]. Zhang T., Ramakrishnan R., & Livny M.


"BIRCH: an efficient data clustering method for very large databases."
In ACM Sigmod Record;1996;25(2):103-114.
[18]. Ester M., Kriegel H.P., Sander J., & Xu X.
"A density-based algorithm for discovering clusters in large spatial databases
with noise."
In KDD;1996;96(34):226-231.

[19]. Ankerst M., Breunig M.M., Kriegel H.P., & Sander J.


"OPTICS: ordering points to identify the clustering structure."
In ACM Sigmod Record;1999;28(2):49-60.

You might also like