CHAPTER
CHAPTER
INTRODUCTION
Evoastra Ventures actively supports firms in the digital economy by providing a
comprehensive suite of data and insights solutions. In a rapidly evolving digital
landscape, businesses need to stay ahead of the curve to thrive. We empower
enterprises to cement their competitive advantage and succeed in a digital-first world
through our technology-driven solutions.
Leading businesses partner with us to leverage our expertise in gathering data from
various sources, translating it into meaningful information, insights, or content, and
using that information to enhance customer experience. Our expertise spans areas
such as data analysis, insights, and technology, and we pride ourselves on our ability
to understand and serve each client’s unique needs. Whether you are a Fortune 500
company or a high-potential startup, we can help you leverage data to tackle
disruption, understand the evolving customers-cape, and accelerate business growth.
With over two years in the business, we have helped companies of all sizes succeed.
Our team of experts in data analysis, insights, and technology is dedicated to
understanding and serving each client’s unique needs. Join leading businesses and
startups who have partnered with Evoastra Ventures to thrive in the digital economy.
Our team of experts combines industry knowledge with technical expertise to provide
innovative solutions that drive growth and efficiency. Whether you're a startup
looking to gain a competitive edge or an established enterprise seeking to optimize
your operations, we have the skills and experience to help you succeed.
DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics,
and computer science to study and evaluate data. The key objective of Data
Science is to extract valuable information for use in strategic decision making,
product development, trend analysis, and forecasting.
Data Science concepts and processes are mostly derived from data
engineering, statistics, programming, social engineering, data warehousing,
machine learning, and natural language processing. The key techniques in use
are data mining, big data analysis, data extraction and data retrieval.
Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to extract
meaningful insights from data. Data science practitioners apply machine
learning algorithms to numbers, text, images, video, audio, and more to
produce artificial intelligence (AI) systems to perform tasks that ordinarily
require human intelligence. In turn, these systems generate insights which
analysts and business users can translate into tangible business value.
COMPANY PROFILE
Evoastra Ventures inc
VISION:
Evoastra Ventures continues to evolve, exploring new technologies and
practices that align with its mission of sustainability and innovation. As
consumer awareness of environmental issues grows, Evoastra is well-
positioned to lead in the sustainable fashion movement. In summary, Evoastra
Ventures stands as a model of how businesses can combine ethical practices
with stylish design. With a strong commitment to sustainability and community
engagement, Evoastra is shaping the future of the textile industry and inspiring
positive change worldwide.
MISSION:
To democratize access to quality education and empower individuals to achieve their
career goals through comprehensive, hands-on learning experiences.
OBJECTIVE
Evoastra Ventures is a company that aims to empower businesses to thrive in
the digital age through data and technology.
Their objective is to create a better, smarter future for all by unlocking the
potential of AI and data-driven decision-making.
CHAPTER-2
The second phase is data retrieval. You want to have data available for analysis,
so this step includes finding suitable data and getting access to the data from the
data owner. The result is data in its raw form, which probably needs polishing
and transformation before it becomes usable.
Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in the
data, combine data from different data sources, and transform it. If you have
successfully completed this step, you can progress to data visualization and
modeling.
The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations
based on visual and descriptive techniques. The insights you gain from this phase
will enable you to start modeling.
Finally, we get to the sexiest part: model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain the insights or
make the predictions stated in your project charter. Now is the time to bring out
the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If
you’ve done this phase right, you’re almost done.
The last step of the data science model is presenting your results and automating
the analysis, if needed. One goal of a project is to change a process and/or make
better
decisions. You may still need to convince the business that your findings will
indeed change the business process as expected. This is where you can shine in
your influencer role. The importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to perform the business
process over and over again, so automating the project will save time.
Applications
Example-Face Recognition
What happened?
Reporting
INTRODUCTION TO PYTHON
Python is a high-level, general-purpose and a very popular programming language.
Python programming language (latest Python 3) is being used in web development,
Machine Learning applications, along with all cutting edge technology in Software
Industry. Python Programming Language is very well suited for Beginners, also for
experienced programmers with other programming languages like C++ and Java.
PYTHON OPERATORS
Arithmetic operators:
Arithmetic operators are used to perform mathematical operations like addition,
subtraction, multiplication and division.
OPERATOR DESCRIPTION SYNTAX
+ Addition: adds two operands X+Y
- Subtraction: subtracts two operands X-Y
* Multiplication: multiplies two operands X*Y
/ Division (float): divides the first operand by X/Y
the
second
Relational Operators:
Relational operators compares the values. It either returns True or False according to
the condition.
OPERATOR DESCRIPTION SYNTAX
> Greater than: True if left operand is greater X>Y
than the right
< Less than: True if left operand is less than the X<Y
right
== Equal to: True if both operands are equal X == Y
!= Not equal to - True if operands are not equal X != Y
>= Greater than or equal to: True if left operand X >= Y
is greater than or equal to the right
<= Less than or equal to: True if left operand is X <= Y
less than or equal to the right
Logical operators:
Logical operators perform Logical AND, Logical OR andLogical NOT
operations.
OPERATOR DESCRIPTION SYNTAX
AND Logical AND: True if both the operands are true X AND Y
OR Logical OR: True if either of the operands is X OR Y
true
NOT Logical NOT: True if operand is false NOT X
Assignment operators:
Assignment operators are used to assign values to the variables.
OPERATOR DESCRIPTION SYNTAX
= Assign value of right side of expression to left x=y+z
side operand
+= Add AND: Add right side operand with left a+=b a=a+b
side operand and then assign to left operand
-= Subtract AND: Subtract right operand from a-=b a=a-b
left operand and then assign to left operand
*= Multiply AND: Multiply right operand with a*=b a=a*b
left
operand and then assign to left operand
/= Divide AND: Divide left operand with right a/=b a=a/b
operand and then assign to left operand
%= Modulus AND: Takes modulus using left and a%=b a=a%b
right operands and assign result to Left
operand
//= Divide(floor) AND: Divide left operand with a//=b a=a//b
right operand and then assign the value(floor)
to left operand
**= Exponent AND: Calculate exponent(raise a**=b
power) value using operands and assign value a=a**b
to left operand
&= Performs Bit wise AND on operands and a&=b a=a&b
assign value to left operand
|= Performs Bit wise OR on operands and Assign a|=b a=a|b
value to left operand
^= Performs Bit wise x OR on operands and a^=b a=a^b
Assign value to left operand
>>= Performs Bit wise right shift on operands And a>>=b a=a>>b
assign value to left operand
<<= Performs Bit wise left shift on operands and a <<=b a= a << b
assign value to left operand
Special operators:
There are some special type of operators like
Identity operators:
is and is not are the identity operators both are used to check if two values are
located on the same part of the memory. Two variables that-are equal does not imply
that they are identical.
Membership operators:
in and not in are the membership operators; used to test whether a value or variable is
in a sequence.
Operator Precedence:
This is used in an expression with more than one operator with different precedence to
determine which operation to perform first.
Operator Associativity:
If an expression contains two or more operators with the same precedence then
Operator Associativity is used to determine. It can either be Left to Right or from
Right to Left.
OPERATOR DESCRIPTION ASSOCIATIVITY
() Parentheses left-to-right
** Exponent right-to-left
*/% Multiplication/division/ left to right
modulus
c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement
• You can assign the same value to multiple Python variables.
d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
DATA TYPES
Python Numbers:
There are four numeric Python data types.
a. int
int stands for integer. This Python Data Type holds signed integers. We can use the
type() function to find which class it belongs to
b.Float
This Python Data Type holds floating-point real values. An int can only store the
number 3, but float can store 3.25 if you want.
c.Long
This Python Data type holds a long integer of unlimited length. But this construct
does not exist in Python 3.x.
Python Strings:
A string is a sequence of characters. Python does not have a char data type, unlike C+
+ or Java. You can delimit a string using single quotes or double- quotes.
c. String Formatters:
String formatters allow us to print characters and values at once.You can use the %
operator.
d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot concatenate
values of different types.
Python Lists:
A list is a collection of values. Remember, it may contain different types of values.To
define a list, you must put values separated with commas in square brackets.You
don’t need to declare a type for a list either.
a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator. Indexing or
a list begins with 0, like for a string. A Python doesn’t have arrays.
b. Length of a List
Python supports an inbuilt function to calculate the length of a list
e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in Data Flair's
tutorial on Python Lists.
Python Tuples:
A tuple is like a list. You declare it using parentheses instead.
b. Tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements.
Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated by
commas. Separate keys and values by a colon(:).The type() function works with
dictionaries too.
a. Accessing a Value
To access a value, you mention the key in square brackets.
b. Reassigning Elements
You can reassign a value to a key
c. List of Keys
Use the keys() function to get a list of keys in the dictionary
d.Bool:
A Boolean value can be True or False.
Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so. H. Type Conversion: Since
Python is dynamically-typed, you may want to convert a value into another type.
Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
CONDITIONAL STATEMENTS IN PYTHON
If statements
If statement is one of the most commonly used conditional statement in most of the
programming languages. It decides whether certain statements need to be executed or
not. If statement checks for a given condition, if the condition is true, then the set of
code present inside the if block will be executed.The If condition evaluates a Boolean
expression and executes the block of code only when the Boolean expression becomes
TRUE.
Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the condition
is true
If-else statements
The statement itself tells that if a given condition is true then execute the statements
present inside if block and if the conditions false then execute the else block.Else
block will execute only when the condition becomes false,his is the block where you
will perform some actions when the condition is not true. If-else statement evaluates
the Boolean expression and executes the block of code present inside the if block if
the condition becomes TRUE and executes a block of code present in the else block if
the condition becomes FALSE.
Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if
condition is true Block of code #Set of statements to
execute if condition is false
elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false. It’s
similar to an if-else statement and the only difference is that in else we will not check
the condition but in elif we will do check the condition. Elif statements are similar to
if-else statements but elif statements evaluate multiple conditions.
Syntax: if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is
false and elif condition is true else:
#Set of statement to be executed when both if and elif
conditions are false
Nested if Syntax:
if(condition):
#Statements to execute if condition is true if(condition):
#Statements to execute if condition is true
#end of nested if
#end of if
Syntax: if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is
false and elif condition is true elif (condition):
#Set of statements to be executed when both if and
first elif condition is false and second elif condition is true elif
(condition):
#Set of statements to be executed when if, first elif and
second elif conditions are false and third elif statement is true
else:
#Set of statement to be executed when all if and elif conditions are false
while loop:
Repeats a statement or group of statements while a given condition is TRUE. It tests
the condition before executing the loop body.
Syntax:
while expression: statement(s)
for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
Syntax:
for iterating_var in sequence:
statements(s)
nested loops:
You can use one or more loop inside any another while, for or do..while loop.
a. break statement:
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
b. continue statement
Causes the loop to skip the remainder of its body and immediately retest its condition
prior to reiterating.
c. pass statement:
The pass statement in Python is used when a statement is required syntactically but
you do not want any command or code to execute
FUNCTIONS IN PYTHON
DATA STRUCTURES:
DESCRIPTIVE STATISTICS
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
The mean in statistics, often referred to as the average, is calculated by summing all
values in a dataset and dividing by the number of values. It provides a central value
that represents the data's overall tendency.
code
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
The median in statistics is the middle value of a dataset when the values are arranged
in ascending or descending order. If there is an even number of observations, the
median is the average of the two middle values.
code
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
Types of variables
Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700
instead of 97.
Reasons of Outliers
•Typos - During collection. Eg-adding extra zero by mistake.
Histograms
Histograms depict the underlying frequency of a set of discrete or continuous data that
are measured on an interval scale.
code
import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()
Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample
data.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question,
collecting data, and then examining what the data tells us about how to proceed. The
hypothesis to be tested is called the null hypothesis and given the symbol Ho. We test
the null hypothesis against an alternative hypothesis, which is given the symbol Ha.
Correlation
Determine the relationship between two variables.
It is denoted by r. The value ranges from -1 to +1. Hence, 0 means no relation.
Syntax
import pandas as pd
import numpy as np
data=pd.read_csv("data.csv")
data.corr()
PREDICTIVE MODELLING
Making use of past data and attributes we predict future using this data.
Eg,
TYPES
1. Supervised Learning
Supervised learning is a type algorithm that uses a known data set (called the training
data set) to make predictions. The training data set includes input data and response
values.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither
classified nor. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.
Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.
Hypothesis Generation
List down all possible variables, which might influence problem objective. These
variables should be free from personal bias and preferences. Quality of model is
directly proportional to quality of hypothesis.
Data Extraction/Collection
Collect data from different sources and combine those for exploration and model
building.While looking at data we might come across new hypothesis.
Variable Treatment
It is the process of identifying whether variable is
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.
Types
1. MCAR (Missing completely at random): Missing values have no relation to the
variable in which missing value exist and other variables in data-set.
2.MAR (Missing at random): Missing values have no relation to the in which
missing value exist and the variables other than the variables in which missing values
exist.
3.MNAR (Missing not at random): Missing values have relation to the variable in
which missing value exists
Identifying
Syntax: -
1. describe()
2. Isnull()
Output will we in True or False
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier
Uni variate
Analyzing only one variable for outlier.
Eg – In box plot of height and weight. Weight will we analyzed for outlier
Bi variate
Analyzing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analyzed.
IDENTIFYING OUTLIER
Graphical Method
Box Plot:
Scatter Plot:
Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile
Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
\
Variable Transformation
Is the process by which-
1.We replace a variable with some function of that variable. Eg – Replacing a variable
x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
Change the scale of a variable
Transforming non linear relationships into linear relationship
Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root,
Binning, etc.
MODEL BUILDING
t is a process to create a mathematical model for estimating / predicting the future
based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers. They want
to predict the probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past
information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default
(closer to 0)
Steps in Model Building
1. Algorithm Selection
2. Training Model
3.Prediction / Scoring
Algorithm Selection
Example-
Training Model
It is a process to learn relationship / correlation between independent and dependent
variables. We use dependent variable of train data set to predict/estimate.
Dataset
• Train
Past data (known dependent variable). Used to train model.
• Test
Future data (unknown dependent variable) Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying
model rules. We apply training learning to test data set for prediction/estimation.
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions exist.
C = -y (log(y) – (1-y) log(1-y))
PROBLEM STATEMENT:
The objective of this project is to extract and analyze details of used Audi vehicles
listed on the Cars24 website, with a specific focus on the Mumbai market. This
analysis will help provide valuable insights into the current state of the used Audi car
market in this region, including trends in pricing, fuel types, age of vehicles, and other
key attributes
PROCESS OVERVIEW:
URL creation
Developed URL to efficiently navigate the CAR24 websites and access car
details .
Data extraction
Utilized web scrapping technique to extract key vehicle details from website.
Data cleaning
Cleaned the extracted data to ensure consistency and accuracy for subsequent
analysis
Data analysis
Analyze the cleaned data to identify the trend , pattern and insights from it.
Conclusion
Summarizing future scope for the project and its impllementation
CHALLENGES FACED
Lack of Data:
There was no data to be found for used Audi cars in any of the given location ,
which resulted in US using different parameters
Inconsistent listings
The car details were not consistently loading across all listings , making it
challenging to extract the required data.
Lack of Data:
Due to the lack of data for used Audi cars in the given locations, we had to adjust our
parameters and approach. We explored alternative data sources and developed
creative solutions to gather the necessary information for our analysis. This required
additional research and innovation to ensure we could still provide valuable insights
for the client.
Inconsistent listings
To address the challenge of inconsistent data loading across car listings on the Cars24
website, we leveraged the power of S elenium, a popular web automation tool. By pro
grammatically scrolling down the page and introducing strategic wait times, we were
able to ensure that the necessary car details consistently loaded, allowing us to
successfully extract the required information for our analysis.
URL CREATION
4.
DATA EXTRACTION
DATA CLEANING
DATA ANALYSIS
FUTURE SCOPE
Expansion to Other Brands and Locations :
Extend the data extraction and analysis to include other car brands and multiple cities
across India. Develop a scalable framework that can handle various car brands and
locations with minimal adjustments.
CONCLUSION
The project successfully achieved its objectives by leveraging data analysis and
insights to address the defined business problem. Key findings demonstrated
significant trends and patterns that informed strategic recommendations, ultimately
enhancing decision-making and operational efficiency. The collaborative efforts of
the team ensured that stakeholder needs were met, and the implementation of
proposed solutions is expected to drive positive outcomes. Moving forward,
continuous monitoring and feedback will be essential to adapt and refine strategies as
needed, ensuring sustained success and alignment with organizational goals.
CHAPTER-6
MAJOR PROJECT
Title - Mice Protein
PROBLEM STATEMENT
The goal of this project is to analyze protein expression levels in the cerebral cortex of
mice to classify them into different categories based on their genotype, behavior, and
treatment
PROJECT STRUCTURE
Missing Values
Unbalanced Labels
Data Distribution
Outlier Detection
Multicollinearity
Tree vs Probabilistic Models
Supervised Learning
Final Predictions
Identifying Key Proteins
Deep Learning
LIBRARIES
Missing Values
Unbalanced Labels
Data Distribution
Outlier Detection
Multicollinearity
Tree vs Probabilistic Models
Supervised Learning
Final Predictions
Identifying Key Proteins
Deep Learning
DESCRIPTIVE STATISTICS
Data information:
Null Values:
Unbalance Labels:
Unbalanced distribution:
Outlier analysis:
Balanced labels:
Balanced distribution
Removed outliers:
Feature correlations:
P
rincipal components:
MACHINE LEARNING
XGBoost
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to all the independent variables
which are then fed into the decision tree which predicts results The weight of
variables predicted wrong by the tree is increased and these variables are then fed to
the second decision tree. These individual classifiers/predictors then ensemble to give
a strong and more precise model
PROBABILISTIC MODELS
Logistic Regression
Logistic regression is used for binary classification where we use the sigmoid
function, which takes input as independent variables and produces a probability value
between 0 and 1 It’ s referred to as regression because it is the extension of linear
regression
but is mainly used for classification problems.
Naïve Bayes
This model predicts the probability of an instance belongs to a class with a given set
of feature value. It is a probabilistic classifier. It is because it assumes that one feature
in the model is independent of existence of another feature In other words, each
feature contributes to the predictions with no relation between each other. It uses
Bayes theorem in the algorithm for training and prediction
CLUSTERING
Confusion Matrix:
Absolute Contributions
Model Architecture
BIOLOGICAL INTEGRATION
To interpret the results in the context of biological knowledge we can use several bio-
chem databases like UniProt to get the function and amino acid sequence of proteins
that are dominantly involved in Down Syndrome and associative learning in mice We
can also search for similar proteins or drugs to try and validate different reactions of
mice based on this
CONCLUSION
this project has successfully met its objectives by delivering valuable insights and
solutions to the identified challenges. Through thorough research, analysis, and
collaboration, we have developed actionable recommendations that align with our
goals. The outcomes not only demonstrate the effectiveness of our approach but also
provide a foundation for future initiatives. As we move forward, continuous
evaluation and adaptation will be essential to ensure sustained impact and
improvement.