0% found this document useful (0 votes)
24 views57 pages

CHAPTER

Evoastra Ventures provides data and insights solutions to empower businesses in the digital economy, helping them leverage data for growth and competitive advantage. The company emphasizes a personalized approach to client engagement, utilizing data science techniques to extract valuable insights for strategic decision-making. With a commitment to sustainability and innovation, Evoastra Ventures aims to democratize access to quality education and unlock the potential of AI and data-driven decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views57 pages

CHAPTER

Evoastra Ventures provides data and insights solutions to empower businesses in the digital economy, helping them leverage data for growth and competitive advantage. The company emphasizes a personalized approach to client engagement, utilizing data science techniques to extract valuable insights for strategic decision-making. With a commitment to sustainability and innovation, Evoastra Ventures aims to democratize access to quality education and unlock the potential of AI and data-driven decision-making.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

CHAPTER-1

INTRODUCTION
Evoastra Ventures actively supports firms in the digital economy by providing a
comprehensive suite of data and insights solutions. In a rapidly evolving digital
landscape, businesses need to stay ahead of the curve to thrive. We empower
enterprises to cement their competitive advantage and succeed in a digital-first world
through our technology-driven solutions.

Leading businesses partner with us to leverage our expertise in gathering data from
various sources, translating it into meaningful information, insights, or content, and
using that information to enhance customer experience. Our expertise spans areas
such as data analysis, insights, and technology, and we pride ourselves on our ability
to understand and serve each client’s unique needs. Whether you are a Fortune 500
company or a high-potential startup, we can help you leverage data to tackle
disruption, understand the evolving customers-cape, and accelerate business growth.

With over two years in the business, we have helped companies of all sizes succeed.
Our team of experts in data analysis, insights, and technology is dedicated to
understanding and serving each client’s unique needs. Join leading businesses and
startups who have partnered with Evoastra Ventures to thrive in the digital economy.

At Evoastra Ventures, we understand that every business is unique, which is why we


take a personalized approach to each client engagement. We begin by gaining a deep
understanding of your business objectives, challenges, and industry landscape. This
allows us to tailor our solutions to meet your specific needs and deliver maximum
value.

Our team of experts combines industry knowledge with technical expertise to provide
innovative solutions that drive growth and efficiency. Whether you're a startup
looking to gain a competitive edge or an established enterprise seeking to optimize
your operations, we have the skills and experience to help you succeed.

DATA SCIENCE
 Data Science as a multi-disciplinary subject that uses mathematics, statistics,
and computer science to study and evaluate data. The key objective of Data
Science is to extract valuable information for use in strategic decision making,
product development, trend analysis, and forecasting.
 Data Science concepts and processes are mostly derived from data
engineering, statistics, programming, social engineering, data warehousing,
machine learning, and natural language processing. The key techniques in use
are data mining, big data analysis, data extraction and data retrieval.
 Data science is the field of study that combines domain expertise,
programming skills, and knowledge of mathematics and statistics to extract
meaningful insights from data. Data science practitioners apply machine
learning algorithms to numbers, text, images, video, audio, and more to
produce artificial intelligence (AI) systems to perform tasks that ordinarily
require human intelligence. In turn, these systems generate insights which
analysts and business users can translate into tangible business value.

COMPANY PROFILE
Evoastra Ventures inc

VISION:
Evoastra Ventures continues to evolve, exploring new technologies and
practices that align with its mission of sustainability and innovation. As
consumer awareness of environmental issues grows, Evoastra is well-
positioned to lead in the sustainable fashion movement. In summary, Evoastra
Ventures stands as a model of how businesses can combine ethical practices
with stylish design. With a strong commitment to sustainability and community
engagement, Evoastra is shaping the future of the textile industry and inspiring
positive change worldwide.

MISSION:
To democratize access to quality education and empower individuals to achieve their
career goals through comprehensive, hands-on learning experiences.

OBJECTIVE
 Evoastra Ventures is a company that aims to empower businesses to thrive in
the digital age through data and technology.
 Their objective is to create a better, smarter future for all by unlocking the
potential of AI and data-driven decision-making.
CHAPTER-2

INTRODUCTION TO DATA SCIENCE


The field of bringing insights from data using scientific techniques is called data
science.Data science is a dynamic and multidisciplinary field that merges statistics,
mathematics, and computer science to extract meaningful insights from data. By
leveraging advanced analytical techniques and algorithms, data scientists analyze vast
amounts of structured and unstructured data to uncover patterns, trends, and
correlations. This process not only involves collecting and cleaning data but also
applying machine learning models and statistical methods to make predictions and
inform decision-making. As organizations increasingly rely on data-driven strategies,
data science has become essential across various industries, including finance,
healthcare, marketing, and technology, enabling businesses to optimize operations,
enhance customer experiences, and drive innovation.

DATA SCIENCE PROCESS:


 The first step of this process is setting a research goal. The main purpose here is
making sure all the stakeholders understand the what, how, and why of the
project.

 The second phase is data retrieval. You want to have data available for analysis,
so this step includes finding suitable data and getting access to the data from the
data owner. The result is data in its raw form, which probably needs polishing
and transformation before it becomes usable.
 Now that you have the raw data, it’s time to prepare it. This includes
transforming the data from a raw form into data that’s directly usable in your
models. To achieve this, you’ll detect and correct different kinds of errors in the
data, combine data from different data sources, and transform it. If you have

 successfully completed this step, you can progress to data visualization and
modeling.

 The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations
based on visual and descriptive techniques. The insights you gain from this phase
will enable you to start modeling.

 Finally, we get to the sexiest part: model building (often referred to as “data
modeling” throughout this book). It is now that you attempt to gain the insights or
make the predictions stated in your project charter. Now is the time to bring out
the heavy guns, but remember research has taught us that often (but not always) a
combination of simple models tends to outperform one complicated model. If
you’ve done this phase right, you’re almost done.

 The last step of the data science model is presenting your results and automating
the analysis, if needed. One goal of a project is to change a process and/or make
better
 decisions. You may still need to convince the business that your findings will
indeed change the business process as expected. This is where you can shine in
your influencer role. The importance of this step is more apparent in projects on a
strategic and tactical level. Certain projects require you to perform the business
process over and over again, so automating the project will save time.

Applications

Amazon Go – No checkout lines

Computer Vision - The advancement in recognizing an image by a computer


involves processing large sets of image data from multiple objects of same category.

Example-Face Recognition

Spectrum of Business Analysis


The spectrum of business analysis in data science focuses on leveraging data-driven
insights to inform business decisions and strategies. Here’s an overview of its key
components:
 Business Problem Identification: Understanding the specific challenges or
opportunities that the business faces. This involves collaborating with
stakeholders to define objectives and scope.
 Data Collection and Integration: Gathering relevant data from various sources,
including internal databases, external APIs, and third-party services. This stage
often involves data wrangling to ensure data quality and compatibility.
 Exploratory Data Analysis (EDA): Conducting preliminary analyses to explore
data distributions, identify patterns, and detect anomalies. This helps in
understanding the data’s characteristics and informs further analysis.
 Statistical Analysis: Applying statistical methods to derive insights, validate
assumptions, and test hypotheses. This includes techniques such as regression
analysis, hypothesis testing, and significance testing.
 Predictive Modeling: Developing and deploying machine learning models to
predict future outcomes or behaviors based on historical data. This can involve
classification, regression, and clustering techniques.
 Data Visualization: Creating visual representations of data and analysis results
to communicate insights effectively to stakeholders. Tools like dashboards and
charts help in making complex data more accessible.
 Insight Generation: Translating analytical results into actionable business
insights. This involves interpreting data in the context of business goals and
providing recommendations.
 Performance Monitoring: Establishing metrics and KPIs to track the
effectiveness of implemented solutions and strategies. This continuous
monitoring helps in evaluating success and identifying areas for further
improvement.
 Feedback Loop and Iteration: Incorporating feedback from stakeholders and
performance metrics to refine models, processes, and strategies. This iterative
approach ensures adaptability and responsiveness to changing business needs.
 Change Management: Facilitating the adoption of data-driven solutions within
the organization, including training users and ensuring alignment with business
practices.
This spectrum highlights how data science and business analysis intersect, ultimately
driving informed decision-making and enhancing organizational performance.
What can happen?
Given data is collected
and used.
Big Data
C
O
M What is likely to
P happen?
Predictive Analysis
L
E
X
What’s happening
I now?
Dashboards
T
Y
Why did it
happen?
Detective Analysis

What happened?
Reporting

Value added to organization


CHAPTER-3
PYTHON FOR DATA SCIENCE

INTRODUCTION TO PYTHON
Python is a high-level, general-purpose and a very popular programming language.
Python programming language (latest Python 3) is being used in web development,
Machine Learning applications, along with all cutting edge technology in Software
Industry. Python Programming Language is very well suited for Beginners, also for
experienced programmers with other programming languages like C++ and Java.

PYTHON OPERATORS

Arithmetic operators:
Arithmetic operators are used to perform mathematical operations like addition,
subtraction, multiplication and division.
OPERATOR DESCRIPTION SYNTAX
+ Addition: adds two operands X+Y
- Subtraction: subtracts two operands X-Y
* Multiplication: multiplies two operands X*Y
/ Division (float): divides the first operand by X/Y
the
second

// Division (floor): divides the first operand by X // Y


the
second

% Modulus: returns the remainder when first X%Y


operand is divided by thesecond
** Power : Returns first raised to power second X ** Y

Relational Operators:
Relational operators compares the values. It either returns True or False according to
the condition.
OPERATOR DESCRIPTION SYNTAX
> Greater than: True if left operand is greater X>Y
than the right
< Less than: True if left operand is less than the X<Y
right
== Equal to: True if both operands are equal X == Y
!= Not equal to - True if operands are not equal X != Y
>= Greater than or equal to: True if left operand X >= Y
is greater than or equal to the right
<= Less than or equal to: True if left operand is X <= Y
less than or equal to the right

Logical operators:
Logical operators perform Logical AND, Logical OR andLogical NOT
operations.
OPERATOR DESCRIPTION SYNTAX
AND Logical AND: True if both the operands are true X AND Y
OR Logical OR: True if either of the operands is X OR Y
true
NOT Logical NOT: True if operand is false NOT X

Bit wise operators:


Bitwise operators acts on bits and performs bit by bit operation
OPERATOR DESCRIPTION SYNTAX
& Bit wise AND X&Y
| Bit wise OR X|Y
~ Bit wise NOT ~X
^ Bit wise XOR X^Y
>> Bit wise right shift X >>
<< Bit wise left shift X <<

Assignment operators:
Assignment operators are used to assign values to the variables.
OPERATOR DESCRIPTION SYNTAX
= Assign value of right side of expression to left x=y+z
side operand
+= Add AND: Add right side operand with left a+=b a=a+b
side operand and then assign to left operand
-= Subtract AND: Subtract right operand from a-=b a=a-b
left operand and then assign to left operand
*= Multiply AND: Multiply right operand with a*=b a=a*b
left
operand and then assign to left operand
/= Divide AND: Divide left operand with right a/=b a=a/b
operand and then assign to left operand
%= Modulus AND: Takes modulus using left and a%=b a=a%b
right operands and assign result to Left
operand
//= Divide(floor) AND: Divide left operand with a//=b a=a//b
right operand and then assign the value(floor)
to left operand
**= Exponent AND: Calculate exponent(raise a**=b
power) value using operands and assign value a=a**b
to left operand
&= Performs Bit wise AND on operands and a&=b a=a&b
assign value to left operand
|= Performs Bit wise OR on operands and Assign a|=b a=a|b
value to left operand
^= Performs Bit wise x OR on operands and a^=b a=a^b
Assign value to left operand
>>= Performs Bit wise right shift on operands And a>>=b a=a>>b
assign value to left operand
<<= Performs Bit wise left shift on operands and a <<=b a= a << b
assign value to left operand
Special operators:
There are some special type of operators like
Identity operators:
is and is not are the identity operators both are used to check if two values are
located on the same part of the memory. Two variables that-are equal does not imply
that they are identical.

is - True if the operands are identical


is not - True if the operands are not identical

Membership operators:
in and not in are the membership operators; used to test whether a value or variable is
in a sequence.

in - True if value is found in the sequence


not in - True if value is not found in the sequence

Precedence and Associativity of Operators:


Operator precedence and associativity as these determine the priorities of the operator.

Operator Precedence:
This is used in an expression with more than one operator with different precedence to
determine which operation to perform first.

Operator Associativity:
If an expression contains two or more operators with the same precedence then
Operator Associativity is used to determine. It can either be Left to Right or from
Right to Left.
OPERATOR DESCRIPTION ASSOCIATIVITY
() Parentheses left-to-right
** Exponent right-to-left
*/% Multiplication/division/ left to right
modulus

OPERATOR DESCRIPTION ASSOCIATIVITY


+- Addition/subtraction left-to-righ
<< >> Bit wise shift left, Bit wise shift left-to-right
right
< <=> >= Relational less than/less than or left-to-right
equalto Relational greater
than/greater thanor equal to
== != Relational is equal to/is not equal to left-to-right

VARIABLES AND DATA TYPES

a. Python Variables Naming Rules:


There are certain rules to what you can name a variable(called an identifier).
• Python variables can only begin with a letter(A-Z/a-z) or an underscore(_).
• The rest of the identifier may contain letters(A-Z/a-z), underscores(_), and
numbers(0-9).
• Python is case-sensitive, and so are Python identifiers. Name and name are two
different identifiers.

b. Assigning and Reassigning Python Variables:


• To assign a value to Python variables, you don’t need to declare its type.
• You name it according to the rules stated in section 2a, and type the value after the
equal sign(=).• You can’t put the identifier on the right-hand side of the equal sign.
• Neither can you assign Python variables to a keyword.

c. Multiple Assignment:
• You can assign values to multiple Python variables in one statement
• You can assign the same value to multiple Python variables.

d. Deleting Variables:
• You can also delete Python variables using the keyword ‘del’.
DATA TYPES

 Python Numbers:
There are four numeric Python data types.

a. int
int stands for integer. This Python Data Type holds signed integers. We can use the
type() function to find which class it belongs to

b.Float
This Python Data Type holds floating-point real values. An int can only store the
number 3, but float can store 3.25 if you want.

c.Long
This Python Data type holds a long integer of unlimited length. But this construct
does not exist in Python 3.x.

 Python Strings:
A string is a sequence of characters. Python does not have a char data type, unlike C+
+ or Java. You can delimit a string using single quotes or double- quotes.

a. Spanning a String Across Lines:


To span a string across multiple lines, you can use triple quotes.

b. Displaying Part of a String:


You can display a character from a string using its index in the string. Remember,
indexing starts with 0

c. String Formatters:
String formatters allow us to print characters and values at once.You can use the %
operator.

d. String Concatenation:
You can concatenate(join) strings using + operator. However, you cannot concatenate
values of different types.

 Python Lists:
A list is a collection of values. Remember, it may contain different types of values.To
define a list, you must put values separated with commas in square brackets.You
don’t need to declare a type for a list either.

a. Slicing a List
You can slice a list the way you’d slice a string- with the slicing operator. Indexing or
a list begins with 0, like for a string. A Python doesn’t have arrays.

b. Length of a List
Python supports an inbuilt function to calculate the length of a list

c. Reassigning Elements of a List


A list is mutable. This means that you can reassign elements later on.

d.Iterating on the List


To iterate over the list we can use the for loop. By iterating, we can access each
element one by one which is very helpful when we need to perform some operations
on each element of list.

e. Multidimensional Lists
A list may have more than one dimension. Have a detailed look on this in Data Flair's
tutorial on Python Lists.

 Python Tuples:
A tuple is like a list. You declare it using parentheses instead.

a. Accessing and Slicing a Tuple


You access a tuple the same way as you’d access a list. The same goes for slicing it.

b. Tuple is Immutable
Python tuple is immutable. Once declared, you can’t change its size or elements.

 Dictionaries:
A dictionary holds key-value pairs. Declare it in curly braces, with pairs separated by
commas. Separate keys and values by a colon(:).The type() function works with
dictionaries too.

a. Accessing a Value
To access a value, you mention the key in square brackets.

b. Reassigning Elements
You can reassign a value to a key

c. List of Keys
Use the keys() function to get a list of keys in the dictionary

d.Bool:
A Boolean value can be True or False.

 Sets:
A set can have a list of values. Define it using curly braces. It returns only one
instance of any value present more than once. However, a set is unordered, so it
doesn’t support indexing. Also, it is mutable. You can change its elements or add
more. Use the add() and remove() methods to do so. H. Type Conversion: Since
Python is dynamically-typed, you may want to convert a value into another type.
Python supports a list of functions for the same.
a. int()
b. float()
c. bool()
d. set()
e. list()
f. tuple()
g. str()
CONDITIONAL STATEMENTS IN PYTHON

 If statements
If statement is one of the most commonly used conditional statement in most of the
programming languages. It decides whether certain statements need to be executed or
not. If statement checks for a given condition, if the condition is true, then the set of
code present inside the if block will be executed.The If condition evaluates a Boolean
expression and executes the block of code only when the Boolean expression becomes
TRUE.

Syntax:
If (Boolean expression): Block of code #Set of statements to execute if the condition
is true

 If-else statements
The statement itself tells that if a given condition is true then execute the statements
present inside if block and if the conditions false then execute the else block.Else
block will execute only when the condition becomes false,his is the block where you
will perform some actions when the condition is not true. If-else statement evaluates
the Boolean expression and executes the block of code present inside the if block if
the condition becomes TRUE and executes a block of code present in the else block if
the condition becomes FALSE.

Syntax:
if(Boolean expression):
Block of code #Set of statements to execute if
condition is true Block of code #Set of statements to
execute if condition is false

 elif statements
In python, we have one more conditional statement called elif statements. Elif
statement is used to check multiple conditions only if the given if condition false. It’s
similar to an if-else statement and the only difference is that in else we will not check
the condition but in elif we will do check the condition. Elif statements are similar to
if-else statements but elif statements evaluate multiple conditions.

Syntax: if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is
false and elif condition is true else:
#Set of statement to be executed when both if and elif
conditions are false

 Nested if-else statements


Nested if-else statements mean that an if statement or if-else statement is present
inside another if or if-else block. Python provides this feature as well, this in turn will
help us to check multiple conditions in a given program.An if statement present inside
another if statement which is present inside another if statements and so on.

Nested if Syntax:
if(condition):
#Statements to execute if condition is true if(condition):
#Statements to execute if condition is true
#end of nested if
#end of if

Nested if-else Syntax:


if(condition):
#Statements to execute if condition is true if(condition):
#Statements to execute if condition is true
else:
#Statements to execute if condition is false
else:
#Statements to execute if condition is false
 elif Ladder
We have seen about the elif statements but what is this elif ladder. As the name itself
suggests a program which contains ladder of elif statements or elif statements which
are structured in the form of a ladder.This statement is used to test multiple
expressions.

Syntax: if (condition):
#Set of statement to execute if condition is true elif (condition):
#Set of statements to be executed when if condition is
false and elif condition is true elif (condition):
#Set of statements to be executed when both if and
first elif condition is false and second elif condition is true elif
(condition):
#Set of statements to be executed when if, first elif and
second elif conditions are false and third elif statement is true
else:
#Set of statement to be executed when all if and elif conditions are false

7.LOOPING CONSTRUCTS IN PYTHON

 while loop:
Repeats a statement or group of statements while a given condition is TRUE. It tests
the condition before executing the loop body.

Syntax:
while expression: statement(s)

 for loop:
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.

Syntax:
for iterating_var in sequence:
statements(s)
 nested loops:
You can use one or more loop inside any another while, for or do..while loop.

Syntax of nested for loop:


for iterating_var in
sequence: for iterating_var in sequence:
statements(s)
statements(s)

Syntax of nested while loop:


while expression:
while expression:
statement(s)
statement(s)

LOOPING CONTROL STATEMMENT

a. break statement:
Terminates the loop statement and transfers execution to the statement immediately
following the loop.

b. continue statement
Causes the loop to skip the remainder of its body and immediately retest its condition
prior to reiterating.

c. pass statement:
The pass statement in Python is used when a statement is required syntactically but
you do not want any command or code to execute

FUNCTIONS IN PYTHON

A. Built-in Functions or pre defined functions:


These are the functions which are already defined by Python.
For example: id (), type(), print (), etc.
B. User-Defined Functions:
These are functions that are defined by the users for simplicity and to avoid repetition
of code. It is done by using def function.

DATA STRUCTURES:

Two types of Data structures:


LISTS: A list is an ordered data structure with elements separated by comma and
enclosed within square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated


by comma and stored as key: value pair, enclosed with curly braces {}.
CHAPTER – 4
STATISTICS FOR DATA SCIENCE

DESCRIPTIVE STATISTICS

Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.

Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)

Mean
The mean in statistics, often referred to as the average, is calculated by summing all
values in a dataset and dividing by the number of values. It provides a central value
that represents the data's overall tendency.

code
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)

Median
The median in statistics is the middle value of a dataset when the values are arranged
in ascending or descending order. If there is an even number of observations, the
median is the average of the two middle values.
code
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)

Types of variables

•Continous – Which takes continuous numeric values. Eg-marks

•Categorial -Which have discrete values. Eg- Gender

•Ordinal – Ordered categorial variables. Eg- Teacher feedback

•Nominal – Unorderd categorial variable. Eg- Gender

Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700
instead of 97.

Reasons of Outliers
•Typos - During collection. Eg-adding extra zero by mistake.

•Measurement Error - Outliers in data due to measurement operator being


faulty.

•Intentional Error - Errors which are induced intentionally. Eg-claiming smaller


amount of alcohol consumed then actual.
•Legit Outlier - These are values which are not actually errors but in data due to
legitimate reasons. Eg - a CEO’s salary might actually be high as compared to other
employees.

Interquartile Range (IQR)


Is difference between third and first quartile from last. It is robust to outliers.

Histograms
Histograms depict the underlying frequency of a set of discrete or continuous data that
are measured on an interval scale.

code
import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()

Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample
data.

Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question,
collecting data, and then examining what the data tells us about how to proceed. The
hypothesis to be tested is called the null hypothesis and given the symbol Ho. We test
the null hypothesis against an alternative hypothesis, which is given the symbol Ha.

Decision made Null Hypothesis is Null Hypothesis id False


True
Reject Null Hypothesis Type 1 Error Correct Decision
Do not reject Null Correct Decision Type 11 Error
Hypothesis
T Tests
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
Z Score
The distance in terms of number of standard deviations, the observed value is away
from mean, is standard score or z score.

+Z – value is above mean.


-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape of original
distribution.

Chi Squared Test


To test categorical variables.

Correlation
Determine the relationship between two variables.
It is denoted by r. The value ranges from -1 to +1. Hence, 0 means no relation.

Syntax
import pandas as pd
import numpy as np
data=pd.read_csv("data.csv")
data.corr()

PREDICTIVE MODELLING
Making use of past data and attributes we predict future using this data.
Eg,

PAST Horror movies


FUTURE Unwatched Horror movies

Predicting stock price movement

1. Analyzing past stock prices.

2. Analyzing similar stocks.

3. Future stock price required.

TYPES

1. Supervised Learning
Supervised learning is a type algorithm that uses a known data set (called the training
data set) to make predictions. The training data set includes input data and response
values.

•Regression - which have continuous possible values. Eg- Marks


•Classification - which have only two values. Eg- Cancer prediction is either 0 or 1.

2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither
classified nor. Here the task of machine is to group unsorted information according to
similarities, patterns and differences without any prior training of data.

•Clustering: A clustering problem is where you want to discover the inherent


groupings in the data, such as grouping customers by purchasing behaviour.
•Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also tend to
buy Y.
3.Stages of Predictive Modelling
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation

Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.

Hypothesis Generation
List down all possible variables, which might influence problem objective. These
variables should be free from personal bias and preferences. Quality of model is
directly proportional to quality of hypothesis.

Data Extraction/Collection
Collect data from different sources and combine those for exploration and model
building.While looking at data we might come across new hypothesis.

Data Exploration and Transformation


Data extraction is a process that involves retrieval of data from various sources for
further data processing or data storage.

Steps of Data Extraction


• Reading the data Eg- From csv file
• Variable identification
• Univariate Analysis
• Bivariate Analysis
• Missing value treatment
• Outlier treatment
• Variable Transformation

Variable Treatment
It is the process of identifying whether variable is

1. Independent or dependent variable


2. Continuous or categorical variable

Why do we perform variable identification?


1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.

Categorical variable - Stored as object.


Continuous variable - Stored as int or float.

Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.

Bivariate Analysis
• When two variables are studied together for their empirical relationship.
• When you want to see whether the two variables are associated with each other.
• It helps in prediction and detecting anomalies.

Missing Value Treatment

Reasons of missing value


1. Non-response – Eg-when you collect data on people’s income and many choose
not to answer.
2. Error in data collection. Eg- Faculty data

3. Error in data reading.

Types
1. MCAR (Missing completely at random): Missing values have no relation to the
variable in which missing value exist and other variables in data-set.
2.MAR (Missing at random): Missing values have no relation to the in which
missing value exist and the variables other than the variables in which missing values
exist.
3.MNAR (Missing not at random): Missing values have relation to the variable in
which missing value exists
Identifying
Syntax: -
1. describe()

2. Isnull()
Output will we in True or False

Different methods to deal with missing values

1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.

2. Deletion
Row wise or column wise deletion. But it leads to loss of data.

Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population

Types of Outlier
Uni variate
Analyzing only one variable for outlier.
Eg – In box plot of height and weight. Weight will we analyzed for outlier
Bi variate
Analyzing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analyzed.

IDENTIFYING OUTLIER
Graphical Method
Box Plot:

Scatter Plot:

Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile

Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
\

Variable Transformation
Is the process by which-
1.We replace a variable with some function of that variable. Eg – Replacing a variable
x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
 Change the scale of a variable
 Transforming non linear relationships into linear relationship
 Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root,
Binning, etc.

MODEL BUILDING
t is a process to create a mathematical model for estimating / predicting the future
based on past data.

Eg-
A retail wants to know the default behaviour of its credit card customers. They want
to predict the probability of default for each customer in next three months.
• Probability of default would lie between 0 and 1.
• Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past
information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default
(closer to 0)
Steps in Model Building
1. Algorithm Selection
2. Training Model
3.Prediction / Scoring

Algorithm Selection
Example-

Eg- Predict the customer will buy product or not.


Algorithms
• Logistic Regression
• Decision Tree
• Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent
variables. We use dependent variable of train data set to predict/estimate.

Dataset
• Train
Past data (known dependent variable). Used to train model.
• Test
Future data (unknown dependent variable) Used to score.

Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying
model rules. We apply training learning to test data set for prediction/estimation.

Algorithm of Machine Learning


Linear Regression
Linear regression is a statistical approach for modelling relationship between a
dependent variable with a given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear
function. That predicts the response value(y) as accurately as possible as a function of
the feature or independent variable(x).

Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions exist.
C = -y (log(y) – (1-y) log(1-y))

K-Means Clustering (Unsupervised learning)


L-means clustering is a type of unsupervised learning, which is used when you have
unlabelled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K
groups based on the features that are provided. Data points are clustered based on
feature similarity.
CHAPTER-5
MINI PROJECT

PROBLEM STATEMENT:

The objective of this project is to extract and analyze details of used Audi vehicles
listed on the Cars24 website, with a specific focus on the Mumbai market. This
analysis will help provide valuable insights into the current state of the used Audi car
market in this region, including trends in pricing, fuel types, age of vehicles, and other
key attributes

PROCESS OVERVIEW:
URL creation
Developed URL to efficiently navigate the CAR24 websites and access car
details .
Data extraction
Utilized web scrapping technique to extract key vehicle details from website.
Data cleaning
Cleaned the extracted data to ensure consistency and accuracy for subsequent
analysis
Data analysis
Analyze the cleaned data to identify the trend , pattern and insights from it.
Conclusion
Summarizing future scope for the project and its impllementation
CHALLENGES FACED
Lack of Data:
There was no data to be found for used Audi cars in any of the given location ,
which resulted in US using different parameters

Inconsistent listings
The car details were not consistently loading across all listings , making it
challenging to extract the required data.

Lack of Data:
Due to the lack of data for used Audi cars in the given locations, we had to adjust our
parameters and approach. We explored alternative data sources and developed
creative solutions to gather the necessary information for our analysis. This required
additional research and innovation to ensure we could still provide valuable insights
for the client.

Inconsistent listings
To address the challenge of inconsistent data loading across car listings on the Cars24
website, we leveraged the power of S elenium, a popular web automation tool. By pro
grammatically scrolling down the page and introducing strategic wait times, we were
able to ensure that the necessary car details consistently loaded, allowing us to
successfully extract the required information for our analysis.
URL CREATION
4.

DATA EXTRACTION
DATA CLEANING

DATA ANALYSIS
FUTURE SCOPE
Expansion to Other Brands and Locations :
Extend the data extraction and analysis to include other car brands and multiple cities
across India. Develop a scalable framework that can handle various car brands and
locations with minimal adjustments.

Enhanced Data Analysis


Incorporate advanced data analysis techniques, such as machine learning, to predict
car prices and trends. Analyze customer reviews and ratings to provide a
comprehensive view of each car's performance and user satisfaction.

Real-time Data Extraction


Implement real-time data extraction to keep the dataset current and relevant. Use APIs
to develop a user-friendly dashboard where users can easily access and interact with
the analyzed data

CONCLUSION
The project successfully achieved its objectives by leveraging data analysis and
insights to address the defined business problem. Key findings demonstrated
significant trends and patterns that informed strategic recommendations, ultimately
enhancing decision-making and operational efficiency. The collaborative efforts of
the team ensured that stakeholder needs were met, and the implementation of
proposed solutions is expected to drive positive outcomes. Moving forward,
continuous monitoring and feedback will be essential to adapt and refine strategies as
needed, ensuring sustained success and alignment with organizational goals.
CHAPTER-6
MAJOR PROJECT
Title - Mice Protein

PROBLEM STATEMENT
The goal of this project is to analyze protein expression levels in the cerebral cortex of
mice to classify them into different categories based on their genotype, behavior, and
treatment

PROJECT STRUCTURE
 Missing Values
 Unbalanced Labels
 Data Distribution
 Outlier Detection
 Multicollinearity
 Tree vs Probabilistic Models
 Supervised Learning
 Final Predictions
 Identifying Key Proteins
 Deep Learning

LIBRARIES
 Missing Values
 Unbalanced Labels
 Data Distribution
 Outlier Detection
 Multicollinearity
 Tree vs Probabilistic Models
 Supervised Learning
 Final Predictions
 Identifying Key Proteins
 Deep Learning

DESCRIPTIVE STATISTICS

Data information:
Null Values:

Unbalance Labels:
Unbalanced distribution:

Outlier analysis:
Balanced labels:

Balanced distribution

Removed outliers:
Feature correlations:
P

rincipal components:
MACHINE LEARNING

Tree Based Models:


Decision Tree
A decision tree is a flowchart-like structure used to make decisions or predictions. It
consists of nodes representing decisions or tests on attributes, branches representing
the outcome of these decisions, and leaf nodes representing final outcomes or
prediction Each internal node corresponds to a test on an attribute, each branch
corresponds to the result of the test, and each leaf node corresponds to a class label or
a continuous value

XGBoost
In this algorithm, decision trees are created in sequential form. Weights play an
important role in XGBoost. Weights are assigned to all the independent variables
which are then fed into the decision tree which predicts results The weight of
variables predicted wrong by the tree is increased and these variables are then fed to
the second decision tree. These individual classifiers/predictors then ensemble to give
a strong and more precise model

PROBABILISTIC MODELS
Logistic Regression
Logistic regression is used for binary classification where we use the sigmoid
function, which takes input as independent variables and produces a probability value
between 0 and 1 It’ s referred to as regression because it is the extension of linear
regression
but is mainly used for classification problems.

Naïve Bayes
This model predicts the probability of an instance belongs to a class with a given set
of feature value. It is a probabilistic classifier. It is because it assumes that one feature
in the model is independent of existence of another feature In other words, each
feature contributes to the predictions with no relation between each other. It uses
Bayes theorem in the algorithm for training and prediction
CLUSTERING
Confusion Matrix:

Absolute Contributions
Model Architecture

BIOLOGICAL INTEGRATION
To interpret the results in the context of biological knowledge we can use several bio-
chem databases like UniProt to get the function and amino acid sequence of proteins
that are dominantly involved in Down Syndrome and associative learning in mice We
can also search for similar proteins or drugs to try and validate different reactions of
mice based on this

CONCLUSION
this project has successfully met its objectives by delivering valuable insights and
solutions to the identified challenges. Through thorough research, analysis, and
collaboration, we have developed actionable recommendations that align with our
goals. The outcomes not only demonstrate the effectiveness of our approach but also
provide a foundation for future initiatives. As we move forward, continuous
evaluation and adaptation will be essential to ensure sustained impact and
improvement.

You might also like