Visualization With Help of PANDAS
Visualization With Help of PANDAS
1. Line Plot
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Year': [2010, 2011, 2012, 2013, 2014],
'Sales': [100, 120, 150, 200, 180]}
df = pd.DataFrame(data)
2. Bar Plot:
file:///C:/Users/disha/Downloads/Day51 - Pandas plot function.html 1/5
11/9/23, 9:40 AM Day51 - Pandas plot function
In [ ]: import pandas as pd
# Create a DataFrame
data = {'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],
'Population': [8398748, 3980408, 2716000, 2326006]}
df = pd.DataFrame(data)
3. Histogram:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80]}
df = pd.DataFrame(data)
# Create a histogram
df.plot(y='Age', kind='hist', bins=5, title='Age Distribution')
4. Scatter Plot:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'X': [1, 2, 3, 4, 5],
'Y': [2, 4, 1, 3, 5]}
df = pd.DataFrame(data)
5. Pie Chart:
In [ ]: import pandas as pd
# Create a DataFrame
data = {'Category': ['A', 'B', 'C', 'D'],
'Value': [30, 40, 20, 10]}
df = pd.DataFrame(data)
The plot function provides various customization options for labels, titles, colors, and more,
and it can handle different plot types by specifying the kind parameter. While Pandas' plot
function is useful for quick and simple visualizations, more complex and customized
visualizations may require using libraries like Matplotlib or Seaborn in combination with
Pandas.
1. Statistical Plots: Seaborn simplifies the process of creating statistical plots by providing
functions for common statistical visualizations such as scatter plots, line plots, bar plots,
box plots, violin plots, and more.
2. Themes and Color Palettes: Seaborn allows you to easily customize the look of your
visualizations by providing different themes and color palettes. This makes it simple to
create professional-looking plots without having to manually tweak every detail.
4. Matrix Plots: Seaborn provides functions for visualizing matrices of data, such as
heatmaps. These can be useful for exploring relationships and patterns in large datasets.
5. Time Series Plots: Seaborn supports the visualization of time series data, allowing you
to create informative plots for temporal analysis.
While Seaborn is built on top of Matplotlib, it abstracts away much of the complexity and
provides a more concise and visually appealing syntax for creating statistical visualizations. It
is a popular choice among data scientists and analysts for quickly generating exploratory
data visualizations.
Seaborn offers several other plot types for various visualization needs:
1. Categorical Plots:
violinplot : Combines aspects of a box plot and a kernel density plot, providing
insights into the distribution of a variable for different categories.
rugplot : Adds small vertical lines (rug) to a plot to indicate the distribution of
data points along the x or y-axis.
3. Regression Plots:
regplot : Creates a scatter plot with a linear regression line fitted to the data.
tsplot : Formerly used for time series data, it has been replaced by the more
flexible lineplot . However, it's still available for backward compatibility.
5. Facet Grids:
FacetGrid : Although not a specific plot type, FacetGrid is a powerful tool that
allows you to create a grid of subplots based on the values of one or more
categorical variables. It can be used with various plot types to create a matrix of
visualizations.
6. Relational Plots:
relplot : Combines aspects of scatter plots and line plots to visualize the
relationship between two variables across different levels of a third variable. It's a
flexible function that can create scatter plots, line plots, or other types of relational
plots.
scatterplot : Creates a simple scatter plot to show the relationship between two
variables.
lineplot : Generates line plots to depict the trend between two variables.
jointplot : Combines scatter plots for two variables along with histograms for
each variable.
7. Matrix Plots:
heatmap : Displays a matrix where the values are represented by colors. Heatmaps
are often used to visualize correlation matrices or other two-dimensional datasets.
These functions provide a range of tools for exploring relationships in data, whether you're
interested in visualizing the distribution of individual variables, the relationship between two
variables, or patterns within a matrix of data.
These are just a selection of Seaborn's capabilities. The library is designed to make it easy to
create a wide range of statistical graphics for data exploration and presentation. The choice
of which plot to use depends on the nature of your data and the specific insights you want
to extract.
In Seaborn, the concepts of "axis-level" and "figure-level" functions refer to the level at
which the functions operate and the structure of the resulting plots.
1. Axis-Level Functions:
Operate at the level of a single subplot or axis.
Produce a single plot by default.
Examples include functions like sns.scatterplot() , sns.lineplot() ,
sns.boxplot() , etc.
Accept the ax parameter to specify the Axes where the plot will be drawn. If
not specified, a new Axes is created.
1. Figure-Level Functions:
Operate at the level of the entire figure, potentially creating multiple
subplots.
Produce a FacetGrid or a similar object that can be used to create a grid of
subplots.
Examples include functions like sns.relplot() , sns.catplot() ,
sns.pairplot() , etc.
Return a FacetGrid object, allowing for easy creation of subplots based on
additional categorical variables.
The choice between axis-level and figure-level functions depends on your specific needs.
Axis-level functions are often more straightforward for simple plots, while figure-level
functions are powerful for creating complex visualizations with multiple subplots or facets
based on categorical variables.
Keep in mind that figure-level functions return objects like FacetGrid, and you can customize
the resulting plots further using the methods and attributes of these objects. The
documentation for each function provides details on how to use and customize the output.
1. Relational Plots
to see the statistical relation between 2 or more variables.
Bivariate Analysis
In [ ]: import pandas as pd
In [ ]: tips = sns.load_dataset('tips')
tips
<seaborn.axisgrid.FacetGrid at 0x7f686180b340>
Out[ ]:
In [ ]: # line plot
gap = px.data.gapminder()
temp_df = gap[gap['country'] == 'India']
temp_df
In [ ]: # using relpplot
sns.relplot(data=temp_df, x='year', y='lifeExp', kind='line')
file:///C:/Users/disha/Downloads/Day53 -Introduction_of_Seaborn.html 6/10
11/11/23, 1:46 PM Day53 -Introduction_of_Seaborn
<seaborn.axisgrid.FacetGrid at 0x7f686180bf70>
Out[ ]:
In [ ]: temp_df = gap[gap['country'].isin(['India','Brazil','Germany'])]
temp_df
<seaborn.axisgrid.FacetGrid at 0x7f68616088e0>
Out[ ]:
In [ ]: # facet plot -> figure level function -> work with relplot
# it will not work with scatterplot and lineplot
sns.relplot(data=tips, x='total_bill', y='tip', kind='scatter', col='sex')
<seaborn.axisgrid.FacetGrid at 0x7f68630464d0>
Out[ ]:
In [ ]:
For univariate data, the displot will show a histogram of the data, along with a line
representing the kernel density estimate (KDE). The KDE is a smoother version of the
histogram that can be used to better visualize the underlying distribution of the data.
For bivariate data, the displot will show a scatter plot of the data, along with a contour plot
representing the joint distribution of the data. The contour plot is a way of visualizing the
relationship between two variables, and can be used to identify clusters of data points or to
see how the variables are related.
histplot
kdeplot
rugplot
In [ ]: tips = sns.load_dataset('tips')
tips
In [ ]: # displot
sns.displot(data=tips, x='total_bill', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x7b831cbeaef0>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b831a9590f0>
Out[ ]:
# countplot
sns.displot(data=tips, x='day', kind='hist')
<seaborn.axisgrid.FacetGrid at 0x7b83171a4a90>
Out[ ]:
In [ ]: # hue parameter
sns.displot(data=tips, x='tip', kind='hist',hue='sex')
<seaborn.axisgrid.FacetGrid at 0x7b8317081f90>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b8316eddc30>
Out[ ]:
In [ ]: titanic = sns.load_dataset('titanic')
In [ ]: titanic.head()
Out[ ]: survived pclass sex age sibsp parch fare embarked class who adult_male dec
<seaborn.axisgrid.FacetGrid at 0x7b831a9e2440>
Out[ ]:
In [ ]: # faceting uusin col and rows -> and it not work on histplot function
<seaborn.axisgrid.FacetGrid at 0x7b8316eddf30>
Out[ ]:
kdeplot
Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian
kernel, producing a continuous density estimate
In [ ]: sns.kdeplot(data=tips,x='total_bill')
In [ ]: sns.displot(data=tips,x='total_bill',kind='kde')
<seaborn.axisgrid.FacetGrid at 0x7b8316ba3250>
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7b831691bf40>
Out[ ]:
Rugplot
Plot marginal distributions by drawing ticks along the x and y axes.
This function is intended to complement other plots by showing the location of individual
observations in an unobtrusive way.
In [ ]: sns.kdeplot(data=tips,x='total_bill')
sns.rugplot(data=tips,x='total_bill')
Bivariate histogram
A bivariate histogram bins the data within rectangles that tile the plot and then shows the
count of observations within each rectangle with the fill color
In [ ]: # Bivariate Kdeplot
# a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian
sns.kdeplot(data=tips, x='total_bill', y='tip')
Matrix Plot
In Seaborn, both heatmap and clustermap functions are used for visualizing matrices,
but they serve slightly different purposes.
1. Heatmap:
The heatmap function is used to plot rectangular data as a color-encoded matrix.
It is essentially a 2D representation of the data where each cell is colored based on
its value.
Heatmaps are useful for visualizing relationships and patterns in the data, making
them suitable for tasks such as correlation matrices or any other situation where
you want to visualize the magnitude of a phenomenon.
1. Clustermap:
The clustermap function, on the other hand, not only visualizes the matrix but
also performs hierarchical clustering on both rows and columns to reorder them
based on similarity.
It is useful when you want to identify patterns not only in the individual values of
the matrix but also in the relationships between rows and columns.
In [ ]: gap = px.data.gapminder()
In [ ]: # Heatmap
In [ ]: # annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True)
In [ ]: # linewidth
# annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True,linewidth=0.5)
In [ ]: # cmap
# annot
temp_df = gap[gap['continent'] == 'Europe'].pivot(index='country',columns='year',va
plt.figure(figsize=(15,15))
sns.heatmap(temp_df,annot=True,linewidth=0.5, cmap='summer')
In [ ]: # Clustermap
iris = px.data.iris()
iris
In [ ]: sns.clustermap(iris.iloc[:,[0,1,2,3]])
<seaborn.matrix.ClusterGrid at 0x7a3bfb6797b0>
Out[ ]:
def reverse_the_string(s):
return s[::-1]
reverse_the_string(s)
'dlroW olleH'
Out[ ]:
1. List
• A mutable data structure. • Can store heterogeneous data types. • Uses an index-based data structure. • Useful for storing multiple values of the
same data type.
2. Dictionary
• A key-value data structure. • A mutable data structure. • Useful for storing student data, customer details, etc.. Here are some other data
structures in Python: Arrays, Strings, Queues, Stacks, Trees, Linked lists, Graphs.
In [ ]: def is_palindrome(word):
return word == word[::-1]
In [ ]: l = [1,2,2,4,4,5,6]
def remove_duplicates(l):
return list(set(l))
remove_duplicates(l)
[1, 2, 4, 5, 6]
Out[ ]:
In Python, a lambda function is a small, anonymous function defined using the lambda keyword. Lambda functions are also sometimes referred
to as anonymous functions or lambda expressions. The primary purpose of lambda functions is to create small, one-time-use functions without
formally defining a full function using the def keyword.
In [ ]: add = lambda x, y: x + y
print(add(2, 3))
In this example, the lambda function takes two arguments ( x and y ) and returns their sum. It is equivalent to the following regular function:
Lambda functions are often used for short, simple operations, especially in situations where you need to pass a function as an argument to
another function
2.How do you create an empty set in Python, and what is the key
difference between an empty set and an empty dictionary?
s = {}
print(type(s))
<class 'dict'>
but it gives type is Dictionary so for the creation of empty set we use differnet Syntax
s = set()
print(type(s))
<class 'set'>
Use Cases:
Empty Set:
1. Suitable when there is a need for a collection with distinct and unordered elements.
2. Ideal for tasks involving set operations such as union, intersection, and difference.
Empty Dictionary:
1. Used when data needs to be stored and retrieved based on specific keys.
2. Essential for scenarios where a mapping between keys and values is necessary.
In summary, an empty set is a collection of unique and unordered elements created using
the set() constructor, while an empty dictionary is a data structure used for storing key-value
pairs, created using the {} notation or the dict() constructor. Understanding the intended
purpose and structure of each is crucial in choosing the appropriate data structure for a
given task.
3.If you have an empty set, how would you add elements using a built-in
function in Python?
{'hello', 1, 3.14}
4.If you have a tuple and a set, which one is faster in terms of
performance, and why?
The performance difference between a tuple and a set in Python is influenced by their
inherent characteristics and use cases.
1. Access Time:
Tuple:
Tuples are generally faster for element access since they are indexed and
ordered.
Accessing elements in a tuple has a constant time complexity O(1).
Set:
Tuple:
Sets are mutable and designed for efficient membership tests and unique
element storage.
Suitable for scenarios where the collection of distinct and unordered elements
needs to be dynamically modified.
3. Use Cases:
Tuple:
Prefer tuples when the order of elements matters, and the data should remain
constant.
Useful in scenarios like representing coordinates (x, y, z) or holding a fixed set
of values.
Set:
Opt for sets when dealing with collections of unique elements and the order is
not significant.
Useful for tasks involving set operations like union, intersection, and difference.
4. Memory Overhead:
Tuple:
Tuples typically have lower memory overhead compared to sets, as they store
only the elements.
Set:
Sets use additional memory for hash tables and other internal structures, which
can lead to higher memory consumption.
In conclusion, the choice between a tuple and a set depends on the specific requirements of
the task. If fast element access, order preservation, and immutability are crucial, a tuple is
preferred. On the other hand, if the focus is on efficient membership tests, unique element
storage, and dynamic modification, a set is more suitable. Understanding the trade-offs and
characteristics of each data structure is key to making an informed decision based on the
specific needs of the application.
In [ ]: set1 = {1, 2, 3, 4, 5}
set2 = {3, 4, 5, 6, 7}
intersection_set = set1.intersection(set2)
print(intersection_set)
{3, 4, 5}
2.How do you create an empty set in Python, and what is the key
difference between an empty set and an empty dictionary?
s = {}
print(type(s))
<class 'dict'>
but it gives type is Dictionary so for the creation of empty set we use differnet Syntax
s = set()
print(type(s))
<class 'set'>
Use Cases:
Empty Set:
1. Suitable when there is a need for a collection with distinct and unordered elements.
2. Ideal for tasks involving set operations such as union, intersection, and difference.
Empty Dictionary:
1. Used when data needs to be stored and retrieved based on specific keys.
2. Essential for scenarios where a mapping between keys and values is necessary.
In summary, an empty set is a collection of unique and unordered elements created using
the set() constructor, while an empty dictionary is a data structure used for storing key-value
pairs, created using the {} notation or the dict() constructor. Understanding the intended
purpose and structure of each is crucial in choosing the appropriate data structure for a
given task.
3.If you have an empty set, how would you add elements using a built-in
function in Python?
{'hello', 1, 3.14}
4.If you have a tuple and a set, which one is faster in terms of
performance, and why?
The performance difference between a tuple and a set in Python is influenced by their
inherent characteristics and use cases.
1. Access Time:
Tuple:
Tuples are generally faster for element access since they are indexed and
ordered.
Accessing elements in a tuple has a constant time complexity O(1).
Set:
Tuple:
Sets are mutable and designed for efficient membership tests and unique
element storage.
Suitable for scenarios where the collection of distinct and unordered elements
needs to be dynamically modified.
3. Use Cases:
Tuple:
Prefer tuples when the order of elements matters, and the data should remain
constant.
Useful in scenarios like representing coordinates (x, y, z) or holding a fixed set
of values.
Set:
Opt for sets when dealing with collections of unique elements and the order is
not significant.
Useful for tasks involving set operations like union, intersection, and difference.
4. Memory Overhead:
Tuple:
Tuples typically have lower memory overhead compared to sets, as they store
only the elements.
Set:
Sets use additional memory for hash tables and other internal structures, which
can lead to higher memory consumption.
In conclusion, the choice between a tuple and a set depends on the specific requirements of
the task. If fast element access, order preservation, and immutability are crucial, a tuple is
preferred. On the other hand, if the focus is on efficient membership tests, unique element
storage, and dynamic modification, a set is more suitable. Understanding the trade-offs and
characteristics of each data structure is key to making an informed decision based on the
specific needs of the application.
In [ ]: set1 = {1, 2, 3, 4, 5}
set2 = {3, 4, 5, 6, 7}
intersection_set = set1.intersection(set2)
print(intersection_set)
{3, 4, 5}
Answer: A tuple is an immutable sequence, and its elements cannot be changed after
creation, while a list is mutable. Tuples are created using parentheses, and lists use square
brackets. Use a tuple when you want to represent a fixed collection of items, and a list when
you need a dynamic and mutable sequence.
Answer: Tuples are immutable, so you cannot reverse them in place. However, you can
create a new tuple with reversed elements using slicing
In [ ]: original_tuple = (1, 2, 3, 4)
reversed_tuple = original_tuple[::-1]
reversed_tuple
(4, 3, 2, 1)
Out[ ]:
3.How can you sort a list of tuples based on the second element of each
tuple?
In [ ]: # Solution 1:
def tuple_length(tup):
return len(tup)
# Solution 2:
tup = (1, 2, 3, 4, 5)
length = len(tup)
print(length)
In [ ]: # Solution:
def first_last_elements(tup):
return tup[0], tup[-1]
In [ ]: l = [1,2,3,4,5]
# solution 1
def reverse_list(l):
return l[::-1]
reverse_list(l)
# solution 2
reverse_list = l[::-1]
print(reverse_list)
[5, 4, 3, 2, 1]
Answer:
In [ ]: list1 = [1, 2, 3]
list2 = [4, 5, 6]
list1.append(4)
print(list1)
list1.extend(list2)
print(list1)
[1, 2, 3, 4]
[1, 2, 3, 4, 4, 5, 6]
Answer: The pop() method removes and returns the item at the specified index (default is
the last item).
In [ ]: my_list = [1, 2, 3, 4, 5]
popped_item = my_list.pop(2) # Removes and returns the item at index 2 (3)
my_list
[1, 2, 4, 5]
Out[ ]:
In [ ]: # given lists
l1 = [1, 2, 3, 4, 5]
l2 = [3, 4, 5, 6, 7]
find_common_element(l1,l2)
[3, 4, 5]
Out[ ]:
In [ ]: # function
def merge_two_dictionaries(dict1,dict2):
merged = dict1.copy()
merged.update(dict2)
return merged
In [ ]: # list
numbers = [1, 2, 3, 4, 4, 5, 5, 5, 6]
print(central_measures(numbers))
In [ ]: def is_palindrome(s):
return s == s[::-1]
True
In [ ]: def factorial_iterative(n):
result = 1
for i in range(2, n+1):
result *= i
return result
number = 5
print(factorial_iterative(number))
120
There are many reasons why OOPs is mostly preferred, but the most important among them
are:
OOPs helps users to understand the software easily, although they don’t know the
actual implementation.
With OOPs, the readability, understandability, and maintainability of the code increase
multifold.
Even very big software can be easily written and managed easily using OOPs.
2. What is a class?
A class can be understood as a template or a blueprint, which contains some values, known
as member data or member, and some set of rules, known as behaviors or functions. So
when an object is created, it automatically takes the data and functions that are defined in
the class. Therefore the class is basically a template or blueprint for objects. Also one can
create as many objects as they want based on a class.
For example, first, a car’s template is created. Then multiple units of car are created based on
that template.
In [ ]: class Dog:
# Class attribute
species = "Canis familiaris"
# Instance method
def bark(self):
print(f"{self.name} says Woof!")
3. What is an object?
An object refers to the instance of the class, which contains the instance of the members and
behaviors defined in the class template. In the real world, an object is an actual entity to
which a user interacts, whereas class is just the blueprint for that object. So the objects
consume space and have some characteristic behavior.
4. What is encapsulation?
One can visualize Encapsulation as the method of putting everything that is required to do
the job, inside a capsule and presenting that capsule to the user. What it means is that by
Encapsulation, all the necessary data and methods are bind together and all the unnecessary
details are hidden to the normal user. So Encapsulation is the process of binding data
members and methods of a program together to do a specific job, without revealing
unnecessary details.
5.what is inheritance
Inheritance is a key concept in object-oriented programming (OOP) that allows a new class
(subclass or derived class) to inherit attributes and behaviors from an existing class (base
class or superclass). The new class can extend or override the functionalities of the existing
class, promoting code reuse and the creation of a hierarchy of classes.
def speak(self):
pass
In this example, the Animal class is the superclass, and the Dog and Cat classes are
subclasses. Both Dog and Cat inherit the __init__ method (constructor) and the
speak method from the Animal class. However, each subclass provides its own
implementation of the speak method, allowing them to exhibit different behaviors.
Inheritance is a powerful mechanism in OOP that helps in organizing and structuring code in
a hierarchical manner, making it easier to manage and extend software systems.
Pandas is widely used in the field of data analysis for several reasons:
1. Ease of Use: Pandas provides a simple and intuitive syntax for data manipulation. Its
data structures are designed to be easy to use and interact with.
2. Data Cleaning and Transformation: Pandas makes it easy to clean and transform data.
It provides functions for handling missing data, reshaping data, merging and joining
datasets, and performing various data transformations.
3. Data Exploration: Pandas allows data analysts to explore and understand their datasets
quickly. Descriptive statistics, data summarization, and various methods for slicing and
dicing data are readily available.
4. Data Input/Output: Pandas supports reading and writing data in various formats,
including CSV, Excel, SQL databases, and more. This makes it easy to work with data
from different sources.
5. Integration with Other Libraries: Pandas integrates well with other popular data
science and machine learning libraries in Python, such as NumPy, Matplotlib, and Scikit-
learn. This allows for a seamless workflow when performing more complex analyses.
6. Time Series Analysis: Pandas provides excellent support for time series data, including
tools for date range generation, frequency conversion, and resampling.
7. Community and Documentation: Pandas has a large and active community, which
means there is extensive documentation and a wealth of online resources, tutorials, and
forums available for users to seek help and guidance.
8. Open Source: Being an open-source project, Pandas allows users to contribute to its
development and improvement. This collaborative nature has helped Pandas evolve and
stay relevant in the rapidly changing landscape of data analysis and data science.
In summary, Pandas is popular in data analysis because it simplifies the process of working
with structured data, provides powerful tools for data manipulation, and has become a
2. Labeled Axes: Both rows and columns of a DataFrame are labeled. This means that
each row and each column has a unique label or index associated with it, allowing for
easy access and manipulation of data.
3. Flexible Size: DataFrames can grow and shrink in size. You can add or remove rows and
columns as needed.
4. Heterogeneous Data Types: Different columns in a DataFrame can have different data
types. For example, one column might contain integers, while another column contains
strings.
6. Missing Data Handling: DataFrames can handle missing data gracefully. Pandas
provides methods for detecting, removing, or filling missing values.
In [ ]: import pandas as pd
df = pd.DataFrame(data)
In this example, each column represents a different attribute (Name, Age, City), and each
row represents a different individual. The DataFrame provides a convenient way to work with
this tabular data in a structured and labeled format.
file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 2/5
11/24/23, 11:51 AM Day64 - Pandas Interview Questions
In [ ]: import pandas as pd
In [ ]: import pandas as pd
In summary, if you want to select data based on the labels of rows and columns, you use
loc . If you prefer to select data based on the integer positions of rows and columns, you
use iloc . The choice between them depends on whether you are working with labeled or
integer-based indexing.
Assuming you have a DataFrame named df, and you want to filter rows based on a
condition, let's say a condition on the 'Age' column:
In [ ]: import pandas as pd
# Condition for filtering (e.g., selecting rows where Age is greater than 25)
condition = df['Age'] > 25
In [ ]: import pandas as pd
In [ ]: import pandas as pd
Step 2: Create Two DataFrames Assuming you have two DataFrames, df1 and df2, with a
common key column:
In [ ]: # Example DataFrames
data1 = {'Key': ['A', 'B', 'C'], 'Value1': [1, 2, 3]}
data2 = {'Key': ['A', 'B', 'D'], 'Value2': ['X', 'Y', 'Z']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Step 3: Choose the Type of Join Decide on the type of join you want to perform. The
common types are:
Inner Join (how='inner'): Keeps only the rows with keys present in both DataFrames.
Outer Join (how='outer'): Keeps all rows from both DataFrames and fills in missing
values with NaN where there is no match.
Left Join (how='left'): Keeps all rows from the left DataFrame and fills in missing values
with NaN where there is no match in the right DataFrame.
Right Join (how='right'): Keeps all rows from the right DataFrame and fills in missing
values with NaN where there is no match in the left DataFrame.
0 A 1 X
1 B 2 Y
In this example, we're performing an inner join based on the 'Key' column. The resulting
DataFrame (merged_df) will have columns from both original DataFrames, and rows where
the 'Key' values match in both DataFrames.
The groupby function in Pandas is used for grouping data based on some criteria, and it is
a powerful and flexible tool for data analysis and manipulation. The primary purpose of
groupby is to split the data into groups based on some criteria, apply a function to each
group independently, and then combine the results back into a DataFrame.
Purpose of groupby :
1. Data Splitting:
groupby is used to split the data into groups based on one or more criteria, such
as a column's values or a combination of columns.
2. Operations on Groups:
After splitting the data into groups, you can perform operations on each group
independently. This might include aggregations, transformations, filtering, or other
custom operations.
3. Aggregation:
One of the most common use cases for groupby is to perform aggregation
operations on each group, such as calculating the mean, sum, count, minimum,
maximum, etc.
4. Data Transformation:
groupby allows you to apply transformations to the groups and create new
features or modify existing ones.
5. Filtering:
You can use groupby in combination with filtering operations to select specific
groups based on certain conditions.
Series:
1. 1-Dimensional Data Structure:
A Series is essentially a one-dimensional labeled array that can hold any data type,
such as integers, floats, strings, or even Python objects.
2. Homogeneous Data:
All elements in a Series must be of the same data type. It is a homogeneous data
structure.
3. Labeled Index:
Each element in a Series has a label (index), which can be customized or can be the
default integer index. This index facilitates easy and efficient data retrieval.
4. Similar to a Column in a DataFrame:
In [ ]: import pandas as pd
a 1
Out[ ]:
b 2
c 3
d 4
dtype: int64
DataFrame:
1. 2-Dimensional Data Structure:
A DataFrame is a two-dimensional tabular data structure where you can store data
of different data types. It consists of rows and columns.
2. Heterogeneous Data:
Similar to a Series, a DataFrame also has a labeled index for rows, and additionally,
it has labeled columns. The column names can be customized.
4. Collection of Series:
You can create a DataFrame from a dictionary, a list of dictionaries, a NumPy array,
or another DataFrame.
In [ ]: import pandas as pd
In summary, while both Series and DataFrames have labeled indices and support various
operations, a Series is a one-dimensional array, and a DataFrame is a two-dimensional table.
Series are often used to represent a single column, while DataFrames are used to represent a
collection of columns with potentially different data types.
Purpose of apply :
1. Element-wise Operations:
You can use apply to apply custom functions that you define to each element or
row/column of your data.
3. Aggregation:
When used with DataFrames, apply can be used for column-wise or row-wise
aggregation.
4. Function Composition:
Examples:
In [ ]: #### Example 1: Element-wise Operation on Series
import pandas as pd
# Creating a Series
s = pd.Series([1, 2, 3, 4])
0 1
Out[ ]:
1 4
2 9
3 16
dtype: int64
import pandas as pd
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
A 6
Out[ ]:
B 15
dtype: int64
# Creating a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
0 5
Out[ ]:
1 7
2 9
dtype: int64
In these examples, the apply function is used to perform operations on each element in a
Series or each row/column in a DataFrame. The flexibility of apply makes it a versatile tool
for various data manipulation tasks in Pandas.
In a pivot table, you specify which columns of the original DataFrame should
become the new index (rows) and which columns should become the new columns.
2. Values:
You specify which columns of the original DataFrame should be used as values in
the new table. These values are then aggregated based on the specified rows and
columns.
3. Aggregation Function:
Example:
In [ ]: import pandas as pd
df = pd.DataFrame(data)
print(df)
print('----------------------------------------------------------------------------
# Creating a pivot table to show average temperature for each city on each date
pivot_table = pd.pivot_table(df, values='Temperature', index='Date', columns='City'
print(pivot_table)
In this example, the pivot table calculates the average temperature for each city on each
date. The resulting table will have dates as rows, cities as columns, and the average
temperature as values.
Pivot tables are particularly useful for exploring and summarizing data with multiple
dimensions, making complex data analysis tasks more accessible and efficient. They allow
you to quickly gain insights into your data's patterns and relationships.
NumPy is a powerful numerical computing library in Python. It provides support for large,
multi-dimensional arrays and matrices, along with mathematical functions to operate on
these arrays. NumPy is a fundamental package for scientific computing in Python, and many
other libraries, such as pandas and scikit-learn, are built on top of it.
Here are some key differences between NumPy arrays and Python lists:
1. Homogeneity:
NumPy arrays are homogeneous, meaning that all elements of an array must be of
the same data type. This allows for more efficient storage and computation.
Python lists can contain elements of different data types.
2. Size:
NumPy arrays are more compact in memory compared to Python lists. This is
because NumPy arrays are implemented in C and allow for more efficient storage
of data.
Python lists have more overhead and are generally less memory-efficient.
3. Performance:
NumPy operations are implemented in C, which makes them much faster than
equivalent operations on Python lists, especially for large datasets.
NumPy provides a wide range of efficient functions for array operations, such as
element-wise operations, linear algebra, and statistical operations.
4. Functionality:
NumPy arrays support vectorized operations, which means that operations can be
performed on entire arrays without the need for explicit loops. This leads to more
concise and readable code.
In Python lists, you often need to use explicit loops for element-wise operations.
Sum_Python_list - [1, 2, 3, 4, 5, 6]
Numpy_list_sum - [5 7 9]
1. If the arrays do not have the same rank (number of dimensions), pad the smaller shape
with ones on its left side.
If the sizes of the dimensions are different, but one of them is 1, then the arrays are
compatible for broadcasting.
If the sizes in a dimension are neither equal nor one, broadcasting is not possible,
and a ValueError is raised.
3. After the broadcasting, each array behaves as if its shape was the element-wise
maximum of both shapes.
In [ ]: import numpy as np
array([6, 7, 8])
Out[ ]:
array([[ 5, 7, 9],
Out[ ]:
[ 8, 10, 12]])
array([[5, 6, 7],
Out[ ]:
[6, 7, 8],
[7, 8, 9]])
In these examples, the smaller array or scalar is broadcasted to match the shape of the larger
array, and the element-wise operation is then performed. This broadcasting mechanism
allows for more concise and readable code when working with arrays of different shapes,
making NumPy operations more flexible.
In [ ]: import numpy as np
[5 7 9]
In this example, array1 + array2 performs element-wise addition, resulting in a new NumPy
array [5, 7, 9]. NumPy takes care of broadcasting if the arrays have different shapes but are
still compatible according to the broadcasting rules.
You can perform element-wise addition, subtraction, multiplication, and division using the
standard arithmetic operators (+, -, *, /). NumPy will apply these operations element-wise to
the corresponding elements of the arrays.
In [ ]: import numpy as np
# Element-wise operations
Addition: [5 7 9]
Subtraction: [-3 -3 -3]
Multiplication: [ 4 10 18]
Division: [0.25 0.4 0.5 ]
In [ ]: import numpy as np
print("Mean:", mean_value)
Mean: 3.0
In this example, np.mean(arr) will calculate the mean of the elements [1, 2, 3, 4, 5], which is
(1 + 2 + 3 + 4 + 5) / 5 = 3.0.
In [ ]: import numpy as np
print("Original array:")
print(original_array)
print("\nReshaped array:")
print(reshaped_array)
Reshaped array:
[[1 2 3]
[4 5 6]]
In this example, original_array is reshaped into a 2x3 matrix using reshape(2, 3).
Here are some key functions and concepts related to NumPy's random module:
In [ ]: import numpy as np
2. Random Integers:
You can generate random integers within a specified range.
array([1, 5, 6, 8, 7])
Out[ ]:
3. Random Sampling:
Functions like choice allow you to sample from a given array.
4. Permutations:
NumPy can be used to generate random permutations.
array([3, 4, 1, 5, 2])
Out[ ]:
5. Distributions:
NumPy's random module supports various probability distributions such as normal,
binomial, exponential, etc.
These functions provide flexibility in generating random data according to different needs,
and they are crucial tools for various statistical and numerical simulations. The ability to set a
seed is especially useful when reproducibility is important, allowing you to recreate the same
set of random numbers for debugging or sharing code with others.
Maximum Value: 8
Minimum Value: 1
Consider we have a DataFrame df, we can either convert the whole Pandas DataFrame df to
NumPy array or even select a subset of Pandas DataFrame to NumPy array by using the
to_numpy() method as shown in the example below:
In [ ]: import pandas as pd
import numpy as np
# Pandas DataFrame
df = pd.DataFrame(data={'A': [3, 2, 1], 'B': [6,5,4], 'C': [9, 8, 7]},
index=['i', 'j', 'k'])
print("Pandas DataFrame: ")
print(df)
Pandas DataFrame:
A B C
i 3 6 9
j 2 5 8
k 1 4 7
Pandas DataFrame to NumPy array:
[[3 6 9]
[2 5 8]
[1 4 7]]
Convert B and C columns of Pandas DataFrame to NumPy Array:
[[6 9]
[5 8]
[4 7]]
In [ ]: import numpy as np
a = np.array([1,2,3])
b = np.array([4,5,6])
# vstack arrays
c = np.vstack((a,b))
print("After vstack: \n",c)
# hstack arrays
d = np.hstack((a,b))
print("After hstack: \n",d)
In [ ]: import numpy as np
arr = np.arange(10, 60)
print(arr)
[10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
58 59]
Matplotlib is a comprehensive data visualization library in Python used for creating static,
animated, and interactive visualizations in a wide range of formats. It was originally
developed by John D. Hunter in 2003 and has since become a go-to library for researchers,
analysts, and scientists for creating high-quality plots and charts.
Matplotlib supports various types of plots, including line plots, scatter plots, bar
plots, histograms, pie charts, 3D plots, and more.
2. Customization and Styling:
Basic Usage:
To use Matplotlib for data visualization, you need to import the matplotlib.pyplot
module. Here's a simple example:
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
This example creates a basic line plot using Matplotlib. You can customize various aspects of
the plot, such as colors, markers, and line styles, to suit your preferences.
In [ ]: # Scatter plot
plt.scatter(x, y, label='Data Points', color='red', marker='o')
plt.legend()
plt.show()
2. Bar Plot:
In [ ]: # Bar plot
categories = ['Category A', 'Category B', 'Category C']
values = [4, 7, 2]
plt.bar(categories, values, color='green')
plt.show()
3. Histogram:
In [ ]: # Histogram
data = [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5]
plt.hist(data, bins=5, color='blue', edgecolor='black')
plt.show()
These are just a few examples, and Matplotlib provides a wide range of options and
customization possibilities. Whether you need simple visualizations or complex, publication-
quality graphics, Matplotlib is a powerful and flexible tool for data visualization in Python.
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.figure() :
The plt.figure() function is used to create a new figure or a new plotting window. A
figure is the top-level container that holds all elements of a plot, including one or more axes
(subplots), titles, legends, etc. When you create a figure using plt.figure() , you can
customize the overall properties of the entire plot.
plt.subplot() :
The plt.subplot() function is used to create and return a single subplot within a grid of
subplots. It is a convenience function for creating a figure and adding a subplot to it in a
single call. The function takes three arguments: the number of rows, the number of columns,
and the index of the subplot you want to create.
In this example, plt.subplot(111) creates a single subplot within a 1x1 grid. The 111
argument indicates that the subplot should be created in the first (and only) position in the
grid.
Key Differences:
plt.figure() creates the top-level container:
It is used to create a new figure, and you can then add one or more subplots (axes)
to this figure.
Useful when you want to customize properties of the entire plot, such as size, title,
or background color.
plt.subplot() creates a single subplot:
which data corresponds to which part of the plot. This is particularly useful when multiple
datasets or multiple plot elements are present on the same axes.
Purpose of plt.legend() :
1. Identifying Plot Elements:
The primary purpose of the legend is to label different plot elements, such as lines,
markers, or other graphical elements, with descriptive text. This helps the viewer
understand the meaning of each element in the plot.
2. Adding Context to the Plot:
The legend provides additional context to the plot by associating labels with
specific data series or plot components. This is crucial when presenting complex
visualizations.
3. Improving Readability:
In cases where multiple datasets or plot elements overlap, a legend can improve
the readability of the plot by clearly indicating which element belongs to which
dataset.
# Sample data
x = [1, 2, 3, 4, 5]
y1 = [2, 4, 6, 8, 10]
y2 = [1, 2, 1, 2, 1]
# Adding legend
plt.legend()
In this example, plt.legend() is used to add a legend to the plot. The label argument
in the plot function is used to assign labels to each line, and these labels are then
displayed in the legend.
Scatter Plot:
Representation:
A scatter plot represents individual data points as markers (dots) on the plot. Each
point corresponds to a single data entry.
Use Case:
Scatter plots are useful when you want to visualize the distribution of individual
data points, examine relationships between two variables, or identify patterns such
as clusters or outliers.
x = [1, 2, 3, 4, 5]
y = [2, 4, 3, 1, 5]
Customization:
Scatter plots offer customization options for marker styles, sizes, and colors to
enhance the visualization.
Line Plot:
Representation:
A line plot represents data points by connecting them with straight lines. It is used
to visualize trends, patterns, or the overall shape of the data.
Use Case:
Line plots are commonly used to show the relationship between two continuous
variables, particularly when there's a sequential order or a sense of continuity in the
data.
x = [1, 2, 3, 4, 5]
y = [2, 4, 3, 1, 5]
Customization:
Line plots provide customization options for line styles, colors, and markers,
allowing you to emphasize specific aspects of the data.
Key Differences:
1. Representation:
Scatter plots show individual data points, while line plots connect data points with
lines.
2. Use Case:
Scatter plots are suitable for examining distributions and relationships between two
variables. Line plots are commonly used to show trends and patterns in sequential
or continuous data.
3. Data Type:
Scatter plots work well with discrete or continuous data. Line plots are most
effective when the data is sequential or continuous.
4. Visual Emphasis:
Scatter plots emphasize the distribution of individual data points. Line plots
emphasize the overall trend or pattern in the data.
In [ ]: def celsius_to_fahrengeit(celsius):
fahrenheit = (celsius * 9/5) + 32
return fahrenheit
print('celsius input:',celsius_input)
In [ ]: import math