Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
Data Analysis and Visualization Using Python Libraries and Streamlit - RTF Pre Read Materials
Data Analysis
Data Analysis is a process of inspecting, cleaning, transforming, and modeling data
with the goal of discovering useful information, suggesting conclusions, and
supporting decision-making.
Stpes for Data Analysis, Data Manipulation and Data Visualization:
Tranform Raw Data in a Desired Format
Clean the Transformed Data (Step 1 and 2 also called as a Pre-processing of Data)
Prepare a Model
Analyse Trends and Make Decisions
NumPy
[[0. 0.]
[0. 0.]]]
In [10]:
# Flatten rgw 3d array to get back the linear array
arr4 = arr3d.ravel()
print(arr4)
[0. 0. 0. 0. 0. 0. 0. 0.]
Indexing of NumPy Arrays
In [11]:
# NumPy array indexing is identical to Python's indexing scheme
arr5 = np.arange(2, 20)
element = arr5[6]
print(element)
8
In [12]:
# Python's concept of lists slicing is extended to NumPy.
# The slice object is constructed by providing start, stop, and step parame
ters to slice()
arr6 = np.arange(20)
arr_slice = slice(1, 10, 2) # Start, stop & step
element2 = arr6[6]
print(arr6[arr_slice])
[1 3 5 7 9]
In [13]:
# Slicing items beginning with a specified index
arr7 = np.arange(20)
print(arr7[2:])
[ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19]
In [14]:
# Slicing items until a specified index
print(arr7[:15])
[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
In [15]:
# Extracting specific rows and columns using Slicing
d = np.array([[1,2,3], [3,4,5], [4,5,6]])
print(d[0:2, 0:2]) # Slice the first two rows and the first two columns
[[1 2]
[3 4]]
NumPy Array Attributes
In [16]:
print(d.shape) # Returns a tuple consisting of array dimensions
print(d.ndim) # Attribute returns the number of array dimensions
print(a.itemsize) # Returns the length of each element of array in bytes
(3, 3)
2
8
In [17]:
y = np.empty([3,2], dtype = int) # Creates an uninitialized array of specifie
d shape and dtype
print(y)
[[140468392404648 140468392404648]
[ 0 0]
[ 0 0]]
In [18]:
# Returns a new array of specified size, filled with zeros
z = np.zeros(5) # np.zeros(shape, dtype = float)
print(z)
[0. 0. 0. 0. 0.]
Reading & Writing from Files
In [19]:
# NumPy provides the option of importing data from files directly into nda
rray using the loadtxt function
# The savetxt function can be used to write data from an array into a te
xt file
#import os
#print(os.listdir('../input'))
arr_txt = np.loadtxt('../input/data_file.txt')
np.savetxt('newfilex.txt', arr_txt)
In [20]:
# NumPy arrays can be dumped into CSV files using the savetxt function
and the comma delimiter
# The genfromtxt function can be used to read data from a CSV file into
a NumPy array
arr_csv = np.genfromtxt('../input/Hurricanes.csv', delimiter = ',')
np.savetxt('newfilex.csv', arr_csv, delimiter = ',')
Pandas
Pandas is an open-source Python library providing efficient, easy-to-use data
structure and data analysis tools. The name Pandas is derived from "Panel Data" -
an Econometrics from Multidimensional Data. Pandas is well suited for many
different kinds of data:
Tabular data with heterogeneously-type columns.
Ordered and unordered time series data.
Arbitary matrix data with row and column labels.
Any other form observational/statistical data sets. The data actually need not be
labeled at all to be placed into a pandas data structure.
Pandas provides three data structure - all of which are build on top of the NumPy
array - all the data structures are value-mutable
Series (1D) - labeled, homogenous array of immutable size
DataFrames (2D) - labeled, heterogeneously typed, size-mutable tabular data
structures
Panels (3D) - Labeled, size-mutable array
In [21]:
import pandas as p
Series
A Series is a single-dimensional array structures that stores homogenous data i.e.,
data of a single type.
All the elements of a Series are value-mutable and size-immutable
Data can be of multiple data types such as ndarray, lists, constants, series, dict
etc.
Indexes must be unique, hashable and have the same length as data. Defaults to
np.arrange(n) if no index is passed.
Data type of each column; if none is mentioned, it will be inferred; automatically
Deep copies data, set to false as default
Creating a Series
In [22]:
# Creating an empty Series
series = pd.Series() # The Series() function creates a new Series
print(series)
Series([], dtype: float64)
In [23]:
# Creating a series from an ndarray
# Note that indexes are a assigned automatically if not specifies
arr = np.array([10,20,30,40,50])
series1 = pd.Series(arr)
print(series1)
0 10
1 20
2 30
3 40
4 50
dtype: int64
In [24]:
# Creating a series from a Python dict
# Note that the keys of the dictionary are used to assign indexes during
conversion
data = {'a':10, 'b':20, 'c':30}
series2 = pd.Series(data)
print(series2)
a 10
b 20
c 30
dtype: int64
In [25]:
# Retrieving a part of the series using slicing
print(series1[1:4])
1 20
2 30
3 40
dtype: int64
DataFrames
A DataFrame is a 2D data structure in which data is aligned in a tabular fashion
consisting of rows & columns
A DataFrame can be created using the following constructor -
pandas.DataFrame(data, index, dtype, copy)
Data can be of multiple data types such as ndarray, list, constants, series, dict etc.
Index Row and column labels of the dataframe; defaults to np.arrange(n) if no
index is passed
Data type of each column
Creates a deep copy of the data, set to false as default
Creating a DataFrame
In [26]:
# Converting a list into a DataFrame
list1 = [10, 20, 30, 40]
table = pd.DataFrame(list1)
print(table)
0
0 10
1 20
2 30
3 40
In [27]:
# Creating a DataFrame from a list of dictionaries
data = [{'a':1, 'b':2}, {'a':2, 'b':4, 'c':8}]
table1 = pd.DataFrame(data)
print(table1)
# NaN (not a number) is stored in areas where no data is provided
a b c
0 1 2 NaN
1 2 4 8.0
In [28]:
# Creating a DataFrame from a list of dictionaries and accompaying row i
ndices
table2 = pd.DataFrame(data, index = ['first', 'second'])
# Dict keys become column lables
print(table2)
a b c
first 1 2 NaN
second 2 4 8.0
In [29]:
# Converting a dictionary of series into a DataFrame
data1 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table3 = pd.DataFrame(data1)
print(table3)
# the resultant index is the union of all the series indexes passed
one two
a 1.0 1
b 2.0 2
c 3.0 3
d NaN 4
DataFrame - Addition & Deletion of Columns
In [30]:
# A new column can be added to a DataFrame when the data is passed
as a Series
table3['three'] = pd.Series([10,20,30], index = ['a', 'b', 'c'])
print(table3)
one two three
a 1.0 1 10.0
b 2.0 2 20.0
c 3.0 3 30.0
d NaN 4 NaN
In [31]:
# DataFrame columns can be deleted using the del() function
del table3['one']
print(table3)
two three
a 1 10.0
b 2 20.0
c 3 30.0
d 4 NaN
In [32]:
# DataFrame columns can be deleted using the pop() function
table3.pop('two')
print(table3)
three
a 10.0
b 20.0
c 30.0
d NaN
DataFrame - Addition & Deletion of Rows
In [33]:
# DataFrame rows can be selected by passing the row lable to the loc() f
unction
print(table3.loc['c'])
three 30.0
Name: c, dtype: float64
In [34]:
# Row selection can also be done using the row index
print(table3.iloc[2])
three 30.0
Name: c, dtype: float64
In [35]:
# The append() function can be used to add more rows to the DataFrame
data2 = {'one':pd.Series([1,2,3], index = ['a', 'b', 'c']),
'two':pd.Series([1,2,3,4], index = ['a', 'b', 'c', 'd'])}
table5 = pd.DataFrame(data2)
table5['three'] = pd.Series([10,20,30], index = ['a', 'b', 'c'])
row = pd.DataFrame([[11,13],[17,19]], columns = ['two', 'three'])
table6 = table5.append(row)
print(table6)
one three two
a 1.0 10.0 1
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4
0 NaN 13.0 11
1 NaN 19.0 17
/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py:6211: FutureWa
rning: Sorting because non-concatenation axis is not aligned. A future versi
on
of pandas will change to not sort by default.
To retain the current behavior and silence the warning, pass 'sort=True'.
sort=sort)
In [36]:
# The drop() function can be used to drop rows whose labels are provide
d
table7 = table6.drop('a')
print(table7)
one three two
b 2.0 20.0 2
c 3.0 30.0 3
d NaN NaN 4
0 NaN 13.0 11
1 NaN 19.0 17
Importing & Exporting Data
In [37]:
# Data can be loaded into DataFrames from input data stored in the CSV
format using the read_csv() function
table_csv = pd.read_csv('../input/Cars2015.csv')
In [38]:
# Data present in DataFrames can be written to a CSV file using the to_c
sv() function
# If the specified path doesn't exist, a file of the same name is automatic
ally created
table_csv.to_csv('newcars2015.csv')
In [39]:
# Data can be loaded into DataFrames from input data stored in the Exce
lsheet format using read_excel()
sheet = pd.read_excel('cars2015.xlsx')
In [40]:
# Data present in DataFrames can be written to a spreadsheet file using t
o_excel()
#If the specified path doesn't exist, a file of the same name is automatica
lly created
sheet.to_excel('newcars2015.xlsx')
Matplotlib
In [44]:
# We can use NumPy to specify the values for both axes with greater pre
cision
x = np.arange(0, 5, 0.01)
plt.plot(x, [x1**2 for x1 in x]) # vertical co-ordinates of the points plotted:
y = x^2
plt.show()
Multiline Plots
In [45]:
# Multiple functions can be drawn on the same plot
x = range(5)
plt.plot(x, [x1 for x1 in x])
plt.plot(x, [x1*x1 for x1 in x])
plt.plot(x, [x1*x1*x1 for x1 in x])
plt.show()
In [46]:
# Different colours are used for different lines
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*x1 for x1 in x],
x, [x1*x1*x1 for x1 in x])
plt.show()
Grids
In [47]:
# The grid() function adds a grid to the plot
# grid() takes a single Boolean parameter
# grid appears in the background of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.show()
In [49]:
# The scale of the plot can also be set using xlim() and ylim()
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlim(-1, 5)
plt.ylim(-1, 10)
plt.show()
Adding Labels
In [50]:
# Labels can be added to the axes of the plot
x = range(5)
plt.plot(x, [x1 for x1 in x],
x, [x1*2 for x1 in x],
x, [x1*4 for x1 in x])
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.show()
Adding a Legend
In [52]:
# Legends explain the meaning of each line in the graph
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.show()
Adding a Markers
In [53]:
x = [1, 2, 3, 4, 5, 6]
y = [11, 22, 33, 44, 55, 66]
plt.plot(x, y, 'bo')
for i in range(len(x)):
x_cord = x[i]
y_cord = y[i]
plt.text(x_cord, y_cord, (x_cord, y_cord), fontsize = 10)
plt.show()
Saving Plots
In [54]:
# Plots can be saved using savefig()
x = np.arange(5)
plt.plot(x, x, label = 'linear')
plt.plot(x, x*x, label = 'square')
plt.plot(x, x*x*x, label = 'cube')
plt.grid(True)
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title("Polynomial Graph")
plt.legend()
plt.savefig('plot.png') # Saves an image names 'plot.png' in the current dire
ctory
plt.show()
Plot Types
Matplotlib provides many types of plot formats for visualising information
Scatter Plot
Histogram
Bar Graph
Pie Chart
Histogram
In [55]:
# Histograms display the distribution of a variable over a range of freque
ncies or values
y = np.random.randn(100, 100) # 100x100 array of a Gaussian distribution
plt.hist(y) # Function to plot the histogram takes the dataset as the para
meter
plt.show()
In [56]:
# Histogram groups values into non-overlapping categories called bins
# Default bin value of the histogram plot is 10
y = np.random.randn(1000)
plt.hist(y, 100)
plt.show()
Bar Chart
In [57]:
# Bar charts are used to visually compare two or more values using recta
ngular bars
# Default width of each bar is 0.8 units
# [1,2,3] Mid-point of the lower face of every bar
# [1,4,9] Heights of the successive bars in the plot
plt.bar([1,2,3], [1,4,9])
plt.show()
In [58]:
dictionary = {'A':25, 'B':70, 'C':55, 'D':90}
for i, key in enumerate(dictionary):
plt.bar(i, dictionary[key]) # Each key-value pair is plotted individually a
s dictionaries are not iterable
plt.show()
In [59]:
dictionary = {'A':25, 'B':70, 'C':55, 'D':90}
for i, key in enumerate(dictionary):
plt.bar(i, dictionary[key])
plt.xticks(np.arange(len(dictionary)), dictionary.keys()) # Adds the keys as lab
els on the x-axis
plt.show()
Pie Chart
In [60]:
plt.figure(figsize = (3,3)) # Size of the plot in inches
x = [40, 20, 5] # Proportions of the sectors
labels = ['Bikes', 'Cars', 'Buses']
plt.pie(x, labels = labels)
plt.show()
Scatter Plot
In [61]:
# Scatter plots display values for two sets of data, visualised as a collecti
on of points
# Two Gaussion distribution plotted
x = np.random.rand(1000)
y = np.random.rand(1000)
plt.scatter(x, y)
plt.show()
Styling
In [62]:
# Matplotlib allows to choose custom colours for plots
y = np.arange(1, 3)
plt.plot(y, 'y') # Specifying line colours
plt.plot(y+5, 'm')
plt.plot(y+10, 'c')
plt.show()
Color code:
b = Blue
c = Cyan
g = Green
k = Black
m = Magenta
r = Red
w = White
y = Yellow
In [63]:
# Matplotlib allows different line styles for plots
y = np.arange(1, 100)
plt.plot(y, '--', y*5, '-.', y*10, ':')
plt.show()
# - Solid line
# -- Dashed line
# -. Dash-Dot line
# : Dotted Line
In [64]:
linkcode
# Matplotlib provides customization options for markers
y = np.arange(1, 3, 0.2)
plt.plot(y, '*',
y+0.5, 'o',
y+1, 'D',
y+2, '^',
y+3, 's') # Specifying line styling
plt.show()
Streamlit
Install Streamlit
There are multiple ways to set up your development environment and install
Streamlit. Read below to understand these options. Developing locally with
Python installed on your own computer is the most common scenario.
Prerequisites
As with any programming tool, in order to install Streamlit you first need to make
sure your computer is properly set up. More specifically, you’ll need:
Python
For this guide, we'll be using venv, which comes with Python.
For this guide, we'll be using pip, which comes with Python.
Download Xcode command line tools using these instructions in order to let the
package manager install some of Streamlit's dependencies.
A code editor
Our favorite editor is VS Code, which is also what we use in all our tutorials.
cd myproject
In your terminal, type:
# Windows PowerShell
.venv\Scripts\Activate.ps1
streamlit hello
If this doesn't work, use the long-form command:
st.write("Hello world")
Any time you want to use your new environment, you first need to go to your
project folder (where the .venv directory lives) and run the command to activate
it:
# Windows command prompt
.venv\Scripts\activate.bat
# Windows PowerShell
.venv\Scripts\Activate.ps1
When you're done using this environment, return to your normal shell by typing:
deactivate
Install Streamlit using Anaconda Distribution
This page walks you through installing Streamlit locally using Anaconda
Distribution. At the end, you'll build a simple "Hello world" app and run it. You can
read more about Getting started with Anaconda Distribution in Anaconda's docs.
If you prefer to manage your Python environments via command line, check out
how to Install Streamlit using command line.
Prerequisites
A code editor
Anaconda Distribution includes Python and basically everything you need to get
started. The only thing left for you to choose is a code editor.
Our favorite editor is VS Code, which is also what we use in all our tutorials.
But don't worry! In this guide we'll teach you how to install and use an
environment manager (Anaconda).
Click "Create."
A terminal will open with your environment activated. Your environment's name
will appear in parentheses at the beginning of your terminal's prompt to show
that it's activated.
streamlit hello
If this doesn't work, use the long-form command:
import streamlit as st
st.write("Hello World")
Click your Python interpreter in the lower-right corner, then choose your
streamlitenv environment from the drop-down.
Set your Python interpreter to your streamlitenv environment
Right-click app.py in your file navigation and click "Open in integrated terminal".
Open your terminal in your project folder
A terminal will open with your environment activated. Confirm this by looking for
"(streamlitenv)" at the beginning of your next prompt. If it is not there, manually
activate your environment with the command:
import streamlit as st
st.title("Hello World")
In your browser, click "Always rerun" to instantly rerun your app whenever you
save a change to your file.
Automatically rerun your app when your source file changes
Your app will update! Keep making changes and you will see your changes as soon
as you save your file.
Your app updates when you resave your source file
When you're done, you can stop your app with Ctrl+C in your terminal or just by
closing your terminal.