0% found this document useful (0 votes)
75 views

Unit 4 Data Science Applications

Uploaded by

x-factor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Unit 4 Data Science Applications

Uploaded by

x-factor
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit-IV

Data Science-Applications
Applications of Data Science
Data science has a wide range of applications across various industries and fields. It involves extracting
valuable insights and knowledge from data to make informed decisions, solve complex problems, and drive
improvements. Here are some common applications of data science:

1. Business and Marketing:

- Customer Segmentation: Analysing customer data to identify different segments for targeted marketing
strategies.

- Market Basket Analysis: Understanding associations between products purchased together to optimize
product placement and promotions.

- Churn Prediction: Predicting which customers are likely to churn (stop using a product or service) and
taking measures to retain them.

- Recommendation Systems: Recommending products, services, or content to users based on their


preferences and behaviours.

2. Healthcare and Medicine:

- Disease Prediction: Using patient data to predict the likelihood of certain diseases or conditions.

- Drug Discovery: Analysing molecular data to discover new drugs and treatments.

- Healthcare Fraud Detection: Identifying fraudulent activities in healthcare claims to prevent financial
losses.

- Personalized Medicine: Tailoring medical treatments and interventions based on individual patient data.

3. Finance and Banking:

- Credit Scoring: Assessing the credit worthiness of individuals or businesses based on their financial
data.

- Algorithmic Trading: Using data analysis and algorithms to make automated trading decisions in
financial markets.

- Risk Management: Analysing data to assess and mitigate financial risks.

- Fraud Detection: Identifying fraudulent transactions and activities in real-time.

4. Transportation and Logistics:

- Route Optimization: Optimizing delivery routes to minimize costs and delivery time.

- Demand Forecasting: Predicting transportation demand to allocate resources effectively.

- Vehicle Maintenance: Using sensor data to predict and prevent breakdowns through predictive
maintenance.
5. Manufacturing and Supply Chain:

- Quality Control: Analysing production data to ensure product quality and identify defects.

- Inventory Management: Optimizing inventory levels to balance supply and demand.

- Supply Chain Optimization: Analysing supply chain data to minimize costs and improve efficiency.

6. Energy and Utilities:

- Smart Grid Management: Using data to manage energy distribution and consumption in real-time.

- Energy Consumption Analysis: Analysing consumption patterns to identify opportunities for energy
efficiency.

- Renewable Energy Forecasting: Predicting renewable energy generation to optimize energy grid
operations.

7. Social Sciences and Government:

- Crime Prediction and Prevention: Analysing crime data to predict crime hotspots and allocate resources
accordingly.

- Public Health Analysis: Tracking and responding to public health trends and epidemics.

- Policy Making: Using data to inform evidence-based policy decisions.

8. Entertainment and Media:

- Content Recommendation: Recommending movies, music, articles, and other content to users based on
their preferences.

- Audience Insights: Analysing audience data to understand viewer preferences and tailor content.

- Social Media Analysis: Analysing social media data to gauge public sentiment and trends.

9. Environmental Science:

- Climate Modelling: Using data to model and predict climate patterns and changes.

- Natural Disaster Response: Using data to anticipate and respond to natural disasters like hurricanes,
earthquakes, and floods.

- Wildlife Conservation: Analysing wildlife data to track and protect endangered species.

10. Education and Learning:

- Personalized Learning: Tailoring educational content to individual student needs and learning styles.

- Performance Analytics: Analysing student performance data to improve teaching methods and
curriculum.

11. Airline Route Planning

- Predict flight delays for the airline industry, which is helping it, grow.
- It also helps to determine whether to land immediately at the destination or to make a stop in
between, such as a flight from Delhi to the United States of America or to stop in between and then
arrive at the destination.

12. Augmented Reality

Do you realise there's a fascinating relationship between data science and virtual reality? A virtual reality
headset incorporates computer expertise, algorithms, and data to create the greatest viewing experience
possible.

The popular game Pokemon GO is a minor step in that direction. The ability to wander about and look at
Pokemon on walls, streets, and other non-existent surfaces. The makers of this game chose the locations of
the Pokemon and gyms using data from Ingress, the previous app from the same business.

13. Speech recognition

Have you ever needed the help of a virtual speech assistant like Google Assistant, Alexa, or Siri?

Well, its voice recognition technology is operating behind the scenes, attempting to interpret and evaluate
your words and delivering useful results from your use. Image recognition may also be seen on social media
platforms such as Facebook, Instagram, and Twitter. When you submit a picture of yourself with someone
on your list, these applications will recognise them and tag them.

14. Gaming

Video and computer games are now being created with the help of data science and that has taken the
gaming experience to the next level.

These are just a few examples of the diverse applications of data science. With the growth of data collection
and technological advancements, the potential applications of data science continue to expand into new
domains.

Technologies for Visualisation


What Are Data Visualization Tools and Technologies?

Some of the best data visualization tools include Google Charts, Tableau, Grafana, Chartist, FusionCharts,
Datawrapper, Infogram, and ChartBlocks etc. These tools support a variety of visual styles, are simple and
easy to use, and be capable of handling a large volume of data.

Data is becoming increasingly important every day. For any organisation, you can understand how important
data is while making crucial decisions. For the same reason, data visualisation is grabbing people's attention.
Modern data visualisation tools and advanced software are on the market. A data visualisation tool is a
software that is used to visualise data. The features of each tool vary, but at their most basic, they allow you
to input a dataset and graphically alter it. Most, but not all, come with pre-built templates for creating simple
visualisations.

What Do the Best Data Visualization Tools Have in Common?

All of the technologies available on the market for data visualisation have something or another feature in
common. The first advantage is their simplicity of usage. There are two types of software that you will most
likely encounter: those that are easy to use and those that are really difficult to visualise data. Some include
good documentation and tutorials and are constructed in user-friendly ways. Others, regardless of their other
qualities, are missing in certain areas, excluding them from any list of "best" tools. The one thing you should
ensure is that the software can handle large amounts of data and many kinds of data in a single display.

The better software can also generate a variety of charts, graphs, and maps kinds. Obviously, there will be
others in the market who presents the facts in a somewhat different manner. Some data visualisation tools
specialise in a single style of chart or map and excel at it. Those tools are also among the "best" tools
available. Finally, there are financial concerns. While a larger price tag does not inherently disqualify a tool,
it must be justified in terms of greater support, features, and overall value.

1. Tableau

One of the most widely used data visualization tools, Tableau, offers interactive visualization solutions to
more than 57,000 companies.

Providing integration for advanced databases, including Teradata, SAP, My SQL, Amazon AWS, and
Hadoop, Tableau efficiently creates visualizations and graphics from large, constantly-evolving datasets
used for artificial intelligence, machine learning, and Big Data applications.

The Pros of Tableau:

1. Excellent visualization capabilities


2. Easy to use
3. Top class performance
4. Supports connectivity with diverse data sources
5. Mobile Responsive
6. Has an informative community

The Cons of Tableau:

1. The pricing is a bit on the higher side


2. Auto-refresh and report scheduling options are not available

2. Dundas BI

Dundas BI offers highly-customizable data visualizations with interactive scorecards, maps, gauges, and
charts, optimizing the creation of ad-hoc, multi-page reports. By providing users full control over visual
elements, Dundas BI simplifies the complex operation of cleansing, inspecting, transforming, and modelling
big datasets.

The Pros of Dundas BI:

1. Exceptional flexibility
2. A large variety of data sources and charts
3. Wide range of in-built features for extracting, displaying, and modifying data

The Cons of Dundas BI:

1. No option for predictive analytics


2. 3D charts not supported

3. JupyteR

A web-based application, JupyteR, is one of the top-rated data visualization tools that enable users to create
and share documents containing visualizations, equations, narrative text, and live code. JupyteR is ideal for
data cleansing and transformation, statistical modelling, numerical simulation, interactive computing, and
machine learning.

The Pros of JupyteR:

1. Rapid prototyping
2. Visually appealing results
3. Facilitates easy sharing of data insights

The Cons of JupyteR:

1. Tough to collaborate
2. At times code reviewing becomes complicated

4. Zoho Reports

Zoho Reports, also known as Zoho Analytics, is a comprehensive data visualization tool that integrates
Business Intelligence and online reporting services, which allow quick creation and sharing of extensive
reports in minutes. The high-grade visualization tool also supports the import of Big Data from major
databases and applications.

The Pros of Zoho Reports:

1. Effortless report creation and modification


2. Includes useful functionalities such as email scheduling and report sharing
3. Plenty of room for data
4. Prompt customer support.

The Cons of Zoho Reports:

1. User training needs to be improved


2. The dashboard becomes confusing when there are large volumes of data

5. Google Charts

One of the major players in the data visualization market space, Google Charts, coded with SVG and
HTML5, is famed for its capability to produce graphical and pictorial data visualizations. Google Charts
offers zoom functionality, and it provides users with unmatched cross-platform compatibility with iOS,
Android, and even the earlier versions of the Internet Explorer browser.

The Pros of Google Charts:

1. User-friendly platform
2. Easy to integrate data
3. Visually attractive data graphs
4. Compatibility with Google products.

The Cons of Google Charts:

1. The export feature needs fine-tuning


2. Inadequate demos on tools
3. Lacks customization abilities
4. Network connectivity required for visualization

6. Visual.ly

Visual.ly is one of the data visualization tools on the market, renowned for its impressive distribution
network that illustrates project outcomes. Employing a dedicated creative team for data visualization
services, Visual.ly streamlines the process of data import and outsource, even to third parties.

The Pros of Visual.ly:

1. Top-class output quality


2. Easy to produce superb graphics
3. Several link opportunities

The Cons of Visual.ly:

1. Few embedding options


2. Showcases one point, not multiple points
3. Limited scope

7. RAW

RAW, better-known as RawGraphs, works with delimited data such as TSV file or CSV file. It serves as a
link between data visualization and spreadsheets. Featuring a range of non-conventional and conventional
layouts, RawGraphs provides robust data security even though it is a web-based application.

The Pros of RAW:

1. Simple interface
2. Super-fast visual feedback
3. Offers a high-level platform for arranging, keeping, and reading user data
4. Easy-to-use mapping feature
5. Superb readability for visual graphics
6. Excellent scalability option

The Cons of RAW:

1. Non-availability of log scales


2. Not user intuitive

8. IBM Watson

Named after IBM founder Thomas J. Watson, this high-caliber data visualization tool uses analytical
components and artificial intelligence to detect insights and patterns from both unstructured and structured
data. Leveraging NLP (Natural Language Processing), IBM Watson's intelligent, self-service visualization
tool guides users through the entire insight discovery operation.

The Pros of IBM Watson:

1. NLP capabilities
2. Offers accessibility from multiple devices
3. Predictive analytics
4. Self-service dashboards

The Cons of IBM Watson:

1. Customer support needs improvement


2. High-cost maintenance

9. Sisense

Regarded as one of the most agile data visualization tools, Sisense gives users access to instant data
analytics anywhere, at any time. The best-in-class visualization tool can identify key data patterns and
summarize statistics to help decision-makers make data-driven decisions.

The Pros of Sisense:

1. Ideal for mission-critical projects involving massive datasets


2. Reliable interface
3. High-class customer support
4. Quick upgrades
5. Flexibility of seamless customization

The Cons of Sisense:

1. Developing and maintaining analytic cubes can be challenging


2. Does not support time formats
3. Limited visualization versions

10. Plotly

An open-source data visualization tool, Plotly offers full integration with analytics-centric programming
languages like Matlab, Python, and R, which enables complex visualizations. Widely used for collaborative
work, disseminating, modifying, creating, and sharing interactive, graphical data, Plotly supports both on-
premise installation and cloud deployment.

The Pros of Plotly:

1. Allows online editing of charts


2. High-quality image export
3. Highly interactive interface
4. Server hosting facilitates easy sharing

The Cons of Plotly:

1. Speed is a concern at times


2. Free version has multiple limitations
3. Various screen-flashings create confusion and distraction

Bokeh (Python)
Bokeh is a popular open-source Python library for creating interactive visualizations for web browsers.
It's designed to help users create rich and interactive visualizations, allowing them to explore data and gain
insights in a web-based environment. Bokeh supports a wide range of plot types and customization options,
making it a versatile tool for data visualization.

Key features of Bokeh include:

1. Interactive Visualizations: Bokeh allows you to create interactive plots that respond to user
interactions such as zooming, panning, and hovering over data points. This interactivity can be useful for
exploring large datasets and revealing hidden patterns.

2. Multiple Output Options: Bokeh can generate visualizations in various formats, including HTML
files, standalone web applications, and embedded plots within Jupyter notebooks.

3. Wide Range of Plot Types: Bokeh supports a variety of plot types, including line charts, bar charts,
scatter plots, histograms, heatmaps, and more.

4. Customization: You can customize visual elements like colours, markers, and labels to create
aesthetically pleasing and informative visualizations.

5. Layouts and Widgets: Bokeh provides tools to arrange multiple plots and interactive widgets on a
single web page, allowing for complex and informative dashboards.

6. High-Performance Rendering: Bokeh is designed for efficient rendering of large datasets, making it
suitable for visualizing big data.

7. Integration with Python Ecosystem: Bokeh can be integrated with other popular data science
libraries in Python, such as NumPy, Pandas, and Scikit-learn.

To get started with Bokeh, you need to install the library using a package manager like pip.

pip install bokeh

Here's a simple example of creating a basic scatter plot using Bokeh:

from bokeh.plotting import figure, output_file, show


# Prepare the data
x = [1, 2, 3, 4, 5]
y = [6, 7, 2, 4, 5]
# Create a scatter plot
p = figure(title="Simple Scatter Plot")
p.scatter(x, y, size=10, color="blue")
# Specify the output file (HTML)
output_file("scatter_plot.html")
# Show the plot in a browser
show(p)

This code creates a scatter plot with interactive features like zoom and pan. The plot is saved as an
HTML file and can be opened in a web browser.
In Bokeh, there are primarily two types of visual interfaces that you can use to create interactive data
visualizations:

1. Bokeh Plotting Interface (High-Level Interface):

The Bokeh Plotting Interface provides a high-level, simplified way to create a wide range of plots and
visualizations with minimal code. It is inspired by the popular Matplotlib library and is designed to be user-
friendly. Key components of the Bokeh Plotting Interface include the figure function for creating plots,
various glyph functions (e.g., circle, line, bar) for adding visual elements to plots, and functions for
customizing plot attributes.

2. Bokeh Models and Bokeh Server (Low-Level Interface):

The Bokeh Models and Bokeh Server approach offers a lower-level, more programmatic way to create
complex and highly interactive data visualizations and web applications. In this approach, you create Bokeh
"models" that represent visual elements and widgets individually. You can then arrange and customize these
models to create sophisticated interactive dashboards and applications. The Bokeh Server allows you to
build real-time, interactive web applications by defining callbacks that respond to user interactions and
dynamically update the visualizations. This interface is suitable for projects that require fine-grained control
over the behaviour of individual visual elements and interactive components.

Various Visualization Technologies


Bokeh offers a variety of plot types that you can use to create interactive and informative visualizations.
Here are some of the most commonly used plot types available in Bokeh:

1. Scatter Plot: A scatter plot displays individual data points as dots on a 2D plane. It's useful for showing
the relationship between two variables.

2. Line Plot: A line plot shows data points connected by lines, often used to visualize trends or changes over
time.

3. Bar Plot: Bar plots display data using rectangular bars, where the height of each bar represents the value
of a specific category or variable.
4. Histogram: Histograms depict the distribution of data by grouping it into intervals, or bins, and
displaying the frequency of data points in each bin.

5. Area Plot: Area plots display data as areas under a line, which is particularly useful for visualizing
cumulative quantities or stacked data.

6. Step Plot: A step plot is similar to a line plot, but it uses horizontal and vertical steps to connect data
points, often used for stepwise or discrete data.

7. Pie Plot: A pie plot shows data as slices of a circular pie, with each slice representing a proportion of the
whole.

8. Box Plot: Box plots visualize the distribution of data using quartiles, medians, and potential outliers.

9. Candlestick Plot: Often used in financial analysis, candlestick plots represent the opening, closing, high,
and low prices of a stock or other financial instrument over a period of time.

10. Heatmap: Heatmaps display data in a grid, where the colour intensity represents the value of each cell.
They're often used to show correlations or patterns in tabular data.

11. Network Plot: Network plots visualize relationships between entities using nodes (vertices) and edges.

12. Hexbin Plot: Hexbin plots are useful for visualizing the distribution of data in a scatter plot by
aggregating points into hexagonal bins.

13. Polar Plot: Polar plots display data in a circular format, useful for visualizing periodic patterns or
directional data.

14. Contour Plot: Contour plots show 3D data on a 2D plane using contour lines to represent different levels
of a third variable.

15. Quiver Plot: Quiver plots display vector fields using arrows to indicate direction and magnitude.

These are just a few examples of the plot types you can create using Bokeh. Bokeh provides a wide range of
customization options, interactivity features, and tools that allow you to create complex and engaging
visualizations tailored to your data and communication goals.

Some more methods to visualize data

 Column Chart: It is also called a vertical bar chart where each category is represented by a
rectangle. The height of the rectangle is proportional to the values that are plotted.
 Bar Graph: It has rectangular bars in which the lengths are proportional to the values which are
represented.
 Stacked Bar Graph: It is a bar style graph that has various components stacked together so that apart
from the bar, the components can also be compared to each other.
 Stacked Column Chart: It is similar to a stacked bar; however, the data is stacked horizontally.
 Area Chart: It combines the line chart and bar chart to show how the numeric values of one or more
groups change over the progress of a viable area.
 Dual Axis Chart: It combines a column chart and a line chart and then compares the two variables.
 Line Graph: The data points are connected through a straight line; therefore, creating a
representation of the changing trend.
 Mekko Chart: It can be called a two-dimensional stacked chart with varying column widths.
 Pie Chart: It is a chart where various components of a data set are presented in the form of a pie
which represents their proportion in the entire data set.
 Waterfall Chart: With the help of this chart, the increasing effect of sequentially introduced positive
or negative values can be understood.
 Bubble Chart: It is a multi-variable graph that is a hybrid of Scatter Plot and a Proportional Area
Chart.
 Scatter Plot Chart: It is also called a scatter chart or scatter graph. Dots are used to denote values for
two different numeric variables.
 Bullet Graph: It is a variation of a bar graph. A bullet graph is used to swap dashboard gauges and
meters.
 Funnel Chart: The chart determines the flow of users with the help of a business or sales process.
 Heat Map: It is a technique of data visualization that shows the level of instances as colour in two
dimensions.

Bokeh Library (Python)


Bokeh is a Python library that provides interactive and browser-based data visualization capabilities. It
allows you to create interactive plots, charts, and dashboards that can be displayed in web browsers. To get
started with Bokeh for data visualization, follow these steps:

1. Installation:

First, make sure you have Python installed on your system. You can install Bokeh using pip:

pip install bokeh

2. Importing Bokeh:

Import the necessary modules from the Bokeh library:

from bokeh.plotting import figure, output_file, show

3. Creating a Basic Plot:

Create a simple scatter plot using Bokeh. In this example, we'll create a scatter plot of random data points:

import numpy as np
# Generate some random data
x = np.random.random(100)
y = np.random.random(100)
# Create a figure object
plot = figure(title="Random Scatter Plot")
# Add a scatter glyph to the figure
plot.scatter(x, y)
# Specify the output file
output_file("scatter_plot.html")
# Show the plot in a browser
show(plot)

Run this script, and it will generate an HTML file named "scatter_plot.html" containing the interactive
scatter plot.
4. Customizing Plots:

Bokeh provides various options to customize the appearance of your plots. You can set attributes like
titles, axis labels, colours, markers, and more:

from bokeh.plotting import figure, output_file, show


import numpy as np
# Generate some random data
x = np.random.random(100)
y = np.random.random(100)
'''
# Create a figure object
plot = figure(title="Random Scatter Plot")
# Add a scatter glyph to the figure
plot.scatter(x, y)
'''
plot = figure(title="Customized Scatter Plot", x_axis_label="X-axis",
y_axis_label="Y-axis")
plot.scatter(x, y, size=10, color="red", marker="circle")

# Specify the output file


output_file("scatter_plot.html")
# Show the plot in a browser
show(plot)
5. Adding Interactivity:

One of Bokeh's strengths is its ability to create interactive plots. You can add interactivity by including
tools like pan, zoom, hover, and more:

from bokeh.models import HoverTool


# Create a HoverTool instance
hover = HoverTool(tooltips=[("X", "@x"), ("Y", "@y")])
plot.add_tools(hover, "pan", "box_zoom", "reset")
show(plot)

In this example, the HoverTool displays tooltips showing the values of the x and y coordinates when
hovering over data points.

Hoveing
Panning

Zooming

6. Layouts, widgets and Dashboards:

Bokeh includes a wide range of interactive widgets like sliders, buttons, dropdowns, and text inputs that
can be easily integrated into your visualizations to enable user interactions. You can also create complex
layouts and dashboards by combining multiple plots and widgets. Also, we can arrange plots in rows,
columns, grids, and more:

from bokeh.layouts import row, column


# Create multiple plots
plot1 = figure(title="Plot 1")
plot2 = figure(title="Plot 2")
# Create a layout with plots in a row
layout = row(plot1, plot2)
show(layout)
7. Integration with Python Ecosystem:

To illustrate the integration of the Bokeh library with NumPy, Pandas, and scikit-learn, we can create a
simple example where we generate random data using NumPy, perform some data manipulation with
Pandas, and then create an interactive scatter plot using Bokeh. In this example, we'll generate synthetic
data, apply a simple linear regression using scikit-learn, and visualize the data and regression line with
Bokeh. Ensure you have these libraries installed before running the code.

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
output_notebook() # For Jupyter Notebook integration
# Generate synthetic data
np.random.seed(0)
X = np.random.rand(100) * 10
y = 2 * X + 1 + np.random.randn(100) * 2
# Create a Pandas DataFrame
data = pd.DataFrame({'X': X, 'y': y})
# Perform linear regression using scikit-learn
model = LinearRegression()
model.fit(data[['X']], data['y'])
y_pred = model.predict(data[['X']])
# Create a Bokeh scatter plot
p = figure(title="Scatter Plot with Linear Regression Line")
p.scatter('X', 'y', source=ColumnDataSource(data), size=8,
color="navy", legend_label="Data")
p.line('X', y_pred, line_width=2, color="red", legend_label="Linear
Regression Line")
p.xaxis.axis_label = 'X'
p.yaxis.axis_label = 'y'
p.legend.title = 'Legend'
# Show the plot
show(p)

8. Wide range of plots:

Here are some of the most commonly used plot types available in Bokeh:

Scatter Plot: A scatter plot displays individual data points as dots on a 2D plane. It's useful for showing the
relationship between two variables.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
import numpy as np
# Sample data
x = np.array([1, 2, 3, 4, 5])
y = np.array([10, 25, 15, 30, 20])
# Create a Bokeh figure
p = figure(title="Scatter Plot Example", x_axis_label="X-axis",
y_axis_label="Y-axis")
# Add scatter points to the figure
p.scatter(x, y, size=10, color="navy", alpha=0.5)
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)
Line Plot: A line plot shows data points connected by lines, often used to visualize trends or changes over
time.

import matplotlib.pyplot as plt


x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
plt.plot(x, y, marker='o')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.title('Line Plot Example')
plt.grid(True)
plt.show()

Bar Plot: Bar plots display data using rectangular bars, where the height of each bar represents the value of
a specific category or variable.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
import numpy as np
# Sample data
categories = ['Category A', 'Category B', 'Category C', 'Category D',
'Category E']
values = [10, 25, 15, 30, 20]
# Create a Bokeh figure
p = figure(x_range=categories, title="Vertical Bar Plot Example",
x_axis_label="Categories", y_axis_label="Values")
# Add vertical bars to the figure
p.vbar(x=categories, top=values, width=0.5, color="blue")
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)

Histogram: Histograms depict the distribution of data by grouping it into intervals, or bins, and displaying
the frequency of data points in each bin.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
import numpy as np
# Sample data
data = np.random.randn(1000) # Replace with your data
# Create a Bokeh figure
p = figure(title="Histogram Example", x_axis_label="Value",
y_axis_label="Frequency")
# Compute histogram data
hist, edges = np.histogram(data, bins=20)
# Add histogram bars to the figure
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
fill_color="blue", line_color="white")
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)
Area Plot: Area plots display data as areas under a line, which is particularly useful for visualizing
cumulative quantities or stacked data.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
import numpy as np
# Sample data
x = np.array([1, 2, 3, 4, 5])
y1 = np.array([10, 25, 15, 30, 20])
y2 = np.array([5, 15, 10, 20, 10])
y3 = np.array([2, 5, 3, 6, 4])
# Create a Bokeh figure
p = figure(title="Stacked Area Plot Example", x_axis_label="X-axis",
y_axis_label="Y-axis")
# Create stacked area plot iteratively
p.varea(x=x, y1=0, y2=y1, fill_color="blue", legend_label="Y1")
p.varea(x=x, y1=y1, y2=y1 + y2, fill_color="green", legend_label="Y2")
p.varea(x=x, y1=y1 + y2, y2=y1 + y2 + y3, fill_color="orange",
legend_label="Y3")
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)
Step Plot: A step plot is similar to a line plot, but it uses horizontal and vertical steps to connect data points,
often used for stepwise or discrete data.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
# Sample data
x = [1, 2, 3, 4, 5]
y = [10, 25, 15, 30, 20]
# Create a Bokeh figure
p = figure(title="Step Plot Example", x_axis_label="X-axis",
y_axis_label="Y-axis")
# Add step glyph to the figure
p.step(x, y, mode="before")
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)

Pie Plot: A pie plot shows data as slices of a circular pie, with each slice representing a proportion of the
whole. Bokeh doesn't have native support for pie charts, and creating them using wedge glyphs can be
complex. If you're looking for a simpler way to create a pie chart, you might consider using a different
visualization library like Matplotlib or Plotly.
Box Plot: Box plots visualize the distribution of data using quartiles, medians, and potential outliers.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
from bokeh.models import ColumnDataSource
import numpy as np
# Sample data
data = np.random.randn(100, 4) # 100 samples with 4 features
# Create a Bokeh figure
p = figure(title="Box Plot Example", x_axis_label="Features",
y_axis_label="Values")
# Create a ColumnDataSource from the data
source = ColumnDataSource(data=dict(x=[1, 2, 3, 4],
q1=np.percentile(data, 25, axis=0),
q3=np.percentile(data, 75, axis=0),
lower_bound=np.percentile(data, 25, axis=0) - 1.5 *
(np.percentile(data, 75, axis=0) - np.percentile(data, 25,
axis=0)),
upper_bound=np.percentile(data, 75, axis=0) + 1.5 *
(np.percentile(data, 75, axis=0) - np.percentile(data, 25,
axis=0))
))
# Create box glyphs
p.segment(x0='x', y0='lower_bound', x1='x', y1='upper_bound',
line_color="black", source=source)
p.vbar(x='x', top='q3', bottom='q1', width=0.5, fill_color="blue",
line_color="black", source=source)
# Set x-axis labels
p.xaxis.ticker = [1, 2, 3, 4]
p.xaxis.major_label_overrides = {1: 'Feature 1', 2: 'Feature 2', 3:
'Feature 3', 4: 'Feature 4'}
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)
Candlestick Plot: Often used in financial analysis, candlestick plots represent the opening, closing, high, and
low prices of a stock or other financial instrument over a period of time.

from bokeh.plotting import figure, show, output_file


from bokeh.models import ColumnDataSource
from datetime import datetime
# Sample data for candlestick plot
dates = ['2023-08-01', '2023-08-02', '2023-08-03', '2023-08-04']
opens = [100, 105, 110, 112]
closes = [102, 109, 112, 108]
highs = [105, 110, 115, 116]
lows = [98, 102, 108, 107]
# Convert dates to datetime format
date_format = '%Y-%m-%d'
formatted_dates = [datetime.strptime(date, date_format) for date in
dates]
# Create a ColumnDataSource
source = ColumnDataSource(data=dict(dates=formatted_dates,
opens=opens, closes=closes, highs=highs, lows=lows))
# Create a Bokeh figure
p = figure(x_axis_type="datetime", width=800, title="Candlestick
Chart")
# Draw the candlestick bars
p.segment(x0='dates', y0='highs', x1='dates', y1='lows',
source=source, line_color="black")
p.vbar(x='dates', top='opens', bottom='closes', width=0.4,
source=source, fill_color="green", line_color="black")
# Customize aesthetics
p.xaxis.major_label_orientation = 45
# Specify the output file (HTML)
output_file("candlestick_plot.html")
# Show the plot
show(p)

Heatmap: Heatmaps display data in a grid, where the colour intensity represents the value of each cell.
They're often used to show correlations or patterns in tabular data.

from bokeh.plotting import figure, show


from bokeh.models import ColorBar
from bokeh.transform import linear_cmap
from bokeh.io import output_notebook
import numpy as np
# Sample data
data = np.random.rand(10, 10) # Replace with your data
# Create a Bokeh figure
p = figure(title="Heatmap Example")
# Flatten the data for plotting
x = np.repeat(np.arange(data.shape[1]), data.shape[0])
y = np.tile(np.arange(data.shape[0]), data.shape[1])
values = data.flatten()
# Create a color mapper
color_mapper = linear_cmap(field_name='value', palette='Viridis256',
low=min(values), high=max(values))
# Create a data source
source = ColumnDataSource(data=dict(x=x, y=y, value=values))
# Add rect glyphs to the figure
p.rect(x='x', y='y', width=1, height=1, source=source,
line_color=None, fill_color=color_mapper)
# Add color bar
color_bar = ColorBar(color_mapper=color_mapper['transform'], width=8,
location=(0, 0))
p.add_layout(color_bar, 'right')
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)

Network Plot: Network plots visualize relationships between entities using nodes (vertices) and edges.

from bokeh.plotting import figure, show


from bokeh.io import output_notebook
import networkx as nx
import numpy as np
# Create a NetworkX graph
nx_graph = nx.Graph()
nx_graph.add_edges_from([(0, 1), (1, 2), (2, 0)])
# Create a Bokeh figure
p = figure(title="Network Plot Example", x_range=(-1.5, 1.5),
y_range=(-1.5, 1.5), toolbar_location=None)
# Extract node positions using NetworkX layout
pos = nx.spring_layout(nx_graph, scale=1, center=(0, 0))
# Plot nodes using circles
node_size = 0.1
for node, (x, y) in pos.items():
p.circle(x, y, size=node_size, color="blue", line_color="black")
# Plot edges using segments
edge_coords = []
for edge in nx_graph.edges():
x0, y0 = pos[edge[0]]
x1, y1 = pos[edge[1]]
edge_coords.append((x0, y0, x1, y1))
p.segment(x0=[x for x, y, _, _ in edge_coords],
y0=[y for x, y, _, _ in edge_coords],
x1=[x for _, _, x, y in edge_coords],
y1=[y for _, _, x, y in edge_coords],
line_color="gray", line_alpha=0.5)
# Display the plot (in a Jupyter Notebook)
output_notebook()
show(p)

Hexbin Plot: Hexbin plots are useful for visualizing the distribution of data in a scatter plot by aggregating
points into hexagonal bins.

Polar Plot: Polar plots display data in a circular format, useful for visualizing periodic patterns or
directional data.

Contour Plot: Contour plots show 3D data on a 2D plane using contour lines to represent different levels of
a third variable.

import numpy as np
import matplotlib.pyplot as plt
# Sample data
x = np.linspace(-2, 2, 100)
y = np.linspace(-2, 2, 100)
X, Y = np.meshgrid(x, y)
Z = np.exp(-(X**2 + Y**2))
# Create a contour plot using Matplotlib
plt.contourf(X, Y, Z, levels=10, cmap='viridis')
plt.colorbar()
# Set labels and title
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Contour Plot Example')
# Show the plot
plt.show()

Quiver Plot: Quiver plots display vector fields using arrows to indicate direction and magnitude.

Waterfall Chart

from bokeh.plotting import figure, show


from bokeh.models import ColumnDataSource
from bokeh.io import output_notebook
# Sample data
categories = ['Start', 'A', 'B', 'C', 'End']
values = [0, 50, -30, 20, 0]
# Calculate cumulative values
cumulative_values = [sum(values[:i+1]) for i in range(len(values))]
# Create a ColumnDataSource
data = {
'categories': categories,
'values': values,
'cumulative_values': cumulative_values,
'positive_values': [val if val > 0 else 0 for val in values],
'negative_values': [val if val < 0 else 0 for val in values],
}
source = ColumnDataSource(data=data)
# Create a Bokeh figure
p = figure(x_range=categories, title="Waterfall Chart")
# Plot the waterfall bars
p.vbar(x='categories', top='cumulative_values', width=0.6,
source=source, color="blue")
# Show positive values in green and negative values in red
p.vbar(x='categories', bottom='cumulative_values',
top='positive_values', width=0.6, source=source, color="green")
p.vbar(x='categories', bottom='cumulative_values',
top='negative_values', width=0.6, source=source, color="red")
# Customize plot
p.xaxis.axis_label = "Categories"
p.yaxis.axis_label = "Value"
p.yaxis.formatter.use_scientific = False
# Show the plot
output_notebook()
show(p)

Example: Here's an example of how you could create a simple sales data visualization using Bokeh in
Python. In this example, I'll use a scatter plot to display sample sales data and label each data point
with additional information using tooltips.

from bokeh.plotting import figure, output_file, show


from bokeh.models import HoverTool, ColumnDataSource
# Sample sales data
products = ['Product A', 'Product B', 'Product C', 'Product D', 'Product
E']
sales = [1500, 2200, 1800, 2500, 1700]
profits = [300, 400, 350, 500, 320]
# Create a ColumnDataSource to link data to the plot
source = ColumnDataSource(data={'products': products, 'sales': sales,
'profits': profits})
# Create the figure
p = figure(x_range=products, title='Sample Sales Data', height=400)
# Create the scatter plot
p.circle(x='products', y='sales', size=15, source=source, color='blue',
legend_label='Sales')
p.square(x='products', y='profits', size=15, source=source,
color='green', legend_label='Profits')
# Add hover tool with tooltips
hover = HoverTool()
hover.tooltips = [
('Product', '@products'),
('Sales', '@sales'),
('Profits', '@profits'),
]
p.add_tools(hover)
# Customize plot aesthetics
p.xaxis.major_label_orientation = 1
p.legend.title = 'Data Type'
p.legend.label_text_font_size = '10pt'
# Specify the output file (HTML)
output_file("sales_data_plot.html")
# Show the plot
show(p)
Recent Trends In Various Data Collection and Analysis Techniques
In today’s current market trend, data is driving any organization in a countless number of ways. Data
Science, Big Data Analytics, and Artificial Intelligence are the key trends in today’s accelerating market. As
more organizations are adopting data-driven models to streamline their business processes, the data analytics
industry is seeing humongous growth. From fuelling fact-based decision-making to adopting data-driven
models to expanding data-focused product offerings, organizations are inclining more towards data
analytics.

These progressing data analytics trends can help organizations deal with many changes and uncertainties.
So, let’s take a look at a few of these Data Analytics trends that are becoming an inherent part of the
industry.

Trend 1: Smarter and Scalable Artificial Intelligence

COVID-19 has changed the business landscape in myriad ways and historical data is no more relevant. So,
in place of traditional AI techniques, arriving in the market are some scalable and smarter Artificial
Intelligence and Machine Learning techniques that can work with small data sets. "Smarter and Scalable
Artificial Intelligence" refers to the development and deployment of AI systems that are not only intelligent
and capable of performing complex tasks but also scalable to handle increasing data volumes, user demands,
and computational resources efficiently.

Trend 2: Agile and Composed Data & Analytics

"Agile and Composed Data & Analytics" refers to an approach in data analytics and data management that
emphasize flexibility, collaboration, and customer-centricity, aiming to deliver high-quality software in a
more responsive and adaptable manner. This approach combines principles from Agile methodology with a
composed (or modular) architecture to create a dynamic and responsive data and analytics ecosystem. This
methodology is more efficient than the traditional software development methodologies like linear and
sequential methods.

Trend 3: Hybrid Cloud Solutions and Cloud Computing

Public clouds are cost-effective but do not provide high security whereas a private cloud is secure but more
expensive. Hence, a hybrid cloud is a balance of both a public cloud and a private cloud where cost and
security are balanced to offer more agility. This is achieved by using artificial intelligence and machine
learning. Hybrid clouds are bringing change to organizations by offering a centralized database, data
security, scalability of data, and much more at such a cheaper cost.

Trend 4: Data Fabric

A data fabric is a modern data management framework that enables organizations to efficiently and
seamlessly manage, integrate, access, and analyse data across multiple data sources and environments. It
provides a unified and flexible way to handle data, making it accessible and useful for various data-driven
processes and applications.

Trend 5: Edge Computing For Faster Analysis

Edge computing is a distributed computing paradigm that brings data processing and analysis closer to the
data source, typically at or near the edge of the network where data is generated. This approach offers
several advantages, including faster analysis, reduced latency, improved data privacy, and more efficient
bandwidth usage.

Reduced Latency: By processing data at or near the source of data generation, edge computing significantly
reduces the time it takes for data to travel to a central data centre or cloud server for analysis. This low
latency is crucial for applications that require real-time or near-real-time responses, such as autonomous
vehicles, industrial automation, and augmented reality.

Trend 6: Augmented Analytics

Augmented analytics is an emerging approach to analytics and business intelligence (BI) that uses artificial
intelligence (AI) and machine learning (ML) technologies to automate and enhance data preparation,
analysis, and visualization. The goal of augmented analytics is to make data analysis more accessible to a
broader range of users, including business professionals who may not have a background in data science or
statistics.

Trend 7: The Death of Predefined Dashboards

Earlier businesses were restricted to predefined static dashboards and manual data exploration restricted to
data analysts or citizen data scientists. But it seems dashboards have outlived their utility due to the lack of
their interactivity and user-friendliness. Questions are being raised about the utility and ROI of dashboards,
leading organizations and business users to look for solutions that will enable them to explore data on their
own and reduce maintenance costs.

It seems slowly business will be replaced by modern automated and dynamic BI tools that will present
insights customized according to a user’s needs and delivered to their point of consumption.

Business Intelligence (BI) tools are software applications and platforms designed to collect, analyze, and
present business data to help organizations make informed decisions. These tools enable users to turn raw
data into actionable insights, reports, dashboards, and visualizations.

Trend 8: XOps

"XOps" is a term that has emerged as an extension of the "DevOps" (Development and Operations) concept.
While DevOps primarily focuses on bridging the gap between software development and IT operations,
XOps broadens this idea to include other departments or teams in an organization. The "X" in XOps can
represent various functions or teams, depending on the context, like AIOps, MlOps, ModelOps, DataOs, etc.

XOps has become a crucial part of business transformation processes with the adoption of Artificial
Intelligence and Data Analytics across any organization. XOps started with DevOps that is a combination of
development and operations and its goal is to improve business operations, efficiencies, and customer
experiences by using the best practices of DevOps. It aims in ensuring reliability, re-usability, and
repeatability and also ensures a reduction in the duplication of technology and processes. Overall, the
primary aim of XOps is to enable economies of scale and help organizations to drive business values by
delivering a flexible design and agile orchestration in affiliation with other software disciplines.

Trend 9: Engineered Decision Intelligence

Decision intelligence is gaining a lot of attention in today’s market. It includes a wide range of decision-
making and enables organizations to more quickly gain insights needed to drive actions for the business. It
also includes conventional analytics, AI, and complex adaptive system applications. When combined with
composability and common data fabric, engineering decision intelligence has great potential to help
organizations rethink how they optimize decision-making. In other words, engineered decision analytics is
not made to replace humans, rather it can help to augment decisions taken by humans.

Trend 10: Data Visualization

With evolving market trends and business intelligence, data visualization has captured the market in a go.
Data Visualization is indicated as the last mile of the analytics process and assists enterprises to perceive
vast chunks of complex data. Data Visualization has made it easier for companies to make decisions by
using visually interactive ways. It influences the methodology of analysts by allowing data to be observed
and presented in the form of patterns, charts, graphs, etc. Since the human brain interprets and remembers
visuals more, hence it is a great way to predict future trends for the firm.

Application Development Methods Used In Data Science


Data Science is a field of study that focuses on using scientific methods and algorithms to extract knowledge
from data. The Data Scientist role may differ depending on the project. Some associate this position with
application analytics, others with vaguely defined AI, and the truth lies somewhere in between. Ultimately,
the Data Scientist's primary goal is to improve the quality of application development and bring value to a
product.

The role of a Data Scientist

Data Scientist is generally required to have knowledge in data analysis, data transformations, and machine
learning. However, different positions are related to this role, such as:

 Data Analyst,
 Data Engineer,
 Machine Learning Engineer,
 MLOps, Or
 DataOps.

Data Scientists may often be perceived as full-stack developers of a machine learning world. Therefore,
many companies prefer hiring Data Scientists with particular skills that involve mentioned roles to fit project
requirements. In small teams, Data Scientists are responsible for designing architecture and building data
processing pipelines, preparing application analytics, developing machine learning solutions, deploying
these to the production environment and monitoring results.

Transforming data into value

The primary purpose of a Data Scientist's work is to solve problems that include reducing costs, increasing
revenue and improving user experience. It can be either achieved by maintaining and investigating
application analytics or introducing AI systems in a project.

Application development analytics usually include components addressing the following questions:

Users demographic

 Where do the users come from?


 What age are they?
 What devices and systems are they using?

Users activity
 How many active users have the application?
 What time does the application suffer from increased traffic?
 How does the cohort analysis look like?
 What is the users' engagement time?

Users paths

 Which application features are frequently used?


 Where do bottlenecks occur in applications flows?

Additional application KPIs

 What is the overall user engagement?


 What is application revenue?

A/B tests results

 What are the results of A/B testing?


 How can the results change considering different user segments?

Crashes and Errors

 How many users are affected?


 Is there any pattern in the segment of affected users?

Analytics can give plenty of information to the development team and the client. Therefore, application
development can be accelerated with tasks prioritization, features validation, and detection of hidden issues.

Although analytics is an important part of application development, Data Scientists are also responsible for
delivering machine learning solutions. Machine learning is a branch of science that focuses on automatic
insights extraction in order to build a knowledge model that can perform a certain task.

Data science is a multidisciplinary field that involves extracting insights and knowledge from data using
various techniques and tools. When it comes to application development in data science, several methods are
commonly used to create and deploy data-driven applications. Here are some of the key methods:

1. Custom Software Development: Building custom data science applications tailored to specific business
needs is a common approach. This involves developing software using programming languages like Python
or R, and frameworks like Flask or Django for web applications. Custom applications can include data
visualization dashboards, predictive models, recommendation systems, and more.

2. Data Visualization Tools: Data visualization tools like Tableau, Power BI, and Plotly are widely used to
create interactive and visually appealing dashboards that help users understand complex datasets. These
tools allow data scientists to create informative charts, graphs, and interactive plots without extensive
coding.

3. Web Scraping and APIs: Data scientists often develop applications that gather data from websites or
online sources using web scraping techniques. APIs (Application Programming Interfaces) provide
structured access to data from various platforms, and data scientists can develop applications that fetch and
analyze data from APIs.

4. Jupyter Notebooks: Jupyter notebooks are interactive documents that combine code, visualizations, and
narrative text. Data scientists use Jupyter notebooks to develop and share code-driven analyses, create
prototypes of data models, and communicate findings to stakeholders.
5. Machine Learning Models as APIs: Trained machine learning models can be deployed as APIs, allowing
other applications to send data to the model and receive predictions in return. This approach is commonly
used for recommendation systems, sentiment analysis, fraud detection, and more.

6. Containerization and Microservices: Data science applications can be containerized using technologies
like Docker and deployed as micro services. This approach provides flexibility, scalability, and easier
management of applications and dependencies.

7. Serverless Computing: Serverless platforms like AWS Lambda, Azure Functions, and Google Cloud
Functions enable data scientists to deploy code in response to events without managing the underlying
infrastructure. This approach can be useful for building lightweight data processing pipelines and APIs.

8. Streaming Data Applications: In scenarios where real-time data analysis is required, data scientists can
develop streaming data applications using technologies like Apache Kafka, Apache Flink, and AWS
Kinesis. These applications process and analyse data as it's generated in real time.

9. Automated ML Platforms: Automated Machine Learning (AutoML) platforms like Google AutoML,
H2O.ai, and DataRobot provide tools for automating the process of model selection, training, and
deployment. These platforms are useful for data scientists with varying levels of expertise.

10. Integration with Business Applications: Data science applications can be integrated into existing
business software or workflows. For example, predictive models can be integrated into customer
relationship management (CRM) systems or enterprise resource planning (ERP) systems to make data-
driven decisions.

11. Collaborative Platforms: Data science teams often use collaborative platforms like Git, GitHub, and
GitLab for version control and collaboration on code and projects. These platforms help manage the
development and deployment of data science applications in a team environment.

12. Continuous Integration and Continuous Deployment (CI/CD): CI/CD pipelines automate the testing,
integration, and deployment of code changes. Data science teams can use CI/CD practices to ensure that
updates to data science applications are deployed smoothly and reliably.

These are just a few of the application development methods used in data science. The choice of method
depends on factors such as the specific use case, the technical requirements, the available tools and
technologies, and the expertise of the data science team.

Data Scientist during the application development life cycle

There are two approaches to hiring a Data Scientist. Preparing application MVP may be difficult for a client
financially. During the initial development, there is an obvious need for developers rather than Data
Scientists. In this scenario, Data Scientist is usually hired when the application is publicly available.
Gathered data can be utilized for further application development and the application might require some
AI-centered features.

On the other hand, Data Scientist knowledge and experience may be beneficial from the start of the
development cycle. Although introducing new machine learning solutions may not be crucial for a new
application, to apply these solutions proper data collection is required. That means that Data Scientist should
be included in the work related to designing databases and data flows. This way, it will be more effortless to
develop machine learning solutions in the future.

You might also like