CS442_DSA_Practical_File
CS442_DSA_Practical_File
CS442_DSA_Practical_File
Practical 1
Date: 30/06/2023
Aim:
To perform data pre-processing of IBM Churn dataset from
https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset.
Load data
Find missing values
Clean data
Find co-relations between attributes.
Remove redundant attributes.
Normalize data
Visualize the data
Use numpy, pandas, and matplotlib.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Handle missing values (e.g., replace with mean, median, or drop columns)
# For this example, let's drop rows with missing values
df.dropna(inplace=True)
Page 1 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
# Normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['tenure', 'MonthlyCharges']] = scaler.fit_transform(df[['tenure', 'MonthlyCharges']])
Output Screenshot:
Page 2 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
In this practical we performed data pre-processing of IBM Churn dataset from
https://www.kaggle.com/datasets/yeanzc/telco-customer-churn-ibm-dataset.
Practical 2
Date: 08/07/2023
Aim: To install Hadoop framework, configure it and setup a single node cluster. Use web based
tools to monitor your Hadoop setup.
Code:
THEORY:
Hadoop:
Page 4 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 5 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 6 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
Page 7 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 3
Date: 14/07/2023
Aim: To implement file management tasks in Hadoop HDFS and perform Hadoop commands.
Code:
Commands :-
1.) Ls :-
This command is use to list all the files which is been present is the hadoop file sys.
Page 8 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
2.) Mkdir:-
To make new directory use this command.
3.) Touchz:-
Page 9 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
6.) copyToLocal :-
7.) rmdir :-
Page 10 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 11 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 4
Date: 21/07/2023
Aim: To implement Map, reduce, filter and lambda in python.
Code:
1. Map Function:
The map() function is used to apply a given function to each item of an iterable (e.g., a list) and
returns a new iterable with the results.
# Example: Doubling each element of a list using map
numbers = [1, 2, 3, 4, 5]
doubled = list(map(lambda x: x * 2, numbers))
print(doubled)
2. Reduce Function:
The reduce() function is used to apply a given function cumulatively to the items of an iterable,
reducing it to a single accumulated result.
# Example: Summing up all elements of a list using reduce
from functools import reduce
numbers = [1, 2, 3, 4, 5]
total = reduce(lambda x, y: x + y, numbers)
print(total)
3. Filter Function:
The filter() function is used to filter elements of an iterable based on a given function's condition.
# Example: Filtering even numbers from a list using filter
numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9]
evens = list(filter(lambda x: x % 2 == 0, numbers))
print(evens) # Output: [2, 4, 6, 8]
Page 12 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
4. Lambda Function:
Lambda functions are anonymous functions defined using the lambda keyword. They are often used
for short, simple operations.
# Example: Using a lambda function to square a number
square = lambda x: x ** 2
result = square(5)
print(result) # Output: 25
Conclusion/Summary: In this practical we learnt about some basic example of how to use map,
reduce, filter, and lambda functions in Python.
Page 13 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 5
Date: 22/07/2023
Aim: To implement a word count application using the MapReduce programming model.
Code:
# Reducer function
def reducer(word, counts):
total = sum(map(int, counts))
print(f"{word}\t{total}")
# Use Hadoop Streaming to perform word count
!cat "new.txt" | python -m mapreduce.pipes \
--map "/usr/bin/python mapper" \
--reduce "/usr/bin/python reducer" \
Page 14 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
--input-format org.apache.hadoop.mapred.TextInputFormat \
--output-format org.apache.hadoop.mapred.TextOutputFormat \
--input /content/drive/My\ Drive/sample.txt \
--output /content/drive/My\ Drive/wordcount_output
# View the word count output
with open("new.txt") as result_file:
wordcount_result = result_file.read()
print(wordcount_result)
Output Screenshot:
Conclusion/Summary:
In this practical we implemented a word count application using the MapReduce programming
model.
Page 15 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 6
Date: 28/07/2023
Aim:
To design and implement MapReduce algorithms to take a very large file of integers and produce
as output:
a) The largest integer
b) The average of all the integers.
c) The count of the number of distinct integers in the input.
Code:
//Reducer.java
import java.io.IOException;
import
org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class Reducer
{
public static void main(String args[]) throws IllegalStateException,
IOException, ClassNotFoundException, InterruptedException
{Configuration conf = new Configuration();
Job job = Job.getInstance(conf,"Practicle 4");job.setJarByClass(Reducer.class);
job.setMapperClass(BDA_Mapper.class);
job.setReducerClass(BDA_Reducer.class); job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(LongWritable.class);job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
System.exit(job.waitForCompletion(true)?0:1); } }
//BDA_Mapper.java
import
java.util.*;
import java.io.*;
import
org.apache.hadoop.io.Text;
Page 16 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.Mapper;
public class BDA_Mapper extends Mapper<LongWritable,Text,Text,LongWritable>
{ private TreeMap<String,Integer>
tmap; @Override
public void setup(Context context) throws IOException,InterruptedException
{tmap = new TreeMap<String,Integer>();}
public void map(LongWritable key,Text value,Context context)
throws IOException,InterruptedException
{if(tmap.containsKey(value.toString().trim()))
{int
count=tmap.get(value.toString().trim());tmap.put(value.toString().trim(),count+
1);} else {tmap.put(value.toString().trim(),1);}}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Integer> entry:tmap.entrySet())
{String number = entry.getKey();int count = entry.getValue();
context.write(new Text(number),new LongWritable(count));}}}
//
BDA_Reducer.java
import java.util.*;
import java.io.*;
import org.apache.hadoop.io.Text;
import
org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import
org.apache.hadoop.mapreduce.Mapper;
import
org.apache.hadoop.mapreduce.Reducer;
public class BDA_Reducer extends Reducer<Text,LongWritable,Text,LongWritable>
{private TreeMap<String , Long> tmap2;
private int max =
Integer.MIN_VALUE,unique=0,cnt=0; private long
sum=0;
@Override
public void setup(Context context) throws IOException,InterruptedException
{tmap2 = new
TreeMap<String,Long>();} @Override
public void reduce(Text key,Iterable<LongWritable> values,Context context)
throws IOException,InterruptedException
{String number = key.toString();long
count=0; for(LongWritable
Page 17 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
val:values)
{count+=val.get();sum+=((int)val.get())*Integer.parseInt(number.trim());}
tmap2.put(number,count);cnt+=count;
if(max<Integer.parseInt(number.trim()))
max=Integer.parseInt(number.trim()); unique++;}
@Override
public void cleanup(Context context) throws IOException,InterruptedException
{ for(Map.Entry<String,Long> entry:tmap2.entrySet())
{Long count = entry.getValue();String name = entry.getKey();
context.write(new Text(name),new LongWritable(count));}
context.write(new Text("MAX NUMBER = "),new
LongWritable(max)); context.write(new Text("AVERAGE = "),new
LongWritable(sum/cnt));
Content.write write(new Text("Total Unique Numbers =
"),new LongWritable(unique));} }
OUTPUT:
MyProject -> then select Build Path-> Click on Configure Build Path and select
Add External jars…. and add jars from it’s download location then click -> Apply
and Close.
Now export the project as jar file. Right-click on MyProject choose Export.. and go
to Java -> JAR file click -> Next and choose your export destination then click ->
Next. choose Main Class by clicking -> Browse and then click -> Finish -> Ok.
In Eclipse go to export
Page 18 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
In this practical, we learnt about the mapReduce Paradigm in detail and also executed
a simple operations on large integers file in MapReduce in Python Language.
Page 19 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 7
Date: 04/08/2023
Aim: To implement basic functions and commands in R Programming. Use R-Studio and build
WordCloud and data visualization using R for easy to understand and better visualization than a data
table.
Code:
install.packages("tm")
install.packages("SnowballC")
install.packages("wordcloud")
install.packages("RColorBrewer")
library("wordcloud")
library("RColorBrewer")
# To choose the text file text =
readLines(file.choose()) # VectorSource() function #
creates a corpus of
# character v ectors docs =
Corpus(VectorSource(text)) # Text transformation
toSpace = content_transformer(
function (x, pattern)
gsub(pattern, " ", x))
docs1 = tm_map(docs, toSpace, "/") docs1 =
tm_map(docs, toSpace, "@") docs1 =
tm_map(docs, toSpace, "#")
strwrap(docs1)
# Cleaning the Text docs1 = tm_map(docs1, content_transformer(tolower))
docs1 = tm_map(docs1, removeNumbers) docs1 = tm_map(docs1,
stripWhitespace) # Build a term-document matrix dtm =
TermDocumentMatrix(docs) m = as.matrix(dtm) v
= sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq =
Page 20 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
v) head(d, 10)
# Generate the Word cloud wordcloud(words =
d$word, freq = d$freq, min.freq = 1, max.words = 200, random.order = FALSE, rot.per
= 0.35, colors = brewer.pal(8, "Dark2"))
Output Screenshot:
Conclusion/Summary:
Page 21 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 8
Date: 11/08/2023
Aim: To implement supervised learning algorithms (linear regression and logistic regression) using
R.
Code:
Install Required Packages:
To install R packages, use the following code cell in your Colab notebook:
# Install and load necessary libraries
install.packages("tidyverse")
library(tidyverse)
install.packages("glmnet")
library(glmnet)
Load and Preprocess Data
You'll need a dataset for your regression tasks. You can upload a dataset directly to
Google Colab or load it from an online source. Here, I'll provide an example using a built-
in R dataset.
Page 22 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
After fitting the models, you can evaluate them using appropriate metrics and visualize the
results.
# Evaluate linear regression model
linear_predictions <- predict(model_lm, newdata = mtcars)
linear_mse <- mean((mtcars$mpg - linear_predictions)^2)
linear_mse
Output Screenshot:
Page 23 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
In this practical we implemented supervised learning algorithms (linear regression and logistic
regression) using R.
Page 24 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 9
Date: 18/08/2023
Aim: To implement unsupervised learning algorithms (k-means) using R.
Code:
1. Install Required Packages:
To install R packages, use the following code cell in your Colab notebook:
install.packages("ggplot2")
2.
Generate or Load Data:
You can generate synthetic data or upload your own dataset. Here's an example of generating
random data:
set.seed(123) # For reproducibility
data <- data.frame(
x = rnorm(100, mean = 0, sd = 1),
y = rnorm(100, mean = 0, sd = 1)
)
3.
Perform K-Means Clustering:
Use the k-means clustering algorithm as shown in the previous response:
k <- 3 # Number of clusters
set.seed(123) # For reproducibility
kmeans_result <- kmeans(data, centers = k)
4.
View the Cluster Assignments:
Check the cluster assignments for each data point:
cluster_assignments <- kmeans_result$cluster
5.
Visualize the Clusters:
Create a scatter plot to visualize the clusters:
library(ggplot2)
Output Screenshot:
Conclusion/Summary:
In this practical we implemented unsupervised learning algorithms (k-means) using R.
Page 26 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 10
Date: 25/08/2023
Aim: To install and implement basic database operations in MongoDB. To implement basic CRUD
operations (create, read, update, delete) in MongoDB.
Code:
THEORY:
MongoDB:
Page 27 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
PRACTICAL:
Page 28 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 29 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
We can check all the records of database using “find” and “pretty()” to format the output.
Conclusion/Summary:
In this practical, we learnt about mongoDB and Cassandra and performed basic CRUD
operations in both the databases.
Page 30 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 11
Date: 01/09/2023
Aim: To design a dashboard using Google data studio.
Code:
Once you have signed in and accepted any Terms of Service notices, we are now ready to create our
blank report. Simply click on “+” for the “Blank” report template. There are other pre-built
templates you can check out, however, for this tutorial, we are going to create a new blank report.
Next, find the Google Analytics connector and select Google Analytics. You will notice there are
many other connectors you can choose from. These are all the different data sources that can be
pulled into a report. For now, let’s stick to our current dashboard.
We now need to authorize our connection to Google Analytics. You will need to select “Allow
Access” on the authorization popup that follows.
Our next step is to rename this connector to make it easier to find. In this example, “Demo Google
Analytics” is being used.
Now we must choose our Google Analytics account, then the property followed by the view. This
should be your main view that you use when viewing your data in Google Analytics.
To finish this step, click the connect button to connect to this data source.
Page 32 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
The final stage to adding the data source is to click “Add to Report”. At this state, you are able to
modify fields and add calculated fields if needed. This is more advanced and will be covered in a
later article. For now, all the fields are setup perfectly fine for our dashboard.
Page 33 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
This is the panel where you can manage your data sources for each report/dashboard. You will see
our first data source we added. To add a second, just click on “Add A Data Source”.
This will take you back to the connector gallery. Simply find the “Search Console” connector and
select it. You may need to authorize once again.
As we did before, rename this data source to something you can identify easily later. (See #1)
Next, select your site and for this dashboard we are going to focus on “Site Impression” table.
Page 34 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Pie Charts
Page 35 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Scorecards
Tables
is pulling the last 28 days, but, what if you wanted to view a different date range?
Google Data Studio as an element known as a “Date Range” filter. This is added the same way as all
other elements. Just select it from the menu bar and drag out a date range filter at an appropriate
size.
Page 37 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
confusing to anyone using your dashboard. Also, we may want to add some branding or images to
help others make better sense of our newly created dashboard.
Text Headers
Google Data Studio makes adding text headers very simple with their “Text” element. Simply drag
out a text box and add your text. Under text properties, update the size, color, and font-weight to
achieve your desired look.
Add text
Adding Images
There are several reasons as to why you may want to add images to your dashboard. The most
common is to add your company’s branding.
Select the image element from the menu bar, drag a box onto your dashboard then click on “select a
file” button on your “Data” tab. Browse you computer and find the image file you want to add.
Page 38 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Add image
Our added image
Step 7: Styling
Our dashboard has all the data we want to see, however, it doesn’t look amazing. With Google Data
Studio, it is very simple to make style edits.
Global Styles
When you don’t have any elements selected, you should see the Layout and Theme sidebar. In this
sidebar, we have the ability to control the overall layout and theme for our entire dashboard and all
elements.
You can control your Primary and Secondary text color and font under the theme tab.
In addition, you can also pick your chart palette. Chart palette is the order of colors that will be used
in all charts. For example, by default we have blue, red, yellow, green, etc. Notice in our pie chart
on our dashboard, we have these same colors used in descending order. If you have brand colors, I
would suggest changing these colors to your brands colors. Make sure the colors are different
enough that you can make out which color represents different data.
Page 40 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Element Styles
Each element in Google Data Studio has it’s own style that can be adjusted. It will automatically
pull from the global styles, however, you can fine tune and even override the global styles on each
element.
Most elements also contain additional elements that you can style. For example, a table has the
ability to set the colors for alternating rows and to adjust headers. Pie charts allow you to change
where the legend is positioned and alignment. Each element has it’s own styles. Take some time to
explore each element.
Output Screenshot:
Conclusion/Summary:
In this practical we learnt to design a dashboard using Google data studio.
Page 41 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 12
Date: 15/09/2023
Aim: To install, deploy & configure Apache Spark. To Select the fields from the dataset using
Spark SQL.
Code:
THEORY:
Spark:
OpenStack Swift, Amazon S3, Kudu, Lustre file system, or a custom solution can be
implemented.
Spark also supports a pseudo-distributed local mode, usually used only for
development or testing purposes, where distributed storage is not required and the local
file system can be used instead; in such a scenario, Spark is run on a single machine
with one executor per CPU core.
PRACTICAL:
You can download spark from official download.
Then extract the file using tar command.
tar xvf spark-*
Before starting a master server, you need to configure environment variables. There are a few
Spark home paths you need to add to the user profile.
Use the echo command to add these three lines to .profile:
Page 43 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Now that you have completed configuring your environment for Spark, you can start a master
server.
In the terminal, type:
start-master.sh
Now that a worker is up and running, if you reload Spark Master’s Web UI, you should see it
on the list:
In this single-server, standalone setup, we will start one slave server along with the master
server.
start-slave.sh spark://master:port
The master in the command can be an IP or hostname.
start-slave.sh spark://ubuntu1:7077
After you finish the configuration and start the master and slave server, test if the Spark shell
works.
Load the shell by entering:
spark-shell
Conclusion/Summary:
In this practical, we learnt about spark and installed it and configured it. We explored the spark
shell as well.
Page 45 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 13
Date: 22/09/2023
Aim: To implement logistic regression using IBM churn dataset with Apache Spark.
Code:
Set up Google Colab with Apache Spark
!pip install pyspark
from pyspark.sql import SparkSession
Data Preprocessing
Perform data preprocessing as needed. This may include handling missing values, encoding
categorical variables, and assembling feature vectors.
Page 46 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Output Screenshot:
Page 47 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
In this practical we implemented logistic regression using IBM churn dataset with Apache Spark.
Page 48 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 14
Date: 23/09/2023
Aim: To perform Graph Path and Connectivity analytics and implement basic queries after loading
data using Neo4j.
Code:
THEORY:
Neo4j:
Page 49 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
PRACTICAL:
We will use neo4j sandbox to fire different queries as shown in the screen shots below.
Page 50 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 51 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Page 52 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Conclusion/Summary:
In this practical, we learnt about neo4j and used different type of queries on sandbox on movie
dataset.
Page 53 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
Practical 15
Date: 29/09/2023
Aim:
To perform case study of the following platforms for solving any big data analytic problem of your
choice. (1) Amazon web services,(2) Microsoft Azure, (3)Google App engine
Code:
IMPLEMENTATION:
Page 54 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
shutdown. We found that even though we had just one accident a year, it would
always be easier to unload the control and the guy.
➔ Achieving Efficiency:
● After relocation, OLX Autos infrastructure teams have now become well prepared to
meet crucial internal product delivery deadlines.
● The organization originally used the Puppet Setup Tool to handle its OpenShift
program.
● Engineers wanted to focus on their Puppet expertise and spent three to four days a
month applying and tracking big improvements to the OLX Autos infrastructure.
● Through the migration to AWS, the organization removed Puppet from its design and
switched container control to AWS.
● By operating on Amazon EKS, the OLX Autos website benefits from increased
performance and scalability, and engineers may redirect their time to higher value-
added activities.
OLX Autos has also unloaded the vital role of handling Stable Sockets Layer
(SSL)/Transport Layer Security (TLS) certificates to AWS Credential Manager.
Previously, teams had to manually buy and instal new certificates each year, but with
AWS, new certificates are deployed with a few quick API calls.
"It's a relief that AWS is now going to take care of that," says Tomar. "We don't
have to spend a single minute in testing and upgrading certificates that may have an
effect on crucial deadlines for our company."
2. Microsoft Azure:
Health data analytics company accelerates critical data insights with SQLServer 2019
Big Data Clusters.
● Vital responses to overall health care patterns, such as COVID-19 test outcomes,
demand consistency, and quick delivery times. That's why businesses like the
Telstra Health affiliate Dr. Foster are so important to transforming care.
● Headquartered in London, the healthcare analytics organization responded to the
need to make large databases accessible to various teams, to offer analytics to its
healthcare clients quicker with the Microsoft SQL Server 2019 Big Data Clusters,
and to secure confidential data.
● These benefits translate into improved customer service, increased competitive
edge, and improved health insurance for all.
We’re building next-generation services that cut through complexity to surface actionable
insights to customers and pinpoint potential areas of concern. We needed a data analytics
platform to accelerate that goal and when I learned about SQL Server 2019 Big Data
Clusters, it aligned with everything we were trying to achieve.
George Bayliffe: Head of Data
Dr Foster
Page 56 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
● Digitized patient records are replacing dog-eared paper files, driving the rapid growth
of the health care analytics field. Leading UK health analytics company Dr Foster
helps hospitals understand the factors influencing the quality of care. To deliver value,
Dr Foster must aggregate and analyze vast amounts of data.
Page 57 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
➔ Responding to a pandemic:
● When COVID-19 forced lockdowns all over the world, the re-engineered Dr Foster
landscape eased the sudden transition for its employees.
● The company continues to meet via Microsoft Teams, using the whiteboarding
functionality to collaborate. Dr Foster is beginning to process 60,000 test results from
parent company Telstra Health in Australia each morning. The solution allows the
company to react as the market requires, says Bayliffe. "Because of the SQL
Server 2019 Big Data Clusters solution and the architecture we've put into place, we
can react to the market," he says.
We’re cloud-ready with our hybrid model, and we can transition to Microsoft Azure when
the need arises.
George Bayliffe: Head of Data
Dr Foster
"Google Cloud Platform services like Cloud Pub/Sub, Cloud Dataflow, and Big Query allow
us to minimize the effort and resources needed to build data pipelines for airline customers
and focus instead on the quality, volume, and velocity of data."
Page 58 | 60
CS442: DATA SCIENCE AND ANALYTICS ID: 21DCS029
● Airlines, online travel agents, and travel eCommerce companies face challenges in
capturing, processing and utilizing data. The business turned to the cloud to deliver its
platform. "The cloud gave us an opportunity to work with large volumes of data while
becoming more agile," says Travlytix.
● When COVID-19 forced lockdowns all over the world, the re-engineered Dr. Foster
landscape eased the sudden transition for its employees.
● The company continues to meet via Microsoft Teams, using the whiteboarding
functionality to collaborate. Dr. Foster is beginning to process 60,000 test results from
parent company Telstra Health in Australia each morning.
● The solution allows the company to react as the market requires, says Bayliff. "Because
of the SQL Server 2019 Big Data Clusters solution and the architecture we've put into
place, we can react to the market," he says.
➔ Low-latency services:
Travlytix also utilizes several Google Cloud Network zones to offer ongoing, low-
latencysupport to clients around the globe.
In addition, applications such as the Cloud Key Management Service, which helps
companies to control cryptographic keys and secure confidential data on the Google
Cloud Network, facilitate compliance with global security regulations.
These support Travlytix's own data encryption during collection and processing to
protect it from intrusion.
Conclusion/Summary:
In this practical, we perform a case study on big data analytic problems on Amazon web
services, Microsoft Azure, Google App Engine.
Page 60 | 60