Edited
Edited
Alireza
Manashty (Director of Data Science Laboratory) at the University of Regina,
Canada. Through this, we could learn how to reframe a business problem into a
data science problem and use big data and data science to solve an actual business
problem by leveraging the data science lifecycle to automate some decisions
utilizing machine learning models.
Big data is the large volume of data that fast generating and inundates
a business on a day-to-day basis, and basically, it is hard to manage by using IT
infrastructure. There is no threshold to define the volume of data, which different
from time to time. Refer to Appendix 4, 80% of big data is unstructured, but
anyhow, a data scientist could use up either structured or big unstructured data to
accomplish the data science task and solving a business problem. During the 21 st
century, a lot of MNCs and businesses are using big data analytics in term to keep
track of primary transactions and run business more efficiently by making better
decisions. Some example (daily basis): Facebook (0.5 petabytes); Walmart (40
petabytes); Google (24 petabytes) where 1 petabyte (PT) = 1,000 terabytes (TB) =
106 gigabytes (GB). Businesses might face challenges too in their technological
innovation with significant big data: for example, a company plan to generate
40PT in its system needs 4,000 drives ($200/pcs), cost $0.8million + tax, and
other accessories budget; hardware supply lead time; transferring data with
1|Page
100MB/sec will take 12.6 years to complete storing 40PT data into the hard drive;
company spaces to store hard drives etc. must put all in consideration.
1) Discovery
2) Data Preparation
Establish the analytic sandbox and ETLT (extract, transform, load, and transform)
the data, followed by data exploration and conditioning (remove outliers/ missing
data), and summarize and visualize the data. First, access the data by
understanding each data code and then proceed for visualization, which is vital
before analyzing where data is easier to read in visual form and could apply to any
domain that is easier to study and communicate with people (refer to Appendix
7A & 7B). In analyzing stage, we could explore the relationship between two
variables and summarize findings.
3) Model Planning
Select the suitable model after data analysis to solve the unique business
problems, respectively. There is a various category of techniques in model
selection (refer to Appendix 8).
2|Page
4) Model Building
Build training and test datasets, where 80% with data labeled is for training, while
20% with unlabeled data is for testing only, which not show in the model. After
setting up the model, we will then train the selected model by evaluating the fitted
model and adjusting accordingly to get accurate results.
5) Communicate Results
The data scientist will prepare different types of presentation for further usage.
The target audience will be management, analyst, and responses. It’s to show their
findings, predictions, and recommendations to solve the business problem.
6) Operationalize
3|Page