0% found this document useful (0 votes)
3 views

Introduction to Data Science

The document provides an introduction to data science, detailing the types of data (structured and unstructured), their characteristics, and the importance of data quality. It explains the differences between data, information, and knowledge, as well as the concept of big data and its implications for decision-making through predictive analytics. The document emphasizes the role of data science in extracting insights and making informed decisions using various scientific methods and tools.

Uploaded by

gamehubpro99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Introduction to Data Science

The document provides an introduction to data science, detailing the types of data (structured and unstructured), their characteristics, and the importance of data quality. It explains the differences between data, information, and knowledge, as well as the concept of big data and its implications for decision-making through predictive analytics. The document emphasizes the role of data science in extracting insights and making informed decisions using various scientific methods and tools.

Uploaded by

gamehubpro99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Data Science

Unit 1
Data
Data can be considered as a collection of information. Data may consist of numbers, texts, figures,
images, videos etc. Since computers are used for processing and analysis purpose; thus, whenever
we talk about data, then we assume data in digital form. For example, the data of a class of student
may have students’ Id, name, Gender, date of birth, mobile number, father’s name, address etc.,
similarly, the data related to a particular disease may contain the information of various parameters
like age, weight, BP level, sugar level, and other disease specific information of patients.
Computers store the data as binary values using patterns of two numbers: 1 and 0. A bit is the
smallest unit of data, and it represents a single value. A byte is eight binary digits long, i.e. equal
to 8 bits. Storage and memory are measured in megabytes and gigabytes. Following are the units
of memory, for data storage:
Table 1: Data measurements and their sizes
Data Measurement Size
Bit Single binary digit 0 or 1
Byte 8 bits
Kilo Byte (KB) 1,024 Bytes
Mega Byte (MB) 1,024 Kilobytes
Giga Byte(GB) 1,024 Megabytes
Tera Byte (TB) 1,024 Gigabytes
Peta Byte (PB) 1,024 Terabytes
Exa Byte (EB 1,024 Petabytes

Structured data
Data that follows a precise and consistent structure or organization is known as structured data,
and it facilitates easy searching and retrieval. This arrangement is frequently shown as rows and
columns, much like in a spreadsheet or table. In structured data systems, every column has a
designated data type, and every row has a record or instances of data. Because structured data is
easily accessed, queried, and analyzed using a variety of tools and techniques, it is extremely
valuable. This makes it the perfect format for machine learning and artificial intelligence
applications, as well as for data-driven applications like analytics and business intelligence.
For example, data containing all personal and educational information of all students of a certain
department in a university. A small part of such data is shown here:

Sr. No. Student ID Univ. Roll No. Name Date of Birth Course Semester
1. 200001 2400001 XYZ 30/12/2002 BCA 2
2. 200002 2400002 UVW 21/05/2003 BCA 3
3. 200003 2400003 TYR 13/04/2001 B.Sc. (IT) 4
4. 200004 2400004 ABC 28/09/2002 BCA 2
5. 200005 2400005 DEF 18/07/2003 B.Sc. (CS) 3

Structured data has a well define structure, follows a consistent order and can be easily accessed
and used by a person or a computer program. Structured data is usually stored in well-defined
schemas such as Databases. It is generally tabular with columns and rows that clearly define its
attributes.

Characteristics of Structured Data:


Following are some characteristics of structured data:
1 Structured data follows a data model and has a recognizable structure.
2 Structured data is stored in the form of rows and columns.
3 Structured data is well organized in the sense that definition, format and meaning of data
is explicitly known.
4 Similar entities in the structured data are grouped together to form relations or classes.
5 Accessing information from structured data is easy.
6 Mathematical analysis can be performed easily in structured data.
7 Structured data can be stored in relational databases and can be managed using structured
query language (SQL)

Advantages:
Easy to understand and use: Structured data has a well-defined schema or data model, making it
easy to understand and use. This allows for easy data retrieval, analysis, and reporting.
Consistency: The well-defined structure of structured data ensures consistency and accuracy in the
data, making it easier to compare and analyze data across different sources.
Efficient storage and retrieval: Structured data is typically stored in relational databases, which are
designed to efficiently store and retrieve large amounts of data. This makes it easy to access and
process data quickly.
Enhanced data security: Structured data can be more easily secured than unstructured or semi-
structured data, as access to the data can be controlled through database security protocols.
Clear data lineage: Structured data typically has a clear lineage or history, making it easy to track
changes and ensure data quality.
Disadvantages:
Inflexibility: Structured data can be inflexible in terms of accommodating new types of data, as
any changes to the schema or data model require significant changes to the database.
Limited complexity: Structured data is often limited in terms of the complexity of relationships
between data entities. This can make it difficult to model complex real-world scenarios.
Limited context: Structured data often lacks the additional context and information that
unstructured or semi-structured data can provide, making it more difficult to understand the
meaning and significance of the data.
Expensive: Structured data requires the use of relational databases and related technologies, which
can be expensive to implement and maintain.
Data quality: The structured nature of the data can sometimes lead to missing or incomplete data,
or data that does not fit cleanly into the defined schema, leading to data quality issues.

Unstructured Data

Unstructured data refers to data that doesn’t adhere to a fixed format. Unlike structured data, which
is neatly categorized in rows and columns, unstructured data is format free, making it less
straightforward to analyze and process. Unstructured data can also be classified as qualitative data,
it cannot be processed and evaluated using standard data tools and techniques. Such data lacks a
predetermined structure and does not adhere to any predetermined framework. Since it lacks a
specified data schema, it is best managed in non-relational databases. Some common examples of
unstructured data are as follows: text files, video files, reports, tweets, Email, images etc.

Characteristics of Unstructured Data:


Following are some characteristics of unstructured data:
1 Unstructured data neither conforms to a data model nor has any structure.
2 Unstructured data cannot be stored in the form of rows and columns as in Databases.
3 Unstructured data does not follow any rules.
4 Unstructured data lacks any particular format or sequence.
5 Unstructured data has no easily identifiable structure
6 Unstructured data cannot be used by computer programs easily because of lack of
identifiable structure.

Data Quality
Data quality refers accuracy, validity, completeness, and consistency of data to ensure that the data
used for analysis, reporting, and decision-making is reliable and trustworthy.
Data Accuracy
Data is considered accurate if it describes the real world, that is, if the real words entities exist in
our data. Anomaly detection or outlier analysis helps to identify unexpected values or events in a
data set.
For example, a value 118 in ‘Age’ column of a graduate student related data is inaccurate.
Data Validity
Data validity refers to whether the data follows defined formats, values, and rules. This involves
ensuring data conforms to a particular format, adhering to predefined rules, patterns or sets of
values. For example, a date format must be recognized.
Data Completeness
Data completeness refers to availability of all required data. It involves ensuring no essential data
is missing. If a dataset is incomplete, it can lead to misinformed decisions.
Data Consistency
Data consistency refers to the quality of data being reliable and in a consistent format across
various databases, systems, and applications. It ensures that data remains the same and aligns with
the established rules and standards throughout its lifecycle, regardless of the platform or location
it’s accessed from.
Data Uniqueness
Uniqueness implies that all data entities are represented only once in the dataset. This involves
managing and eliminating duplicate data entries to ensure that each piece of information is only
recorded once, preventing confusion or miscalculations.
Factors affecting Data Quality:
a) Human error
b) System errors
c) Sampling errors
d) Incomplete data
e) Biasedness

Importance of data quality


The importance of data quality can be described as follows:
1. Decision making: Poor quality data can lead to inaccurate decisions and poor strategic
planning for organizations.
2. Operation efficiency: Poor quality data can lead to wastage of time due to rectification and
can delay the decision making.
3. Integration and analysis: Poor quality data makes integration and analysis complex,
leading to incomplete insights and inaccurate conclusions.
4. Legal issues: Poor quality data can lead to legal issues non-compliance with data
protection laws lawsuits, damage of reputation etc.
5. Trust: The persons affected by the decisions made from Poor quality data may loose trust.
Data Science
Data science is a cross-disciplinary field of study that utilizes the concepts and methods of
Mathematics, Statistics, and computer science. Data science is a very broad term which mainly
includes collection, analysis and predictions. It is a field of science that extracts information from
both organized and complex data using various scientific techniques and algorithms. This
knowledge is then used to generate knowledge, make predictions, and develop data-driven
solutions. It makes use of a big amount of data to provide insightful conclusions using computation
and statistics. It helps organizations to make better decisions, future predictions, discovering
various patterns in the data and many more things.
The term “data science” combines two key elements: “data” and “science.”
1. Data: It refers to the raw information that is collected, stored, and processed. In today’s
digital age, enormous amounts of data are generated from various sources such as sensors,
social media, transactions, and more. This data can come in structured formats (e.g.,
databases) or unstructured formats (e.g., text, images, videos).
2. Science: It refers to the systematic study and investigation of phenomena using scientific
methods and principles. Science involves forming hypotheses, conducting experiments,
analyzing data, and drawing conclusions based on evidence.
Data Science refers to the scientific study of data. Data Science involves applying scientific
methods, statistical techniques, computational tools, and domain expertise to explore, analyze, and
extract insights from data.

Information
Information is the result of processing data, usually by computer. This results in facts, which enable
the processed data to be used in context and have meaning. Information is data that has meaning.
Data on its own has no meaning. It only takes on meaning and becomes information when it is
interpreted. Data consists of raw facts and figures. When that data is processed into sets according
to context, it provides information. Data refers to raw input that when processed or arranged makes
meaningful output. Information is usually the processed outcome of data. When data is processed
into information, it becomes interpretable and gains significance. In IT, symbols, characters,
images, or numbers are data. These are the inputs an IT system needs to process in order to produce
a meaningful interpretation. In other words, data in a meaningful form becomes information.
Information can be about facts, things, concepts, or anything relevant to the topic concerned. It
may provide answers to questions like who, which, when, why, what, and how. If we put
Information into an equation it would look like this:
Data + Meaning = Information
Knowledge
Knowledge is a skill or theoretical understanding of a subject. If we look at information as a
sentence (answering a question), then knowledge would probably be a book: many related
sentences, in ordered paragraphs, structured in pages and chapters.
While data is physical evidence and information is a subjective statement regarding this evidence,
knowledge is a philosophical concept. The philosophical study of knowledge is called
epistemology. Knowledge is so abstract that we don’t have the ability to store it (we store only
data), The AI field of trying to define knowledge into a structured format that can be stored is
called knowledge representation
We can summarize the difference among data, information and knowledge as follows:
Data Information Knowledge
Is objective Should be objective Is subjective
Has no meaning Has a meaning Has meaning for a specific
purpose
Is unprocessed Is processed Is processed and understood
Is quantifiable, there can Is quantifiable, there can be Is not quantifiable, there is no
be data overload information overload knowledge overload

Big Data
Big data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that continues to grow exponentially over time. These datasets are so huge and
complex in volume, velocity, and variety, that traditional data management systems cannot store,
process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital technology
advancements, such as connectivity, mobility, the Internet of Things (IoT), and artificial
intelligence (AI). As data continues to expand and proliferate, new big data tools are emerging to
help companies collect, process, and analyze data at the speed needed to gain the most value from
it.
Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size
over time. Big data is used in machine learning, predictive modeling, and other advanced analytics
to solve business problems and make informed decisions.
Big data definitions may vary slightly, but it will always be described in terms of volume, velocity,
and variety. These big data characteristics are often referred to as the “3 Vs of big data” and were
first defined by Gartner in 2001.
Volume
As its name suggests, the most common characteristic associated with big data is its high volume.
This describes the enormous amount of data that is available for collection and produced from a
variety of sources and devices on a continuous basis.
Velocity
Big data velocity refers to the speed at which data is generated. Today, data is often produced in
real time or near real time, and therefore, it must also be processed, accessed, and analyzed at the
same rate to have any meaningful impact.
Variety
Data is heterogeneous, meaning it can come from many different sources and can be structured,
unstructured, or semi-structured. More traditional structured data (such as data in spreadsheets or
relational databases) is now supplemented by unstructured text, images, audio, video files, or semi-
structured formats like sensor data that can’t be organized in a fixed data schema.
In addition to these three original Vs, three others that are often mentioned in relation to harnessing
the power of big data: veracity, variability, and value.
 Veracity: Big data can be messy, noisy, and error-prone, which makes it difficult to control
the quality and accuracy of the data. Large datasets can be unwieldy and confusing, while
smaller datasets could present an incomplete picture. The higher the veracity of the data,
the more trustworthy it is.
 Variability: The meaning of collected data is constantly changing, which can lead to
inconsistency over time. These shifts include not only changes in context and interpretation
but also data collection methods based on the information that companies want to capture
and analyze.
 Value: It’s essential to determine the business value of the data you collect. Big data must
contain the right data and then be effectively analyzed in order to yield insights that can
help drive decision-making.
Here are some big data examples:
 Consumer behavior and shopping habits tracking data
 Monitoring online payment patterns
 Medical data such as research reports, clinical notes, and lab results
 Image data from cameras and sensors, as well as GPS

Data Science and Decision making


Predictive Analytics
Predicting future outcomes is enormously valuable when formulating a business strategy. Today,
modern data science has tools, such as machine learning and statistical modeling, that have refined
the process to be far more accurate and reliable. Statistical modeling creates mathematical
constructs to represent a set of data. These models capture relationships among data, allowing
predictions to be made based on input variables. The following are some of the types of predictive
analytics models businesses use:
Forecast Models
A forecast model predicts future events based on past data. Forecast models can help businesses
in some of the following ways:
 Use resources more effectively
 Predict future sales
 Effectively time the launch of new services
 Estimate recurring costs
Classification Models
Classification models are a type of supervised machine learning model that categorizes data and
describes relationships within a given dataset. One of the best uses for classification models is
answering binary questions.
Businesses frequently use classification models in fraud detection and evaluating credit risk. For
example, they ask "Is this transaction standard for this customer?" and "Is this applicant likely to
default on their loan?"9
Outliers Models
Outlier analysis focuses on finding anomalous entries within a dataset, either alone or as a part of
a category. An outliers model can provide more context than classification models, which are best
for answering binary questions.
For instance, when used in fraud detection, an outlier model can factor in variables such as
location, time and transaction amount. It can use that data to determine that a $100 transaction in
a customer’s hometown is less likely to be fraudulent than a $100 transaction in a foreign country.
Business Intelligence
Business intelligence uses data analytics to answer questions to help an organization reach its
goals. The primary purpose of business intelligence is to provide a comprehensive view of a
company’s collective data and present it. Thus, stakeholders can leverage it to enhance
performance.
Business Intelligence Methods
Business intelligence methods encompass all the technologies, applications, strategies and
practices used to collect, analyze and present business information. The data science process
includes the following steps:
 Understanding and framing the business problem
 Collecting relevant data
 Cleaning data to remove irrelevant, duplicate or erroneous data
 Exploring the data for trends and insights
 Building and deploying models
 Communicating your results with actionable steps
Data Visualization
Most non-technical people have difficulty understanding mathematical formulas and statistical
models in their raw form. Data visualization is a graphical representation of data that makes it
easier to understand patterns, trends and outliers in data.
Charts, graphs, maps and other data visualization methods allow data scientists to easily
communicate their results to business stakeholders.
Data visualization can distill complex concepts into a shareable, interactive format. It brings data
to life to tell a story and provide business insights at a glance. Far from being a casual afterthought,
data visualization is one of the most essential skills for any data science professional. 13
Effective data visualization incorporates the following elements: 14
 It’s targeted toward the audience and how they will understand and interpret the data
 It creates a framework that clearly establishes what data is being communicated
 It tells a compelling story that helps the viewer gain insight from the data
The Role of Data Science in Business Decision Making
The role of data science in business decision-making has been on a steady rise, given the vast
amounts of data that modern businesses generate. This data, often referred to as 'big data,'
comprises an amalgamation of customer data, operational data and market data. Data scientists
play a pivotal role in extracting, processing and analyzing this data, transforming it into valuable
business insights.
Data scientists employ advanced data analytics techniques, including machine learning algorithms,
to dissect large and complex datasets. Their aim is to identify patterns, trends and correlations that
can influence business decisions. By doing so, data science becomes a cornerstone of strategic
planning, fostering a culture of data-driven decision-making across organizations.
Data-driven decision-making allows businesses to utilize data science findings to make more
informed business decisions. These decisions can range from customer segmentation strategies to
optimizing manufacturing processes, all supported by the critical data collected and analyzed by
data scientists.
Harnessing Data Science in Business Strategy
The integration of data science into business strategy provides businesses with a competitive edge.
Data scientists analyze customer data to glean insights into customer behavior, preferences and
trends. This information is invaluable for shaping business strategy, from marketing campaigns to
product development and customer service initiatives.
Data science also facilitates the prediction of future trends. By analyzing data, machine learning
algorithms can project potential future scenarios, allowing businesses to make proactive decisions
rather than merely reacting to changes in the market.
Data Collection and Analysis in Modern Business
Data collection and analysis have become central to modern business operations. 15 Businesses
routinely gather data through various channels, including customer interactions, social media and
industry research. This data is then cleaned, processed and analyzed to produce actionable insights.
Data analytics, therefore, becomes a fundamental aspect of business analytics. By analyzing data,
businesses can discover critical patterns and trends, enabling them to make data-driven decisions
that align with their objectives. Data scientists often utilize various types of data in their analyses,
including both qualitative and quantitative data.
Through this robust process, businesses can improve their data literacy, ensuring they are well-
equipped to make informed decisions and shape their business strategies effectively.
Using Predictive Analysis for Business Growth
For businesses seeking growth and stability, the power of predictive analysis cannot be overstated.
By harnessing past data and performance metrics, predictive analytics allow for an evidence-based
approach to business decision-making. This methodology, which relies on data-driven evidence,
provides a robust platform for planning and shaping future strategies, covering areas such as sales,
marketing and even project management.
Incorporating predictive analysis allows businesses to anticipate the dynamics of their target
market, adjusting their strategies based on emerging trends and customer behavior. Moreover, it
offers insights into possible challenges in production processes, facilitating proactive problem-
solving measures.
Incorporating Artificial Intelligence in Data Science
The union of artificial intelligence (AI) and data science has ushered in a new era in the business
world. AI's ability to analyze raw data rapidly and accurately translates into valuable insights that
can significantly impact business outcomes. Such insights play a crucial role in unveiling customer
behaviors, identifying operational efficiencies and spotting emerging trends—all vital components
in achieving business goals.
Furthermore, AI simplifies the entire process of data analysis by automating tasks from data
collection to insight generation. Such automation ensures speed, accuracy and efficiency, thereby
improving the accuracy of the results.
The Significance of Analytical Tools and Technical Skills in Business
The growing complexity of data necessitates the use of advanced analytical tools and the
development of technical skills in the modern business world. 16 These tools, when wielded by
skilled data scientists, facilitate the efficient and effective parsing of complex datasets, generating
actionable insights for business decision-making.
The relevance of analytical tools extends beyond the realm of data scientists, with departments
across the entire organization harnessing their power. From sales and marketing to HR and
operations, each sector can use these tools to make decisions rooted in data, ultimately aligning
more closely with the overarching business objectives.
To conclude, the significance of data science in helping companies develop successful strategies
is evident. It provides the tools and insights required to make informed decisions, anticipate future
outcomes and steer the course of the business. As such, it is an invaluable asset for any business
seeking growth in today's data-driven world.

You might also like