Introduction to Data Science
Introduction to Data Science
Unit 1
Data
Data can be considered as a collection of information. Data may consist of numbers, texts, figures,
images, videos etc. Since computers are used for processing and analysis purpose; thus, whenever
we talk about data, then we assume data in digital form. For example, the data of a class of student
may have students’ Id, name, Gender, date of birth, mobile number, father’s name, address etc.,
similarly, the data related to a particular disease may contain the information of various parameters
like age, weight, BP level, sugar level, and other disease specific information of patients.
Computers store the data as binary values using patterns of two numbers: 1 and 0. A bit is the
smallest unit of data, and it represents a single value. A byte is eight binary digits long, i.e. equal
to 8 bits. Storage and memory are measured in megabytes and gigabytes. Following are the units
of memory, for data storage:
Table 1: Data measurements and their sizes
Data Measurement Size
Bit Single binary digit 0 or 1
Byte 8 bits
Kilo Byte (KB) 1,024 Bytes
Mega Byte (MB) 1,024 Kilobytes
Giga Byte(GB) 1,024 Megabytes
Tera Byte (TB) 1,024 Gigabytes
Peta Byte (PB) 1,024 Terabytes
Exa Byte (EB 1,024 Petabytes
Structured data
Data that follows a precise and consistent structure or organization is known as structured data,
and it facilitates easy searching and retrieval. This arrangement is frequently shown as rows and
columns, much like in a spreadsheet or table. In structured data systems, every column has a
designated data type, and every row has a record or instances of data. Because structured data is
easily accessed, queried, and analyzed using a variety of tools and techniques, it is extremely
valuable. This makes it the perfect format for machine learning and artificial intelligence
applications, as well as for data-driven applications like analytics and business intelligence.
For example, data containing all personal and educational information of all students of a certain
department in a university. A small part of such data is shown here:
Sr. No. Student ID Univ. Roll No. Name Date of Birth Course Semester
1. 200001 2400001 XYZ 30/12/2002 BCA 2
2. 200002 2400002 UVW 21/05/2003 BCA 3
3. 200003 2400003 TYR 13/04/2001 B.Sc. (IT) 4
4. 200004 2400004 ABC 28/09/2002 BCA 2
5. 200005 2400005 DEF 18/07/2003 B.Sc. (CS) 3
Structured data has a well define structure, follows a consistent order and can be easily accessed
and used by a person or a computer program. Structured data is usually stored in well-defined
schemas such as Databases. It is generally tabular with columns and rows that clearly define its
attributes.
Advantages:
Easy to understand and use: Structured data has a well-defined schema or data model, making it
easy to understand and use. This allows for easy data retrieval, analysis, and reporting.
Consistency: The well-defined structure of structured data ensures consistency and accuracy in the
data, making it easier to compare and analyze data across different sources.
Efficient storage and retrieval: Structured data is typically stored in relational databases, which are
designed to efficiently store and retrieve large amounts of data. This makes it easy to access and
process data quickly.
Enhanced data security: Structured data can be more easily secured than unstructured or semi-
structured data, as access to the data can be controlled through database security protocols.
Clear data lineage: Structured data typically has a clear lineage or history, making it easy to track
changes and ensure data quality.
Disadvantages:
Inflexibility: Structured data can be inflexible in terms of accommodating new types of data, as
any changes to the schema or data model require significant changes to the database.
Limited complexity: Structured data is often limited in terms of the complexity of relationships
between data entities. This can make it difficult to model complex real-world scenarios.
Limited context: Structured data often lacks the additional context and information that
unstructured or semi-structured data can provide, making it more difficult to understand the
meaning and significance of the data.
Expensive: Structured data requires the use of relational databases and related technologies, which
can be expensive to implement and maintain.
Data quality: The structured nature of the data can sometimes lead to missing or incomplete data,
or data that does not fit cleanly into the defined schema, leading to data quality issues.
Unstructured Data
Unstructured data refers to data that doesn’t adhere to a fixed format. Unlike structured data, which
is neatly categorized in rows and columns, unstructured data is format free, making it less
straightforward to analyze and process. Unstructured data can also be classified as qualitative data,
it cannot be processed and evaluated using standard data tools and techniques. Such data lacks a
predetermined structure and does not adhere to any predetermined framework. Since it lacks a
specified data schema, it is best managed in non-relational databases. Some common examples of
unstructured data are as follows: text files, video files, reports, tweets, Email, images etc.
Data Quality
Data quality refers accuracy, validity, completeness, and consistency of data to ensure that the data
used for analysis, reporting, and decision-making is reliable and trustworthy.
Data Accuracy
Data is considered accurate if it describes the real world, that is, if the real words entities exist in
our data. Anomaly detection or outlier analysis helps to identify unexpected values or events in a
data set.
For example, a value 118 in ‘Age’ column of a graduate student related data is inaccurate.
Data Validity
Data validity refers to whether the data follows defined formats, values, and rules. This involves
ensuring data conforms to a particular format, adhering to predefined rules, patterns or sets of
values. For example, a date format must be recognized.
Data Completeness
Data completeness refers to availability of all required data. It involves ensuring no essential data
is missing. If a dataset is incomplete, it can lead to misinformed decisions.
Data Consistency
Data consistency refers to the quality of data being reliable and in a consistent format across
various databases, systems, and applications. It ensures that data remains the same and aligns with
the established rules and standards throughout its lifecycle, regardless of the platform or location
it’s accessed from.
Data Uniqueness
Uniqueness implies that all data entities are represented only once in the dataset. This involves
managing and eliminating duplicate data entries to ensure that each piece of information is only
recorded once, preventing confusion or miscalculations.
Factors affecting Data Quality:
a) Human error
b) System errors
c) Sampling errors
d) Incomplete data
e) Biasedness
Information
Information is the result of processing data, usually by computer. This results in facts, which enable
the processed data to be used in context and have meaning. Information is data that has meaning.
Data on its own has no meaning. It only takes on meaning and becomes information when it is
interpreted. Data consists of raw facts and figures. When that data is processed into sets according
to context, it provides information. Data refers to raw input that when processed or arranged makes
meaningful output. Information is usually the processed outcome of data. When data is processed
into information, it becomes interpretable and gains significance. In IT, symbols, characters,
images, or numbers are data. These are the inputs an IT system needs to process in order to produce
a meaningful interpretation. In other words, data in a meaningful form becomes information.
Information can be about facts, things, concepts, or anything relevant to the topic concerned. It
may provide answers to questions like who, which, when, why, what, and how. If we put
Information into an equation it would look like this:
Data + Meaning = Information
Knowledge
Knowledge is a skill or theoretical understanding of a subject. If we look at information as a
sentence (answering a question), then knowledge would probably be a book: many related
sentences, in ordered paragraphs, structured in pages and chapters.
While data is physical evidence and information is a subjective statement regarding this evidence,
knowledge is a philosophical concept. The philosophical study of knowledge is called
epistemology. Knowledge is so abstract that we don’t have the ability to store it (we store only
data), The AI field of trying to define knowledge into a structured format that can be stored is
called knowledge representation
We can summarize the difference among data, information and knowledge as follows:
Data Information Knowledge
Is objective Should be objective Is subjective
Has no meaning Has a meaning Has meaning for a specific
purpose
Is unprocessed Is processed Is processed and understood
Is quantifiable, there can Is quantifiable, there can be Is not quantifiable, there is no
be data overload information overload knowledge overload
Big Data
Big data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that continues to grow exponentially over time. These datasets are so huge and
complex in volume, velocity, and variety, that traditional data management systems cannot store,
process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital technology
advancements, such as connectivity, mobility, the Internet of Things (IoT), and artificial
intelligence (AI). As data continues to expand and proliferate, new big data tools are emerging to
help companies collect, process, and analyze data at the speed needed to gain the most value from
it.
Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size
over time. Big data is used in machine learning, predictive modeling, and other advanced analytics
to solve business problems and make informed decisions.
Big data definitions may vary slightly, but it will always be described in terms of volume, velocity,
and variety. These big data characteristics are often referred to as the “3 Vs of big data” and were
first defined by Gartner in 2001.
Volume
As its name suggests, the most common characteristic associated with big data is its high volume.
This describes the enormous amount of data that is available for collection and produced from a
variety of sources and devices on a continuous basis.
Velocity
Big data velocity refers to the speed at which data is generated. Today, data is often produced in
real time or near real time, and therefore, it must also be processed, accessed, and analyzed at the
same rate to have any meaningful impact.
Variety
Data is heterogeneous, meaning it can come from many different sources and can be structured,
unstructured, or semi-structured. More traditional structured data (such as data in spreadsheets or
relational databases) is now supplemented by unstructured text, images, audio, video files, or semi-
structured formats like sensor data that can’t be organized in a fixed data schema.
In addition to these three original Vs, three others that are often mentioned in relation to harnessing
the power of big data: veracity, variability, and value.
Veracity: Big data can be messy, noisy, and error-prone, which makes it difficult to control
the quality and accuracy of the data. Large datasets can be unwieldy and confusing, while
smaller datasets could present an incomplete picture. The higher the veracity of the data,
the more trustworthy it is.
Variability: The meaning of collected data is constantly changing, which can lead to
inconsistency over time. These shifts include not only changes in context and interpretation
but also data collection methods based on the information that companies want to capture
and analyze.
Value: It’s essential to determine the business value of the data you collect. Big data must
contain the right data and then be effectively analyzed in order to yield insights that can
help drive decision-making.
Here are some big data examples:
Consumer behavior and shopping habits tracking data
Monitoring online payment patterns
Medical data such as research reports, clinical notes, and lab results
Image data from cameras and sensors, as well as GPS