We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32
DATA SCIENCE USING R
VIII SEMESTER DS-427T
Department of Computer Science and Engineering,
BVCOE New Delhi 1 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Introduction ⚫ R is an open-source language that is contributed to developers and programmers from all around the world. ⚫ Due to its platform independence, diversity of packages, and robust graphical features, it has become the primary tool for the analytics industry. ⚫ R has become the lingua franca of Data Science and statistics. It is the most popular analytic tool. ⚫ The estimated R users are nearing approximately 2 million! ⚫ Part of GNU project. ⚫ Written primarily in C and Fortran. ⚫ Available for various operating systems: Unix/Linux, Windows, Mac.Department of Computer Science and Engineering, ⚫2 Can BVCOE New Delhi be downloaded Subject: Data Science Using R , Instructor: Ms and installed 8/20/2024 from http://cran.r-project.org/ RACHNA NARULA History of R
Department of Computer Science and Engineering, BVCOE New Delhi
3 Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 8/20/2024 Department of Computer Science and Engineering, BVCOE New Delhi 4 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Popularity of R
Department of Computer Science and Engineering,
BVCOE New Delhi 5 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Department of Computer Science and Engineering, BVCOE New Delhi 6 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Some application areas
Department of Computer Science and Engineering,
BVCOE New Delhi 7 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Different Job Roles Some of the positions that are available for the R programmers are as follows: ⚫ Data Scientist :A Data Scientist is supposed to extract data, transform it into a structured format, perform analysis and forecast future insights. ⚫ Business Analyst: A Business Analyst has to develop solutions that are technical in nature for the various business problems. They are required to seek solutions, advance the efforts of the company as well as fulfill the requirements of the business. ⚫ Data Analyst :A Data Analyst is responsible for extracting and analyzing data. This task requires extensive usage of R’s statistical libraries to deliver accurate results so that the companies can make careful data-driven decisions. Department of Computer Science and Engineering, BVCOE New Delhi 8 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA ⚫ Data Visualization Expert : R is most popular for its visualization libraries. Due to this reason, Data Visualization experts in R programming are in-demand in the industries. ⚫ Quantitative Analyst: Quantitative Analysts are engaged in the financial and banking industries. These industries have to deal with all types of data and R provides an ideal solution to their various data problems.
Department of Computer Science and Engineering,
BVCOE New Delhi 9 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA What R does and does not o data handling and storage: numeric, o is not a database, but connects to textual DBMSs o matrix algebra o has no graphical user interfaces, but connects to Java, TclTk o hash tables and regular expressions o language interpreter can be very slow, o high-level data analytic and statistical but allows to call own C/C++ code functions o no spreadsheet view of data, but o classes (“OO”) connects to Excel/MsOffice o graphics o no professional / commercial support o programming language: loops, branching, subroutines
Department of Computer Science and Engineering,
BVCOE New Delhi 10 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Source: dataflair.org
Department of Computer Science and Engineering,
BVCOE New Delhi 11 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA When to choose between R and Python? • The choice between R vs Python also depends on what you are trying to accomplish with your code. • If you are trying to analyze a dataset and present the findings in a research paper, then R is probably a better choice. • But if you are writing a data analysis program that runs in a distributed system and interacts with lots of other components, it would be preferable to work with Python.
Department of Computer Science and Engineering,
BVCOE New Delhi 12 Subject: Data Science Using R , Instructor: Ms 8/20/2024 RACHNA NARULA Structured data and Unstructured data ⚫ Structured Data ⚫ The data which is to the point, factual, and highly organized is referred to as structured data. It is quantitative in nature, i.e., it is related to quantities that means it contains measurable numerical values like numbers, dates, and times.
13 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Unstructured Data ⚫ All the unstructured files, log files, audio files, and image files are included in the unstructured data. Some organizations have much data available, but they did not know how to derive data value since the data is raw.
14 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA ⚫ Unstructured data is the data that lacks any predefined model or format. It requires a lot of storage space, and it is hard to maintain security in it. It cannot be presented in a data model or schema. That's why managing, analyzing, or searching for unstructured data is hard. It resides in various different formats like text, images, audio and video files, etc. It is qualitative in nature and sometimes stored in a non-relational database or NO-SQL.
15 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 16 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Qualitative and Quantitative Data ⚫ Statistics is a subject that deals with the collection, analysis, and representation of collected data. The analytical data derived from methods of statistics are used in the fields of geology, psychology, forecasting, etc.
⚫ Quantitative data is numerical, countable, and measurable,
providing information on how many, how much, or how often. Qualitative data, however, is descriptive, interpretative, and language-based, helping us understand the reasons, processes, or contexts behind certain behaviors. 17 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Qualitative Data ⚫ The data collected on grounds of categorical variables are qualitative data. Qualitative data are more descriptive and conceptual in nature. It measures the data on the basis of the type of data, collection, or category.
⚫ The data collection is based on what type of quality is
given. Qualitative data is categorized into different groups based on characteristics. The data obtained from these kinds of analysis or research is used in theorization, perceptions, and developing hypothetical theories. These data are collected from texts, documents, transcripts, audio and video recordings, etc
18 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Examples of Qualitative Data
⚫ Textual responses from open-ended survey questions
⚫ Observational notes or fieldwork observations ⚫ Interview transcripts ⚫ Photographs or videos ⚫ Personal narratives or case studies
19 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Quantitative Data ⚫ The data collected on the grounds of the numerical variables are quantitative data. Quantitative data are more objective and conclusive in nature. It measures the values and is expressed in numbers. The data collection is based on “how much” is the quantity. The data in quantitative analysis is expressed in numbers so it can be counted or measured. The data is extracted from experiments, surveys, market reports, matrices, etc. Some examples of quantitative data are:
⚫ Age, Height, Weight, etc.
⚫ Temperature ⚫ Income ⚫ Number of siblings ⚫ GPA 20 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 21 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Levels of Measurement ⚫ Levels of measurement, also called scales of measurement, tell you how precisely variables are recorded. In scientific research, a variable is anything that can take on different values across your data set (e.g., height or test scores).
⚫ There are 4 levels of measurement:
⚫ Nominal: the data can only be categorized
⚫ Ordinal: the data can be categorized and ranked ⚫ Interval: the data can be categorized, ranked, and evenly spaced ⚫ Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.
22 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA ⚫ Depending on the level of measurement of the variable, what you can do to analyze your data may be limited. There is a hierarchy in the complexity and precision of the level of measurement, from low (nominal) to high (ratio).
23 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 24 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 25 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA 26 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA The Five steps of Data Science ⚫ Data Science is a detailed study of the flow of information from the colossal amounts of data present in an organization’s repository. It involves obtaining meaningful insights from raw and unstructured data which is processed through analytical, programming, and business skills. The five essential steps to perform data science are as follows: ⚫ 1. Asking an interesting question ⚫ 2. Obtaining the data ⚫ 3. Exploring the data ⚫ 4. Modeling the data ⚫ 5. Communicating and visualizing the results 27 Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Ask an interesting question ⚫ This is probably my favorite step. As an entrepreneur, I ask myself (and others) interesting questions every day. I would treat this step as you would treat a brainstorming session. Start writing down questions regardless of whether or not you think the data to answer these questions even exists.
28 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Obtain the data ⚫ Once you have selected the question you want to focus on, it is time to scour the world for the data that might be able to answer that question. As mentioned before, the data can come from a variety of sources; so, this step can be very creative!
29 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Explore the data ⚫ Once this step is completed, the analyst generally has spent several hours learning about the domain, using code or other tools to manipulate and explore the data, and has a very good sense of what the data might be trying to tell them.
30 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Model the data ⚫ step involves the use of statistical and machine learning models. In this step, we are not only fitting and choosing models, but we are also implanting mathematical validation metrics in order to quantify the models and their effectiveness.
31 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA Communicate and visualize the results ⚫ This is arguably the most important step. While it might seem obvious and simple, the ability to conclude your results in a digestible format is much more difficult than it seems. We will look at different examples of cases when results were communicated poorly and when they were displayed very well.
32 Department of Computer Science and Engineering, BVCOE New Delhi
Subject: Data Science Using R , Instructor: Ms RACHNA NARULA