Big Data
Big Data
UNIT II
of Programming was to be explored.
• The advent of Local Area Networks and other Networking technologies are able to provide the solutions
of combining computing and storing capacities of systems on the network.
• This model was inspired by the combination of map and reduce operations • MapReduce based program work in two phases, namely, Map and Reduce.
commonly used in existing Programming languages. • Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
• The MapReduce model had a huge impact on Google’s ability to handle huge • Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and
amounts of data in a reasonable time. C++.
• MapReduce was the pioneer attempt for processing big data and other future • The input to each phase is key-value pairs.
technologies like Hadoop also had software utilities that still use the Mapreduce • In addition, every programmer needs to specify two functions: map function and reduce function.
model.
Understanding MapReduce in Hadoop
What is
● MapReduce is a Hadoop framework used for writing applications that can process MapReduce
vast amounts of data on large clusters.
● This application allows data to be stored in a distributed form. It simplifies
enormous volumes of data and large scale computing. ● MapReduce is a programming framework that allows us to perform distributed and parallel
● There are two primary tasks in MapReduce: map and reduce. processing on large data sets in a distributed environment.
● We perform the former task before the latter. In the map job, we split the input ● MapReduce consists of two distinct tasks – Map and Reduce.
● As the name MapReduce suggests, the reducer phase takes place after the mapper phase has
dataset into chunks. been completed.
● Map task processes these chunks in parallel. ● So, the first is the map job, where a block of data is read and processed to produce key-value
● The map we use outputs as inputs for the reduce tasks. Reducers process the pairs as intermediate outputs.
● The output of a Mapper or map job (key-value pairs) is input to the Reducer.
intermediate data from the maps into smaller tuples, that reduces the tasks,
● The reducer receives the key-value pair from multiple map jobs.
leading to the final output of the framework. ● Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a
smaller set of tuples or key-value pairs which is the final output.
4.Handling of errors/faults:
• MapReduce engines usually provide a high level of fault tolerance and robustness in handling errors.
5. Scale out Architecture:
• The reason for providing robustness to these engines is their high tendency to make errors or faults. • Mapreduce engines are built in such a way that they can
• There are high chances of failure in clustered nodes on which different parts of program are running. accommodate more machines as and when required.
• Therefore the engine must have the capability of recognizing the fault and rectify it. • This possibility of introducing more computing resources to the
• Moreover, the engine design involves the ability to find out the tasks that are incomplete and eventually assign architecture makes Mapreduce programming model more suitable
then to different nodes.
higher computational demands of Big data.
HADOOP VS MAPREDUCE
Language : Framework :
How it works : Stages of Mapreduce Working of Mapreduce:
• The data goes through the following phases of MapReduce algorithm:
• Input Splits:
• Applications to handle data are designed by software professionals on the basis of
An input to a MapReduce program is divided into fixed-size pieces called input splits. algorithms, which are stepwise processes to solve a problem / achieve a goal.
Input split is a chunk of the input that is consumed by a single map.
• The Mapreduce model also works on an algorithm to execute the above stages
• Mapping
This is the very first phase in the execution of mapreduce program. In this phase data in each split is passed to a
• This algorithm can be depicted as follows:
mapping function to produce suitable key – value pairs. In our word count example, the job of mapping phase is to
count a number of occurrences of each word from input splits and prepare a list in the form of <word, frequency> i.e
1. Take a large dataset or set of records
key is word and frequency is its value. 2. Perform iteration over the data
• Shuffling
3. Extract some interesting patterns to prepare an output list by using map function.
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records from Mapping phase
output. In our example, the same words are clubbed together along with their respective frequency. 4. Arrange/Sort output list properly to enable optimization for further processing.
• Reducing 5. Compute set of results by using the reduce function.
In this phase, output values from the Shuffling phase are aggregated. This phase combines values from Shuffling
phase and returns a single output value. In short, this phase summarizes the complete dataset. 6. Provide the final output.
In our example, this phase aggregates the values from Shuffling phase i.e., calculates total occurrences of each word. The working of the MapReduce Approach is shown below :
Thank you
Case Examples of MapReduce
Mapreduce is used to process various types of data obtained from various sectors.Some of the fields benefitted by the use of Mapreduce
are:
1.Web Page visits: Suppose a researcher wants to know the number of times the website of a particular newspaper was accessed.The
map task would be to read the logs of the web page requests and make a complete list.
The map output may look similar to the following:
<emailURL,1>
UNIT II
<newspaperURL,1>
<socialmediaURL,1>
<sportsnewsURL,1>
<newspaperURL,1>
Part b <emailURL,1>
<newspaperURL,1>
The reduce function would find the results for the newspaper URL and add them. The output of the preceding code is :
<newspaperURL,3>
Contd. Contd.
2. Word frequency: A researcher wishes to read articles about flood but he 3. Word count: Suppose the researchers wishes to determine the number of times celebrities talk about the
present bestseller book.
does not want those articles in which flood is discussed as a minor The data to be analysed comprises written blogs , posts and tweets of the celebrities.
topic.Therefore he decided that an article basically dealing with earthquakes The map function will make the list of all words.This list will be in the form of following Key value pairs
(where key is word and value is 1 for every appearance of the word)
and floods should have the word ‘ Tectonic plate’ in it more than 10 times. The output of map function :
<global warming ,1>
● The map function will count the number of times the specified word
<food,1>
occurred in each document and provide the result as <global warming , 1>
<bestseller,1>
<document,frequency>
<afghanisthan,1>
● The reduce function will then count and select only the results that have <bestseller,1>
The preceding output will be converted in the following form by reduce function:
frequency of more than 10 words.
<global warming ,2>
<food,1>
<Bestseller,2>
<Afghanisthan,1>
Parts of Big Data Architecture/Big data Stack :
Human Body ● As it deals with huge values of variety data, Big data analysis requires the use of
best technologies at every stage , be it collecting data , cleaning it, sorting and
organizing it, integrating or analysing it.
● Thus, technologies associated with Big data analysis are a bit complex in nature
and so to understand them , we create model template / architecture commonly
known as Big data Architecture before designing the systems.
● The configuration of this model varies depending on the specific needs of the
organization.
● However the basic layers and components more or less remains the same.
● The model should give a complete view of all the required elements.
● Although initially creating a model or even viewing it may seem time-consuming
, but it can save a significant amount of time ,effort and rework during
subsequent stages of implementation.
● Availability: The infrastructure setup must be available at all times to ensure nearly 100% ● While creating this model , we must take into consideration all the hardware , infrastructure
uptime guarantee / service. software,operational software,management software,Application Programming interface (APIs) and
software development tools.
It is obvious that businesses cannot wait in case of a service interruption / failure ; therefore an
● In short , we can say that the architecture of the Big data environment must fulfill all principles of Big
alternative to the main system must also be maintained. data implementation described above and able to perform the following functions:
● Scalability:The Big data systems must be scalable enough to accommodate varying storage ✔ Capture data from different sources.
and computing requirements. ✔ Cleaning and integrating data of different types of formats.
● Flexibility : Flexible infrastructure facilitate adding more resources to the setup and promote ✔ Sorting and organizing data
failure recovery.It should be noted that flexible infrastructure is also costly ; however costs can
✔ Analysing data
be controlled with the use of cloud services , where you need to pay for what you actually use.
✔ Identifying relationships / patterns in data
● Cost : You must select the infrastructure that you can afford.This includes all the hardware,
storage and networking requirements. ✔ Deriving conclusions
Big data Architecture/Big Data Stack Layers of the Big Data Handling Technologies
Architecture:
● Above figure shows a sample illustration of Big Data Architecture,comprising the following layers and components:
1. Data Sources layer
2. Ingestion layer
3. Storage layer
6. Security layer
7. Monitoring layer
8. Analytics layer
9. Visualisation later
1.Data Sources layer: Example : Take Telecom industry and identify the
● Organisations generate huge amounts of data on a daily basis.
● The basic function of the Data Sources layer is to absorb and integrate the data
sources of data
coming from various sources , at varying velocity and in different formats.
● Before this data is considered for Big Data Stack , we have to differentiate
between the noise and relevant information.
● This layer uses the HDFS that lies on top of HADOOP Infrastructure layer. ● Big data projects are full of security issues because of using the distributed architecture , a
simple programming model and the open framework of services.
● Therefore , the following security checks must be considered while designing a Big Data
Stack:
1. It must authenticate nodes by use of protocols.
3. It must subscribe a key management service for trusted keys and certificates.
4. It must maintain logs of communication that occurs between nodes and trace any
anomalies across layers.
5. It must ensure safe communication between nodes by using Secure Sockets Layer (SSL).
Monitoring layer Analytics Engine:
● This layer consists of a number of monitoring systems.
● The role of an analytics engine is to analyse huge amounts of
● These systems remain automatically aware of all the unstructured data.This type of analysis is related to text analytics
configurations and functions of different operating systems and and statistical analytics.
hardware. ● Some examples of different types of unstructured data that are
● They provide machine/node communication through high level available as large datasets include the following:
protocol like XML –Extension Markup Language. ❖ Documents containing textual patterns.
● Some examples for monitoring Big data stacks are Ganglia and ❖ Text and symbols generated by customers or users using social
Nagios. media forums ,such as Yammer, Twitter and Facebook.
❖ Machine generated data , such as Radio frequency Identification
(RFID) feeds and weather data.
Virtualisation & Big Data execute or perform the same functions as physical machine.
Virtualisation Environment
Types/Approaches of Virtualisation:
● In the Big data environment , you can virtualize almost every ● Servers are the lifeblood of any network.
element such as server, storage, applications,data,networks,
processors,etc. ● They provide the shared resources that the network users need, such as
centers that are connected are done so through software and wireless technology. This ●While implementing network virtualization , you do not need to rely on
allows the reach of the network to be expanded as far as it needs to for peak efficiency.
the physical network for managing traffic between connections.
● A local area network, or LAN, is a kind of wired network that can usually only reach within
the domain of a single building.
●You can create as many virtual networks as you need from a single
● A wide area network, or WAN, is another kind of wired network, but the computers and
devices connected to the network can stretch over a half-mile in some cases. physical implementation.
● Conversely, a virtual network doesn’t follow the conventional rules of networking
because it isn’t wired at all and instead specialized internet technology is used to access.
●In the Big data environment , network virtualization helps in defining
different networks with different sets of performance and capacities to
Network virtualisation manage the large distributed data required for Big data analysis.
Virtualisation Contd.
●Processor virtualization optimizes the power of the processor and maximizes
its performance.
●Memory virtualization separates memory from the servers.
●Big data analysis needs systems to have high processing power(CPU) and
memory (RAM) for performing complex computations.
●These computations can take a lot of time in case CPU and memory
resources are not sufficient.
● It is used to create a platform that can provide dynamic linked data services.
● On the other hand , storage virtualisation combines physical storage
resources so that they can be shared in a more effective way.
● Relational database systems use a model that organizes data into tables of rows (also ● The tables can be related based on the common Customer ID field. You can, therefore,
called records or tuples) and columns (also called attributes or fields). query the table to produce valuable reports, such as a consolidated customer statement.
● The columns for a transaction table might be Transaction Date, Customer ID, Transaction
Amount, Payment Method, etc.
Contd.
Contd.
●These tables can be linked or related using keys. Each row in a table is ● RDBMS consists of several tables and relationships between those tables
identified using a unique key, called a primary key. help in classifying the information contained in them.
●This primary key can be added to another table, becoming a foreign key. ● Each table in RDBMS has a pre set schema.
●The primary/foreign key relationship forms the basis of the way relational ● These schemas are linked using the values in specific columns of each
databases work.
table.(primary key /foreign key).
●Returning to our example, if we have a table representing product orders,
one of the columns might contain customer information. ● The data to be stored / transacted in a RDBMS need to adhere to ACID
standards:
●Here, we can import a primary key that links to a row with the
● ACID is a concept that refers to the four properties of a transaction in a
information for a specific customer.
database system, which are: Atomicity, Consistency, Isolation and
ACID : Durability.
● These properties ensure the accuracy and integrity of the data in the Consistency: Ensures that data abides by the schema (table) standards,
database, ensuring that the data does not become corrupt as a result of such as correct data type entry , constraints and keys.
some failure, guaranteeing the validity of the data even when errors or Isolation: Refers to the encapsulation of information , i.e. makes only
failures occur. necessary information visible.
Atomicity: Ensures full completion of a database operation. Durability: Ensures that transactions stay valid even after a power
failure or errors.
A transaction must be an atomic unit of work, which means that all
the modified data are performed or none of them will be. The
transaction should be completely executed or fails completely, if one RDBMS and Big data
part of the transaction fails, all the transaction will fail. This provides
reliability because if there is a failure in the middle of a transaction, ●Like other databases , the main purpose for RDBMS is to provide a solution
none of the changes in that transaction will be committed. for storing and retrieving information in a more convenient and efficient
manner.
●The most common way of fetching data from these tables is by using ● One of the biggest difficulties
Structural Query Language(SQL). with RDBMS is that it is not yet
near the demand levels of Big
●As you know data is stored in tables of the form of rows and columns ; The data. The volume of data
size of the file increases as new data / records are added resulting in handling today is rising at a faster
increase in size of the database. rate.
●Big data solutions are designed for storing and managing enormous amounts ● For example: Facebook stores 1.5
of data using a simple file structure , format and highly distributed storage petabytes of photos. Google
processes 20PB each day .Every
mechanism.
minute , over 168 million emails
are sent and received , 11 million
Contd. searches in Google .
● Big data primarily comprises
semi-structured data , such as
social media sentiment analysis ,text mining data etc. while RDBMSs are more suitable In this structured data is mostly processed.
for structured data such as weblog , financial data etc. In this both structured and unstructured data is
processed.
Differences between RDBMS and It is less scalable than Hadoop. It is highly scalable.
Big Data systems The data schema of RDBMS is static type. The data schema of Hadoop is dynamic type.
RDBMS Big data Hadoop
Cost is applicable for licensed software. Free of cost, as it is an open source software.
Traditional row-column based databases, An open-source software used for storing data
basically used for data storage, manipulation and and running applications or processes
retrieval. concurrently.
RDBMS and big data link
●Big data solutions provide a way to avoid storage limitations and reduce the cost of
processing and storage for immense data volumes. Conclusion:
●Nowadays systems based on RDBMS , also able to store huge amounts of data with ● In the data – tsunami kind of environment , where data inflow is beyond usual
advanced technology and developed software and hardware. Example: Analytics conventions and rationales, Big data systems act as a dam to contain the water(here
Platform System(APS) from Microsoft. data) and then utilizes RDBMS cleverly to make channels in order to distribute data
specifically to hydroelectric stations ,irrigation canals,other places where water is most
●In fact Relational database systems and Big data batch processing solutions are seen required.
as complementary mechanisms rather than competitive mechanisms.
● Thus Big data systems happens to be non-relational when it comes to storing and
●Batch processing solutions of Big data are very unlikely ever to replace RDBMS. handling incoming data , and then it abides by conventional RDBMS mechanisms to
●In most cases , they balance and enhance capabilities for managing data and
generating Business intelligence cases.
●Results / output of Big data systems can still be stored in RDBMS as shown in the
next diagram.
disseminate the results to meaningful formats. ●It states that any distributed data store can only provide two of the
following three guarantees:
❑Consistency : Same data is visible by all the nodes.
❑Availability : Every request is answered whether it succeeds or fails.
❑Partition tolerance – Despite network failures ,the system continues to
operate.
●The CAP Theorem is useful in decision making in the case of design of
database servers/ systems.
CAP THEOREM : How to understand it?
●CAP Theorem is also called Brewer’s Theorem.
● In the theorem, partition tolerance is a must. The assumption is that the system ● Consistency in CAP is different than that of ACID. Consistency in CAP means having the
operates on a distributed data store so the system, by nature, operates with network most up-to-date information.
partitions.
● Network failures will happen, so to offer any kind of reliable service, partition tolerance
is necessary—the P of CAP.
Technical background of a query
● The moment in question is the user query. We assume that a user makes a query to a database, and the networked database is to return a value.
● That leaves a decision between the other two, C and A.
● When a network failure happens, one can choose to guarantee consistency or
availability :
Alice from London &
❖ High consistency comes at the cost of lower availability. ALICE Ramesh from Hyderabad
● Thus, we sacrifice availability to ensure the data returned by the query is consistent.
●Peers make a portion of their resources, such as processing power, disk
storage or network bandwidth, directly available to other network
participants, without the need for central coordination by servers or stable
hosts. Peers are both suppliers and consumers of resources, in contrast to 2) Data recovery or backup is very difficult. Each computer
the traditional client–server model in which the consumption and supply should have its own backup system
of resources is divided.
● A lot of corporations still use relational databases for some data but increasing
Polyglot persistence: persistence requirements of dynamic applications are growing from
predominantly relational to a mixture of data sources.
● Polyglot applications are the ones that make use of several core database
technologies.
● Such databases are often used to solve complex problem by breaking it into
Integrating Big data in Traditional
fragments and applying different database models.
● Then the results of different sets are aggregated into a data storage and analysis
Data warehouses
solution.It means picking up the right Non-relational DB for the right application.
● For example, Disney in addition to RDBMS also uses Cassandra and Mongo DB
.NETFLIX uses Cassandra ,Hbase and SimpleDB.
Summarise : Data Warehouse Big data Handling Technology / Solution :
●Big Data Technology is a medium to store and operate huge amounts of heterogeneous
Group of methods and software data, holding data in low-cost storage devices.
• Incorporated or used in Big organisations.Provides Dashboard based interface
●It is designed for keeping data in a raw or unstructured format while processing is in
progress.
Data collection from functional systems that are heterogeneous
• Data sources and types being different ●It is preferred because there is a lot of data that has to be manually and relationally
handled.
Synchronized into a centralized database
●If this data is potentially used , it can provide much valuable information leading to superior
decision making.
Analytical visualization can be done
Keeps data in unstructured format while processing goes on. Illustration through case:
●Consider the case of ABC company.
Increases performance because of optimized storage
●It has to analyse the data of 100000 employees across the world.
Also enhances Analytical abilities
●Assessing the performance manually of each employee is a huge task for the administrative
department before rewarding bonuses and increasing salaries based on his awards list /
contribution to company.
●The company sets up a data warehouse in which information related to each employee
is stored and provides useful reports and results.
Have a Big data solution and no Data warehouse or vice versa ? YES ●Thus it is a misunderstood conviction that once a Big data solution is
Have both? YES implemented, existing relational data warehousing becomes redundant and not
Thus there is hardly any correlation between a Big data technology and Data warehouse.
required anymore.
● Organisations that use Data warehousing technology will continue to do so and those that use both are
future proof from any further technological advancements .
● Big data systems are normally used to understand strategic issues, for example inventory maintenance
or target based individual performance reports.
● Data warehousing is used for reports and visualizations for management purposes at pan – company
level.
● Data warehousing is a proven concept and thus will continue to provide crucial database support to
many enterprises.
Note :
●Data Availability is a well-known challenge for any system related to transforming
Integrating Big data in Traditional Data and processing data for use by end-users and Big data is no different.
●HADOOP is beneficial in mitigating this risk and make data available for analysis
warehouses immediately upon acquisition.
●Organisations are beginning to realise that they have an inevitable business requirement ●The challenge here however , is to sort and load data that is unstructured and in
of combining traditional Data warehouses (based on structured formats) to less varied formats.
structured Big data systems.
●Also context – sensitive data involving several different domains may require another
●The main challenges confronting the physical architecture of the integration between level of availability check.
the two include data availability , loading storage , performance , data volume ,
scalability and varying query demands against the data and operational costs for
maintaining the environment. 2.Pattern study
●To cope up with the above issues that might hamper the overall implementation and ●Pattern study is nothing but the centralization and localization of data according
integration process , following are the issues and challenges associated. to the demands.
1.Data Availability:
●For example: In amazon , results are combined based on end user location(i.e. ●Especially in case of big documents , images or videos .
destination pin code) , so as to return only meaningful contextual knowledge
●Sqoop , Flume etc. come handy in this scenario.
than to impart the entire data to the user.
●Trending topics in news channels / epapers is also an example of pattern
study(keywords or popularity of links as per the hits they receive , etc. are
4.Data volumes and Exploration
conjoined to know the pattern. ●Data exploration and mining is an activity associated with Big data systems and it
yields large datasets as processing output.
3.Data Incorporation and ●These datasets are required to be preserved in the system by occasional
optimization of intermediary datasets. Negligence in this aspect can be reason for
Integration: potential performance drain over a period of time.
●Traffic spikes and Volatile surge in data volumes can easily dislocate the
●The data incorporation process for Big data systems becomes a bit complex when functional systems of the firm.All over the Data cycle (Acquisition 🡪
file formats are heterogeneous. transformation -🡪 Processing 🡪 Results) , we need to take care of this.
●Continuous data processing on a platform can create a conflict for resources over a
given period of time, often leading to deadlocks.
●Distributed storage is a new storage technology to compete against above.
5.Compliance and Localised legal ●Exchange of data and Persistence of data across different storage layers need to
be take care of while handling Big data projects.
Requirements.
●Various compliance standards such as Safe Harbor ,PCI Regulations etc. can have Changing Deployment models in Big
some impact on data security and storage.
●For example transactional data need to be stored online as per Courts of law. data era
●Thus to meet such requirements , Big data infrastructure can be used. ● Data management deployment models are shifting altogether to different levels ever since the inception of Big
data systems with Data warehouse.
●Large volumes of data must be carefully handled to ensure that all standards
● Following are the necessities to be taken care of while handling Big data systems with Data warehouse:
relevant to the data are compiled and security measures carried out.
1. Scalability and speed:The platform should support parallel processing , optimized storage , dynamic query
6.Storage performance: 2.
optimization.
Agility and Elasticity: Agile means that the platform should be flexible and respond rapidly in case of
●Processors , memories / core disks etc. are the traditional methods of storage and changing trends.Elasticity means that the platform models can be expanded and decreased as per the
they have proven to be beneficial and successful in working of organisations. demands of the user.
3. Affordability and Manageability: One must solve issues such as flexible pricing ,licensed software ,
customization and cloud based techniques for managing and controlling data.
4. Appliance Model / Commodity Hardware: Create clusters.
Thank You...
Conceptualizing Data Analysis as a Process
● The “Problem” with Data Analysis
● Data Analysis as a Linear Process
● Data Analysis as a Cycle
UNIT II
● Above all, an effective data analysis process is functional – i.e., it is useful and ● Some questions are close-ended and therefore relatively straightforward, e.g.,
adds value to organizational services and individual practices. “Did our program meet the 10% mandate for serving children with disabilities last year”?
● Therefore, a preliminary step in the data analysis process is to select and ● Other questions are highly open-ended, such as:
train a team to carry out the process. “How could we do a better job of parent involvement?”
● More specifically, these standards are the basis for the first step in the data In the first case, there are only two possible answers to the question: “Yes” or “No.” In the
second case, a possible answer to the question could include many relevant pieces of
analysis process – forming one or more specific questions to be examined. information.
● Finally, by formulating specific questions at the beginning of the We urge programs to develop a specific planning process for data collection (no matter how brief) in
order to avoid the common pitfalls of the collection process, which include having:
process, programs are also in a position to develop skills in evaluating
• Too little data to answer the question;
their data analysis process in the future.
• More data than is necessary to answer the question; and/or
• Data that is not relevant to answering the question.
Process Component #4. Data Analysis:
CONT... ● What Are Our Results?
● In order to successfully manage the data collection process, programs need a plan ● Once data have been collected, the next step is to look at and to identify what
that addresses the following: is going on – in other words, to analyze the data. Here, we refer to “data
✔ What types of data are most appropriate to answer the questions? analysis” in a more narrow sense: as a set of procedures or methods that can
✔ How much data are necessary? be applied to data that has been collected in order to obtain one or more sets
✔ Who will do the collection? of results.
✔ When and Where will the data be collected? ● Because there are different types of data, the analysis of data can proceed on
✔ How will the data be compiled and later stored? different levels.
● By creating a data collection plan, programs can proceed to the next step of the
overall process.
● The wording of the questions, in combination with the actual data collected,
have an influence on which procedure(s) can be used – and to what effects.
● In addition, once a particular round of data analysis is completed, a program can
then step back and reflect upon the contents of the data collection plan and identify
“lessons learned” to inform the next round.
● Once data have been analyzed and an interpretation has been developed, programs face the next ● Purpose: was the data analysis process consistent with federal standards and other, relevant regulations?
tasks of deciding how to write, report, and/or disseminate the findings. ● Questions: were the questions worded in a way that was consistent with federal standards, other regulations, and
organizational purposes? Were the questions effective in guiding the collection and analysis of data?
● First, good writing is structured to provide information in a logical sequence. In turn, good writers are
strategic – they use a variety of strategies to structure their writing. ● Data Collection: How well did the data collection plan work? Was there enough time allotted to obtain the
necessary information? Were data sources used that were not
● One strategy is to have the purpose for the written work to be clearly and explicitly laid out. This helps
● effective? Do additional data sources exist that were not utilized? Did the team collect too little data or too much?
to frame the presentation and development of the structure of the writing. Second, good writing
takes its audience into account. ● Data Analysis Procedures or Methods: Which procedures or methods were chosen? Did these conform to the
purposes and questions? Were there additional procedures or methods that could be used in the future?
● Therefore, good writers often specify who their audience is in order to shape their writing.
● Interpretation/Identification of Findings: How well did the interpretation process work? What information was
● A final thought is to look upon the writing/reporting tasks as opportunities to tell the story of the data used to provide a context for the interpretation of the results? Was additional relevant information not utilized for
interpretation? Did team members disagree over the interpretation of the data or was there consensus?
you have collected, analyzed, and interpreted.
● Writing, Reporting, and Dissemination. How well did the writing tell the story of the data? Did the intended
audience find the presentation of information effective?
Thank You