We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3
IBM InfoSphere BigInsights and Streams
• BigInsights is an analytics platform that enables companies to turn complex Internet-scale
information sets into insights. • It consists of a packaged Apache Hadoop distribution, with a greatly simplified installation process, and associated tools for application development, data movement, and cluster management. • *Other open source technologies in BigInsights are: ◦ Pig ▪ A platform that provides a high-level language for expressing programs that analyze large datasets. ▪ Pig has a compiler that translates Pig programs into sequences of MapReduce jobs that the Hadoop framework executes. ◦ HIVE ▪ A data-warehousing solution built on top of the Hadoop environment. ▪ It has familiar relational-database concepts, such as tables, columns, and partitions, and a subset of SQL (HiveQL) to the unstructured world of Hadoop. ▪ Hive queries are compiled into MapReduce jobs executed using Hadoop. ◦ Jaql ▪ An IBM-developed query language designed for JavaScript Object Notation (JSON) and provides a SQL-like interface. ◦ Hbase ▪ A column-oriented NoSQL data-storage environment designed to support large, sparsely populated tables in Hadoop. ◦ Flume ▪ A distributed, reliable, available service for efficiently moving large amounts of data as it is produced. ▪ Flume is well-suited to gathering logs from multiple systems and inserting them into the Hadoop Distributed File System (HDFS) as they are generated. ◦ Avro ▪ A data-serialization technology that uses JSON for defining data types and protocols, and serializes data in a compact binary format. ◦ Lucene ▪ A search-engine library that provides high-performance and full-featured text search. ◦ ZooKeeper ▪ A centralized service for maintaining configuration information and naming, providing distributed synchronization and group services. ◦ Oozie ▪ A workflow scheduler system for managing and orchestrating the execution of Apache Hadoop jobs. • In addition, the BigInsights distribution includes the following IBM-specific technologies: ◦ BigSheets ▪ A browser-based spreadsheet-like interface that enables business users to gather and analyze data easily. ▪ Users can work with several common formats of data like csv,tsv(tab separated value). ◦ Text analytics ▪ Pre-built library for text annotators.
◦ Adaptive MapReduce. ▪ An IBM Research solution for speeding up the execution of small MapReduce jobs by changing how MapReduce tasks are handled.
Stream Computing
• Stream computing is a new paradigm necessitated by new data-generating scenarios, such as
the ubiquity of mobile devices, location services, and sensor pervasiveness. • In static data computation questions are asked of static data. • In streaming data computation data is continuously evaluated by static questions.
The InfoSphere platform
• InfoSphere is a comprehensive information-integration platform that includes data
warehousing and analytics, information integration, master data management, life-cycle management, and data security and privacy. • The InfoSphere Streams platform ◦ supports real-time processing of streaming data, ◦ enables the results of continuous queries to be updated over time, and ◦ can detect insights within data streams that are still in motion. • The main design goals of InfoSphere Streams are to: ◦ Respond quickly to events and changing business conditions and requirements. ◦ Support continuous analysis of data at rates that are orders of magnitude greater than existing systems. ◦ Adapt rapidly to changing data forms and types. ◦ Manage high availability, heterogeneity, and distribution for the new stream paradigm. ◦ Provide security and information confidentiality for shared information. • InfoSphere Streams ◦ Provides a programming model and IDE for defining data sources. ◦ Software analytic modules called operators fused into processing execution units. ◦ Provides infrastructure to support the composition of scalable stream-processing applications from these components. • The main platform components are: • Runtime environment— This includes platform services and a scheduler for deploying and monitoring Streams applications across a single host or set of integrated hosts. • Programming model— You can write Streams applications using the Streams Processing Language (SPL), a declarative language. In this model, a Streams application is represented as a graph that consists of operators and the streams that connect them. • Monitoring tools and administrative interfaces— Streams applications process data at speeds much higher than those that the normal collection of operating system monitoring utilities can efficiently handle. InfoSphere Streams provides the tools that can deal with this environment.
Streams Processing Language
• SPL, the programming language for InfoSphere Streams, is a distributed data-flow composition language. • It is an extensible and full-featured language like C++ or Java. • The basic building blocks of SPL programs: • Stream— An infinite sequence of structured tuples. It can be consumed by operators on a tuple-by-tuple basis or through the definition of a window. • Tuple— A structured list of attributes and their types. Each tuple on a stream has the form dictated by its stream type. • Stream type— Specifies the name and data type of each attribute in the tuple. • Window— A finite, sequential group of tuples. It can be based on count, time, attribute value, or punctuation marks. • Operator— The fundamental building block of SPL, its operators process data from streams and can produce new streams. • Processing element (PE)— The fundamental execution unit. A PE can encapsulate a single operator or many fused operators. • Job— A deployed Streams application for execution. It consists of one or more PEs. In addition to a set of PEs, the SPL compiler also generates an Application Description Language (ADL) file that describes the structure of the application. The ADL file includes details about each PE, such as which binary file to load and execute, scheduling restrictions, stream formats, and an internal operator data-flow graph.
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions