Blog Series: IT Perspectives by Jaqui Lynch

In my last 2 blogs, “What’s Wrong with the Internet of Things?” and “What To Do With All That Data…”, I talked about Big Data, analytics and the Internet of Things (IoT) and how they all work together. As we learned there are two main types of data that are of interest in a Big Data environment – data at rest and data in motion (or streams).  Data at rest is addressed by IBM BigInsights and data in motion is addressed by Infosphere Streams.

IBM describes BigInsights as follows:

“BigInsights is a software platform for discovering, analyzing, and visualizing data from disparate sources. You use this software to help process and analyze the volume, variety, and velocity of data that continually enters your organization every day.”

There are three product editions for BigInsights from IBM:

  • BigInsights for Apache Hadoop
    • Industry standard Hadoop offering along with BigInsights
  • BigInsights on Cloud
    • Provides Hadoop as a service on IBM’s Softlayer global cloud infrastructure
  • Open Platform with Apache Hadoop
    • Provides the Hadoop platform for big data projects

This article focuses on BigInsights for Hadoop and data at rest.

Data at rest

Data at rest refers to the huge volumes of data that are collected over time from multiple sources. The data can be structured or semi-structured and may or may not have schemas. The key is that the volume of data is too large for traditional tools and that the data needs to be processed in place. Hadoop is now the key “front-end” of many data warehouse solutions because it is well suited for data archiving as Hadoop can perform analytics on the archived data. This ability means it is necessary to move only the specific result sets to the data warehouse (and not the full, large set of raw data) for further analysis. Additionally, Hadoop can collect data from multiple, disparate systems and can easily handle data in multiple different forms.

Hadoop

Hadoop addresses the need to access huge amounts of data rapidly. It is a new way to store massive volumes of information and to perform analytics on a set of data that goes beyond traditional structured data. BigInsights is a secure, resilient and high-performance Hadoop distribution that is based on open standards and includes the range of rich tools that Hadoop users expect. By default, the underlying filesystem for Hadoop is HDFS (Hadoop distributed file system). Because of the Hadoop filesystem APIs, it is also very common to see GPFS (general parallel filesystem) as a replacement for HDFS. GPFS has now been renamed to Spectrum Scale so you may find it listed under either name.

BigInsights

BigInsights is a 100% standard Hadoop implementation. It offers a comprehensive suite of tools that vastly simplifies the implementation and use of Hadoop and that help deal with data at rest. BigInsights consists of open source Hadoop plus multiple tools including:

  • Big SQL is a very rich form of SQL designed for use with Hadoop. It provides for HDFS caching and for high availability.
  • Big Sheets is a spreadsheet tool for business users to use with Hadoop. It is integrated with Big SQL and also provides geospatial support, data discovery and visualization. Geospatial support provides the ability to analyze customer data through a space, allowing them to determine patterns based on location.
  • Big R is a library of functions that provide end-to-end integration with the R language and BigInsights. Big R can be used for comprehensive data analysis on the BigInsights server, hiding some of the complexity of manually writing MapReduce jobs. It uses the open source R language.
  • System ML is used for machine learning via Big R
  • Apache Spark provides the flexible big data processing framework and also allows for high performance
  • Apache Ambari provides the Hadoop cluster administration GUI

The key modules or editions for BigInsights are:

  • IBM Open Platform with Apache Hadoop
  • Analyst Module
  • Data Scientist Module
  • Enterprise Management Module
  • Quick Start Edition

As a side note: Throughout a Hadoop environment you will hear the term MapReduce. MapReduce is a fault tolerant programming model to process record oriented data where mappers on different data nodes read data from HDFS and process them concurrently. The Mappers then generate, shuffle and sort the data and then write the data to the local filesystem (not HDFS). Reducers on a subset of the data nodes read the shuffle data to aggregate the results based on keys. MapReduce is used by Hadoop to understand and assign work to the nodes or machines in the Hadoop cluster.

The Open Platform with Apache Hadoop provides support for the cluster management and services such as Hive, HBase, Oozie, Flume and HDFS. The Analyst module provides Big SQL and BigSheets and the Data Scientist module provides everything in the Analyst module plus text analytics and Big R. The Enterprise Management module includes GPFS and Platform Symphony. All three of the above come combined in BigInsights for Apache Hadoop along with a license to use some additional software to get even more value out of Hadoop.

The Analyst Module is used to find and visualize data across all sources, not just Hadoop. This is where you will see the power of Big SQL, BigSheets, Apache Spark, Apache Ambari and Hadoop itself. In particular, Apache Spark provides significant performance benefits. It provides an alternative in-memory framework to MapReduce. Spark SQL provides the APIs that allow SQL queries to be embedded in Scala, Python or Java programs in Spark. Spark provides MLLIB, the optimized library support for machine learning functions. It also includes the GraphX API for graphs and parallel computation. The Analyst Module includes IBM Big SQL, BigSheets and BigInsights Home services. Home services is the main interface used to launch BigInsights and is used to launch BigSheets, BigInsights, Text analytics BigInsights and Big SQL.

The Data Scientist Module is designed to accelerate the work of data science teams. It provides for advanced analytics used to extract insights from Hadoop and uses R distributed to perform statistical analysis and business web tooling to perform text extraction. It also uses R optimized for Hadoop that can take advantage of machine learning algorithms. The text extraction examines text on a sentence by sentence basis and provides the inputs to analytical tools such as SPSS which can then analyze the patterns within a document and across documents looking for context and relationships.

The Enterprise Management Module helps administrators manage, monitor and secure their Hadoop distribution and includes Spectrum Scale FPO (was GPFS FPO) and Platform Symphony. Platform Symphony offers advanced ownership, time window support and functions related to ensuring fair share at the various levels. The tools provided by the Enterprise Management module are used to allocate resources, monitor multiple clusters, and optimize workflows.

The Quick Start Edition includes Big SQL, Big R, BigSheets, text analytics, connectors and the Hadoop core.

What’s new?

The most recent version of BigInsights (v4) offers some significant enhancements over previous versions.  Nearly all of the open source filesets have been updated including an update to Spark to 1.5.1 and to the latest Hadoop v2.7.1. Some of the new features include:

  • Apache Kafka 0.8.2.1 – a high throughput distributed messaging system
  • Teradata connector for Hadoop 1.4 (Command line edition) for Apache Hadoop 2.7 is now supported
  • New Spark version packages SparkR, an R binding for Spark based on Spark’s new DataFrame API.
  • Know now includes both LDAP and PAM support
  • New algorithms for Big R
  • Updates to BigSheets and Big SQL
  • Integration of the Enterprise Management module with Apache Ambari
  • Enhancements to text analytics including CSV format downloads

BigInsights on cloud

Finally, we should take a quick look at BigInsights on Cloud which is IBM’s implementation of Hadoop as a service on IBM’s Softlayer global cloud infrastructure. It offers the security and performance of an on-premises implementation combined with managed services to help reduce costs and complexity.

BigInsights on Cloud provides the following features and benefits:

  • Managed operations provide 24 x 7 monitoring
  • IBM Open Platform with current and stable Apache Hadoop components
  • Dedicated bare metal nodes for enhanced performance, data privacy and security
  • High value in-Hadoop analytics features, including Big SQL, Big Sheets, Text Analytics, Big R and Machine Learning

Summary

BigInsights is designed to make it far simpler to discover, analyze and visualize significant volumes of data from disparate sources. Is designed to assist data science teams with advanced analytics to extract valuable insights far more rapidly than has been possible in the past. The business can use BigInsights to look at all kinds of data, building relationships and context that can be gathered and analyzed in a very rapid manner. This allows the business to focus on what is important to improve the business rather than the enormous amounts of data that no one is sure what to do with.

References

IBM BigInsights for Apache Hadoop

IBM BigInsights Product Page

Hadoop HDFS Architecture

What’s new in BigInsights v4.1.0

 

Schedule a consultation today to learn how Flagship can make the most of your data at rest.

If you liked this blog, you also might like:  What To Do With All That Data… or What’s Wrong with the Internet of Things?

logo-ibmStay connected online:

Facebook | Twitter | LinkedIn | Instagram

Big Data and Analytics

Businesses and governments worldwide are being challenged to make sense of data and gather valuable insights from structured and unstructured data that are emerging from a variety of sources such as videos, blogs and social networking sites. To meet growing client demands in today’s data intensive industries, IBM has established the world’s deepest and broadest portfolio of Big Data and Analytics technologies and solutions, spanning services, software, research and hardware.Flagship Solutions Group helps clients tackle these big data challenges in virtually every industry – from public safety to healthcare, retail, automotive, telecommunications, and everything in between.As evidenced by our broad big data solutions portfolio, powered by IBM, consulting services, capabilities around cloud and capabilities around traditional software deliveries, we offer the full spectrum of analytics capabilities organizations need to handle big data and extract value from it — from descriptive, predictive and prescriptive to cognitive, including predictive capabilities that allow users to model once and deploy broadly on both structured and unstructured data. Effectively harnessing unique insights from big data can drive better business outcomes, helping businesses grow, attract and retain customers, improve the intimacy enterprises have with their clients and understand more about customers – how to anticipate their needs and provide more targeted and relevant marketing and services. Learn more by exploring our many content pieces. 

  • Report: Analytics in a Dash

  • This IT Managers Journal looks at how new cloud-based data warehousing and analytics solutions can level the playing field for businesses that don’t have resources to deploy sophisticated data warehousing and Big Data  infrastructure, and how they can offer enterprises of all types and sizes access to advanced analytics tools that can quickly turn raw information into real business advantage.

  • Infographic: Turn Data and Analytics into a Competitive Advantage

  • Business leaders recognize that applying analytics across all data types is changing operations and decision making for the better. Learn more in this infographic. 

  • IBM Cloud Data Services unlocks the full potential of a hybrid cloud infrastructure

  • In the digital era, data is exploding with uncanny speed, filling all available space. Not surprisingly, then, data has taken up prominent residence in the cloud.Does your media enterprise enjoy the best of both on-premises and cloud environments? Use IBM Cloud Data Services to gain real-time insights from your data no matter where it lives, all while enjoying the power and scalability you need to make the most of the insights you uncover.To learn more about IBM Cloud Data Services, explore the new era of hybrid infrastructure at https://ibm.biz/dash-db-free.Subscribe to the IBM Analytics Channel: https://www.youtube.com/subscription_…The world is becoming smarter every day, Learn more.

  • eBook: Understanding Big Data

  • In this book, the three defining characteristics of Big Data — volume, variety, and velocity, are discussed. You’ll get a primer on Hadoop and how IBM is ‘hardening’ it for the enterprise, and learn when to leverage IBM InfoSphere BigInsights (Big Data at rest) and IBM InfoSphere Streams (Big Data in motion) technologies. Deployment and scaling strategies plus industry use cases are also included in this practical guide. Review this book and get started with big data.

  • Infographic: Why access matters for big data and analytics

  • Data is only useful if you can access it. You need the right infrastructure to access information in context and deriveaccurate insights. With the right infrastructure in place, you can gain new levels of visibility into your customer and operational data. With shared and secure access to relevant information no matter where it is located, you can achieve new levels of customer intimacy and differentiation.

  • Data Sheet: IBM Planning Analytics

  • IBM Planning Analytics is a fast, easy, flexible and complete planning and analytics cloud solution. It helps Finance organizations drive greater process efficiency and deliver the foresight they need to steer business performance. This solution not only automates manual tasks, but takes you beyond automation by providing self-service analytics that can help you uncover new insights directly from your data. It speeds decision making and improves decision quality.

  • IBM Big Data in a Minute: Drive Smarter Business Decisions with Data & Analytics

  • http://ibm.co/1zTI11oJohn Wiley & Sons partnered with IBM to shift strategy and increase delivery of digital publications.With IBM PureData for Analytics, John Wiley & Sons analyzes its data in new ways, for example: basing business decisions, like which titles to publish, on solid, data-backed insights.

  • Infographic: Got a Big Data Headache?

  • In this informative infographic, discover how you can accelerate various types of data analytics processes using IBM BigInsights for Apache Hadoop.

     

  • Report: Analytics. A Blueprint For Value

  • In today’s competitive marketplace, executive leaders are racing to convert data-driven insights into meaningful results. Successful leaders are infusing analytics throughout their organizations to drive smarter decisions, enable faster actions and optimize outcomes. These are among the key findings from the 2013 IBM Institute for Business Value research study on how organizations around the globe are leveraging key capabilities to amplify their ability to create value from big data and analytics. 

    This research report reveals how organizations can achieve positive returns on their analytic investments by taking advantage of the growing amounts of data.