Definition of Big data

         Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database andsoftware techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity. Despite these problems, big data has the potential to help companies improve operations and make faster, more intelligent decisions.

          Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, and information privacy. The term often refers simply to the use of predictive analytics or other certain advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making. And better decisions can mean greater operational efficiency, cost reduction and reduced risk. 

What is Big data - really ?

        There's nothing new about the notion of big data, which has been around since at least 2001. In a nutshell, Big Data is your data. It's the information owned by your company, obtained and processed through new techniques to produce value in the best way possible.

Ask any Big Data expert to define the subject and they'll quite likely start talking about "The three V's" - "volume, velocity and variety," concepts originally coined by Doug Laney in 2001 (PDF) to refer to the challenge of data management. In short, it's a lot of data produced very quickly in many different forms. This could involve customer transactional histories, production databases, web traffic logs, online videos, social media interactions, and so forth.

        An August, 2013 blog post by Mark van Rijmenam titled "Why The 3V's Are Not Sufficient To Describe Big Data," added "veracity, variability, visualization, and value" to the definition,                         Broadening the realm even further. Rijmenam stated "90% of all data ever created, was created in the past two years. From now on, the amount of data in the world will double every two years."

Beyond Volume, Variety and Velocity is the issue of BIG DATA Veracity( 3V)

        We have all heard of the the 3Vs of big data which are Volume, Varietyand Velocity. Yet, Inderpal Bhandar, Chief Data Officer at Express Scripts noted in his presentation at the Big Data Innovation Summit in Bostonthat there are additional Vs that IT, business and data scientists need to be concerned with, most notably big data Veracity. Other big data V’s getting attention at the summit are: validity and volatility. Here is an overview the 6V’s of big data.


        Big data implies enormous volumes of data. It used to be employees created data. Now that data is generated by machines, networks and human interaction on systems like social media the volume of data to be analyzed is massive. Yet, Inderpal states that the volume of data is not as much the problem as other V’s like veracity.


        Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. This variety of unstructured data creates problems for storage, mining and analyzing data. Jeff Veis, VP Solutions at HP Autonomypresented how HP is helping organizations deal with big challenges including data variety.


        Big Data Velocity deals with the pace at which data flows in from sources like business processes, machines, networks and human interaction with things like social media sites, mobile devices, etc. The flow of data is massive and continuous. This real-time data can help researchers and businesses make valuable decisions that provide strategic competitive advantages and ROI if you are able to handle the velocity. Inderpal suggest that sampling data can help deal with issues like volume and velocity.


        Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is being stored, and mined meaningful to the problem being analyzed. Inderpal feel veracity in data analysis is the biggest challenge when compares to things like volume and velocity. In scoping out your big data strategy you need to have your team and partners work to help keep your data clean and processes to keep ‘dirty data’ from accumulating in your systems.


        Like big data veracity is the issue of validity meaning is the data correct and accurate for the intended use. Clearly valid data is key to making the right decisions. Phil Francisco, VP of Product Management from IBM spoke about IBM’s big data strategy and tools they offer to help with data veracity and validity.


        Big data volatility refers to how long is data valid and how long should it be stored. In this world of real time data you need to determine at what point is data no longer relevant to the current analysis.

        Big data clearly deals with issues beyond volume, variety and velocity to other concerns like veracity, validity and volatility.  

Big data applications: Real-World strategies for managing


        Even if your organization is compelled to become more data-driven, many don’t know how to transform themselves out of the use-your-gut mentality and into a data-first one.

        The easiest way? Take shortcuts by refusing to reinvent the wheel and following the trails blazed by early adopters. Here are 5 cool Big Data apps, along with the use cases (and end users) that are helping to change the meaning of “business as usual.”

1. Big Data application: Roambi

        How this Big Data app works: One thing often overlooked in the rush towards data-driven decision making is mobility. Increasingly mobile workforces need more ways to manipulate data from a smartphone that just basic business tools, which are so often stripped down for mobile. Mobile workers need the ability to access and analyze the same business data they use in the office in order to make smart, on-the-go decisions.
     Roambi contends that it was founded to solve this very problem. Roambi’s goal is to reinvent the mobile business app to improve the productivity and decision-making of on-the-go employees. Roambi re-designs the way people interact with, share, and present data from a completely mobile perspective.
        Use case of note: The Phoenix Suns.  In addition to their goal of consistently performing at an elite level on the court, the Phoenix Suns are making big strides off the court through the use of analytics, which they use to help drive strategy for both business and basketball decisions.
        While considered by some in the NBA as a small business in terms of the infrastructure and processes in place, in the past three years, the Suns organization has invested significant resources in not only organizing the data they accumulate, but in also guaranteeing the accuracy of that data and ensuring that it is being used by all decision makers across the organization.
        Whether it’s an off-site meeting or a long road trip, as is the nature with any professional sports team, a majority of their work is done away from the office. The organization’s ownership was looking for a way to make their critical business data available wherever their decision makers were located.
        As the Suns began taking steps to become more mobile, there was a healthy amount of skepticism that a mobile solution could be found that was both valuable and, more importantly, easy enough for end users (most of whom don’t have a very technical background) in the organization to adopt.
        That changed when the Suns adopted Roambi. The Suns started using Roambi.        analytics with their front office, organizing and visualizing key player scouting information all in one place, as well as making this information available in real time.
        After the success of the initial rollout, the Suns decided to expand their use of Roambi to their back office. On the business side, the Suns optimized their operations by providing KPIs across sales and marketing, reporting on everything from ticket sales to game summary reports to in-stadium promotions to customer buying behavior to inventory – all via mobile devices, so executives were all working off of same set of numbers and were able to make critical business decisions in a moment’s notice.

2. Big Data application: Esri ArcGIS

        How this Big Data app works: Esri ArcGIS, as the name implies, is a Geographic Information System (GIS) that makes it easy to create data-driven maps and visualizations.
Use case of note (in this case it is more of a partnership): In mid-July at the Esri User Conference, the company radically updated its  Developed in partnership with Richard Saul Wurman and Radical Media and originally launched last year, the Urban Observatory helps cities use the common language of maps to understand patterns in diverse datasets.
“Our world has always had Big Data surrounding us that, until recently, has remained untapped for any real understanding,” said Wurman. “We are several iterations into developing a common language for mapping urbanization. It will allow cities to understand not only the major threads of their performance, land use, and contents comparatively but [also,] eventually, the nuance of change and action.”
        I attended the Esri UC last week and spent plenty of time playing with (and before that standing in line to get access to) the Urban Observatory exhibit, an interactive exhibit that makes it easy to compare and contrast data from cities worldwide, all on a touch screen.
At least half of the world's population is currently living in urbanized areas. The Global Health Observatory (GHO) projects that by 2050, 7 out of 10 people will live in a city. This year, nearly 60 cities are part of the Urban Observatory.
        Participation in Urban Observatory is open to every city around the globe. Any city that has data its officials would like to share is eligible to be included. In February 2015, Urban Observatory will go on permanent display in the Smithsonian Institution.

3. Big Data Application: Cloudera Enterprise

        How this Big Data app works: Not long after nailing down one of the (if not the) biggest funding rounds in history, Cloudera is now making inroads into the Internet of Things market with its app, locking down a deal with a major home automation company in mid-July. Oh, and I almost forgot: that close partnership with/funding from Intel is something you just can’t ignore either.
        Use case of note: Cloudera has a ton of customers, but Wells Fargo and home automation company Vivent are two to pay attention to.  Wells Fargo has used Cloudera Enterprise to build an enterprise data hub.
        Vivent is the use case that really caught my attention, though, since it ties two of the hottest, most promising tech trends of the moment together. Vivent is using Cloudera Enterprise to glean insights from the data generated from intelligent devices and systems embedded with sensors in and around homes. “[With Cloudera, we can now] look across many data streams simultaneously for behaviors, geo-location, and actionable events in order to better understand and enrich our customers’ lives. This platform has differentiated our business and given us a tremendous competitive advantage,” said Brandon Bunker, senior director, Customer Analytics and Insights
        Vivent says that it has acquired more than 800,000 customers using a variety of third-party smart-enabled devices – roughly 20-30 sensors per home. Many of those devices come in the form of thermostats, smart appliances, video cameras, window and door sensors, and smoke and carbon monoxide detectors. Without a central internal repository to gather and analyze the data generated from each sensor, Vivent was previously limited in its ability to innovate and to add higher intelligence to its security offerings.
        For example, knowing when a home is occupied or vacant is important to security – but when tied into the HVAC system (which tends to be the largest contributor to a home’s energy bill and carbon emissions), you can add a layer of energy cost savings by cooling or heating a home based on occupancy. Similarly, by adding geo-location into the equation, you can begin to adjust temperature changes to a home based on the proximity to an owner’s arrival, for instance, when the owner has a connected vehicle. Studies have shown that consumers could see 20 to 30 percent energy savings by turning off HVAC systems when residents are away or sleeping

4. Big Data application: Zaloni Bedrock

        How this Big Data app works: Many businesses know they want to implement a Hadoop data lake, but don’t know how to do so in a cost-effective, scalable way. Moreover, simply putting data into Hadoop does not make it ready for analytics. To use common analytics toolsets, you must know where data is, how it’s structured (or not) and where it came from.
        You may also need to prepare it by filtering or joining datasets together, or masking out parts that are sensitive in nature. This typically takes a significant amount of time and effort and can be highly error prone. If you’ve done a poor job ingesting, organizing, and preparing data for analytics, the results of your analytics will be equally poor. Flawed analytics can lead to flawed business decisions and making better business decisions was the whole point of the data lake in the first place.
        With Zaloni Bedrock, the process is automated. According to Zaloni, you set it up once and you’re done. It doesn’t matter how much data you are adding to the lake, since there is no technical limit.
        Zaloni argues that without a product like Bedrock to help you along, 60 percent or more of the time and effort you spend to build an analytics system using a Hadoop data lake will be spent on data management and data preparation alone.
        Use case of note: UnitedHealth Group’s Optum division, an IT and tech-enabled health services business, uses Bedrock as part of their data platform to manage services like data ingest and workflow execution. Bedrock enables Optum to monitor multiple data sources, capture and store schema/operational metadata, and provides features like data catalog search for end users.

5. Big Data application: Tamr

        How this Big Data app: Tamr is a data-connection and machine-learning platform designed to make enterprise data as easy to find, explore, and use as Google. According to Tamr, due to the cost and complexity of connecting and preparing the vast, untapped reserves of data sources available for analysis, most organizations use less than 10 percent of the relevant data available to them.
        It’s just too manual, too inefficient and too expensive to connect and ready the massive variety of internal and external data for analytics and other applications critical for business growth. Tamr argues that if the industry is going to be successful at helping customers manage the growth and variety of data that lies ahead – from internal sources, external public and private sources, Internet of Things feeds, etc. – a complete overhaul of traditional methods of information integration and quality management will be required.
        Use case of note: Multinational media and information company Thomson Reuters faced challenges maintaining critical, accurate data. It had outgrown its manual curation processes and looked to Tamr to provide a better solution for continuously connecting and enriching its core enterprise information assets (data on millions of organizations with more than 5.4 million records pulled from internal and external data sources).
        Using Tamr, one project that Thomson Reuters estimated would take six months was completed in only two weeks, requiring just forty hours of manual review time – a 12x improvement over the previous process. The number of records requiring manual review shrunk from 30 percent to 5 percent, and the number of identified matches across data sources increased by 80 percent – all while achieving Thomson Reuters’ 95-percent precision benchmark.
        Tamr says that the disambiguation rate (or the rate of resolving conflicts) rose from 70 percent to 95 percent. Furthermore, the knowledge Tamr gleaned from its machine learning activities means that future data integration will take even less time per source.

Next Post »