Over the past half decade, the big data flame has spread like wildfire throughout the enterprise, and the IT department has not been immune. The promise of data-driven initiatives capable of transforming IT from a support function to a profit center has sparked enormous interest.
After all, datacenter scale, complexity, and dynamism has rapidly outstripped the ability of siloed, infrastructure-focused IT operations management to keep pace. IT big-data analytics has emerged as the new IT operations-management approach of choice, promising to make IT smarter and leaner. Nearly all next-generation operational intelligence products incorporate data analytics to some degree. However, as many enterprises are learning the hard way, big data doesn’t always result in success.
While the Four Vs of big data – volume, velocity, variety, and veracity – are intended to serve as pillars upon which to construct big data efforts, there’s a fifth V that needs to be included, and that’s value. Every big data initiative should begin with the question “What value do I want to derive from this effort” How a group or organization answers that question should deeply inform the means by which that end is achieved. To date, however, value has very much been the silent V.
So how should organizations go about deriving the greatest value from their data Three key areas deserve close attention:
* Understand data gravity. The term “data gravity” was coined by Dave McCrory, the CTO of Basho Technologies, and refers to the pull that data exerts on related services and applications. According to McCrory, data exerts this gravitational pull in two key ways. First, without data, applications and services are virtually useless. For this reason, application and service providers naturally gravitate toward data, and the bigger the data set, the more applications and services it will attract.
Second, the bigger the data set, the harder it is to move. Generally it’s more efficient and cost-effective to perform processing near where the data resides. We’ve seen large companies use cloud-based services for IT operations data. If the data itself originates in the same cloud, this approach is fine. Even data generated on-premises can be stored and analyzed in the cloud if it’s small enough. For large amounts of data generated outside the cloud, however, problems arise. For example, one organization had to purchase dedicated bandwidth just to upload the telemetry. Even then, there was so much data at times, the local forwarders would fall behind, and it would be hours before the data was available. In cases such as this one, it’s important to understand data gravity and process the data near where it’s generated.
* Be aware of the signal-to-noise ratio. The phrase “garbage in, garbage out” looms large in big data. Some data sources are poor quality and have a low signal-to-noise ratio. Application logs are a great example of this problem. Many applications throw exceptions and log errors as part of normal operation. Enabling verbose logging can provide good information, but it comes with a huge amount of irrelevant noise.
Another example of this problem is threat detection systems, which have been in the news in association with high-profile data breaches. Threat detection systems generate thousands of alerts every day, far more than IT and security teams can actually investigate. The low signal-to-noise ratio of these systems means that alerts are often ignored altogether, and actual threats are missed amidst the chaos.
Finding the signal in all that noise can be hard, and when time is of the essence, cutting through the noise can become mission-critical. If you’re sifting through garbage, the chances of finding what you need in time drop dramatically.
* Consider the motion of your data. Is the data you’re trying to analyze at rest or in flight The answer to this question has a huge impact on how you process, view, and analyze the data, as well as the value you can derive from it.
Most big data is at rest and analyzed post hoc in batch processes that rely on indexing and parallel processing using techniques based on sharding or MapReduce. At its core, this approach is all about volume and variety, and enterprises are leveraging multiple frameworks and data stores – such as Hadoop, MongoDB and Cassandra – for a variety of structured and unstructured data. While multiple data sources provide context and insight, this approach is always going to be retrospective.
Recently, greater attention is being paid to data-in-flight as the need for greater agility and adaptability drives demand for higher velocity analysis. Imagine that you’re the CIO of a major retailer. It’s Cyber Monday and your website's page load times are averaging more than 30 seconds. The post-hoc analysis isn't going to save your company's Cyber Monday. Being able to tell the CEO that it won't happen again next year will be cold comfort.
If you’re that CIO, you need insight into what’s going wrong and guidance about how to fix it immediately. For these situations, high velocity data-in-flight is of paramount importance, giving IT the ability to see how systems are behaving in the moment, compare that behavior to established baselines, and drill down to find the root cause of a problem.
While data-in-flight can provide incredible value, analysis of this data requires a fundamentally different approach based on stream processing and summary metrics. In many cases, the data volume is such that it must be processed in-flight. In other cases, real-time information is more valuable, while old data is less valuable. For example, wire data is too voluminous to be stored but is extremely valuable to answer questions about what is happening in the IT environment in real time.
In the end, it’s important to remember that no single dataset or analytics framework can be all things to all people; most tools that offer a single pane of glass wind up serving nothing more than a single glass of pain. By leveraging multiple datasets as well as analytics and visualization products optimized for particular data types and goals, IT teams can achieve a complete, correlated, cross-tier view of the environment, enabling them to eliminate waste, create greater efficiency, and maximize scarce resources. This approach spells not only value for IT, but also value for the business as a whole.
Rothstein is the CEO and co-founder of ExtraHop, the global leader in real-time wire data analytics for IT intelligence and business operations.