下载  >  大数据  >  spark  > Big Data Analytics Using Splunk

Big Data Analytics Using Splunk 评分:

Big Data Analytics Using Splunk is a hands-on book showing how to process and derive business value from big data in real time. Examples in the book draw from social media sources such as Twitter (tweets) and Foursquare (check-ins). You also learn to draw from machine data, enabling you to analyze,
Contents at a glance About the authors About the technical reviewer Acknowledgments mamaammmaanammmmmamammmmaaanmmmaaamnmamaan Xix Chapter 1: Big data and splunkann Chapter2: Getting Data into Sp|unk,,,,,,,,,,,,,,…,,……!9 Chapter 3: Processing and Analyzing the Data E mEmmmmmmmnnn31 Chapter 4: visualizing the results 63 Chapter 5: Defining Alerts ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 97 Chapter 6: Web Site Monitoring a n109 Chapter 7: Using Log Files To create Advanced Analytics.ammar 127 Chapter 8: The Airline On-Time Performance Project. ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 139 Chapter 9: Getting the Flight Data into Splunk……,…,,,,…,…,…,,,,,…,…143 Chapter 10: Analyzing Airlines, Airports, Flights, and Delays ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■口■ 161 Chapter 11: Analyzing a Specific Flight Over the Years ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 195 Chapter 12: Analyzing Tweets mmmaamammmmmmmamm 211 Chapter 13: Analyzing Foursquare Check-Ins aamamammamamm 231 Chapter 14: Sentiment Analysis mm RA RRR RRRRRIRBRBRRBRBRREIRIIRLIRAIRIRRaRIIaI 255 CONTENTS AT A GLANCI Chapter 15: Remote Data collection. 283 Chapter 16: Scaling and high Availability Hamnan 295 Appendix a: the performance of Splunkamaaaaaaamamaammaiiaan 307 Appendix B: Useful Splunk Apps 323 Index umm 345 CHAPTER 1 Big data and splunk In this introductory chapter we will discuss what big data is and different ways(including Splunk) to process big data What Is big data? big data is, admittedly an overhyped buzzword used by software and hardware companies to boost their sales Behind the hype, however, there is a real and extremely important technology trend with impressive business potential. Although big data is often associated with social media, we will show that it is about much more than that Before we venture into definitions, however, let's have a look at some facts about big data Back in 2001, Doug Laney from Meta Group an it research company acquired by gartner in 2005 wrote a research paper in which he stated that e-commerce had exploded data management along three dimensions volumes, velocity, and variety. These are called the three Vs of big data and, as you would expect, a number of vendors have added more ys to their own definitions Volume is the first thought that comes with big data: the big part. Some experts consider Petabytes the starting point of big data. As we generate more and more data, we are sure this starting point will keep growing. However, volume in itself is not a perfect criterion of big data, as we feel that the other two Vs have a more direct impact Velocity refers to the speed at which the data is being generated or the frequency with which it is delivered Think of the stream of data coming from the sensors in the highways in the Los angeles area, or the video cameras in some airports that scan and process faces in a crowd. There is also the click stream data of popular e-commerce web sites ariety is about all the different data and file types that are available. Just think about the music files in the itunes store(about 28 million songs and over 30 billion downloads), or the movies in Netflix(over 75,000), the articles in the New York Times web site(more than 13 million starting in 1851), tweets(over 500 million every day), foursquare heck-ins with geolocation data(over five million every day), and then you have all the different log files produced b any system that has a computer embedded. When you combine these three Vs, you will start to get a more complete picture of what big data is all about Another characteristic usually associated with big data is that the data is unstructured. We are of the opinion that there is no such thing as unstructured data. We think the confusion stems from a common belief that if data cannot conform to a predefined format, model, or schema, then it is considered unstructured An e-mail message is typically used as an example of unstructured data; whereas the body of the e-mail could be considered unstructured, it is part of a well-defined structure that follows the specifications of RFC-2822, and contains a set of fields that include From, To, Subject, and Date. This is the same for Twitter messages, in which the body of the message, or tweet, can be considered unstructured as well as part of a well-defined structure n general, free text can be considered unstructured, because, as we mentioned earlier, it does not necessarily conform to a predefined model. Depending on what is to be done with the text, there are many techniques to process it, most of which do not require predefined formats CHAPTER 1 BIG DATA AND SPLUNK Relational databases impose the need for predefined data models with clearly defined fields that live in tables, which can have relations between them. We call this Early Structure Binding in which you have to know in advance what questions are to be asked of the data, so that you can design the schema or structure and then work with the data to answer them As big data tends to be associated with social media feeds that are seen as text-heavy it is easy to understand why people associate the term unstructured with big data From our perspective, multistructured is probably a more accurate description, as big data can contain a variety of formats(the third Vof the three Vs) It would be unfair to insist that big data is limited to so-called unstructured data. Structured data can also be considered big data, especially the data that languishes in secondary storage hoping to make it some day to the data warehouse to be analyzed and expose all the golden nuggets it contains. The main reason this kind of data is usually ignored is because of its sheer volume, which typically exceeds the capacity of data warehouses based on relational databases At this point, we can introduce the definition that Gartner, an Information Technology(Ir)consultancy, proposed in 2012: Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and processes optimization We like this definition, because it focuses not only on the actual data but also on the way that big data is processed Later in this book, we will get into more detail on this We also like to categorize big data, as we feel that this enhances understanding. From our perspective, big data can be broken down into two broad categories: human-generated digital footprints and machine data As our interactions on the Internet keep growing, our digital footprint keeps increasing. Even though we interact on a daily basis with digital systems, most people do not realize how much information even trivial clicks or interactions leave behind. We must confess that before we started to read Internet statistics the only large numbers we were familiar with were the McDonald's slogan"Billions and Billions Served"and the occasional exposure to U.S. politicians talking about budgets or deficits in the order of trillions. Just to give you an idea, we present a few Internet statistics that show the size of our digital footprint. We are well aware that they are obsolete as we write them, but here they are anyway By February 2013, Facebook had more than one billion users, of which 618 million were active on a daily basis. They shared 2.5 billion items and"liked"other 2.7 billion every day generating more than 500 terabytes of new data on a daily basis In March 2013, LinkedIn, which is a business-oriented social networking site, had more than 200 million members, growing at the rate of two new members every second, which generated 5. 7 billion professionally oriented searches in 2012 Photos are a hot subject, as most people have a mobile phone that includes a camera. The numbers are mind-boggling Instagram users upload 40 million photos a day, like 8,500 of them every second, and create about 1,000 comments per second On Facebook, photos are uploaded at the rate of 300 million per day which is about seven petabytes worth of data a month. By January 2013, Facebook was storing 240 billion photos Twitter has 500 million users, growing at the rate of 150,000 every day, with over 200 million of the users being active. In October 2012, they had 500 million tweets a day Foursquare celebrated three billion check-ins in January 2013, with about five million check-ins a day from over 25 million users that had created 30 million tips On the blog front, WordPress, a popular blogging platform reported in March 2013 almost 40 million new posts and 42 million comments per month, with more than 388 million people also reported, in March 2013, a total of almost 100 million blogs that contain more than rm viewing more than 3.6 billion pages per month. Tumblr, another popular blogging platfo 44 billion posts. a typical day at Tumblr at the time had 74 million blog posts Pandora, a personalized Internet radio, reported that in 2012 their users listened to 13 billion hours of music, that is about 13, 700 vears worth of music CHAPTER 1 BIG DATA AND SPLUNK In similar fashion, Netflix announced their users had viewed one billion hours of videos in July 2012, which translated to about 30 percent of the Internet traffic in the United States. As if that is not enough, in March 2013, YouTube reported more than four billion hours watched per month and 72 hours of video uploaded every minute n march 2013, there were almost 145 million internet domains, of which about 108 million usedthefamous".comtopleveldomainsThisisaveryactivespaceonMarch21,therewere 167, 698 new and 128, 866 deleted domains, for a net growth of 38, 832 new domains n the more mundane e-mail world, Bob Al-Greene at Mashable reported in November 2012 that there are over 144 billion e-mail messages sent every day, with about 61 percent of them from businesses. The lead e-mail provider is gmail, with 425 million active users qu Reviewing these statistics, there is no doubt that the human-generated digital footprint is huge. You can ickly identify the three Vs; to give you an idea of how big data can have an impact on the economy, we share the announcement Yelp, a user-based review site, made in January 2013, when they had 100 million unique visitors and over one million reviews: "A survey of business owners on Yelp reported that, on average, customers across all categories surveyed spend $101. 59 in their first visit. That's everything from hiring a roofer to buying a new mattress and even your morning cup of joe. If each of those 100 million unique visitors spent $100 at a local business in January Yelp would have influenced over $10 billion in local commerce. We will not bore you by sharing statistics based on every minute or every second of the day in the life of the Internet. However, a couple of examples of big data in action that you might relate with can consolidate the notion the recommendations you get when you are visiting the amazon web site or considering a movie in Netflix, are based on big data analytics the same way that Walmart uses it to identify customer preferences on a regional basis and stock their stores accordingly by now you must have a pretty good idea of the amount of data our digital footprint creates and the impact that it has in the economy and society in general. Social media is just one component of big data The second category of big data is machine data. There is a very large number of firewalls, load balancers routers,switches, and computers that support our digital footprint. All of these systems generate log files, ranging from security and audit log files to web site log files that describe what a visitor has done, including the infamous abandoned shopping carts It is almost impossible to find out how many servers are needed to support our digital footprint, as all companies are extremely secretive on the subject. Many experts have tried to calculate this number for the most visible companies, such as Google, Facebook, and Amazon, based on power usage, which (according to a Power Usage Effectiveness indicator that some of these companies are willing to share) can provide some insight as to the number of servers they have in their data centers. Based on this, James Hamilton in a blog post of August 2012 published server estimates conjecturing that Facebook had 180, 900 servers and Google had over one million servers. Other experts state that amazon had about 500 million servers in March 2012. In September 2012, the New york Times ran a provocative article that claimed that there are tens of thousands of data centers in the United States, which consume roughly 2 percent of all electricity used in the country, of which 90 percent or more goes to waste, as the servers are not really being used We can only guess that the number of active servers around the world is in the millions. When you add to this all the other typical data center infrastructure components, such as firewalls, load balancers, routers, switches, and many others, which also generate log files, you can see that there is a lot of machine data generated in the form of log files by the infrastructure that supports our digital footprint What is interesting is that not long ago most of these log files that contain machine data were largely ignored These log files are a gold mine of useful data, as they contain important insights for It and the business because they are a definitive record of customer activity and behavior as well as product and service usage. This gives companies end-to-end transaction visibility, which can be used to improve customer service and ensure system security, and also helps to meet compliance mandates What's more, the log files help you find problems that have occurred and an assist you in predicting when similar problems can happen in the future CHAPTER 1 BIG DATA AND SPLUNK In addition to the machine data that we have described so far, there are also sensors that capture data on a real-time basis. Most industrial equipment has built-in sensors that produce a large amount of data. For example, a blade in a gas turbine used to generate electricity creates 520 Gigabytes a day, and there are 20 blades in one of these turbines. An airplane on a transatlantic flight produces several Terabytes of data, which can be used to streamline maintenance operations, improve safety, and (most important to an airline, s bottom line)decrease fuel consumption. Another interesting example comes from the Nissan Leaf, an all-electric car. It has a system called CARWINGs which not only offers the traditional telematics service and a smartphone app to control all aspects of the car but wirelessly transmits vehicle statistics to a central server. Each Leaf owner can track their driving efficiency and compare their energy economy with that of other Leaf drivers. We don't know the details of the information that Nissan is collecting from the Leaf models and what they do with it, but we can definitely see the three vs in action in this example In general, sensor-based data falls into the industrial big data category, although lately the"Internet of Things has become a more popular term to describe a hyperconnected world of things with sensors, where there are over 300 million connected devices that range from electrical meters to vending machines. We will not be covering this category of big data in this book, but the methodology and techniques described here can easily be applied to industrial big data analytics Alternate Data processing techniques Big data is not only about the data, it is also about alternative data processing techniques that can better handle the three Vs as they increase their values. The traditional relational database is well known for the following characteristics Transactional support for the ACID properties Atomicity: Where all changes are done as if they are a single operation Consistency At the end of any transaction, the system is in a valid state. Isolation: The actions to create the results appear to have been done sequentially, one at a time Durability: All the changes made to the system are permanent The response times are usually in the subsecond range, while handling thousands of nteractive users The data size is in the order of Terabytes Typically uses the SQL-92 standard as the main programming language In general, relational databases cannot handle the three Vs well. Because of this, many different approaches have been created to tackle the inherent problems that the three Vs present. These approaches sacrifice one or more of the ACID properties, and sometimes all of them, in exchange for ways to handle scalability for big volumes, velocity, or variety. Some of these alternate approaches will also forgo fast response times or the ability to handle a high number of simultaneous users in favor of addressing one or more of the three vs Some people group these alternate data processing approaches under the name NosQl and categorize them according to the way they store the data, such as key-value stores and document stores, where the definition of document varies according to the product. Depending on who you talk to, there may be more categories CHAPTER 1 BIG DATA AND SPLUNK The open source Hadoop software framework is probably the one that has the biggest name recognition in the big data world, but it is by no means alone. As a framework it includes a number of components designed to solve the issues associated with distributed data storage, retrieval and analysis of big data. It does this by offering two basic functionalities designed to work on a cluster of commodity servers a distributed file system called HDFS that not only stores data but also replicates it so that it is always available a distributed processing system for parallelizable problems called MapReduce, which is a two-Step approach. In the first step or Map, a problem is broken down into many small ones and sent to servers for processing. In the second step or Reduce, the results of the Map step are combined to create the final results of the original problem Some of the other components of Hadoop, generally referred to as the Hadoop ecosystem, include Hive, which is a higher level of abstraction of the basic functionalities offered by hadoop hive is a data warehouse system in which the user can specify instructions using the SQL-92 standard and these get converted to MapReduce tasks. Pig is another high-level abstraction of Hadoop that has a similar functionality to Hive, but it uses a programming language called Pig Latin, which is more oriented to data flows HBase is another component of the Hadoop ecosystem, which implements Google's Bigtable data store. Bigtable is a distributed, persistent multidimensional sorted map Elements in the map are an uninterpreted array of bytes which are indexed by a row key, a column key, and a timestamp There are other components in the hadoop ecosystem, but we will not delve into them We must tell you that in addition to the official Apache project, Hadoop solutions are offered by companies such as Cloudera and Hortonworks, which offer open source implementations with commercial add-ons mainly focused on cluster management MapR is a company that offers a commercial implementation of Hadoop, for which it claims higher pertormance Other popular products in the big data world include Cassandra, an Apache open source project, is a key-value store that offers linear scalability and fault tolerance on commodity hardware DynamoDB, an Amazon Web Services offering, is very similar to Cassandra MongoDB, an open source project, is a document database that provides high performance, fault tolerance, and easy scalability. CouchDB, another open source document database that is distributed and fault tolerant In addition to these products there are many companies offering their own solutions that deal in different ways ⅵ ith the three vs. What Is splunk? Technically speaking, Splunk is a time-series indexer, but to simplify things we will just say that it is a product that takes care of the three Vs very well. Whereas most of the products that we described earlier had their origins in processing human-generated digital footprints, Splunk started as a product designed to process machine data Because of these humble beginnings, Splunk is not always considered a player in big data. But that should not prevent you from using it to analyze big data belonging in the digital footprint category, because, as this book shows, Splunk does a great job of it splunk has three main functionalities Data collection, which can be done for static data or by monitoring changes and additions to files or complete directories on a real time basis data can also be collected from network ports or directly from programs or scripts. Additionally, Splunk can connect with relational databases to collect, insert or update data CHAPTER 1 BIG DATA AND SPLUNK Data indexing, in which the collected data is broken down into events, roughly equivalent to database records, or simply lines of data. Then the data is processed and a high performance ndex is updated, which points to the stored data Search and analysis. Using the Splunk Processing Language, you are able to search for data and manipulate it to obtain the desired results whether in the form of reports or alerts The results can be presented as individual events, tables, or charts Each one of these functionalities can scale independently; for example, the data collection component can scale to handle hundreds of thousands of servers. The data indexing functionality can scale to a large number of servers which can be configured as distributed peers, and, if necessary, with a high availability option to transparently handle fault tolerance. The search heads, as the servers dedicated to the search and analysis functionality are known, can also scale to as many as needed. Additionally, each of these functionalities can be arranged in such a way that they can be optimized to accommodate geographical locations, time zones, data centers, or any other requirements. Splunk is so flexible regarding scalability that you can start with a single instance of the product running on your laptop and grow from there You can interact with Splunk by using SplunkWeb, the browser-based user interface, or directly using the command line interface(CLI). Splunk is flexible in that it can run on Windows or just about any variation of Unix Splunk is also a platform that can be used to develop applications to handle big data analytics. It has a powerful set of APIs that can be used with Python, Java, JavaScript, Ruby, PHP, and C#. The development of apps on top of Splunk is beyond the scope of this book; however, we do describe how to use some of the popular apps that are freely available. We will leave it at that, as all the rest of the book is about splunk About this book We have a couple of objectives with this book. The first one is to provide you with enough knowledge to become a data wrangler so that you can extract wisdom from data. The second objective is that you learn how to use Splunk, a simple yet extremely powerful tool that will allow you to"click for gold" in the data you analyze The book has been designed so that you become exposed to big data from digital footprints and machine data be a hands-on guide for big data analytic projects that involve machine data, social media, and mining existing dat It starts by presenting simple concepts and progressively introducing slightly more difficult approaches. It is meant warehouses. We do this through real projects, which review in detail how to collect data, load it into Splunk, process and analyze it, and visualize the results so that they can be easily consumed by the intended audience We have broken the book into four parts Splunks Basic Operation, in which we introduce basic data collection, processing, analysis, and visualization of results. We use machine data in this part of the book to introduce you to the basic commands of the Splunk Processing Language. The last chapter in this part presents a way to create advanced analytics using log files The airline on-time performance project. Once you are familiar with the basic concepts and commands of Splunk, we take you through the motions of a typical big data analytics project We present you with a simple methodology, which we then apply to the project at hand, the analysis of airline performance data over the last 26 years. The data of this project falls under the category of mining an existing data warehouse. Using this project, we go over collecting data that ilable in csv format as well king it up directly fr relational datab In both cases, there are some special considerations regarding the timestamp that is available in this data set, and we go in detail on how to handle them. This interesting project allows us to introduce some new Splunk commands and other features of commands that were presented in the first part of the book

...展开详情
2017-10-28 上传 大小:16.86MB
举报 收藏
分享
Big Data Analytics with Spark PDF

Big Data Analytics Spark

立即下载
Big Data Analytics

Big Data Analytics Turning Big Data into Big Money

立即下载
Wiley Data Science and Big Data Analytics Discovering

机器学习 大数据处理机器学习 大数据处理机器学习 大数据处理机器学习 大数据处理机器学习 大数据处理

立即下载
BIG DATA ANALYTICS

分析报告:大数据分析 本文是由 The Data Warehousing Institute 撰写的报告,其中很好地阐述了最新的大数据分析工具和技术。

立即下载
Big Data Analytics with Spark

Big Data Analytics with Spark

立即下载
Big Data Analytics Beyond Hadoop

Big Data Analytics Beyond Hadoop Vijay Srinivas Agneeswaran

立即下载
Big Data Analytics Using Splunk

Big Data Analytics Using Splunk 是一部很好的工具书

立即下载
Big Data Analytics Made Easy

Big Data Analytics Made Easy is a must-read for everybody as it explains the power of Analytics in a simple and logical way along with an end to end code in R. Even if you are a novice in Big Data Analytics, you will still be able to understand the concepts explained in this book. If you are already

立即下载
Big Data Analytics with Java

Big Data Analytics with Java: Data analysis, visualization & machine learning techniques By 作者: Rajat Mehta ISBN-10 书号: 1787288986 ISBN-13 书号:: 9781787288980 Release Finelybook 出版日期: 2017-07-31 pages 页数: (418 ) $49.99 Book Description to Finelybook sorting This book covers case studies such as s

立即下载
Big Data Analytics for Dummies

Big Data Analytics for Dummies 是针对大数据的简单入门教程,适合初学者。

立即下载
big data analytics with java

This book covers case studies such as sentiment analysis on a tweet dataset, recommendations on a movielens dataset, customer segmentation on an ecommerce dataset, and graph analysis on actual flights dataset. This book is an end-to-end guide to implement analytics on big data with Java.

立即下载
Big Data in Practice How 45 Successful Companies Used Big Data Analytics to epub

Big Data in Practice How 45 Successful Companies Used Big Data Analytics to Deliver Extraordinary Results 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除

立即下载
Handbook of Big Data Analytics

Springer Handbooks of Computational Statistics it covers statistical methods for high-dimensional problems, algorithmic designs, computation tools, analysis flows and the software-hardware co-designs that are needed to support insightful discoveries from big data

立即下载
GLADE Big Data Analytics Made Easy

GLADE Big Data Analytics Made Easy

立即下载
Big Data Analytics with Hadoop 3

Big Data Analytics with Hadoop 3: Build highly effective analytics solutions to gain valuable insight into your big data Explore big data concepts, platforms, analytics, and their applications using the power of Hadoop 3 Apache Hadoop is the most popular platform for big data processing, and can be

立即下载
html+css+js制作的一个动态的新年贺卡

该代码是http://blog.csdn.net/qq_29656961/article/details/78155792博客里面的代码,代码里面有要用到的图片资源和音乐资源。

立即下载
Camtasia 9安装及破解方法绝对有效

附件中注册方法亲测有效,加以整理与大家共享。 由于附件大于60m传不上去,另附Camtasia 9百度云下载地址。免费自取 链接:http://pan.baidu.com/s/1kVABnhH 密码:xees

立即下载
电磁场与电磁波第四版谢处方 PDF

电磁场与电磁波第四版谢处方 (清晰版),做天线设计的可以作为参考。

立即下载
压缩包爆破解密工具(7z、rar、zip)

压缩包内包含三个工具,分别可以用来爆破解密7z压缩包、rar压缩包和zip压缩包。

立即下载
算法第四版 高清完整中文版PDF

《算法 第4版 》是Sedgewick之巨著 与高德纳TAOCP一脉相承 是算法领域经典的参考书 涵盖所有程序员必须掌握的50种算法 全面介绍了关于算法和数据结构的必备知识 并特别针对排序 搜索 图处理和字符串处理进行了论述 第4版具体给出了每位程序员应知应会的50个算法 提供了实际代码 而且这些Java代码实现采用了模块化的编程风格 读者可以方便地加以改造

立即下载