Wikipedia as of mid-2018 defined the term Big Data as follows:"Big Data" refers to structured and unstructured data of huge volumes and significant diversity, efficiently processed horizontally, scalable software tools that emerged in the late 2000s and are alternative to traditional database management systems and Business Intelligence class solutions.
As we can see, there are such vague terms in this definition as "huge", "significant", "effective" and "alternative". Even the name itself is very subjective. For example, is 4 terabytes (the capacity of a modern external hard drive for a laptop) already big data or not yet? Wikipedia adds the following to this definition: "in a broad sense, 'big data' is spoken of as a socioeconomic phenomenon associated with the emergence of technological capabilities to analyze huge amounts of data, in some problematic areas - the entire world of data, and the resulting transformational consequences.
IBS analysts have estimated "the whole world volume of data" at such values:
2003 - 5 exabytes of data (1 EB = 1 billion gigabytes)
2008 - 0.18 zettabytes (1 ZB = 1024 exabytes)
2015 ?. - over 6.5 zettabytes
2020 ?. - 40-44 zettabyte (forecast)
2025 ?. - this volume will grow another 10 times.
The report also notes that most of the data will be generated not by ordinary consumers, but by enterprises1 (remember Industrial Internet of Things), big data performance management it`s a key for good business.
It is also possible to use a simpler definition, quite in line with the well-established opinion of journalists and marketers.
"Big data is a set of technologies that are designed to perform three operations:
-Process large data volumes compared to "standard" scenarios
-Ability to work with fast moving data in very large volumes. So there is not just a lot of data, but more and more of it all the time.
-Ability to work with structured and weakly structured data in parallel and in different aspects ".
These "skills" are believed to reveal hidden patterns that elude limited human perception. This provides unprecedented opportunities to optimize many areas of our lives: public administration, medicine, telecommunications, finance, transport, manufacturing, and so on. It is not surprising that journalists and marketers so often used the word combination Big Data, that many experts consider this term discredited and suggest to abandon it.
Moreover, in October 2015, Gartner excluded Big Data from popular trends. The company's analysts explained their decision by the fact that the term "big data" includes a large number of technologies that are already actively used in enterprises, they partly belong to other popular areas and trends and have become a daily working tool.
In any case, the term Big Data is still widely used, which is confirmed by our article.
Three "V" and three principles for handling big data
The defining characteristics for large data are, in addition to their physical volume, and others that emphasize the complexity of the task of processing and analyzing these data. The set of VVV features (volume, velocity, variety - physical volume, speed of data growth and the need to process them quickly, the ability to simultaneously process data of different types) was developed by Meta Group in 2001 in order to indicate the equal importance of data management in all three aspects.
Interpretations with four V (added veracity - reliability), five V (viability - viability and value - value), and seven V (variability - variability and visualization - visualization) appeared later. But IDC, for example, interprets the fourth V as value, emphasizing the economic feasibility of processing large volumes of data under appropriate conditions.
Based on the above definitions, the basic principles of working with large data are as follows:
-Horizontal scalability. This is the basic principle for processing large data. As mentioned earlier, more and more data are being processed every day. Correspondingly, it is necessary to increase the number of computational nodes on which this data is distributed and processing should take place without performance degradation.
-Fault tolerance. This principle follows from the previous one. As the number of computational nodes in the cluster can be many (sometimes tens of thousands) and their number is not ruled out, it will increase and the probability of machine failure will also increase. Methods of working with large data should take into account the possibility of such situations and provide for preventive measures.
-Data locality. Since data is distributed over a large number of computing nodes, if they are physically located on one server and processed on another, the cost of data transfer may become unreasonably large. Therefore, it is desirable to process data on the same machine on which it is stored.
These principles are different from those found in traditional, centralized, vertical storage models for well structured data. Accordingly, new approaches and technologies are being developed for handling large data.
Technologies and trends in working with Big Data
Initially, the set of approaches and technologies included means of mass-parallel processing of indefinitely structured data, such as DBMS NoSQL, algorithms MapReduce and project tools Hadoop. Later on, other solutions providing similar features for processing super large data arrays, as well as some hardware began to be classified as large data technologies, you can see it everywhere foe e.x.https://www.enteros.com/ using our analitics you can grow in your knowledge. So, largde data tecnologies:
-MapReduce - model of distributed parallel computing in computer clusters, presented by Google. According to this model, the application is divided into a large number of identical elementary tasks performed at the cluster nodes and then naturally reduced to the final result.
-NoSQL (from Not Only SQL, not just SQL) - a general term for various non-relational databases and repositories, does not refer to any one specific technology or product. Conventional relational databases are well suited for fairly fast and similar queries, and on complex and flexibly built queries, typical of large data, the load exceeds reasonable limits and the use of DBMS becomes ineffective.
-Hadoop is a freely distributed set of utilities, libraries and framework for developing and running distributed programs running on clusters of hundreds and thousands of nodes. It is considered to be one of the fundamental technologies of big data.
-R - programming language for statistical processing of data and work with graphics. It is widely used for data analysis and has actually become a standard for statistical programs.
-Hardware solutions. Corporations Teradata, EMC, etc. offer hardware-software complexes, designed for processing large data. These complexes are supplied as ready-to-install telecommunication cabinets containing a server cluster and control software for mass-parallel processing. This also sometimes includes hardware solutions for analytical processing in RAM, in particular, Hana hardware and software complexes of SAP company and Exalytics complex of Oracle company, despite the fact that such processing is not initially mass-parallel, and the amount of RAM of one node is limited to a few terabytes6.
Consulting company McKinsey, except for technologies NoSQL, MapReduce, Hadoop, R considered by the majority of analysts, includes in a context of applicability for processing of the big data also technologies Business Intelligence and relational control systems of databases with support of language SQL.
Methods and techniques of large data analysis
International consulting company McKinsey, specializing in solving problems related to strategic management, identifies 11 methods and techniques of analysis applicable to large data.
- Data Mining class methods (data mining, data mining, data mining, data mining) - a set of methods for detecting previously unknown, non-trivial, practically useful knowledge in the data for decision-making. Such methods include, in particular, association rule learning, classification (categorization), cluster analysis, regression analysis, detection and analysis of deviations, and others.
- Crowdsourcing - classification and enrichment of data by a wide, undefined circle of persons performing this work without entering into labor relations.
- Data fusion and integration - a set of techniques to integrate heterogeneous data from a variety of sources for in-depth analysis (e.g. digital signal processing, natural language processing including tonal analysis, etc.).
- Machine learning, including training with and without a teacher - using models based on statistical analysis or machine learning to produce complex forecasts based on basic models
- Artificial neural networks, network analysis, optimization, including genetic algorithms (genetic algorithms - heuristic search algorithms used to solve optimization and modeling problems by random selection, combination and variation of required parameters using mechanisms similar to natural selection in nature)
- Pattern Recognition
- Forecasting analytics
- Simulation modeling - a method that allows you to build models that describe the processes as they would actually take place. Simulation modeling can be considered as a type of experimental tests.
- Spacial analysis - a class of methods that use topological, geometric and geographic information extracted from data
- Statistical analysis - time series analysis, A/B testing (A/B testing, split testing is a marketing research method; when used, a control group of elements is compared to a set of test groups in which one or more indicators have been changed to find out which of the changes improve the target indicator).
- Visualization of analytical data - presentation of information in the form of figures, diagrams, using interactive features and animation both for obtaining results and for use as raw data for further analysis. A very important step in the analysis of large data, allowing the most important results to be presented in the most user-friendly form.