Center for Strategic Assessment and forecasts

Autonomous non-profit organization

Home / Science and Society / New in Science / Articles
Big data: How they change our view of the world
Material posted: Publication date: 09-10-2013

Everyone knows that the Internet has changed the work of enterprises, government bodies and people's lives. However, another, not so obvious, the trend of technology is equally radical changes. This use of "big data": from a huge array you can gain knowledge, not available when using only small portions.

"Explosion of data" transforms not only the process of information processing, but also the approach to cognition. The Outlook, based on the analysis of causation, disputed the benefits of correlation. Big data is making major changes in management practices and affect the nature of politics. Possession and the ability to put them into his service helps to foresee the future and gives a new key to power. But opening and using the capabilities of technology, it is necessary to remember its limitations and dark side.

Everyone knows that the Internet has changed the work of enterprises, government bodies and people's lives. However, the new, not so noticeable trend in technology is causing an equally radical changes. This use of "big data". Today we get considerably more information than ever before, and how to use it more and more unconventional. Big data is not the same thing as the Internet, although the world wide web greatly simplifies the collection and exchange of information. Big data is not just a means of communication: the point is that from a huge array there is an opportunity to gain knowledge that is not available when using only small portions.

In the third century BC was considered that the whole knowledge of mankind is stored in the library of Alexandria. Today the world has accumulated so much information that everyone living has to 320 times higher as that of the dataset, which, as historians believe, was kept in the Alexandrian tomes – its volume is estimated at 1200 exabytes (quadrillion kilobytes). If all this is put on CD-ROMs, which can then be decomposed into five piles, each of them will have a height to the moon.

Explosion of data, referred to, are a relatively new phenomenon. Back in 2000, only a quarter of all the world's accumulated information was digitized. The rest was stored on paper, films and other analog media. But as the volume of digital data increases rapidly, doubling every three years, the situation is changing rapidly, and today is not digitized remains less than 2% of all stored information.

With this a gigantic scale there is a temptation to view big data exclusively from the point of view of their size. But it can be confusing. Big data can be turned into "digital" that have never been assessed quantitatively: let's call it pacificasia (datafication). For example, the location of the object on the Earth's surface became possible to ratifitsirovala first with the discovery of longitude and latitude, and relatively recently with the invention of GPS systems. The words are converted to numbers that computers dig in ancient books by the layering of eras. Even friendships and sympathy datafellows via Facebook ("likes").

For this kind of data is possible incredible new application with the help of inexpensive computer memory, powerful processors, smart algorithms, software, and mathematics, which borrows figures from fundamental statistics. Instead of trying to teach a computer to drive a car, or the translation from one language to another, what experts in artificial intelligence struggled unsuccessfully for decades, a new approach is the injection of a sufficiently large amount of data in the computer. The results show the probability that the traffic light is green and not red light, or in a certain context lumiere is closer in meaning to the concept of "light" than leger.

Such use of the array of information requires three profound changes in our approaches. The first is the selection of the data set, if people can't settle for small amounts or samples, as more than 100 years ago started making statisticians. Second – the rejection of preferential use of crystal clear and verified data in favor of natural disorder: an increasing number of scenarios and situations allows for some inaccuracy, because the large flow of different quality more efficient and less expensive than the limited squeeze very accurate information. Thirdly, in many cases we have to abandon the search for causes and to adopt naprijenie kinds of determination. Instead of trying to understand precisely why an engine breaks down or disappears side effect of some medications, the researchers can collect and analyze large amounts of information about these things and phenomena, and all that is connected with them, in search of stereotypes and templates that will help to predict their occurrence now or in the future. That is, to answer the question "what?" instead of "why?" but often that's enough.

The Internet has changed the principles of communication between people. Big data is different from ordinary: they are transforming the process of knowledge society, and over time can change our view of the world. Getting access to an array of information, we probably at some point discover that many aspects of life are probabilistic and not dynamic in nature.

Approaching "N = ALL"

Throughout most of its history mankind has worked with relatively small amounts of data, because the instruments of their collection, organization, storage and analysis were flawed. People took the information to a minimum to make it even easier to explore. The genius of modern statistics, who first came to the fore in the late nineteenth century, is that it has enabled society to understand complex realities even when you have limited targets. Today technological conditions turned to 179 degrees. There is still, and always will remain, the limited data that we are able to recycle, but compared to the previous boundaries are being expanded and will eventually become even wider.

In the past, people sought out information sampling method. When data collection was expensive and time-consuming processing, a different approach to be and could not. Modern sampling is based on the fact that within a certain error it is possible to make any conclusions about the General population based on the analysis of a small group of its representatives, selected randomly. For example, the exit-polling on election night intended to predict voting results based on the random survey group of voters from a few hundred people. A positive result is obtained in the case of direct questions, but if we aim to investigate specific subgroups, this method is no good. What if the worker of public opinion will want to know what candidate will vote an unmarried woman under the age of 30? What about unmarried American women of Asian descent under the age of 30 with University education? The random sampling is meaningless, because there can be only two people that meet these characteristics, but insufficient for an objective assessment of how to vote people from this social group. But if we collect all the data, that is, in the language of specialists in the collection of statistics, when n = all, then the problem disappears.

This example reveals another disadvantage of using only parts and not the totality of the information. In the past, when people relied on limited amount of data, they often had to decide from the outset what to collect and how to use collected. Today, when we accumulate all, it is not necessary to know the purpose. Of course, it is not always possible to grasp the immensity, but every year more and more realistic to aim for comprehensive data about a particular phenomenon. Big – the question is not simply the creation of larger samples, but use the maximum possible amount of available information on the subject. We still need stats, but there is no need to rely on a small sample.

However, no compromise can not do. At times you zoom in, it is advisable to abandon unambiguous, carefully selected data and put up with some disorder. This is contrary to what people tried to work over many centuries. However, the obsession with neatness and accuracy is an artifact of the era, which has been characterized by the limited known information. When data were collected literally bit by bit, scientists had to be sure that their numbers are totally accurate or close to ideal. Access to a much larger volume means that we can allow for some inaccuracy (provided that the information collected are not completely false) to benefit from the depth of penetration into the essence of the subject, which provides a huge array.

Consider the problem of translation from one language to another. It would seem that computers should be good translators, because they can store large amounts of information and quickly find her. But if you simply substitute words of Anglo-French dictionary, the translation will turn out disgusting. Language is a complex substance. The breakthrough came in the 1990s when IBM delved into statistical machine translation. She loaded into the computer verbatim records of the parliamentary hearings in French and English and have programmed it to draw conclusions which word in one language most accurately corresponds to a word in another. Translation has turned into a tremendous probability math problem. But after the initial breakthrough process has stalled, and further progress followed.

Then it took Google. Instead of using a relatively small number of high-quality translations, the search giant used information larger array of data, but from the less orderly Internet data "in vivo" so to speak. Google borrowed translations from corporate websites, documents in all languages of the European Union, even translations from its giant project of scanning books. Analyzed millions, and billions of pages. The result turned out quite good, better than IBM, and then at 65 different languages. Large amounts of messy data interrupted little more "clean" sample.

From causality to correlation

Two shift in our approach (using separate data for their entirety, as well as from ordered to chaotic data) resulted in a third change. From the causal (causal) connection, we moved on to neprijemny types of determination (correlation). It is a transition from the constant attempts to understand the root causes of the universe to the knowledge nepecino connection conditions and events and its application.

Of course, it is desirable to know the reasons for certain phenomena. The problem is that they are often extremely difficult to establish, and in many cases, when we think that we have identified the causes, it is nothing more than an illusion. Behavioral Economics has demonstrated that people tend to see causes where none exist. Therefore, we need constantly to be alert to our biased attitude did not lead us astray. Sometimes it is enough to provide freedom of expression to the data itself.

Take for example a company for the delivery of UPS packages. She attaches the sensors on some car parts to detect overheating or vibration, which in the past was associated with the release of these parts fail. In this way, the company can predict a breakdown before it happens and replace the part when convenient, and not on the side of the road. The data do not reveal the exact relationship between the heat, vibration and failure. From these data, the UPS cannot conclude why a particular mechanism faults. But that's enough info to make it clear what to do in the near future. It allows quite accurately to detect a fault in a particular mechanism or part of the vehicle.

A similar approach is used to resolve "breakdowns" in the human body. Canadian scientists are developing a method of large data to identify infections in premature babies before they had visible symptoms. By converting 16 vital signals, including heart rate, blood pressure, breathing and levels of oxygen in the blood into the information flow of a speed of more than thousand units of information per second, they revealed a correlation between very insignificant changes and really serious problems. Eventually this technique will allow doctors to start early action to save lives. Over time, the record of these observations could also explain what the causes of failures in the body. But when threatened the health of the newborn, a simple knowledge of what is likely to happen, is more important than an accurate understanding of the causes.

Medicine gives us another good example of why if you have large data to capture the relationship between States can be extremely valuable, even if the underlying causes are unclear. In February 2009, Google made a fuss in the medical community. The researchers published an article in the journal Nature in which he outlined how it is possible to track seasonal outbreaks of influenza using only archival records search company Google. Every day the search engine in the United States alone handles over a billion requests and keeps all of them without exception. The company matched 50 million terms that most often appeared in search queries within the period from 2003 to 2008, with data on influenza from the Centers for disease prevention and control of diseases. The idea was to discover if it's a match search for specific terms in Google with the subject of outbreaks of influenza – in other words, to see whether correlates the frequency of search for specific terms in Google data centers for disease control outbreaks of influenza in specific geographic areas. The centers track the number of actual calls to clinics across the country; however, published information is late for a week or two, but it's an eternity in the event of a pandemic. On the contrary, Google is working almost in real time.

Google never claimed to know what queries are the best indicators. He missed all of the terms through an algorithm that ranks their correlation with outbreaks of flu. Then the system would combine the terms and have evaluated the potential improvement of the existing model. Expelling nearly half a billion calculations on the basis of the available data, Google identified 45 terms – such words and phrases as "headache", "nose is running" – which is clearly correlated with the data centers for the flu outbreak. All 45 terms were in some way connected with the flu. But the billion search queries per day it is difficult to see with the naked eye, which of those will work and will be suitable for verification.

Moreover, the data were imperfect. Because originally no one was going to use this information, the misspelling of terms and incomplete phrases were commonplace. However, the data array more than compensated for their randomness. Of course, the result is a simple correlation. There was no classification of causes, which carried out the search on a specific term – whether it's sickness of man, his message about the sneezing in the next apartment, or anxiety about reading in the newspaper. Google knows this, and she don't care. In fact, in December last year, Google seems to have overestimated the number of cases of influenza in the United States. It reminds us that predictions and forecasts are only probabilities that may not always be true, especially when the basis for the forecast are ever-changing search queries on the Internet, susceptible to external influence, such as messages in the media. And yet, big data can give a General direction of development of the situation, and that is what was used in Google's system.

BACK-END OPERATIONS (calculations on the database machine)

Many technologists believe that the history of big data should be measured with the digital revolution of the 1980s, when advances in microprocessors and computer memory were given the opportunity to analyze and store more information. But this is just the outside of the case. Computers and the Internet certainly contribute to big data by reducing costs for the collection, storage, processing and dissemination of information. But at its core, big data is a relatively late opening of humanity in his attempt to understand and quantify the world around us. For clarity, let's take a quick look behind his back.

Evaluation of POS people sitting – the art and science of Shigeomi Koshimizu, Professor of advanced Institute of industrial technology in Tokyo. Few people would think that sitting postures are important information, but it is. When a person sits, the contours of the body, posture and weight distribution, it is possible to quantify and to bring these to the table. With the help of sensors placed at 360 different points on the car seat, Koshimizu and a group of engineers take data on the pressure "area below the back" of the driver, estimating each point on a scale from 0 to 256 points. Digital code, unique to each person. During the trial this system is able to distinguish one person from another with an accuracy of 98%.

This is not a study of demented scientists. Koshimizu plans to use technology to create a new generation of anti-theft systems. Equipped with such a system the car is able to recognize "the stranger" while driving and to require a password to start the engine. Transformation of POS data in mean vital services to the population and potentially profitable business. The benefits extend far beyond preventing car theft. The aggregated data will help to identify the relationship between the pose of the driver and road safety, for example, commit poses in front of a traffic accident. The system is also able to "feel" the slowdown of the reaction due to fatigue and to send an alarm or automatically slam on the brakes.

Koshimizu took up the matter, which have never been studied from the point of view of the data, and no one would even imagine could not, that it can have informational quality – and transformed it into a digital, quantitative format. Not yet invented the right term for this kind of transformation, but ratifikacija appropriate word. Ratifikacija is not the same as digitization, where the analog content – books, films, photographs – is converted into digital information or a sequence of ones and zeros read by the computer. Ratifikacija is a much broader activity in which any aspect of life is transformed into data. "Glasses" augmented reality Google are transforming human eye the data. Twitter atificial incoherent thoughts, and LinkedIn is a professional network.

When we pacificorum anything, we can change the destiny of this object and convert information into new types of value. For example, IBM received in 2012 a US patent for a "decision in the area of security indoors using planar processing technologies" – technical way of describing the surface of the floor, reacting to touch, something like a giant smartphone display. Ratifikacija here opens a variety of possibilities. So, Paul will be able to find available on items and turn on lights or open doors when a person enters the room.

Moreover, it is able to identify people by weight or by volume as they stand or walk. To detect when someone has fallen and cannot rise, which is important for the elderly. With the help of this technology trading company will have the ability to track the flow of customers in stores. This kind of data when it is stored and analyzed, will help you learn about things and events that we would never have thought, because I don't know how they are easy and cheap to measure.

Big data and "Big Apple"

Big data capabilities extend far beyond medicine and consumer goods: they are making fundamental changes in the working methods of the governments and influence the nature of politics. If we talk about the acceleration of economic growth, provision of services to the population or about the conduct of wars, the benefits will go to those who can put big data into its service. Today, the most exciting interest is the work on the municipal level where it is easier to access data and to experiment with the information. The mayor of new York ("Big Apple"), Michael Bloomberg (who made his fortune in data processing) led a movement for the transition of municipal services of the city on big data to improve public services and reduce costs. One example is fire strategy.

Buildings illegally erected partitions at most at risk of fire. The city annually receives 25 thousand complaints about overcrowded buildings, but at its disposal only 200 fire inspectors. A small team of analysts municipality estimates that big data can address the imbalance between needs and available resources. Their efforts have created a database of all 900 thousand buildings in the city and supplemented by information collected from 19 municipal offices: the registry of seized property for nonpayment of taxes, illegal use of utilities, interruption of network services, payment of utility bills, frequency of ambulance calls, a rating of local crime, complaints about rodents, etc. and Then they compared this database with information about the fires over the last five years in descending order of damage caused by them, hoping to detect correlations. It is not surprising that among the factors that predicted the risk of fire, the important role played by the type of building and year of construction. Less expected was the identification of the regularities, according to which the risk of severe fires decreased in buildings that have received permission to external brickwork.

All this has allowed city hall employees to develop a system that helps to determine the number of overcrowded buildings that needed immediate response. None of the design features of these buildings were not directly caused fires; rather, such features are correlated with increased or reduced risk of fire. Knowledge was extremely valuable: in the past, inspectors of construction works were written orders for the evacuation of people and the release of premises in 13% visited their facilities; after the transition to the new methodology, this percentage increased to 70%, huge increase in efficiency.

Of course, insurance companies are long established in a similar pattern, assessing the risk of fire, but they are only intuitive to rely on a limited set of factors. Unlike insurers, the new York city hall has used big data technique, which allowed her to study a lot more variables, including those that at first glance were not related to risk of fire. The model used by the city, cheaper and more efficient. Most important, the forecasts based on big data differ also higher selectivity.

Big data increase the transparency of democratic governance. Around the idea of "open data" there was a whole movement calling to go beyond the laws on freedom of information, which usually operate in developed democracies. His supporters urge us to open for broad public access to unclassified information arrays. The United States were pioneers, creating a special website (state data), and their example was followed by many other countries.

Turning to the use of big data, government must also protect citizens against unhealthy market dominance. Companies such as Google, Amazon and Facebook, as well as lesser-known "data brokers" like Acxiom and Experian, have amassed huge amounts of information about everyone and everything. The antitrust laws are designed to protect citizens against the monopolization of markets for goods and services, such as software or media, because the size of these markets it is easy to evaluate. But how to apply antitrust principles to big data? After all, this market defies description, constantly changing shape. Even more disturbing is the privacy and correspondence, as large amounts of data will almost certainly lead to the disclosure of pieces of personal information and, apparently, current legislation and technology is not able to deal with it.

Attempts to legalize large data associated with the control, it can lead to friction between States. European countries are already involved in the investigation of the company Google for violation of antitrust legislation and attacks on personal information. This is reminiscent of the anti-monopoly campaign, initiated by the European Commission against Microsoft ten years ago. Facebook may also become a target for prosecution in different countries, as the company owns large amounts of information about individual citizens and their private life. Diplomats have to break the spear, arguing about whether the information falls under the laws on free trade. In the future, when China will begin to censor search information on the Internet, he will face complaints not only to limit the freedom of speech, but an illegal restriction of trade.

Big data or Big brother?

States need to protect their citizens and their markets from the hassles associated with big data. However, people are not insured from being able to face with another dark side of big data – the risk of turning into "Big brother". In all countries, especially undemocratic, big data exacerbate the asymmetry of power between the state and ordinary people.

It can reach such proportions that would lead to the dictatorship of big data. This opportunity with a great power of imagination is revealed in such films in the genre of science fiction like "minority report" (Minority Report). The film, released in 2002, takes place in a dystopian near future. The character Tom cruise heads to the police Department of crime prevention, which is working with clairvoyants. They help to identify people who are going to commit a crime. The plot shows a clear potential for errors and fallacies inherent in this system and, worse, the possibility of encroachment on the freedom of expression.

Although the idea of identifying potential offenders before they committed the crime seems fantastic, big data has allowed the government to treat it seriously. In 2007, the Department of homeland security has embarked on a research project FAST (Future Attribute Screening technology screening and identify the signs that manifest in the future). With her help identified potential terrorists. Like the polygraph, this technology is based on various physiological indicators of the human condition – from the view direction, to a cardiac rhythm and gestures. The police of many cities, including Los Angeles, Memphis, Richmond and Santa Cruz, have adopted "predictive policing" software, which analyzes data on previously committed crimes, to identify where and when can be done future.

While these systems do not identify as suspects of specific individuals, but they develop in this direction. Probably will continue to find young people who are most prone to shop theft. Compelling reasons for greater specification of the behavior of popping up when it comes to preventing negative social phenomena, not related to the crime. For example, if social workers were able to predict with 95 percent accuracy which teenage girls can become pregnant or who of high school students are candidates for expulsion from school, don't they would have tried to intervene in a timely manner to prevent adverse outcomes? Sounds good. In the end, prevention is better than punishment. But even intervention with the aim of providing real help, and not censure or reprimand, may be interpreted as a punishment – at least there is a risk to dishonour the person in the eyes of others that you start to consider it socially unreliable. In this case, the state's actions will be perceived as a punishment before committing to specific reprehensible actions, and also an encroachment on the freedom of expression.

Another reason for concern is the excessive trust public authorities to the data. In his book, 1999 edition of "the Eyes of the state" ("Seeing Like a State") an anthropologist James Scott with numbers and facts in the hands of an indication of how the state, in its insatiable passion for quantitative assessment of information and data collection sometimes makes the individual's life into a nightmare. Government officials use maps as source material to identify ways of reorganization of certain settlements, without knowing what kind of people live there. On the basis of summary tables data of yield, quite ignorant bureaucrats decide about the feasibility of collectivization in agriculture. They explore and use the most imperfect and primitive methods of interaction between people of all ages that sometimes required them only for computational convenience.

This unnecessary habit is to rely on data may fail. Institutions often succumb to the magic of numbers and give them a deeper meaning than they deserve. Remember one of the lessons of the Vietnam war. The U.S. Secretary of defense Robert McNamara was literally obsessed with the idea of using statistics as a way of measuring military success. He and his colleagues focused their attention on the losses of the enemy. The number of dead enemy soldiers was a defining parameter: figures published in the press, and they were guided commanders. For supporters of the war, the statistics showed success, and critics served as a proof of her immorality. However, the statistics do not reveal the complex reality of the conflict. The numbers were often inaccurate and unhelpful to assess the real situation. Information, of course, can improve life, but in the analysis of statistical data should be more common sense.

The human factor

Big data will inevitably change our way of life, work and thinking. The Outlook, based on the analysis of causation, disputed the benefits of correlation. The possession of knowledge, which once meant an understanding of the past, now helps to foresee the future. Not so easy to answer the challenge of big data. Most likely, they are just another step in the endless debate about how to explore the world.

Still, big data will become an integral part of the solution to many pressing problems. To stop the process of climate change, it is necessary to analyze information about environmental pollution and make an informed decision about what to focus the main efforts and how at least a little to reduce the problem. Sensors placed around the world, including those embedded in smartphones, provide a comprehensive picture of the climate scientists to better model the processes of global warming. Meanwhile improving the quality of healthcare and reducing the cost of medical services, especially for poor populations, will require automation of tasks which are solved by man, but quite capable computer. For example, the study of cancerous cells or detecting infections before the first symptoms.

Ultimately, big data marks the moment when the information society finally begins to live up to its promising title. Information takes centre stage. Collected digital bits find new applications and generate new types of value. But this requires new thinking, challenge existing institutions and order to public life. What is the role of people, their intuition, their ability to go against the facts in the world, where more and more decisions are made on the basis of data analysis? If all appeal to big data and use their tools, probably, the main difference of man is his unpredictability. He is able to show the instincts, to take risks, to cope with unforeseen circumstances and errors. If so, you have to include a field of activity for human: to reserve space for intuition, common sense and the ability to random discoveries. It is important to ensure that these valuable human qualities have not been supplanted by computer algorithms.

The notion of social progress is affected by these changes. Big data to experiment faster and explore a wide range of problems. These advantages should produce more innovation. But sometimes the spark of invention becomes what hold back any data. It is something that can't be confirmed by any amounts of available information as it should appear in the future. If Henry Ford had applied to the algorithms of big data to identify the desires and needs of clients, it all came down to having to come up with "a faster horse" (to rephrase his famous saying). In the world of big data, it is necessary to develop and promote the inherent human quality – creative thinking, intuition, and intellectual ambition, ingenuity. They move progress.

Big data is a resource and tool designed to inform rather than to explain. They lead to the understanding of different phenomena, but sometimes provoke erroneous conclusions – it all depends on how you use them. But no matter how bright and dazzling seem the power of big data, their deceptive tinsel and attractiveness should not detract from their inherent imperfections. By accepting and using technology, we must not forget its limitations.



RELATED MATERIALS: Science and Society
Возрастное ограничение