2.5 quintillion bytes of data are generated every day!
–A quintillion is one followed by 18 zeros
Which is equivalent to 57.5 billion 32 GB iPads.
According to IDC’s Digital Universe Study:
- The world’s “digital universe” is in the process of generating 1.8 Zettabytes of information
- With continuing exponential growth this projects to 40 Zettabytes in 2020
- Two put this into perspective, 1ZB is 1000 EB or a billion TB.
- Five exabytes (10^18 gigabytes) of data would contain all words ever spoken by human beings on earth
How is Big Data generated?
- Google processes 20 PB a day (2008)
- Wayback Machine has 3 PB + 100 TB/month (3/2009)
- Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
- eBay has 6.5 PB of user data + 50 TB/day (5/2009)
- CERN’s Large Hydron Collider (LHC) generates 15 PB a year
“Big data is high volume, high velocity, and high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation” (Gartner 2012)
So whats big Big Data?
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
Big Data: A definition:
Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, curation, storage, search, sharing, analysis, and visualisation. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to “spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions. (Wikipedia)
The four dimensions of Big Data
- Volume: Large volumes of data
- Velocity: Quickly moving data
- Variety: structured, unstructured, images, etc.
- Veracity: Trust and integrity is a challenge
What about Structured Data Types
A. Structured Data
Typically stored in databases or spreadsheets, required to be managed in accordance with a standardised storage format and ontology e.g. names, place names.
B. Unstructured Data
Text, audio, imagery, video, E.g. student email, chat rooms, questionnaire responses, lecture videos (audio & video)
Different data types lend themselves to different analytical techniques.
What about Data Analytical
A. Structured data analysis
Descriptive statistics – sums, means, std devs, basic plotting (graphs, charts, histograms).
B. Unstructured Data Dalysis
Text : document clustering , topic detection, entity extraction (people, places, locations, dates, times etc.,
Audio : speaker identification, language identification, speech to text
Video analysis : face recognition, object recognition, target tracking
And then there is Data visualisation –
Tools that enable the human to see meaningful patterns in data
Utilising massive data to discover and explain
Is not as easy as you might think…
- Poor and sparse samples, surrogates, bias…
- As number of dimensions increases it becomes increasingly difficult to add in any data point without giving rise to some kind of statistically significant ‘pattern’ or ‘cluster’
- And parametric distributions become unreliable
- It is very difficult to discover useful things that are unknown by experts
We need to capture the meaning of data, not just the data itself
Some data facts?
Over 5 billion mobile devices in use worldwide. Each call, text and instant message is logged as data. Mobile devices, particularly smart phones and tablets, also make it easier to use social media and use other data-generating applications. Mobile devices also collect and transmit location data.
Billions of internet transactions – purchases, funds transfers, share trades – every day, including countless automated transactions. Each creates a number of data points collected by retailers, banks, credit card issuers, credit agencies and others.
Millions of networked electronic devices – including servers and other IT hardware, smart energy meters and temperature and other sensors – all create semi-structured log data that record every action.
Hundreds of millions of social networking users and resources. Each Facebook update, Tweet, blog post and comment creates multiple new data points, both structured and unstructured.