Tip:
Highlight text to annotate it
X
Big data is an elusive concept.
It represents an amount of digital information,
which is uncomfortable to store,
transport,
or analyze.
Big data is so voluminous
that it overwhelms the technologies of the day
and challenges us to create the next generation
of data storage tools and techniques.
So, big data isn't new.
In fact, physicists at CERN have been rangling
with the challenge of their ever-expanding big data for decades.
Fifty years ago, CERN's data could be stored
in a single computer.
OK, so it wasn't your usual computer,
this was a mainframe computer
that filled an entire building.
To analyze the data,
physicists from around the world traveled to CERN
to connect to the enormous machine.
In the 1970's, our ever-growing big data
was distributed across different sets of computers,
which mushroomed at CERN.
Each set was joined together
in dedicated, homegrown networks.
But physicists collaborated without regard
for the boundaries between sets,
hence needed to access data on all of these.
So, we bridged the independent networks together
in our own CERNET.
In the 1980's, islands of similar networks
speaking different dialects
sprung up all over Europe and the States,
making remote access possible but torturous.
To make it easy for our physicists across the world
to access the ever-expanding big data
stored at CERN without traveling,
the networks needed to be talking
with the same language.
We adopted the fledgling internet working standard from the States,
followed by the rest of Europe,
and we established the principal link at CERN
between Europe and the States in 1989,
and the truly global internet took off!
Physicists could easily then access
the terabytes of big data
remotely from around the world,
generate results,
and write papers in their home institutes.
Then, they wanted to share their findings
with all their colleagues.
To make this information sharing easy,
we created the web in the early 1990's.
Physicists no longer needed to know
where the information was stored
in order to find it and access it on the web,
an idea which caught on across the world
and has transformed the way we communicate
in our daily lives.
During the early 2000's,
the continued growth of our big data
outstripped our capability to analyze it at CERN,
despite having buildings full of computers.
We had to start distributing the petabytes of data
to our collaborating partners
in order to employ local computing and storage
at hundreds of different institutes.
In order to orchestrate these interconnected resources
with their diverse technologies,
we developed a computing grid,
enabling the seamless sharing
of computing resources around the globe.
This relies on trust relationships and mutual exchange.
But this grid model could not be transferred
out of our community so easily,
where not everyone has resources to share
nor could companies be expected
to have the same level of trust.
Instead, an alternative, more business-like approach
for accessing on-demand resources
has been flourishing recently,
called cloud computing,
which other communities are now exploiting
to analyzing their big data.
It might seem paradoxical for a place like CERN,
a lab focused on the study
of the unimaginably small building blocks of matter,
to be the source of something as big as big data.
But the way we study the fundamental particles,
as well as the forces by which they interact,
involves creating them fleetingly,
colliding protons in our accelerators
and capturing a trace of them
as they zoom off near light speed.
To see those traces,
our detector, with 150 million sensors,
acts like a really massive 3-D camera,
taking a picture of each collision event -
that's up to 14 millions times per second.
That makes a lot of data.
But if big data has been around for so long,
why do we suddenly keep hearing about it now?
Well, as the old metaphor explains,
the whole is greater than the sum of its parts,
and this is no longer just science that is exploiting this.
The fact that we can derive more knowledge
by joining related information together
and spotting correlations
can inform and enrich numerous aspects of everyday life,
either in real time,
such as traffic or financial conditions,
in short-term evolutions,
such as medical or meteorological,
or in predictive situations,
such as business, crime, or disease trends.
Virtually every field is turning to gathering big data,
with mobile sensor networks spanning the globe,
cameras on the ground and in the air,
archives storing information published on the web,
and loggers capturing the activities
of Internet citizens the world over.
The challenge is on to invent new tools and techniques
to mine these vast stores,
to inform decision making,
to improve medical diagnosis,
and otherwise to answer needs and desires
of tomorrow's society in ways that are unimagined today.