Contact site moderator
Dr. Robert E. Belford
Google continues to be the most popular search engine, but it is not resting on its laurels. Improvements in the Google search algorithm and more powerful modes of image search may offer important new capabilities for Chemistry. Science applications of Big Data are now being used for genomic sequencing, climate science, electron microscopy, oceanography and physics. Chemical applications of Big Data have been slower to appear, but these techniques are now proving valuable for chemical literature searches and Quantitative Structure–Activity Relationship (QSAR) models. In addition, the special software that is being used to analyze Big Data is beginning to appear in traditional chemical searches, like Chemical Abstracts. danah boyd (sic) argues that Big Data has emerged as a system of knowledge that is already changing the objects of knowledge that we work with every day.
Recent Developments in Search Engines and Applications to Chemical Search
Over the years, one of the recurring themes of this column has been the question, “What is the best search engine for chemists?” For some time Google has been the most popular engine, and this continues to be the case. Despite the millions of dollars that Microsoft has spent to advertise Bing, Google’s market share has only dropped from 65% to 64.8%. Bing has gained some market share, but most of the increase has been at the expense of the third place engine, Yahoo. It is estimated that the efforts to overtake Google are cost Microsoft almost a billion dollars a quarter. ((http://www.dailytech.com/Bing+Loses+Nearly+1B+per+Quarter+for+Microsoft/article22817.htm)). Microsoft’s latest marketing ploy is the “Bing It On Challenge” (http://www.bingiton.com/) that asks people which search results they would choose, Bing or Google, in a “side-by-side, blind-test comparison.” Microsoft claims that people choose Bing by as two-to-one ratio. I tried it with chemical terms and picked Google four out of five times, with the other result being a draw. Similarly, a Professor from Yale Law School recruited people to make the comparison and found that Google was the preferred engine unless the search terms were selected by Microsoft. (http://www.slate.com/blogs/future_tense/2013/10/02/bing_vs_google_professor_fact_checks_bing_it_on_challenge_microsoft_loses.html) Microsoft claims that the professor’s experiment was flawed, but identifies no specific problems.
Google continues to improve their search engine. In my last column I talked about knowledge graph, the side bar on some search results that will help to disambiguate searches when there is more than one possible term that corresponds to the query (http://www.ccce.divched.org/P1Fall2012CCCENL) . Google announced last month that it had quietly rolled out a new search algorithm, called "Hummingbird" (http://www.infoworld.com/t/search-engines/googles-new-hummingbird-algorithm-about-queries-not-just-seo-227732. Google constantly makes small improvements to the search algorithm, but hummingbird is billed to be a more significant change.
One new feature is the comparison tool. For example, if you search for "butter vs. olive oil,” the first result is a convenient table comparing the two. Apparently, this is still being implemented because a chemical topic, like “benzene vs. cyclohexane” does not display a similar table (although it does return sites that will make this comparison). For this kind of comparison, the Wolfram Alpha search engine still provides much better results for chemists (http://www.wolframalpha.com/).
Many users pose search queries as natural language questions, but in the past search engines have responded by parsing search phrases one word at a time. The new Hummingbird algorithm is an artificial intelligence program tuned to parsing entire questions, not individual words. For many searchers, especially those who do not construct sophisticated query strings, this should give far better results. Hummingbird also is supposed to see a search in the context of a previous search. For example, If you search for "pictures of Great Danes" for instance, and then "common health problems," Google says it will understand that you're looking for common health problems associated with Great Danes (http://www.entrepreneur.com/article/228620#ixzz2gNwHUfee).
Google has announced a new and more sophisticated way to search for images (http://www.google.com/insidesearch/features/images/searchbyimage.html). There was not enough time to explore this facility in any detail, but to get an idea of how it might work, do the following:
This may be a useful capability if you are looking for suggestions of compounds that have a structure similar a known compound.
The fact that Google is still popular does not tell the whole story. Because chemists search for unusual terms, the crucial question that determines the best engine is the size of the search engine index. Remember that search engines do not directly search the WWW. Instead, automated programs called netbots move around the web and collect information, which is then stored in an index. When a chemist does a search, he or she is really searching that index. The larger the index, the more likely it is that it will contain the scientific information that the chemist is searching for.
Maurice de Kunder estimates that as of September, 2013, the indexed Web maintained by Google contains at least 3.84 billion pages
(http://www.worldwidewebsize.com/). This is also probably a good indication of the size of the accessible web. The total size of the WWW is difficult to guess because much of it is dark, that is, it isn’t easily accessible by search engine netbots. Tammy Everts estimated that in 2012, the average web page was over a Megabyte in size (http://www.webperformancetoday.com/2012/05/24/average-web-page-size-1-mb/). This suggests that the accessible web is about 1 petabyte. De Kunder estimates that Google uses an index about four times as large as Bing, which is another reason for chemists should expect better results from Google.
Looking at de Kunder’s data makes it clear that search engines must deal with huge amounts of data. The following comparisons that may help to clarify the size of a petabyte (http://groups.yahoo.com/neo/groups/SUNNETmedia/conversations/topics/816). One petabyte of data would fill 250,000 DVDs or 20 million four –drawer file cabinets. These seem like an impossibly large amount of data, but Google currently processes about 20 petabytes of data per day and Facebook currently stores about 100 petabytes of data. The entire written works of mankind from the beginning of recorded history in all languages is estimated to represent about 50 petabytes.
Even a petabyte is small compared with the vast amounts of information in circulation. According to researchers at the UC-San Diego, Americans consumed about 3.6 zettabytes of information in 2008, where a zettabyte is 1 x1021 bytes of data. David Weinberger suggests another way to visualize these very large units . A digital version of War and Peace would be about 2 megabytes or 1296 pages in a traditional book, so one zettabyte would equal 5x 1014 copies of War and Peace. It would take a photon of light travelling at 186,000 miles per second 2.9 days to go from the top to the bottom of this stack.
Part of the reason for this incredible accumulation of data is that computing resources are becoming much cheaper and more powerful. In 1980 a terabyte of disk storage cost $14 million; now it costs about $30. Not too long ago, only a few individuals or companies had access to supercomputers; now Amazon or Google will “rent” a cloud-based supercomputer cluster for only a few hundred dollars an hour. Social networks are also contributing massive amounts of data. Twitter generates more than 7 Terabytes (TB) a day; Facebook more than 10 TBs; and some enterprises already store data in the petabyte range. (A Terabyte is about 1 x 1012 bytes.) This leads to the question of how search engines deal with these huge quantities of data.
The problem is not just the amount of data. IBM describes the current information environment as The Three V’s, volume, variety, and velocity . The sheer volume of stored data is exploding. IBM predicts that there will be 35 zettabytes stored by 2020, and the amount of data available is growing by about 60% every year. The velocity of data depends on not just the speed at which the data is flowing but also the pace at which it must be collected, analyzed, and retrieved. And finally, this data comes in a bewildering variety of structured and unstructured formats. Whereas traditional data warehouses use a relational database (like Excel rows and columns), search engines must also handle unstructured data in non-relational databases, sometimes called NoSQL. Unstructured data is often text-based and may include large amounts of metadata. The combination of the three V’s is often described as “Big Data.”
Search engine providers long ago realized that new computational methods are needed to analyze Big Data. In order to power its searches, Google developed a strategy called MapReduce. A search is divided into many small components, each of which is assigned to one of the available processors for execution (Google has over a million servers) and then the results from each processor are recombined. The most popular software to accomplish this is a program called Hadoop, first developed in 2005 by the Apache Software Foundation. Doug Cutting, who was one of the developers, named the program after his son's toy elephant. There are currently six different freeware versions of Hadoop available. Hadoop is designed to collect data, even if it does not fit nicely into tables, distribute a query across a large number of separate computer processors, and then combine the results into a single answer set in order to deliver results in almost real time.
Are Big Data procedures relevant for Chemists?
Although chemical applications don’t usually involve data sets that are as big as those described above, Big Data tools can be useful even for moderately large data sets
For example, some chemical literature searches and Quantitative structure–activity relationship (QSAR) models can benefit from the application of Hadoop and similar techniques. The Royal Society of Chemistry has digitized all of the articles from their journals going back to 1841and plans to make this database available (http://www.slideshare.net/AntonyWilliams/digitally-enabling-the-rsc-archive?from_search=1). The goal is to extract reactions, spectra, data, and as many small molecules as possible. It is hoped that processing this information with Big Data tools will promote the discovery of new relationships that had previously been overlooked.
Chemical search services, like Chemical Abstracts, are now beginning to use Hadoop to work with datasets that are too large for conventional methods. Anthony Trippe writes that those who work with chemical patents have had to deal with Big Data even before the name became popular (http://www.patinformatics.com/blog/first-look-new-stn-big-data-creates-chemistry-without-limits/). He points out that, “The universe of available patent documents, worldwide, is well over 80 million, and in the CAS world of chemistry, the running count of known organic, and inorganic molecules currently stands at over 73 million substances, not including an additional nearly 65 million sequences. These types of numbers, as well as the interconnectedness of the data, certainly allow patent, and chemical information to qualify as sources of Big Data.”
In July of this year, FIZ Karlsruhe and Chemical Abstracts Service (CAS) announced the launch of Version One of a new STN platform. Trippe reports that this new version of STN is powered by Hadoop. Up until recently, data analysts sometimes found it to be difficult to run a broad structure search for patents dealing with compounds of interest, but Trippe says the new platform “. . . puts the entire collection of chemical, and patent data at their fingertips, and allows them to manipulate it at will.”
To demonstrate the power of the new STN powered by HADOOP, Trippe describes how a pharmaceutical company could search for compounds that might function like Januvia, a new class of anti-diabetic agents which inhibits Dipeptidyl Peptidase – IV (DPP4). This requires a search for all compounds that are structurally similar to sitagliptin (the free-base of Januvia), and have been studied in conjunction with DPP4. Sitagliptin has two ring systems, one a phenyl, and the other a 6,5 system with four Nitrogen atoms. Any compound that involves this basic skeleton would be of interest. This search involved more than 10 million substances, and all of the references related to these items. A query of this magnitude would have been very difficult to run on the old version of STN. It runs in less than 30 seconds on the new STN, and produces almost 2.5 million structures. Trippe concludes by writing, “Combining the breadth, and depth of the collections available, with the deep indexing that has been created by the database producers, generates a powerful combination that opens the door to exploring chemistry, and patents in a way that has never existed before. “
Search engines have become such a ubiquitous part of everyday life that it is easy to assume that there are few changes occurring. This is far from the truth. Google is leading the way towards new methods for search, and when these become fully functional they may be very useful to chemists. Beyond this, the techniques that are being developed for search are opening new avenues for science. In particular, the application of ideas from Big Data is already playing a significant role in Chemistry, and may become even more important in the future.
Big Data is important not just because of size but also because of how it connects data, people, and information structures. It enables us to see patterns that weren’t visible before. Major companies, like Google, Amazon, and Netflix, are already using large data sets to offer new services to their customers, and the National Security Agency is also using big data for less user-friendly purposes. danah boyd (sic) summarizes the potential of Big Data by writing, “Just as Ford changed the way we make cars – and then transformed work itself - Big Data has emerged as a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community (http://softwarestudies.com/cultural_analytics/Six_Provocations_for_Big_Data.pdf).” boyd adds, “And finally, how will the harvesting of Big Data change the meaning of learning and what new possibilities and limitations may come from these systems of knowing?”
1. Weinberger, D., Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room. 2012, New York, NY: Basic Books.
2. Paul C. Zikopoulos , e.a., Understanding Big Data. 2012, New York, NY: McGraw-Hill.