Recent Developments in Search Engines and Applications to Chemical Search

Harry E. Pence, SUNY Oneonta, Oneonta, NY, 13861, penche@oneonta.edu

October 23, 2013 - October 25, 2013
Abstract: 

Google continues to be the most popular search engine, but it is not resting on its laurels.  Improvements in the Google search algorithm and more powerful modes of image search may offer important new capabilities for Chemistry.  Science applications of Big Data are now being used for genomic sequencing, climate science, electron microscopy, oceanography and physics.  Chemical applications of Big Data have been slower to appear, but these techniques are now proving valuable for chemical literature searches and Quantitative Structure–Activity Relationship (QSAR) models.  In addition, the special software that is being used to analyze Big Data is beginning to appear in traditional chemical searches, like Chemical Abstracts.  danah boyd (sic) argues that Big Data has emerged as a system of knowledge that is already changing the objects of knowledge that we work with every day.

Article PDF: 

 

Recent Developments in Search Engines and Applications to Chemical Search

Over the years, one of the recurring themes of this column has been the question, “What is the best search engine for chemists?”  For some time Google has been the most popular engine, and this continues to be the case.  Despite the millions of dollars that Microsoft has spent to advertise Bing, Google’s market share has only dropped from 65% to 64.8%.  Bing has gained some market share, but most of the increase has been at the expense of the third place engine, Yahoo.  It is estimated that the efforts to overtake Google are cost Microsoft almost a billion dollars a quarter.  ((http://www.dailytech.com/Bing+Loses+Nearly+1B+per+Quarter+for+Microsoft/article22817.htm)).   Microsoft’s latest marketing ploy is the “Bing It On Challenge” (http://www.bingiton.com/) that asks people which search results they would choose, Bing or Google, in a “side-by-side, blind-test comparison.”  Microsoft claims that people choose Bing by as two-to-one ratio.  I tried it with chemical terms and picked Google four out of five times, with the other result being a draw.  Similarly, a Professor from Yale Law School recruited people to make the comparison and found that Google was the preferred engine unless the search terms were selected by Microsoft.  (http://www.slate.com/blogs/future_tense/2013/10/02/bing_vs_google_professor_fact_checks_bing_it_on_challenge_microsoft_loses.html)  Microsoft claims that the professor’s experiment was flawed, but identifies no specific problems.

Google continues to improve their search engine.  In my last column I talked about knowledge graph, the side bar on some search results that will help to disambiguate searches when there is more than one possible term that corresponds to the query (http://www.ccce.divched.org/P1Fall2012CCCENL) .  Google announced last month that it had quietly rolled out a new search algorithm, called "Hummingbird" (http://www.infoworld.com/t/search-engines/googles-new-hummingbird-algorithm-about-queries-not-just-seo-227732.   Google constantly makes small improvements to the search algorithm, but hummingbird is billed to be a more significant change. 

One new feature is the comparison tool.  For example, if you search for "butter vs. olive oil,” the first result is a convenient table comparing the two.  Apparently, this is still being implemented because a chemical topic, like “benzene vs. cyclohexane” does not display a similar table (although it does return sites that will make this comparison).  For this kind of comparison, the Wolfram Alpha search engine still provides much better results for chemists (http://www.wolframalpha.com/). 

 Many users pose search queries as natural language questions, but in the past search engines have responded by parsing search phrases one word at a time.  The new Hummingbird algorithm is an artificial intelligence program tuned to parsing entire questions, not individual words.  For many searchers, especially those who do not construct sophisticated query strings, this should give far better results.  Hummingbird also is supposed to see a search in the context of a previous search.  For example, If you search for "pictures of Great Danes" for instance, and then "common health problems," Google says it will understand that you're looking for common health problems associated with Great Danes (http://www.entrepreneur.com/article/228620#ixzz2gNwHUfee). 

 Google has announced a new and more sophisticated way to search for images (http://www.google.com/insidesearch/features/images/searchbyimage.html).  There was not enough time to explore this facility in any detail, but to get an idea of how it might work, do the following:

  • Go the Google search page, type in biphenyl, and click the image search on the toolbar.
  • Click on a simple image for biphenyl in the search results.
  • This should bring up a gray box with a large version of the image you originally clicked on.
  • Now click on “search by image” under the title line on the right in the gray box.
  • Now click on the line “search visually similar images” several lines down in the results, and you should see a whole page of compounds that have structures that are similar to biphenyl.

This may be a useful capability if you are looking for suggestions of compounds that have a structure similar a known compound.

 The fact that Google is still popular does not tell the whole story.  Because chemists search for unusual terms, the crucial question that determines the best engine is the size of the search engine index.  Remember that search engines do not directly search the WWW.  Instead, automated programs called netbots move around the web and collect information, which is then stored in an index.  When a chemist does a search, he or she is really searching that index.  The larger the index, the more likely it is that it will contain the scientific information that the chemist is searching for. 

Maurice de Kunder estimates that as of September, 2013, the indexed Web maintained by Google contains at least 3.84 billion pages

(http://www.worldwidewebsize.com/).  This is also probably a good indication of the size of the accessible web.  The total size of the WWW is difficult to guess because much of it is dark, that is, it isn’t easily accessible by search engine netbots.  Tammy Everts estimated that in 2012, the average web page was over a Megabyte in size (http://www.webperformancetoday.com/2012/05/24/average-web-page-size-1-mb/).  This suggests that the accessible web is about 1 petabyte.  De Kunder estimates that Google uses an index about four times as large as Bing, which is another reason for chemists should expect better results from Google.

 Looking at de Kunder’s data makes it clear that search engines must deal with huge amounts of data.   The following comparisons that may help to clarify the size of a petabyte (http://groups.yahoo.com/neo/groups/SUNNETmedia/conversations/topics/816).  One petabyte of data would fill 250,000 DVDs or 20 million four –drawer file cabinets.  These seem like an impossibly large amount of data, but Google currently processes about 20 petabytes of data per day and Facebook currently stores about 100 petabytes of data.   The entire written works of mankind from the beginning of recorded history in all languages is estimated to represent about 50 petabytes.

Even a petabyte is small compared with the vast amounts of information in circulation.  According to researchers at the UC-San Diego, Americans consumed about 3.6 zettabytes of information in 2008, where a zettabyte is 1 x1021  bytes of data. David Weinberger suggests another way to visualize these very large units [1].  A digital version of War and Peace would be about 2 megabytes or 1296 pages in a traditional book, so one zettabyte would equal 5x 1014 copies of War and Peace.  It would take a photon of light travelling at 186,000 miles per second 2.9 days to go from the top to the bottom of this stack.

 Part of the reason for this incredible accumulation of data is that computing resources are becoming much cheaper and more powerful.  In 1980 a terabyte of disk storage cost $14 million; now it costs about $30.  Not too long ago, only a few individuals or companies had access to supercomputers; now Amazon or Google will “rent” a cloud-based supercomputer cluster for only a few hundred dollars an hour.  Social networks are also contributing massive amounts of data.  Twitter generates more than 7 Terabytes (TB) a day; Facebook more than 10 TBs; and some enterprises already store data in the petabyte range.  (A  Terabyte is about 1 x 1012  bytes.)  This leads to the question of how search engines deal with these huge quantities of data.  

 The problem is not just the amount of data.  IBM describes the current information environment as The Three V’s, volume, variety, and velocity [2].  The sheer volume of stored data is exploding.  IBM predicts that there will be 35 zettabytes stored by 2020, and the amount of data available is growing by about 60% every year.   The velocity of data depends on not just the speed at which the data is flowing but also the pace at which it must be collected, analyzed, and retrieved.  And finally, this data comes in a bewildering variety of structured and unstructured formats.  Whereas traditional data warehouses use a relational database (like Excel rows and columns), search engines must also handle unstructured data in non-relational databases, sometimes called NoSQL.  Unstructured data is often text-based and may include large amounts of metadata.  The combination of the three V’s is often described as “Big Data.”

 Search engine providers long ago realized that new computational methods are needed   to analyze Big Data.  In order to power its searches, Google developed a strategy called MapReduce.  A search is divided into many small components, each of which is assigned to one of the available processors for execution (Google has over a million servers) and then the results from each processor are recombined.  The most popular software to accomplish this is a program called Hadoop, first developed in 2005 by the Apache Software Foundation.  Doug Cutting, who was one of the developers, named the program after his son's toy elephant.  There are currently six different freeware versions of Hadoop available.  Hadoop is designed to collect data, even if it does not fit nicely into tables, distribute a query across a large number of separate computer processors, and then combine the results into a single answer set in order to deliver results in almost real time. 

Are Big Data procedures relevant for Chemists?

Although chemical applications don’t usually involve data sets that are as big as those described above, Big Data tools can be useful even for moderately large data sets

For example, some chemical literature searches and Quantitative structure–activity relationship (QSAR) models can benefit from the application of Hadoop and similar techniques.   The Royal Society of Chemistry has digitized all of the articles from their journals going back to 1841and plans to make this database available (http://www.slideshare.net/AntonyWilliams/digitally-enabling-the-rsc-archive?from_search=1).  The goal is to extract reactions, spectra, data, and as many small molecules as possible.  It is hoped that processing this information with Big Data tools will promote the discovery of new relationships that had previously been overlooked.

 Chemical search services, like Chemical Abstracts, are now beginning to use Hadoop to work with datasets that are too large for conventional methods.  Anthony Trippe writes that those who work with chemical patents have had to deal with Big Data even before the name became popular (http://www.patinformatics.com/blog/first-look-new-stn-big-data-creates-chemistry-without-limits/).  He points out that,  “The universe of available patent documents, worldwide, is well over 80 million, and in the CAS world of chemistry, the running count of known organic, and inorganic molecules currently stands at over 73 million substances, not including an additional nearly 65 million sequences. These types of numbers, as well as the interconnectedness of the data, certainly allow patent, and chemical information to qualify as sources of Big Data.” 

 In July of this year, FIZ Karlsruhe and Chemical Abstracts Service (CAS) announced the launch of Version One of a new STN platform.  Trippe reports that this new version of STN is powered by Hadoop.  Up until recently, data analysts sometimes found it to be difficult to run a broad structure search for patents dealing with compounds of interest, but Trippe says the new platform “. . . puts the entire collection of chemical, and patent data at their fingertips, and allows them to manipulate it at will.”

 To demonstrate the power of the new STN powered by HADOOP, Trippe describes how a pharmaceutical company could search for compounds that might function like Januvia, a new class of anti-diabetic agents which inhibits Dipeptidyl Peptidase – IV (DPP4).   This requires a search for all compounds that are structurally similar to sitagliptin (the free-base of Januvia), and have been studied in conjunction with DPP4.  Sitagliptin has two ring systems, one a phenyl, and the other a 6,5 system with four Nitrogen atoms.  Any compound that involves this basic skeleton would be of interest.   This search involved more than 10 million substances, and all of the references related to these items.  A query of this magnitude would have been very difficult to run on the old version of STN.  It runs in less than 30 seconds on the new STN, and produces almost 2.5 million structures.   Trippe concludes by writing, “Combining the breadth, and depth of the collections available, with the deep indexing that has been created by the database producers, generates a powerful combination that opens the door to exploring chemistry, and patents in a way that has never existed before. “

 

Conclusion

 Search engines have become such a ubiquitous part of everyday life that it is easy to assume that there are few changes occurring.  This is far from the truth.  Google is leading the way towards new methods for search, and when these become fully functional they may be very useful to chemists.  Beyond this, the techniques that are being developed for search are opening new avenues for science.  In particular, the application of ideas from Big Data is already playing a significant role in Chemistry, and may become even more important in the future. 

 Big Data is important not just because of size but also because of how it connects data, people, and information structures.  It enables us to see patterns that weren’t visible before.  Major companies, like Google, Amazon, and Netflix, are already using large data sets to offer new services to their customers, and the National Security Agency is also using big data for less user-friendly purposes.  danah boyd (sic) summarizes the potential of Big Data by writing, “Just as Ford changed the way we make cars – and then transformed work itself - Big Data has emerged as a system of knowledge that is already changing the objects of knowledge, while also having the power to inform how we understand human networks and community (http://softwarestudies.com/cultural_analytics/Six_Provocations_for_Big_Data.pdf).”   boyd adds, “And finally, how will the harvesting of Big Data change the meaning of learning and what new possibilities and limitations may come from these systems of knowing?”

 

References:

1.       Weinberger, D., Too Big to Know: Rethinking Knowledge Now That the Facts Aren't the Facts, Experts Are Everywhere, and the Smartest Person in the Room Is the Room. 2012, New York, NY: Basic Books.

2.       Paul C. Zikopoulos , e.a., Understanding Big Data. 2012, New York, NY: McGraw-Hill.

Comments

Economics and Informatics

First, thanks for the nice paper.  A lot to think about has been reduced to something closer to manageable.

I wonder what is going on with search engines which are not public, e.g. pay for use?  Just the software available on things like MS and NMR has moved ahead massively over the years and it often ran on whatever onboard system the instrument ran on.  The other side is of course the data itself.  Many large companies have a lot of proprietary data of their own.  It would seem a logical progression that a search engine could do much more if it could use that data and look at results out in the public internet?

In short, I am just wondering what search capabilities money can buy these days?

Thanks

<Richard>

 

 

 

Chemistry specific search engines

All public and proprietary chemical databases have associated search engines, search interfaces and result manipulation tools. A survey of those would be a great thing to go with Prof. Pence's article, which focuses on capabilities of general purpose search engines.

Is such a (non-marketing) survey available?

Thanks

~Milind

Chemistry specific search engines

I agree that this would be an excellent resource, but I am not aware of any such surveys.

 

search engines

Hi Harry,

An interesting paper and I wonder if you find Google Scholar better than Google search for some searches.  Do you get better results using chemical structures, names or formulas?

Thanks,

Brian

Google Scholar

Dear Brian,

For serious searches I usually use both regular Google and Google Scholar. For straight chemistry I don't usually find much in Google Scholar, since most of the best stuff is locked up in walled gardens. The open access articles seems to come up even in regular Google (if you can design a smart search). I haven't tried chemical structures or formulas on Scholar.

I share your concern for the

I share your concern for the public and am dismayed that even science oriented magazines, such as Discover, Scientific American, Popular Science have amazingly few articles on chemistry even when noted chemists are on their boards.  Few popular authors are concerned with chemistry, seems we could look to physcists who are very effective in publizing their big projects.

search engines

Godd queston, Brian. Unfortunately, I don't have a good answer, because I have been working from a different direction.

When I started looking at Big Data, my main concern was to discover who are the influencers who determine what the general public thinks about chemistry. Is it the ACS, some other professional lobbying organization, or is it some group of people that most chemists are not aware of? It is no secret that many people have an unfavorable attitude towards Chemistry. On a whole range of issues, from fracking to polllution to genetically modified foods, we take a hit. Some criticism is fair; some is not. Who is telling the Chemistry story to the public? The ACS likes to focus on social media outlets that it controls, so that doesn't reach Joe Sixpack. If we could see how public opinion about chemistry is formed, we might have a better opportunity to make our case. This is the sort of problem that students, who are very much into social media, might enjoy looking at.

The problem is to find tags or key words that are specific to chemistry. For example, chemists often use ACS to identify our society, but it could also refer to the American Cancer Society, the American College fo Surgeons, or the American Cetacean Society. The term chemistry is often used in posts about love affairs. It is by no means a trivial problem, but I slough bravely onward. Hopefully I can get to your very interesting question sometime in the future.

Harry

 

Big Data and Knime

Knime has been used for Cheminformatics for quite some time now, and Knime also has Big Data extensions (reportedly!). I would look at "big" not only from the size/complexity point of view, but also from the continuous "big" role over time of raw data in the domain under consideration. I don't know whether this applies to Cheminformatics...

Big Data and Knime

Thanks for your suggestion about Knime. I plan to look into that soon.

I originally intended to focus my article mainly on Big Data, but then changed my mind because I ran across some interesting developments in traditional search and also because I thought that starting a discussion on the topic of Big Data in Chemistry might be a little premature. The comments are leading me to conclude that Big Data and related topics are on the radar for at least a number of chemists, and so I'm glad that I included at least a teaser on the topic.

 

Harry

Apart from the three Vs mentioned above....

... there are two more Vs involved: Visualization and Verifiability of results. Users tell me that verifiability (omissions and false positives) is a big question mark.

Then, there is the search engine user interface: which engine provides best facilities to search for "molecules with a six member carbon ring and a five member carbon ring separated by 3 carbon atoms"? I am not sure how I would do it with Google, or with Bing for that matter.

Google did not endear itself to the cheminformatics community the way they went after Chmoogle.

But anyway, the search engines discussed here are all general purpose engines. What is the development on the special purpose search engines (like Hadoop on STN)? (I have never used any in the last 5 years)

Which search engines are good for searching for data on my desktop or company server?

I hope I have not gone off the subject.

Regards

I didn't come up with only 3

I didn't come up with only 3 Vs on my own; that is the IBM definition. I agree that Visualization and Verifiability of results are related to Big Data, but I don't think that either is required to define Big Data. These are related to how you use Big Data after you have identified it and try to understand it.

As far as being able to search for "molecules with a six member carbon ring and a five member carbon ring separated by 3 carbon atoms" I am still looking for such an engine myself. The closest I have found is Google's new image search, which will return things that look similar. You have not gone off the subject; you are thinking the same way that I am. I don't know of anything that will do what you are asking, but I thought if I shared my imperfect understanding, someone on the list would trump my ace and go beyond what I knew - - - therefore educating me as well as many other chemists. Worst case, perhaps my tentative comments will encourage someone else to create the product we are all lookiing for.

 

 

Actually, I was not thinking in terms of Big Data...

...when I mentioned the two additional Vs, but in terms of general search results. Visualization of results is difficult unless the search engine provides additional tools to the user to weed out the unimportant results, and get a result set that is small enough to comprehend... As for verification, there is just no way to ensure it, as far as I have been given to understand.

Prof. Pence, your article is indeed a lucid and comprehensive one, and I am sure would be useful to professionals from other fields too. Thanks.

Two more Vs

I totally agree that in many areas, not just molecular structure, we are entering an time where data visualization is not only esential but when we also have many more powerful tools to do this than every before. All you have to do is run a Google image search on visualization tools and you will begin to see how true this is. I think we are coming to the point where we need to think more about how to introduce these tools into our teaching. For example, I found the article Seven Data Visualizaation Tools You Can use in the Classroom to be very thought provoking. (http://www.teachthought.com/technology/5-free-data-visualization-tools-you-can-use-in-the-classroo/)

The problem of verification is also key. As we become more dependent upon our data processing tools, it also becomes harder to make sure that what they are telling us is really what is happening. I have already found this to be frustrating as I was using Big Data tools, and I fear that it will only become worse as the tools become more powerful.

Thank you for you kind comments on my paper.

Harry

What is Big?

Dear Harry,

In this article you start by discussing Google and then introduce big data, and ask if big data procedures are relevant to chemists? This has me asking several questions.  First, does Google make big small? When I use Google I feel it is personal, something I can manage, but when I think of big data, I think of Microsoft Research 4th Paradigm type stuff, which is a bit intimidating, and involves resources above and beyond anything available in my classroom. But Google, I can use that, my students use that. Is the trick to getting big data into education to make it look small?

My second thought on "What is big?", deals with accessibility. Is PubChem big data? Is ChemSpider big data? I am not only asking this from the data-set size, which I understand is what is meant by big data, but also from the accessibility side. When I was an undergraduate student and had to identify an unknown we used resources like the Merck Manual and the CRC Handbook. I suspect that is still done in many schools today, even though these online resources could easily be integrated into the curriculum of highschools, community colleges, PUIs and R1 institutions. Kids can even pull the data up on their cell phones. Are people on the list using these technologies in today's classrooms?

Does allowing more people access to the data make it bigger? I would really like to hear more on how big data can become of practical value to this community of chemistry educators.

Thanks for the great article.
Bob

What is Big?

I loved your suggestion that the trick to getting big data into education to make it look small. I couldn't agree more! Let me be honest and admit that I am just starting on Big Data, and it still looks very large to me. Perhaps as I learn more I can find an answer to the question that both of us are asking.

I have done my humble best to make more chemical educators aware of the new resouces that are available. For example, see

Bennett and Pence, J. Chem. Educ., 2011, 88 (6), pp 761–763Williams and Pence, J. Chem. Educ., 2011, 88 (6), pp 683–686Pence and Williams, J. Chem. Educ., 2010, 87 (11), pp 1123–1124Pence and Pence, J. Chem. Educ., 2008, 85 (10), p 1449

I can only hope that someone is listening.

I can only run as fast as I can run. When I find out, I will do my best to share it.

 

Use of data in chemical education

In my experience web-based data collection (either big or small) AND use of that data can impact greatly how one teaches. It is like opening your eyes after years of working blind. The first step is getting chemical educators using data is to appreciate how it can inform their teaching starting with small data acquired from their own students. These days many chemical educators have access to such data (from Moodle quizzes or something similar), and a flick through the results of the quiz assigned as homework can influence the emphasis in a classroom presentation. However, in my experience dealing with hundreds of teachers, trying to convince them that their response data is relevant or even vaguely interesting is an uphill battle. Let us hope that changes because I think that the potential for either big or small data from students to influence the direction that teaching takes is is huge.

Use of data in chemical education

I totally agree that we need to add this type of material to our lessons. Keep up the good fight!

Contact me after the conference is over if you think there are some projects that we might cooperate on.

Big Data for chemistry is not all that big

While chemistry datasets are increasing in size, I'm not convinced that chemistry has big data problems the way that biology or search etc have big data problems. With a few exceptions (the GDB series of databases is one example) most chemistry problems can still be solved on local machines or clusters of them. It is certainly interesting to see the use of Hadoop in cheminformatics applications (http://blog.rguha.net/?p=748http://blog.rguha.net/?p=325, http://blog.rguha.net/?p=297) but fundamentally most chemistry Hadoop applications are 'chunking' procedures - conceptually identical to running multiple jobs on a cluster. I don't know the details of the STN Hadoop implementation, but it would be interesting to find out.

What would be really cool is to see Hadoop (or other distributed computing paradigms) applied at the algorithmic level for chemical problems, that go beyond mere input data chunking.

Starting_Material_Search and Retrosynthesis graphs: BigData?

I understood from some process chemistry practitioners (industrial -- non-academic) a few years ago that both the problems above involve massive usage and recursive reusage of lots of data, and both require heavy parallelism... would they be candidates for BigData?

~Milind

seeking further understanding

Dear Rajarshi,

I am probably speaking for most of the people on this list, who are chemical educators, but I looked at your blog, and must confess, at this point in time, that most of it is over my head. But as an educator, I want to teach my students the best I can, and give them the best opportunities I can. I was wondering if you could take a minute and expound on your last statement, "What would be really cool is to see Hadoop (or other distributed computing paradigms) applied at the algorithmic level for chemical problems, that go beyond mere input data chunking", and parse this in a way where I could explain the significance of this to my students. Could you sort of do this at a level you would do it if the BBC or CNN were to interview you. What do you mean by chemical problems? Data chunking?...

With thanks and respect,
Bob

parallel cheminformatics

So, many Hadoop applications are implemented in a fashion that is very similar to using a cluster of computers. You have a program that performs some analysis and you divide the input data into chunks. So a single SD file of 10,000 compounds might be dividided into 100 SD files of 100 compounds each. In a traditional cluster setting, each 100-molecule SD file is processed on a separate node, using the same program. That is, only the input data changes. As a result you can run the same program at the same time for 100 SD files. This gives you a (theoretical) 100x speedup

Many Hadoop programs are written in a similar fashion - the advantage of the Hadoop version is that the job management (ie sending a chunk of input to a node to be processed) is handled by the Hadoop infrastructure, rather than by the user. So the user is only responsible for writing the code that will analyze a chunk of data.

When I refer to Hadoop at the algorithmic level, I mean that the rather than the input being chunked, the problem itself can be reframed so that different parts can be run in parallel. For example, for graph theoretical problems, many graph alogorithms can be parallelized in Hadoop. In such a case, there is just one input - a graph - and the analysis itself (say finding clusters) can operate in parallel on that input. The problem is that for this to be cost-effective, you need to be working on massive graphs. Chemical structures are not massive, so we don't benefit from a Hadoop version of methods that operate on chemical graphs.

So my original statement really asks about scenarios where for a given inout data, are there cheminformatics algorithms and problems that can employ a divide and conquer approach on that given input.

In general, I have not been able to come up with cheminformatics algorithms that are amenable to the map/reduce style parallelization implemented in Hadoop, but would love to hear of other peoples experiences. However, recently, my colleagues and I did come up with a possible application, viz., bioisostere analysis. I have some slides that provide a summary of this (as well as big data for chemistry in general)

parallel cheminformatics

The other thing that Hadoop does is to read files that are not structured. Wold this capability be more important for cheminiformatics?

Harry

parallel cheminformatics

Hi Harry, but this would not be specific to Hadoop, right? In general, Hadoop has no explicit support for chemical file formats, and so Hadoop users must implement that before working with stuff like SD files etc

parallel cheminformatics

I beliee that you are correct, but I have not gone this far yet in my investigation of Big Data and Hadoop.

Harry

Big Data for Chemistry is not that new

I agree that Big Data is not critical for most Chemistry applications, but there are some areas where it is already being used, for example, drug development,medicine, and environmental chemistry. As sensors of various kinds become increasingly inexpensive, the amount of data will surely increase to the point where Big Data software becomes useful.

I also agree with the comment that, "What would be really cool is to see Hadoop (or other distributed computing paradigms) applied at the algorithmic level for chemical problems, that go beyond mere input data chunking." So far I haven't seen much of this, but I hoped that if anyone is doing this sor of thing and read my coments they might share their information.

test D6

test D6

???

What's that?