Skip to content
Robert edited this page May 22, 2015 · 11 revisions

Setting up the pipeline for collecting and processing article data, references and citations from PubMed is quite a manual and fiddly process and is ripe for extracting to a Node.js module. Issue #8 has been opened.

The pipeline is comprised of various functions that coordinate three classes. The main functions are batchInsert, batchReferencesCollect and batchCitationsCollect. These functions then call articleData, update_references and collect_citations respectively. The idea being the process of collecting data could be done in stages from the command line by simply typing node batchInsert or node batchReferencesCollect. These functions coordinate the classes CitationController, DocumentController and ReferencesController. An instance of a class manages requests to PubMed and inserting documents to MongoDB via ArticleModel and/or JournalModel (see Schemas page).

###How is the data collected?

This assumes there is one document in the database that at a bare minimum looks like this:

{
    "_id" : ObjectId("54170353222a2s7c22935274"),
    "__v" : 0,
    "is_ref_of" : [ ],
    "pmid" : 20614589,
    "references" : [ ]
}

From this starting point either the full article data (title, author, doi, year), references or citations can be found. Lets start with references so we can build up a large number of documents quickly.

####Collecting references

batchReferencesCollect is simply a helper to enable the setting of an interval between calls to update_references. An interval is used to comply with PubMed's request to stick to no more than three requests a second.

update_references instantiates an instance of a ReferencesController and uses the async.waterfall function to to organise the callbacks.

On typing node batchReferencesCollect from the pipeline directory the following things happen.

  1. update_references starts by calling the getPmid method of referencesController which queries the database for a document that matches the following conditions: the size of the reference array is zero and the pmid isn't in the ignorePmid array. A pmid is ignored if it's value is zero or if the request to PubMed comes back saying there are no references. The ignorePmid array allows the getPmid query to find the next pmid.

  2. fetchReferences takes the pmid and adds it to the options object used by the request call and a request is made to http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?

  3. The XML response (an example is here) is passed to referencesExist and checked for a LinkName called pubmed_pubmed_refs. If found the Links are returned and the watefall carries on, if not then False is returned and the pmid is added to the ignorePmid array and the waterfall chain is broken at this point. After the set interval it starts back again at step 1.

  4. The pmids from pubmed_pubmed_refs Links are then extracted into a pmids array in the parseResponse method.

  5. The pmids are then added to the database or updated if it already exists in the upsertDocs method. As each pmid is added or updated the ObjectId of the original document is added to the is_ref_of (really should be called citations) field of of the document we are updating or adding. This is for the next step of populating the references field of the original document.

  6. Finally the populateReferences method of ArticleModel is called. This method searches for all documents that have the ObjectId of the original article in the is_ref_of field and adds the ObjectIds from the results (the documents that were added in the previous step) to the references field of the original document. The reference_count field is also updated at this point.

####Collecting article data

Now that we have the pmids of an article's references we can retrieve the article data (title, author, journal, doi) and the all important pmcid from PubMed too.

The code for this one is a bit messier than the last part so hopefully I can make it clear what's happening. No excuses but I wrote this code before I knew what a callback was it seems. Anyway, articleData jumps straight into creating a document to upload to the database and I'm sorry.

Like in collecting references the articleData "helper" function uses an instance of DocumentController. The first call is to createDocument which does the following:

  1. gets a pmid that doesn't have a title field and therefore no data
  2. fetches the article data and checks if the journal title already exists in the journals collection.
  3. turns the XML response it into a hash object (which might not be a good idea - needs more research)

Next the document object is checked by the validate method for any undefined properties. This is not a MongoDB validation but one that iterates over the object deleting properties that are undefined so that MongoDB doesn't throw any errors. No log is kept of properties that come back as undefined and are deleted prior to uploading to database. This might cause a problem with step one of createDocument.

The updateDocument method adds the document to the database.

Once all the article data has been collected the free_access field can be set in the mongo shell by checking for a pmcid field and setting the free_access field to true.

db.articlemodels.find({pmcid:{$exists:true}}).forEach(
    function(article){
         article.free_access = true;
         db.articlemodels.save(article)
         }
)

####Collecting citation data

TODO

####What can go wrong?

Massive TODO

Clone this wiki locally