New website address!

Pending the (hopeful) defense of my dissertation on August 8th I’ll be starting a postdoc at the Lamont-Doherty Earth Observatory at Columbia University.  Since there’s generally a little downtime between turning in a thesis and defending it I’ve migrated this site to a server at Columbia University.  The site at students.washington.edu/bowmanjs will no longer be maintained, but all posts and comments can be found at the new site.  I hope to see you there!

Please note that the Facebook page associated with this site (The Deming Ecosystem in Antarctica) will no longer be maintained.

Jeff

http://ldeo.columbia.edu/~bowmanjs

Posted in Uncategorized | Leave a comment

Some thoughts on modeling

I’m not a modeler, but I played one once in grad school. Or at least that’s how I’m feeling at the moment. I’m currently working on the last chapter of my dissertation and it became necessary to explore the mechanisms underlying some empirical observations. This encouraged (forced?) me to undertake my first foray into modeling, developing a simple model of how a protein might evolve given a set of conditions and using it to predict what conditions might be responsible for the observed state. Because of this deviation from my normal observational and experimental work I’ve been thinking a lot about what models actually are and what they’re useful for. Considering the degree to which everyone is exposed to models on a daily basis it’s troubling how little people, including most scientists, think about them.

Currently scientific modeling is so compartmentalized that “modeler” is considered an adequate description of someone’s research area, as in “oh, she’s a modeler”. That label is at the same time complimentary and derogatory. On the one hand it implies some mastery of computer programming, mathematics, statistics, and probably a healthy publication list. On the other hand it can imply a reductionist scientific philosophy (wherein interesting phenomena are reduced to simple, predictable “parameters”) and a tendency to see a better model as the overarching scientific goal. I once went to a talk where a climate model lectured the audience for 45 minutes on how to “be a good observational scientist” for the sake of model improvement. It was too easy to walk away from the talk with the sense that that modeler’s primary goal was not an improved understanding of the system but the development of a model that best reproduces the observed state (and thus might reasonably predict a future state). However that isn’t necessarily a bad thing, so long as the limitations are understood.

I see models as falling into two categories, which I’ll call mechanistic and predictive. A predictive model, such as a global climate model or a protein structural prediction model, isn’t about understanding the system. It’s about accurately predicting the endstate. Interesting dynamics occur in any natural process that don’t have an impact on the outcome. Representing all of them in a model is time consuming and computationally expensive. Consider a model designed to predict how well an automobile functions, where well is defined as driving in an efficient and safe fashion. Lots of phenomena contribute to this; the fuel delivery system, the various engine components, the exhaust system, the brakes, etc. It might be very interesting that the driver chooses to listen to the radio or run the AC while driving. Even though this has an impact on the other subsystems in the car, it doesn’t have a direct impact on the car running well. If we wanted to represent the operation of the vehicle as a model we would probably decline to include these phenomena as variables. To do so would require developing equations that predict their operation in relation to one another, and solving those equations repeatedly when the model is run with no improvement in our ability to predict the car’s performance.

A mechanistic model on the other hand can be thought of as an inventory of all the tiny pieces of a system that contribute to the system’s operation. Yes, we can represent the performance of the car in the previous model perfectly well without including the radio or air conditioner, but if we want to explore how these systems react with other subsystems they need to be included. For example, suppose a driver likes the interior of the car really, really cold and hopes to install an industrial grade air conditioner that exceeds the wattage of the alternator. This would certainly impact the performance of the vehicle but, since we opted not to include it in the predictive model of automobile performance, we’d never know it from running that model

The hard reality is that no one will ever raise enough money to hire enough researchers and purchase enough computer time to run mechanistic models of the Earth system, or even a local ecosystem. As valuable as that would be it’s just too difficult to do (without lots of time, people, and money). Thus developing useful predictive models depends on very careful selection of the parameters that will be included.

In predictive oceanographic models an oft-cited example is the representation of phytoplankton. These single-celled primary producers are responsible for a huge chunk of global carbon fixation and carbon sequestration. Often a single “average” of nutrient requirements and carbon uptake and export rates is used for phytoplankton in climate and ecosystem models, despite the fact that there are thousands of different phytoplankton species, often with radically different nutrient requirements and uptake/export rates. Representing phytoplankton as a single variable is efficient, but can produce an erroneous result in a predictive model. Furthermore such a predictive model is useless for experiments that explore ecosystem impacts of changing phytoplankton populations. However it isn’t practical to represent thousands of phytoplankton species as separate variables. Not only would it be computationally inefficient, we don’t know enough about most of those species to estimate their nutrient requirements of uptake/export rates. So where do we draw the line? What about prokaryotes, which are even more diverse than phytoplankton? Viruses? Heterotrophic protists?

Right now I’m on my way to a workshop at the Bigelow Laboratory for Ocean Science at Boothbay, Maine, for three days of exploration of this issue. A major question of the attendees is how much observing is required to get some of these parameters approximately right? How many phytoplankton does one need to sequence for example, before we have a reasonable understanding of the genetic diversity (and thus metabolic potential) of this group? The workshop, sponsored by the Ocean Carbon and Biogeochemistry program of the US Carbon Cycle Science Program, was organized by the Woods Hole Oceanographic Institute and the organizers did an excellent job of including students and postdocs in addition to senior researchers. Thanks for the opportunity to join the conversation and I’m looking forward to digging into these issues!

Posted in Uncategorized | Leave a comment

Making maps in R

I don’t have a lot of experience with geospatial data analysis but following a recent cruise I had the need to plot some geospatial data.  The go-to program for this seems to be Matlab (or even ArcGIS, for those who are serious about their map making), but I do almost all of my plotting in R.  This sent me on a bit of an exploration of the various map making options for R.  Although the maturity of the various R geospatial libraries is far below that of commercial products, it’s possible to make some pretty decent plots.  Here’s a synopsis of the method I ended up using.

To plot data on a map you first need a map to work from.  A map comes in the form of a shape file, which is nothing more than an collection of lines or shapes describing the outline of various map features.  Shape files come in various flavors, including lines and polygons.  For reasons that will become clear working with polygons is much preferable to working with lines, but I found it difficult (impossible) to consistently get the right polygon format into R.  R does have some built in polygons in the maps and mapdata packages.  Using these it is possible to access a global polygon map, however the resolution of the map is quite low.  The documentation doesn’t provide a maximum resolution, but it is low enough to be of little value on scales below hundreds of miles.  Here’s a quick plot of our study area (the Antarctic Peninsula) with no further information plotted, just to get a sense of the resolution.

library(maps)
library(mapdata)
map(database = 'world',
    regions = "antarctica",
    xlim = c(-77, -55),
    ylim = c(-70, -60),
    fill = T,
    col = 'grey',
    resolution = 0,
    bg = 'white',
    mar = c(1,1,2,1)
)

box()

basic_plot

 Not too bad, but the y dimension is over 1000 km.  If our map covered 10 km or even 100 km we’d have at best a large block to define any landmass.  To produce a more detailed map we need to use a shape file with a much higher resolution.  This is where things get a little tricky.  There are high resolution shape files available from a number of sources, primarily government agencies, but finding them can be difficult.  After a lot of searching I discovered the GSHHG High-resolution Geography Database available from NOAA.  The database is quite large, but it can be subset with the GEODAS Coastline Extractor which also makes use of the WDBII database (included with the GSHHG link) containing rivers and political boundaries.  It takes a little fussing around, but once you get the hang of the coastline extractor it isn’t too bad to work with.  On first use you have to point the extractor to the databases.  Then select “plot” from the drop down menu, and enter the desired lat and long range.  This returns a nice graphic of the selected region which you can export in the desired shape file format using “Save as”.

The export step is where I’m a bit stuck.  Ideally you would export the coastline as a polygon so that R could color it, if desired (as in the previous example).  I can only get R to recognize shape files containing lines however, exported in the ESRI format.  Here’s the region plotted earlier but now from the GSHHG database:

library(maps)
library(mapdata)
library(maptools)
plter_shape <- readShapeLines('plter_region_shore_shore',
                              repair = T)
map("world",
    "antarctica",
    xlim = xlim,
    ylim = ylim,
    fill = T,
    col = 'grey',
    resolution = 0,
    bg = 'white',
    mar = c(6,3,2,1),
    type = 'n'
)

plot(plter_shape, add = T)
box()

Which produces this plot:

Rplot02

That’s quite a bit more detail.  If we zoomed in the improvement would be even more pronounced.  You’ll notice that I plotted the ESRI shape file on a plot produced from maps, even though I had no intention of using maps to reproduce the coastline.  Using maps allows R to plot the data as a projection, which is particularly important at high latitudes.  The default maps projection isn’t great for 60 degrees south (see the maps documentation for a list of available projections), but it gets the point across.  A square projection would be much worse.  What’s particularly nice about all this is that R will use the projected coordinates for all future calls to this plot.  Points or lines added to the plot will thus show up in the right place.

Now that we have our basic map we can add some data.  One of our science tasks was to evaluate bacterial production along the Palmer Long Term Ecological Research Station grid.  Plotting depth-integrated values on our map lets us see where bacterial production was high.  I’m not going to bother showing how I arrived at depth integrated values because the code is too specific to my data format, but suffice to say that I used the interp1 function to interpolate between depth points at each station, then I summed the interpolated values.  This gives me a data frame (depth_int) of three variables: lon, lat, and bp (bacterial production).

First, let’s generate some colors to use for plotting:

z <- depth_int$bp
zcol <- colorRampPalette(c('white', 'blue', 'red'))(100)[as.numeric(cut(z, breaks = 100))]

Then we add the points to the plot.  This is no different than building any other R plot:

points(depth_int$lon,
       depth_int$lat,
       bg = zcol,
       pch = 21,
       cex = 2,
       col = 'grey'
)

Snazz it up a little by adding grid lines and axis labels:

axis(1, labels = T)
axis(2, labels = T)
grid()

Titles…

title <- expression(paste('Bacterial production gC m'^{-2},'d'^{-1}, sep = " "))
title(xlab = 'Longitude',
      ylab = 'Latitude',
      main = 'title')

And the coup de grâce, a scale bar.  I can’t find a good scale bar function, but it’s easy enough to build one from scratch:

## set the location and the colorbar gradation
xleft <- -74
xright <- -73
ybot <- -65
yint <- (65 - 62.5) / 100
ytop <- ybot + yint

## create the bar by stacking a bunch of colored rectangles
for(c in colorRampPalette(c('white', 'blue', 'red'))(100)){
  ybot = ybot + yint
  ytop = ytop + yint
  rect(xleft, ybot, xright, ytop, border = NA, col = c)
  print(c(xleft, xright, ybot, ytop, c))
}

## generate labels
labels <- round(seq(min(z), max(z), length.out = 5),2)

## add the labels to the plot
text(c(xright + 0.2),
     seq(-65, -62.5, length.out = 5),
     labels = as.character(labels),
     cex = 0.8,
     pos = 4)

Done correctly this should approximate the plot published in my earlier post.  Plenty of tweaks one could make (I’d start by shifting the location of the scale bar and drawing an outlined white box around it), but it’s not a bad start!

Rplot03

Posted in Uncategorized | Leave a comment

Dispatch from AAAS-CASE

Salaries for University of Washington professors. There is some weirdness at the low end driven by mid-year promotions and positions with inexplicably low salaries. Positions at the high end are primarily within the medical school. Excluding positions < $30,000 per year the mean is $114,000 per year.

Size of single investigator awards granted by Antarctic Organisms and Ecosystems, within the Division of Polar Programs, Directorate of Geosciences.  The mean award amount is a bit shy of $400,000.

As mentioned in my previous post I spent the last three days at the AAAS-CASE workshop in Washington, DC.  The purpose of the workshop was to introduce junior scientists to the legislative process and the ins and outs of science funding at the federal level.  After two days of intense introduction, including a fascinating lecture from congressional expert Judy Schneider, we spent a day on Capital Hill meeting with staff members of our congressional delegations.  I made the rounds of the Washington delegation with Kelly Fleming, a graduate student in chemical engineering at the UW.

The Washington congressional delegation is strongly pro R&D.  The University of Washington brings a lot of federal money to the state, as does the Pacific Northwest National Laboratory and NOAA’s Pacific Marine Environmental Laboratory.  Additionally Boeing, Microsoft, and other large commercial interests in the state benefit heavily from federal R&D investment and a highly educated labor pool, and presumably lobby on behalf of sustained (or improved) funding.

Despite this pro-science stance I was initially surprised at how little congressional staffers understand about how funding works at the institutional level.  In retrospect this isn’t surprising at all.  After six years in graduate school I’m still not 100 % clear on how science funding works at the UW (see previous post on this issue here).  It quickly became clear that what we could best provide is some perspective on how increased grant funding directly translates to more jobs at institutions that receive federal funds and an improved quality of life.

Perhaps contrary to popular opinion academic salaries are not high.  For public universities salaries are a matter of public record, the salaries of UW professors can be found here.  It’s a bit difficult to pull a mean out, because certain appointments receive no salary, and there are duplicate entries for faculty members who were promoted in a given year.  I would guesstimate the mean at $114,000 per year – and this is dragged upwards a bit by some outlier salaries in well-paying departments (engineering, medicine, etc.).  On the surface the mean seems like a healthy salary.  Although it’s much lower than other fields with a comparable amount of graduate education (e.g. law, medicine), and much lower than what industry pays PhD researchers, this is offset by certain advantages (namely academic freedom and prestige).  If we dig a little deeper however, we see that the reality is quite different.

Most faculty appointments are only funded at 50 % of the stated salary (taking the mean, a paltry $57,000 per year).  That’s not a bad salary when you’re 30 – a bit worse if you’re trying to raise a family – but for a full professor at the end of a productive career its an undervaluation.  Why the limit to 50 % funding?  This guaranteed pay is for teaching classes and performing other service directly to the university.  The remainder of a faculty member’s salary is supposed to be raised from grants.  Salaries aren’t a great use of grant funding, but the system might be considered rational if grant funding was adequate.  To determine if that’s the case lets pull a typical grant apart.

In the course of recently preparing a grant proposal of my own I determined the mean size of single investigator awards for the NSF Antarctic Organisms and Ecosystems program, within the Division of Polar Programs, to be ~$400,000 over three years.  Here’s how a typical award to that program might break down, costs are very approximate but loosely based on UW rates:

Institutional Overhead at 56 % (i.e. indirect cost recovery, varies): $224,000
Equipment/analytical services (i.e. doing the actual science): $50,000
Consumables (the day to day lab stuff): $20,000
Graduate student operating costs (i.e. tuition for 3 years): $45,000
Graduate student stipend and benefits: $90,000
Partial salary for laboratory technician (someone’s got to teach the grad student): $30,000
Travel costs (assuming this isn’t fieldwork based): $10,000
Remaining for faculty salary: $-119,000

And therein lies the problem.  Faculty work around the financial limits as best they can by not hiring laboratory technicians, cutting back on analysis, and in some cases just dealing with less than 100 % pay (for far more than 40 hours of work a week).  Education, academic service, and research all suffer as a result.  So how to fix this problem?  The largest chunk of money that comes out of a grant is overhead, but how would the UW and other institutions function without the income derived from indirect cost recovery?  Vastly more funding to NSF, NIH, NASA, and the other granting agencies would help (and is a laudable goal anyway), but this isn’t likely to happen anytime soon.

I think the solution might lie in pursuing the more modest goal of retooling the way funds are allocated to investigators, while slowly pushing for more overall science funding.  Taking our previous example, by increasing the size of the award to $519,000 we at least break even (keeping in mind that the hypothetical project was pretty cheap, requiring no expensive field work, equipment, or analysis).  Adding three months of support for our beleaguered faculty member each year brings the total to approximately $609,000.  If we assume each faculty member has two active grants at a time (lucky them!) the math works okay, however two issues remain.

1) If every grant was 50 % larger, we need 50 % more funds to support the same number.

2)  Generally optimists say that 10 % of submitted grants get funded.  The number can be much lower for some divisions.

I don’t think that pushing to get 50 % more funds to the grant providing agencies is unrealistic (it’s a minuscule amount of the federal budget), but the political will to do this isn’t even on the horizon.  In the meantime making sure that grants are distributed equitably will help, though this goes against the grain of the brutally Darwinian academic culture where what matters is that an idea and implementation be best.  I think there is a balance to be struck here in terms of funding the best and funding what’s really good but also supports a promising researcher and their research team.  Encouraging adequate institutional support (which for the UW really means adequate state support) is also important.  Reduced overhead and alternate funding sources for graduate students and technical staff would really help stretch the federal grant dollars.

Number two is really the bigger issue.  I’m not familiar with NIH grant applications, but NSF grants require, in addition to a great idea, weeks of preparation.  I’m a reasonably fast writer and reader, but based on my own experiences I’d estimate that it takes about 120 hours of work to put together a good proposal and all the necessary supporting documents.  That’s assuming that you’ve already got a reasonably well-baked idea in your head and some preliminary data to show that the idea is well founded.

In our scenario a faculty member needs to fund two of these things every three years, which means that they have to submit about 20 (of course there will be some resubmissions in this total).  That’s 2,040 hours of work, or 680 hours each year – 17 work weeks at 40 hours per week, if we lived in a sane world where that was the norm.  Given the requirements for teaching, advising, service, and oh yeah, doing the actual research, I don’t think this is realistic.  One solution is to move grants to a five year cycle so that a faculty member needs to fund two grants every five years.  Yes, that makes each grant a bit larger, but it also reduces the number of active awards and has the additional advantage of bringing the grant cycle in (approximate) alignment with the length of time it takes a graduate student to get a PhD.

I had a great conversation with my colleagues at the Blue Marble Space Institute of Science about this issue this morning during the monthly “Beer with BMSIS” event (to be fair it was the afternoon in most of the country, and we only discuss a beer of choice, we don’t typically drink it).  These events are recorded, so you can listen here (and of course tune in the first Thursday of each month, you don’t need to be a member of BMSIS to join us – but it does help if you like beer).

 

Posted in Uncategorized | Leave a comment

AAAS-CASE

I’m currently in Washington, DC for the pilot AAAS-CASE (Catalyzing Advocacy and Science in Engineering) workshop.  I wanted to share a great short video that was sent to the attendees before the workshop.

http://www.innovationdeficit.org/

The video describes the “innovation deficit”, the gap between the federal investment in science and engineering, and what is needed to sustain new advances in energy, medicine, agriculture, technology and in our fundamental understanding of the world around us.  The problem isn’t difficult to understand, the entire NSF research budget is something like $7 billion (roughly 35 F-35 join strike fighters at the going rate – every single one of which flies or fails based on past federal expenditures on both basic and applied research).

How do we address the innovation deficit?  By building stronger connections between scientists and policy makers – two groups naturally ill at ease with one another.  The idea behind AAAS-CASE is to start a conversation among junior scientists about how they can guide better policy, and to equip them with the tools to engage with policy makers.  We’ll see how it goes…

Posted in Uncategorized | Leave a comment

Converting pfams to cogs

Anyone with a sequence to classify faces a bewildering array of database choice.  You’ve got your classic NCBI quiver; nt, nr, refseq, cdd, etc.  Then you’ve got uniprot, seed, pfam, and numerous others.  Some of these have obvious advantages or disadvantages.  The NCBI nr database for example, is comprehensive but cumbersome.  Choosing between others can at times feel arbitrary.

One database that I use a lot is pfam.  I like it because, in addition to being directly supported by the excellent protein classifier HMMER3.0, it is well curated and has helpful Wikipedia-like articles for some of the protein families.  Classification is based on the presence of conserved domains which is also a little easier to wrap your head around than the implications of sequence similarity alone.  The downside to pfam is that the presence of a conserved domain doesn’t always clue you in to the function of an unknown protein sequence.  There are also many thousands of protein domains. Sometimes you can guess at the function from the name.  Aldedh for example, is the aldehyde dehydrogenase family. Other names are more obtuse.  Abhydrolase_5 is a family containing the alpha/beta hydrolase fold domain, but unless you’re a biochemist the implications of that are probably a mystery (it was to me, fortunately there are the wiki articles).  What’s needed is a broader level of classification for these protein families.

NCBI used to maintain a really great database called COG (Clusters of Orthologous Genes).  Proteins were grouped into COGs which were given descriptive names like Glyceraldehyde-3-phosphate dehydrogenase/erythrose-4-phosphate dehydrogenase.  Even better the COGs were assigned letters based on their perceived physiological function.  Some might argue that this is too coarse a classification, and they might be right, but it is still helpful when evaluating the ecological relevance of a protein.  Here are the COG codes, taken from the fun.txt file at the COG ftp site:

INFORMATION STORAGE AND PROCESSING
[J] Translation, ribosomal structure and biogenesis
[A] RNA processing and modification
[K] Transcription
[L] Replication, recombination and repair
[B] Chromatin structure and dynamics

CELLULAR PROCESSES AND SIGNALING
[D] Cell cycle control, cell division, chromosome partitioning
[Y] Nuclear structure
[V] Defense mechanisms
[T] Signal transduction mechanisms
[M] Cell wall/membrane/envelope biogenesis
[N] Cell motility
[Z] Cytoskeleton
[W] Extracellular structures
[U] Intracellular trafficking, secretion, and vesicular transport
[O] Posttranslational modification, protein turnover, chaperones

METABOLISM
[C] Energy production and conversion
[G] Carbohydrate transport and metabolism
[E] Amino acid transport and metabolism
[F] Nucleotide transport and metabolism
[H] Coenzyme transport and metabolism
[I] Lipid transport and metabolism
[P] Inorganic ion transport and metabolism
[Q] Secondary metabolites biosynthesis, transport and catabolism

POORLY CHARACTERIZED
[R] General function prediction only
[S] Function unknown

I decided to try and map the pfam database to the COG database so that I could have my cake and eat it too.  To do that I ran blastp on the Pfam-A.fasta file against a protein blast database created from cog0303.fa and used the following script to parse the tab-delimited blast output and various other files to create the mapping.  This was a reasonably large blast job and I ran into a computer issue part way through.  As a result the job didn’t complete.  When I have the chance to re-run I’ll post the completed pfam to cog table here. In the meantime here’s the script for anyone who would like to create their own pfam to cog map.  It is assumed that all database files are in the working directory.

## first step was to blast pfam-A against cog0303.fa
## blastp -db COG/cog0303 -query Pfam-A.fasta -out Pfam-A_cog.txt -outfmt 6 -num_threads 8 -evalue 1e-5
## if you want to conduct a new mapping, for example from a new blast file, remove the old pickle first

import gzip
import cPickle

cog_cats = {}
cogs_seqs = {}
cog_names = {}
pfam_seqs = {}
pfam_cog = {}
import os

if 'pfam_cog_dict.p' not in os.listdir('.'):
     ## map cog name to cog category
     print 'mapping cog name to cog category'
     with open('cogs.csv', 'r') as cog_file:
     for line in cog_file:
          line = line.rstrip()
          line = line.split(',')
          cog_cats[line[0]] = line[1]
          cog_names[line[0]] = line[2]

     ## map cog sequence to cog name
     print 'mapping cog sequence to cog name'
     with open('domdat.csv', 'r') as cogs_seqs_file:
          for line in cogs_seqs_file:
          line = line.rstrip()
          line = line.split(',')
          cogs_seqs[line[0]] = line[6]

     ## map pfam sequence to pfam
     print 'mapping pfam sequence to pfam'
     with open('Pfam-A.fasta', 'r') as pfam_seqs_file:
          for line in pfam_seqs_file:
          if line.startswith('>'):
          line = line.strip('>')
          line = line.rstrip()
          line = line.split()
          pfam_seqs[line[0]] = line[2]

     ## map blast
     print 'mapping blast'
     with gzip.open('Pfam-A_cog.txt.gz', 'rb') as blast:
          for line in blast:
          line = line.rstrip()
          line = line.split()
          pfam_seq = line[0]
          cog_seq = line[1]
          pfam = pfam_seqs[pfam_seq]
          cog = cogs_seqs[cog_seq]
          try:
               cog_name = cog_names[cog]
          except KeyError:
               cog_name = 'NULL'
          try:
               cog_cat = cog_cats[cog]
          except KeyError:
               cog_cat = 'NULL'
          try:
               temp = pfam_cog[pfam]
               temp_cog = temp[0]
               temp_cog_cat = temp[1]
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = temp[2]
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          except KeyError:
               temp_cog = set()
               temp_cog_cat = set()
               temp_cog.add(cog)
               temp_cog_cat.add(cog_cat)
               temp_cog_name = set()
               temp_cog_name.add(cog_name)
               pfam_cog[pfam] = temp_cog, temp_cog_cat, temp_cog_name
          print pfam, cog, cog_cat, cog_name

     cPickle.dump(pfam_cog, open('pfam_cog_dict.p', 'wb'))

else:
     print 'using existing pfam to cog map'
     pfam_cog = cPickle.load(open('pfam_cog_dict.p', 'rb'))

     with open('pfam_cog_cat_map.txt', 'w') as output:
          for pfam in pfam_cog.keys():
          temp = pfam_cog[pfam]
          pfam_temp = pfam.split(';')
          pfam_code = pfam_temp[0]
          pfam_code = pfam_code.split('.')
          pfam_code = pfam_code[0]
          for cog_cat in list(temp[1]):
               print >> output, pfam_code+'\t'+pfam_temp[1]+'\t'+cog_cat

This script produces a pickle with the mapping in case you want to format the output differently later without having to redo everything, and a tab-delimited file with the COG categories that each pfam belongs to:

PF07792     Afi1     NULL
PF05346     DUF747     NULL
PF12695     Abhydrolase_5     E
PF12695     Abhydrolase_5     I
PF12695     Abhydrolase_5     FGR
PF12695     Abhydrolase_5     Q
PF12695     Abhydrolase_5     R
PF12695     Abhydrolase_5     U
PF12695     Abhydrolase_5     NULL
PF12644     DUF3782     S

Note that each pfam belongs to many COGS (maybe too many cogs to make this useful in many cases).  Hopefully it will help clarify the role of some of these pfams however!

Posted in Uncategorized | Leave a comment

Where goes the carbon in the Palmer LTER?

In the last post I broadly described some of the ecological changes underway in the Palmer LTER.  If we consider only the biological components of any ecosystem (excluding the chemical and physical components) we can diagram their interactions as a food web, with primary producers at the base of the web and top predators at the top.  Changes to the structure of the food web are generally driven by changes at the top (‘top-down’ effects) or at the bottom (‘bottom-up’ effects).  Whaling during the 19th century for example, exerted a very strong top-down effect on the structure of Antarctic food webs.  Rapidly changing sea ice conditions during this century are exerting a strong bottom-up effect by altering the pattern of primary production.

The classic Antarctic food web. The traditional food web is a good learning tool, but is incomplete. In reality a substantial amount of biomass cycles through the marine microbial community. Taken from http://science.jrank.org/kids/pages/63/FITTING-ALL-TOGETHER.html.

The traditional food web that we all learned in elementary school is (not surprisingly) an incomplete representation of the trophic structure of an ecosystem.  If we consider the carbon flowing through such a food web we see it move only in one direction, up from the primary producers to the top consumers.  A rule of thumb is that the biomass of each trophic level is one tenth of the one below it, so as carbon moves up through the food web most of it must be exiting the system.  Where does it go?  Much of it is lost through respiration – we and nearly all other multicellular organisms exhale carbon dioxide.  The rest is lost through inefficient feeding, excretion, and death in the consumer trophic levels.

A pelagic food web with the microbial loop. The diagram represents an open-ocean environment different from the west Antarctic Peninsula, but does a good job at highlighting the function of the microbial loop (at left). DOC leaks from the food web at every trophic level. Bacteria incorporate this DOC into microbial biomass and are consumed by small zooplankton, returning the carbon to the food web. Taken from http://oceanworld.tamu.edu/resources/oceanography-book/microbialweb.htm.

If that was the end of the story carbon would be raining down to the deep ocean at a rapid pace in the form of dead phytoplankton, dead consumers, and fecal pellets (this is actually the goal of ocean fertilization), and the photic zone, the sunlight portion of the ocean that supports photosynthesis, would be awash in small particles of organic carbon (we call this dissolved organic carbon or DOC) that aren’t practical for larger organisms to eat.

What limits this in the ocean are bacteria, which consume DOC and partially degrade detrital particles.  Bacteria are large enough to be prey items for some consumers, so the carbon they incorporate is recycled into the food web in a process known as the microbial loop.  Bacteria take up a huge amount of the carbon that leaves the food web, but not all of it is looped back up to the higher trophic levels.  Like multicellular organisms heterotrophic bacteria respire carbon dioxide.  If that carbon dioxide isn’t take back up by phytoplankton (a further loop embedded in the food web) it will eventually leave the system.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

Depth integrated bacterial production for the Palmer LTER during the 2014 LTER cruise, estimated from the uptake of tritiated leucine.

How much carbon is taken up by primary production, incorporated into bacterial biomass, and exported from the photic zone are huge research questions.  None of these questions are particularly easy to answer, and the numbers vary widely across time and space.  On the Palmer LTER cruise we used three different methods to estimate these values for the west Antarctic Peninsula.  I’ll talk a little about the first two here.  Bacterial production is measured by the uptake of tritiated leucine.  Leucine is an amino acid that is broadly taken up by the marine microbial community.  Tritium is a radioactive isotope of hydrogen, tritiated leucine is leucine with tritium in place of hydrogen atoms.  It is possible to quantitatively measure very small amounts of radioactivity, which makes it possible to measure the incorporation of very small amounts of tritiated leucine into microbial biomass.

To do this we take water samples from multiple depths at each station and incubate them with tritiated leucine at the in-situ temperature.  After a few hours the bacteria in the water samples are killed (“fixed”), carefully washed, and measured for radioactivity.  From this a rate of leucine incorporation is calculated.  This number tells us something about the rate at which bacteria are increasing their biomass.  What we actually want to know however, is how much carbon they are taking up to do this.  Many years of experiments have allowed the development of a conversion factor to calculate the rate of carbon incorporation from the rate of leucine incorporation.  What this method doesn’t tell us however, is the amount of carbon that is consumed during microbial respiration – a much harder number to get at!

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

Depth integrated primary production for the Palmer LTER during the 2014 LTER cruise. Estimated from the uptake of 14C bicarbonate.

The method for measuring primary production is similar to that for measuring bacterial production, except that radioactive bicarbonate (labeled with the 14C isotope of carbon) is used in place of radioactive leucine.  In seawater the bicarbonate exists as carbon dioxide and is incorporated into phytoplankton biomass during photosynthesis.

All of this fixed carbon exits the system in one of two ways.  Either it is respired as carbon dioxide as it works its way up the food web, eventually ending up in the atmosphere, or it is exported out of the photic zone as sinking detrital particles.  Particulate matter that is not consumed by bacteria below the photic zone will eventually end up on the seafloor.  The consumption of this carbon by the microbial community is often very high right at the seafloor, but rapidly diminishes below the sediment surface as oxidized chemical species are depleted.  Carbon that makes it to this depth is effectively sequestered, until a geological process (such as subduction) returns it to the surface.  Measuring the flux of particulate matter to the seafloor relies on the use of sediment traps or the measurement of radioactive thorium-234, a process that is far more complex than the measure of tritiated leucine or 14C bicarbonate.  To learn more about it check out the website of Café Thorium, Ken Buessler’s research group at WHOI, particularly this article.

The two figures above show our estimates of primary and bacterial production across the study area.  Both sets of values are pretty high.  There was a strong phytoplankton bloom this year (perhaps connected to improved sea ice coverage) and the bacterial community seems to have responded in kind to this input of fixed carbon.  Note however, that bacterial production is an order of magnitude below primary production.  Most of the carbon is exiting the system through respiration or particle export.  A small amount is contained within the biomass of phytoplankton, krill, and the higher trophic levels.  If you look carefully at the two plots you’ll also see that bacterial production is somewhat decoupled from primary production, being high where primary production is low and sometimes low when it is high.  The devil is in the detail and it will take some digging to understand the dynamics of carbon transfer in this system!

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould.  The odd looking contraption behind us is a French made sediment trap.  Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor.  The French design is supposed to minimize these effects.  Fun fact - tenured WHOI faculty are impervious to the cold and to head injuries.

The bacterial abundance, production, and export team for the 2014 Palmer LTER cruise, on the deck of the ARSV Laurence M. Gould. The odd looking contraption behind us is a French made sediment trap. Sediment traps are invaluable tools, but famously undersample particles sinking to the seafloor. The French design is supposed to minimize these effects. Fun fact – tenured WHOI faculty are impervious to the cold and to head injuries.

 

Posted in Uncategorized | Leave a comment

A cruise in the Palmer LTER

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

The Laurence M. Gould in one of the many coves off the West Antarctic Peninsula.

For the last five weeks I’ve been on a research cruise aboard the ARSV Laurence M. Gould off the west Antarctic Peninsula (WAP).  This region has received a lot of attention in recent years as one of the fastest warming places in the planet – annual temperatures have increased 7°C over the last 50 years.  Because of past sealing and whaling and ongoing fishing the WAP is hardly a pristine ecosystem (remember, say no to Chilean sea bass, aka the Patagonian toothfish, one of the most pressured predators in the region).  Nonetheless it’s under much less direct human pressure than most places on Earth which, combined with the rapid rate of warming, makes it a good site to observe the impact of changing climate on marine ecosystems.  To carry out these observations the WAP hosts a long term ecological research site (the Palmer LTER), one of 25 NSF funded LTER sites.  The Palmer LTER is unique in being one of only two marine-pelagic LTERs, and also one of only two LTERs located in Antarctica (the other is in the McMurdo Dry Valleys).  The cruise I participated in takes place each year during the Antarctic summer.  It is tasked with making a variety of measurements throughout the Palmer LTER and otherwise supporting science activities connected with long-term observations there.

Sample sites within the Palmer LTER. Each year the Gould occupies approximately three points spanning each “line”, depending on sea ice conditions.

The Laurence M. Gould has extremely limited access to the internet, so unfortunately it wasn’t possible to upload articles in real time.  Instead I’ll try to summarize the cruise in a couple of articles after the fact.

I mentioned that the climate of the WAP is changing fast.  Twenty years ago the whole peninsula would have been considered a polar environment, hosting a polar marine ecosystem.  The outlying islands that extend in a sweeping arc out to South Georgia Island would have been considered subpolar.  The subpolar ecosystem is now moving south in a hurry, changing the distribution of nutrients, primary production, and marine species.  Because it’s hard to picture what’s going on below the ocean’s surface it’s a little difficult to visualize these changes in the marine ecosystem.  A comparable change in the terrestrial environment might be the encroachment of desert on a region of rich grassland.  Deserts aren’t all bad (I rather like them), but they can’t support the biomass or diversity of grasslands.  In the parlance of economics they can’t support the same level of ecosystem services.

In the WAP a huge driver of ecological change is sea ice.  The species that occupy the Antarctic Peninsula are divided into two groups: ice dependent and ice independent.  The ice dependent species have an absolute requirement for sea ice and include Adélie penguins, leopard seals, crabeater seals, sea ice algae, and most importantly, krill.  The krill relationship with sea ice isn’t totally straightforward.  Adult krill are happy to feed on phytoplankton in the water column and have no real sea ice requirement.  Juvenile krill on the other hand, are dependent on the dense layer of lipid rich algae that grow on the underside of sea ice to support their growth.  No juvenile krill means no adult krill, a problem because everything else in the Antarctic marine ecosystem depends on adult krill.  Without krill there is no way to transfer biomass from primary producers at the base of the foodweb to higher trophic levels.  To return to the grassland analogy, removing the krill from the ecosystem would be like removing all the hoofed mammals from the Serengeti.  All the predators and scavengers would disappear too.   Ultimately even the grass would fail, because the presence of large grazers has benefits like soil aeration and nutrient and seed dispersal.  In the WAP it would be worse than this because the biomass far exceeds that of a grassland of comparable size, and this biomass fits critically into the ecology of such globe-trotting species as the humpback whale, orca, and Wilson’s storm petrel (thought by some to be the world’s most abundant bird).

Sea ice cover in the Palmer LTER.

Sea ice cover in the Palmer LTER from 1981 to 2011.  Anvers Island is at the northern extent of the LTER, Adelaide Island near the center, and Charcot Island at the southern extent.  There is a lot of year to year variation, but the overall trend of reduced ice cover is clear.  From Ducklow et al. 2013, see link at end of post.

The general trend in the WAP is for reduced sea ice and primary production in the north of the study region, and increased primary production in the south.   All the factors influencing this change aren’t yet clear, nor is it clear how these changes will manifest in the future.  It will take decades of observation to clarify the trend and the mechanisms behind it.  One likely driver of the southward shift in primary production is the reduced availability of micronutrients supplied by glacial meltwater.  The Peninsula, like the rest of the continent, is covered with glaciers.  Glaciers melt at their base where they contact rock, and this mineral-rich meltwater is a key source of iron and other micronutrients to the marine ecosystem.  It’s counterintuitive that warming would reduce the availability of meltwater.  The availability of these micronutrients have to do with the distribution of meltwater in the ocean however, not the rate of melt.  Freshwater is much less dense than seawater, so glacial meltwater “floats” overtop seawater, forming a micronutrient rich lens.  In the presence of sea ice this lens is protected from wind driven vertical mixing, and for a brief period each summer following sea ice retreat there is a strong spatial overlap between micronutrient rich water and sunlight.  The result is a “bloom”, a sudden explosion of growth among the primary producers.  In the absence of sea ice storms dilute this surface water with micronutrient poor water from deeper in the ocean.  By summer the amount of photosynthesis that can be supported is much reduced.

Leopard seals in the WAP, near the British Antarctic Survey's Rothera Station.  Leopard seals are a key ice dependent predator in the region.

Leopard seals in the WAP, near the British Antarctic Survey’s Rothera Station. Leopard seals are a key ice dependent predator in the region.

Nothing is that simple however.  In the WAP the upwelling of micronutrient poor deep water through marine canyons appears to strongly support primary production.  Like upwelling water the world over this deep water, while deficient in micronutrients, is rich in macronutrients like nitrate.  So the reality is probably that both water sources, and a reasonably stable water column, are required to support the level of primary production the rest of the ecosystem depends on.  Returning to the grassland analogy, tweaking the delivery and mixing rate of surface and deep water is equivalent to the jetstream shifting its course over land, depriving a historically wet region of water and limiting the amount of plant growth that it can support.

So, from the standpoint of primary production, two important things are happening in the WAP.  First, the reduction in sea ice means there are fewer ice algae to support the growth of juvenile krill.  Second, the reduction in sea ice is increasing the rate of vertical mixing in the water column, reducing the growth of phytoplankton – the food source of adult krill.

This year was an exception to the current trend, with ice conditions rebounding a little after years of decline.  It was nothing approaching the coverage that was normal 15 or 20 years ago, but it was sufficient to support an impressive phytoplankton bloom and will hopefully slow the loss of some of the ice dependent species like the Adélie penguin, whose numbers have been dwindling.  Next post I’ll talk a bit more about some of our surprising findings regarding the fate of all that carbon produced during the bloom…

For more on the Palmer LTER check out this excellent, non-technical article in the magazine of the Oceanography Society: West Antarctic Peninsula: an ice-dependent coastal marine ecosystem in transition

Posted in Uncategorized | Leave a comment

Android, Mendeley, Referey, and taming the reference beast

Like everyone else in Academia I’m in a constant struggle against information overload.  This comes not only from the overwhelming amount of data from analyses I’m working on, but from an ever-growing stream of scientific literature.  I gave up trying to stay on top of journal tables-of-contents a long time ago, instead I rely on Google Scholar alerts to notify me when relevant publications come out.  Unfortunately this switch didn’t reduce the volume of literature that catches my eye, it merely insured that all the literature would be relevant, interesting, and even more important to file away somewhere and remember.

Some people seem to have a great mental filing system for authors, dates, and subjects of key articles, and can pull a critical reference from the depths of their hard drive based on title alone.  I’m not one of those people.  If I can’t find a reference by a keyword search it doesn’t exist in my universe.  Fortunately, in front of a laptop it is generally possibly to find what you want by keyword search.  Both Windows Search and OSX Spotlight automatically index the contents of pdfs, making it easy to find all the papers in a personal library pertaining to say, Wigglesworthia.  To make this even easier about a year ago I started using Mendeley Desktop, a pretty decent program for organizing and annotating pdfs.  I don’t use Mendeley quite the way its meant to be used however.  The free version of the program comes with a generous chunk of online storage space, which allows you to maintain an online copy of your library that’s accessible from anywhere.  My pdf library is pretty large however, and I’m hesitant to upgrade to the paid version of Mendeley – which comes with more storage space – until I’m really sold on the program (which is happening quickly, aided by the hope that Mendeley’s cite-while-you-write add on will forever remove from my life the horrible, evil, travesty which is Endnote…).

Enter the Android tablet that I recently acquired.  My vision is to be able to sit in a class/seminar/meeting, with that critical paper/fact/reference on the tip of my tongue, and actually be able to LOOK IT UP.  Should be simple enough, right?  Well, maybe.  An additional requirement is that I need to be able to look things up when I’m at sea, in the field, or otherwise not connected to the internet.  This makes it a little trickier.  After reading a number of blog posts from people with similar problems I was able to patch together a solution that works (so far) remarkably well.  The limitations of Android and Apple are equal in this respect, so this solution should be relevant for ipad users too.

My initial though was to migrate my pdf library to Dropbox, and use the Android Dropbox app and native search function to find what I wanted.  Two problems: 1) The Dropbox app doesn’t actually sync files to the tablet (crazy!) and 2) There IS NO native search function on either Droid or Apple ios that searches file contents (crazier still!).  Fortunately I came across Dropsync, a competent little app that picks up where Dropbox falls flat.  With the paid version you can keep select Dropbox folders synced to the SD card in your Droid.  Problem one solved.

If I didn’t need the search function to work without internet I could have downloaded one of a couple of Mendeley-compatible apps, upgraded to the paid version of Mendeley, and lived the life of online library access (ipad owners can cut the middle man and make use of an official Mendely app).  Needing an offline search feature I was pretty well stuck until I found out about Referey (thanks Mendeley, for posting links to third-party apps that help your users…).  Referey was designed with my exact problem in mind.  Here’s how it works:

Having migrated your entire reference library to a folder in Dropbox, you create a symlink in that folder to the sqlite database where Mendeley keeps the index for your library.  On Windows 7 I found the database in C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite.  From the destination folder in Dropbox I created a symlink as such:

MKlink largedatabasewithyourmendeleyusername.sqlite "C:\Users\Jeff\AppData\Local\Mendeley Ltd\Mendeley Desktop\largedatabasewithyourmendeleyusername.sqlite"

I believe for Apple/Unix the command is “symlink”.  That keeps both my reference library and the Mendeley index up to date on my tablet.  On your tablet, in preferences, you tell Referey where in your Dropbox (synced with Dropsync, not Dropbox) you’ve stashed the database and your library.  Referey does no further indexing, it simply uses what Mendeley already knows to find what you want in the library.  Pdfs returned in a search can be opened and annotated using whatever app you want (I’m currently using iAnnotate PDF).

For a while I really doubted whether a solution existed for what seems like an obvious problem.  I was pretty happy to find one…

Posted in Uncategorized | Leave a comment

Frost flower metagenome paper submitted

I just hit the “submit” button for our latest frost flower paper, which reviewer-be-willing will appear in an upcoming polar and alpine special issue of FEMS Microbial Ecology.  Several scattered bits of analysis from the paper have appeared in this blog, here’s a recap of what we did and why.

In a 2013 paper we described an unusual community of bacteria in Arctic frost flowers composed primarily of members of the bacterial order Rhizobiales.  This was a bit odd because Rhizobiales, while not unknown in the marine environment, or not typically found in high abundance there.  Rhizobiales have definitely not been observed in association with sea ice.  To try and figure out what was going on with this particular community we went back to the same samples (originally collected in April, 2010) and used whole-genome amplification to amplify all the DNA until we had enough to sequence metagenomes from one frost flower and one young sea ice sample.  The good people at Argonne National Lab did the actual sequencing (great work, great prices…).

Following our failed effort in February of 2010 we received an email in April that natural frost flowers were forming.  I went up the next day and sampled from this lead, these are the samples from our 2013 paper (Bowman et al., EMIR, 2013).  That night it snowed a foot, eradicating all frost flowers.  If I'd been a day later I would have missed them.

The field of frost flowers yielding the samples for this paper, and for Bowman et al. 2013.

I’ve never worked with metagenomic data before, and once I had the actual sequence reads it took me a while to figure out how to get the most information out of them.  For a first cut I threw them into MG-RAST.  This confirmed the Rhizobiales dominance easily enough, but for publication quality results you really need to do the analysis yourself.  Ultimately I used a four-pronged approach to evaluate the metagenomes.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis.  This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

MG-RAST, great for a first cut of your data, and for wowing your advisor with snazzy graphics, but no good for rigorous analysis. This figure from MG-RAST shows the recruitment of metagenomic sequence reads to the genome of Sinorhizobium meliloti 1021.

1.  Extract 16S rRNA reads and analyze taxonomy.  For this I use mothur, treating the 16S associated reads like any 16S amplicon dataset (accept that you can’t cluster them!).

2.  Align all the reads against all available completed genomes.  We’ve got 6,000 or so genomes in Genbank, may as well do something with them.  This actually provides a ton of information, and backs up the 16S based taxonomy.

3.  Assemble.  I’m not that good at this, the final contigs weren’t that long.  Cool though.

4.  Targeted annotation.  One would love to blastx all the reads against the nr database, or even refseq protein, but this is silly.  It would tie up thousands of cpus for days,this  computing power is better spent elsewhere.  If you have some idea what you’re looking for however, you can blastx against small protein databases pretty easily.  After the blast search you can use pplacer or a similar tool to conduct a more fine scale analysis of the reads with positive matches.

So what’d we find?  The frost flower metagenome contains lots of things that look like Rhizobiales, but that are pretty different from most of the Rhizobiales with sequenced genomes in Genbank.  Among other things they don’t have nodulation genes or genes for nitrogen fixation, unlike many terrestrial Rhizobiales.  The community does have genes for some interesting processes however, including the utilization of dimethylsulfide (DMS) – a climatically active biogenic gas, the degradation of glycine betaine – which may be an important compound in the nitrogen cycle, and haloperoxidases – oxidative stress enzymes that release climatically active methylhalides.

You can find more details on some of this analysis on this poster.

Posted in Uncategorized | 1 Comment