Intro
Computational biologists often have to use both Python and R in a single project. Either components of the project will rely on libraries in different languages, or all the Stack Overflow help for the components will be in R or Python only. However, if you can limit a project to a single language, you’ll have fewer dependencies and less installation time for future users.
This post documents in Python one task that’s already well-documented in R: converting ENSEMBL ids to gene symbols and vice versa.
The R Method
If you only care about the Python solution, go ahead and skip forward one section.
library("biomaRt")
ensembl = useMart("ensembl",dataset="hsapiens_gene_ensembl")
ensembl_to_gensymbol <- getBM(attributes = 'hgnc_symbol',
filters = 'ensembl_gene_id',
values = ensemblsIDS,
mart = ensembl)
There are several variations on this theme that show up if you google “convert ensembl to gene symbol,” and the parameters are well documented here.
In short, you have attributes
, which are the types of results you want (e.g. gene symbols), filters
, which are the type of data you’re passing in (e.g. ensembl gene ids), values
, which are the data you want to be converted, and mart
which is created in the previous step and tells biomaRt where to look for the info.
The Python Method
We’ll use the Python biomart package to make interacting with BioMart servers easier1.
To install it, use pip install biomart
.
The Python version of the code is more verbose (though, to be fair, most of the code is for parsing the output into a dict), so I’ll walk through it in chunks. If you want the whole function for copy-pasting, click here or scroll to the end of the section.
import biomart
def get_ensembl_mappings():
# Set up connection to server
server = biomart.BiomartServer('http://uswest.ensembl.org/biomart')
mart = server.datasets['mmusculus_gene_ensembl']
This first block makes a connection to the server and tells the library which dataset to use. The first line won’t typically need to be changed, but if you’re working with organisms other than mice, you may want to change your dataset. You can find a list of available datasets in the “Selecting a BioMart database and dataset” section of this post.
# List the types of data we want
attributes = ['ensembl_transcript_id', 'mgi_symbol',
'ensembl_gene_id', 'ensembl_peptide_id']
This list contains the different ids that we want to map to each other.
The names of these attributes are dependent on the database, so you’ll want to change them if you aren’t mapping mouse gene ids to each other.
If you’re unsure which attributes are present in your dataset, you can call mart.show_attributes()
to print them out.
# Get the mapping between the attributes
response = mart.search({'attributes': attributes})
This line sends your query to the database, telling it to retrieve the attributes you requested.
The example code here is a little different than the R version since it doesn’t have a filters
or values
field.
If you want to only return results for your favorite ENSEMBL gene ids, you could use something like the line below for your search parameters:
# ALTERNATE SEARCH PARAMETERS
{'attributes': attributes,
'filters': {'ensembl_gene_id': <A LIST OF YOUR IDS>}}
Because of the dictionary structure of the search parameters, you don’ have to use value =
like in R.
Instead, pass the gene ids as a list of values with your attribute as the key.
# If someone puts an emoji in a gene name, this line will break
data = response.raw.data.decode('ascii')
If you’re familiar with the requests library, this statement might look familiar to you. It takes the request results and converts them from a binary string to an easier-to-work-with text string.
ensembl_to_genesymbol = {}
for line in data.splitlines():
line = line.split('\t')
# The entries are in the same order as in the `attributes` variable
transcript_id = line[0]
gene_symbol = line[1]
ensembl_gene = line[2]
ensembl_peptide = line[3]
# Some of these keys may be an empty string. If you want, you can
# avoid having a '' key in your dict by ensuring the attributes
# have a nonzero length before adding them to the dict
ensembl_to_genesymbol[transcript_id] = gene_symbol
ensembl_to_genesymbol[ensembl_gene] = gene_symbol
ensembl_to_genesymbol[ensembl_peptide] = gene_symbol
This part is all base Python. Each line in the response from BioMart is a tab-separated list containing the attributes you requested in the order you requested them. If that attribute doesn’t have an entry (e.g. the gene doesn’t code for a peptide), the entry will empty instead.
Knowing this, you can split the line, assign the values you want to variables, and add each entry to the mapping dict.
import biomart
def get_ensembl_mappings():
# Set up connection to server
server = biomart.BiomartServer('http://uswest.ensembl.org/biomart')
mart = server.datasets['mmusculus_gene_ensembl']
# List the types of data we want
attributes = ['ensembl_transcript_id', 'mgi_symbol',
'ensembl_gene_id', 'ensembl_peptide_id']
# Get the mapping between the attributes
response = mart.search({'attributes': attributes})
data = response.raw.data.decode('ascii')
ensembl_to_genesymbol = {}
# Store the data in a dict
for line in data.splitlines():
line = line.split('\t')
# The entries are in the same order as in the `attributes` variable
transcript_id = line[0]
gene_symbol = line[1]
ensembl_gene = line[2]
ensembl_peptide = line[3]
# Some of these keys may be an empty string. If you want, you can
# avoid having a '' key in your dict by ensuring the
# transcript/gene/peptide ids have a nonzero length before
# adding them to the dict
ensembl_to_genesymbol[transcript_id] = gene_symbol
ensembl_to_genesymbol[ensembl_gene] = gene_symbol
ensembl_to_genesymbol[ensembl_peptide] = gene_symbol
return ensembl_to_genesymbol
Once you put all the pieces together, you get the function above.
Conclusion
Writing a project in two languages is painful. With luck, this information was the one piece you were missing, and now you can avoid one language entirely. If not, hopefully it at least saved you some time.
Footnotes
-
biomart is a Python package that interacts with BioMart web servers, not to be be mistaken for biomaRt, an R package for the same thing. I tried to get the capitalization correct to minimize confusion, but capitalization is hard. ↩