An in-depth overview of your genetic code using the SciPy Stack

Bioinformatics involves creating methods and software to help us understand biological data.

It’s like data science, but for biology.

Genomics data, especially from next-generation DNA sequencing (NGS), is increasing rapidly. Stephens, Zachary D et al. shows that we’re now gathering genomics data at an exabyte-per-year rate.

The Comprehensive Introduction To Your Genome With SciPy
SciPy gathers many Python modules for scientific computing, which is ideal for many bioinformatic needs.

This post demonstrates how to analyze human genome data in a GFF3 file using the SciPy Stack. The Generic Feature Format Version 3 (GFF3) is the standard format for storing genomic features. This post will show you how to use SciPy to answer these questions:

  1. How much of the human genome is incomplete?
  2. How many genes are in the human genome?
  3. What is the average size of a gene?
  4. How are genes distributed among chromosomes?

You can get the latest human genome GFF3 file from here. The README. This directory also includes a file explaining the GFF3 format, and you can find a complete specification here.

We’ll use Pandas, a powerful SciPy tool, to work with the GFF3 data.

Setting Up

First, we’ll set up a virtual environment and install SciPy. This can take a while if done manually because SciPy has many packages. Using Miniconda simplifies the process.

1
2
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b

The -b flag tells bash to run in batch mode. After installing Miniconda, create a virtual environment for genomics and install SciPy.

1
2
3
mkdir -p genomics
cd genomics
conda create -p venv ipython matplotlib pandas

This installs the three packages needed for this post. You can add more SciPy packages to the conda create command.

Use conda search to find package names. Now, activate the virtual environment and start IPython.

1
2
source activate venv/
ipython

IPython is a more powerful alternative to the standard Python interpreter. Anything you do in the default interpreter, you can do in IPython. If you haven’t tried it, I highly recommend it.

Downloading the Annotation File

Now that we’re set up, let’s download the human genome annotation (GFF3 format).

It’s around 37 MB, much smaller than the entire human genome (about 3 GB in plain text). That’s because it only contains annotations, while the actual sequence data is stored in the FASTA format. You can download the FASTA file here, but we won’t be using it in this tutorial.

1
!wget ftp://ftp.ensembl.org/pub/release-85/gff3/homo_sapiens/Homo_sapiens.GRCh38.85.gff3.gz

The ! tells IPython that this is a shell command, not a Python command. However, IPython can process common shell commands like ls, pwd, rm, mkdir, and rmdir without the !.

Looking at the beginning of the GFF file, you’ll see lines starting with ## or #! containing metadata/pragmas/directives.

According to the README, ## indicates stable metadata, while #! indicates experimental metadata.

You’ll also see ###, which has a more specific meaning defined in the specification.

Comments meant for humans appear after a single #. For this analysis, we’ll ignore all lines starting with #.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
##gff-version   3
##sequence-region   1 1 248956422
##sequence-region   10 1 133797422
##sequence-region   11 1 135086622
##sequence-region   12 1 133275309
...
##sequence-region   MT 1 16569
##sequence-region   X 1 156040895
##sequence-region   Y 2781480 56887902
#!genome-build  GRCh38.p7
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.22
#!genebuild-last-updated 2016-06

The first line indicates that the file uses GFF version 3.

Following that are summaries of all sequence regions, which are also found later in the file.

Lines beginning with #! provide information about the specific genome build this annotation file corresponds to, which is GRCh38.p7.

The Genome Reference Consortium (GCR) is an international group that manages updates to reference genome assemblies for various species, including humans, mice, zebrafish, and chickens.

Here are the first few annotation lines in the file:

1
2
3
4
5
6
7
8
1       GRCh38  chromosome      1       248956422       .       .       .       ID=chromosome:1;Alias=CM000663.2,chr1,NC_000001.11
###
1       .       biological_region       10469   11240   1.3e+03 .       .       external_name=oe %3D 0.79;logic_name=cpg
1       .       biological_region       10650   10657   0.999   +       .       logic_name=eponine
1       .       biological_region       10655   10657   0.999   -       .       logic_name=eponine
1       .       biological_region       10678   10687   0.999   +       .       logic_name=eponine
1       .       biological_region       10681   10688   0.999   -       .       logic_name=eponine
...

The columns are seqid, source, type, start, end, score, strand, phase, attributes. Some are self-explanatory. For example:

1
1       GRCh38  chromosome      1       248956422       .       .       .       ID=chromosome:1;Alias=CM000663.2,chr1,NC_000001.11

This line represents the first chromosome, with a seqid of 1. It starts at the first base and extends to the 24,895,622nd base, meaning the chromosome is about 25 million bases long.

Our analysis won’t use the score, strand, and phase columns, so we can ignore them. The attributes column tells us that Chromosome 1 has three aliases: CM000663.2, chr1, and NC_000001.11. That’s the basic structure of a GFF3 file. We won’t examine every line; instead, let’s load the whole file into Pandas.

Pandas is ideal for working with GFF3 files because they are tab-delimited, and Pandas excels at handling CSV-like formats.

One exception is when the GFF3 file contains ##FASTA.

As explained in the specification, ##FASTA marks the end of an annotation section and the beginning of one or more sequences in FASTA format (which is not tab-delimited). However, our GFF3 file doesn’t have this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
In [1]: import pandas as pd


In [2]: pd.__version__
Out[2]: '0.18.1'


In [3]: col_names = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
Out[3]: df = pd.read_csv('Homo_sapiens.GRCh38.85.gff3.gz', compression='gzip',
                         sep='\t', comment='#', low_memory=False,
                         header=None, names=col_names)

This code snippet uses pandas.read_csv to load the GFF3 file.

Since it’s not a standard CSV, we need some adjustments.

header=None tells Pandas that the file lacks a header.

names=col_names assigns specific names to each column. Without this, Pandas would use numbers.

sep='\t' indicates that tabs separate the columns, not commas. You can use regular expressions (regex) for more complex separators. comment='#' tells Pandas to ignore lines starting with #.

compression='gzip' specifies that the file is gzip-compressed.

pandas.read_csv offers many options for reading different CSV-like formats.

The result is a DataFrame, Pandas’s primary data structure for representing 2D data.

Pandas also provides Series and Panel for 1D and 3D data, respectively. The documentation introduces Pandas’s data structures.

Let’s view the first few entries with .head:

1
2
3
4
5
6
7
8
In [18]: df.head()
Out[18]:
  seqid  source               type  start        end    score strand phase                                         attributes
0     1  GRCh38         chromosome      1  248956422        .      .     .  ID=chromosome:1;Alias=CM000663.2,chr1,NC_00000...
1     1       .  biological_region  10469      11240  1.3e+03      .     .           external_name=oe %3D 0.79;logic_name=cpg
2     1       .  biological_region  10650      10657    0.999      +     .                                 logic_name=eponine
3     1       .  biological_region  10655      10657    0.999      -     .                                 logic_name=eponine
4     1       .  biological_region  10678      10687    0.999      +     .                                 logic_name=eponine

The output is neatly formatted, with long strings in the attributes column truncated and replaced with ....

To display full strings, use pd.set_option('display.max_colwidth', -1). Pandas offers many customizable options.

Now, let’s get basic information about the dataframe with .info:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
In [20]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2601849 entries, 0 to 2601848
Data columns (total 9 columns):
seqid         object
source        object
type          object
start         int64
end           int64
score         object
strand        object
phase         object
attributes    object
dtypes: int64(2), object(7)
memory usage: 178.7+ MB

This tells us that the GFF3 file has 2,601,848 lines, each with nine columns.

We also see the data type for each column.

Start and end are integers (int64) representing genome positions.

Other columns have the object type, meaning they likely contain a mix of integers, floats, and strings.

The data occupies 178.7+ MB of memory, making it more compact than the uncompressed file (around 402 MB), as confirmed below:

1
2
gunzip -c Homo_sapiens.GRCh38.85.gff3.gz > /tmp/tmp.gff3 && du -s /tmp/tmp.gff3
402M    /tmp/tmp.gff3

We’ve loaded the entire GFF3 file into a Python DataFrame. All subsequent analysis will be based on this object.

Let’s examine the first column, seqid.

1
2
3
4
5
6
7
8
9
In [29]: df.seqid.unique()
Out[29]:
array(['1', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19',
       '2', '20', '21', '22', '3', '4', '5', '6', '7', '8', '9',
       'GL000008.2', 'GL000009.2', 'GL000194.1', 'GL000195.1',
       ...
       'KI270757.1', 'MT', 'X', 'Y'], dtype=object)
In [30]: df.seqid.unique().shape
Out[30]: (194,)

Accessing column data in a dataframe can be done with df.seqid or df['seqid']. The latter is more versatile, especially when column names are Python keywords (like class) or contain periods or spaces.

The output shows 194 unique seqids: Chromosomes 1 to 22, X, Y, mitochondrion (MT) DNA, and 169 others.

Seqids starting with KI and GL are DNA segments – or scaffolds – that haven’t been assembled into chromosomes.

This is a crucial point in genomics. While the first draft of the human genome was published over 15 years ago, it remains incomplete. Assembling these sequences is challenging, largely due to complex repetitive regions in the genome.

Now, let’s look at the source column.

The README file explains that the source describes the algorithm or process that generated the feature.

1
2
3
4
5
6
7
8
9
In [66]: df.source.value_counts()
Out[66]:
havana            1441093
ensembl_havana     745065
ensembl            228212
.                  182510
mirbase              4701
GRCh38                194
insdc                  74

This demonstrates the value_counts method, invaluable for quickly counting categorical variables.

We see seven unique values in this column. Most GFF3 entries originate from havana, ensembl, and ensembl_havana.

You can delve deeper into these sources and their relationships in this post.

For simplicity, we’ll focus on entries from GRCh38, havana, ensembl, and ensembl_havan.a.

How Much of the Genome Is Incomplete?

Entries from source GRCh38 contain information about each chromosome. Let’s filter out the rest and store the result in a new variable, gdf.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
In [70]: gdf = df[df.source == 'GRCh38']
In [87]: gdf.shape
Out[87]: (194, 9)
In [84]: gdf.sample(10)
Out[84]:
              seqid  source         type  start        end score strand phase                                         attributes
2511585  KI270708.1  GRCh38  supercontig      1     127682     .      .     .  ID=supercontig:KI270708.1;Alias=chr1_KI270708v...
2510840  GL000208.1  GRCh38  supercontig      1      92689     .      .     .  ID=supercontig:GL000208.1;Alias=chr5_GL000208v...
990810           17  GRCh38   chromosome      1   83257441     .      .     .  ID=chromosome:17;Alias=CM000679.2,chr17,NC_000...
2511481  KI270373.1  GRCh38  supercontig      1       1451     .      .     .  ID=supercontig:KI270373.1;Alias=chrUn_KI270373...
2511490  KI270384.1  GRCh38  supercontig      1       1658     .      .     .  ID=supercontig:KI270384.1;Alias=chrUn_KI270384...
2080148           6  GRCh38   chromosome      1  170805979     .      .     .  ID=chromosome:6;Alias=CM000668.2,chr6,NC_00000...
2511504  KI270412.1  GRCh38  supercontig      1       1179     .      .     .  ID=supercontig:KI270412.1;Alias=chrUn_KI270412...
1201561          19  GRCh38   chromosome      1   58617616     .      .     .  ID=chromosome:19;Alias=CM000681.2,chr19,NC_000...
2511474  KI270340.1  GRCh38  supercontig      1       1428     .      .     .  ID=supercontig:KI270340.1;Alias=chrUn_KI270340...
2594560           Y  GRCh38   chromosome  2781480  56887902     .      .     .  ID=chromosome:Y;Alias=CM000686.2,chrY,NC_00002...

Filtering is straightforward in Pandas.

The expression df.source == 'GRCh38' generates a series of True and False values for each entry, aligned with df’s index. Using this in df[] selects only entries with a corresponding True value.

There are 194 keys in df[] where df.source == 'GRCh38'.

Since there are 194 unique seqids, each entry in gdf represents a specific seqid.

We randomly choose 10 entries with the sample method for examination.

Unassembled sequences have a type of supercontig, while others are classified as chromosomes. To calculate the proportion of the genome that’s incomplete, we need the total genome length (the sum of all seqid lengths).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
In [90]: gdf = gdf.copy()
In [91]: gdf['length'] = gdf.end - gdf.start + 1
In [93]: gdf.head()
Out[93]:
       seqid  source        type  start        end score strand phase                                         attributes     length
0          1  GRCh38  chromosome      1  248956422     .      .     .  ID=chromosome:1;Alias=CM000663.2,chr1,NC_00000...  248956421
235068    10  GRCh38  chromosome      1  133797422     .      .     .  ID=chromosome:10;Alias=CM000672.2,chr10,NC_000...  133797421
328938    11  GRCh38  chromosome      1  135086622     .      .     .  ID=chromosome:11;Alias=CM000673.2,chr11,NC_000...  135086621
483370    12  GRCh38  chromosome      1  133275309     .      .     .  ID=chromosome:12;Alias=CM000674.2,chr12,NC_000...  133275308
634486    13  GRCh38  chromosome      1  114364328     .      .     .  ID=chromosome:13;Alias=CM000675.2,chr13,NC_000...  114364327
In [97]: gdf.length.sum()
Out[97]: 3096629532
In [99]: chrs = [str(_) for _ in range(1, 23)] + ['X', 'Y', 'MT']
In [101]: gdf[-gdf.seqid.isin(chrs)].length.sum() / gdf.length.sum()
Out[101]: 0.0037021917421198327

First, we create a copy of gdf with .copy(). Directly modifying a slice of df would trigger a SettingWithCopyWarning (see here for more details).

We calculate each entry’s length and add it to gdf as a new column called “length”. The total length is roughly 3.1 billion bases, with unassembled sequences constituting about 0.37%.

Here’s how slicing works in the last two commands.

We create a list, chrs, containing all well-assembled sequences (chromosomes and mitochondria). We use the isin method to filter for entries whose seqid is in chrs.

The minus sign (-) before the index reverses the selection, keeping entries not in the list (the unassembled sequences starting with KI and GL).

Note: We could also filter by the type column to achieve the same result:

1
gdf[(gdf['type'] == 'supercontig')].length.sum() / gdf.length.sum()

How Many Genes Are There?

Now, we’ll focus on entries from sources ensembl, havana, and ensembl_havana, which contain most annotation entries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
In [109]: edf = df[df.source.isin(['ensembl', 'havana', 'ensembl_havana'])]
In [111]: edf.sample(10)
Out[111]:
        seqid          source  type      start        end score strand phase                                         attributes
915996     16          havana   CDS   27463541   27463592     .      -     2  ID=CDS:ENSP00000457449;Parent=transcript:ENST0...
2531429     X          havana  exon   41196251   41196359     .      +     .  Parent=transcript:ENST00000462850;Name=ENSE000...
1221944    19  ensembl_havana   CDS    5641740    5641946     .      +     0  ID=CDS:ENSP00000467423;Parent=transcript:ENST0...
243070     10          havana  exon   13116267   13116340     .      +     .  Parent=transcript:ENST00000378764;Name=ENSE000...
2413583     8  ensembl_havana  exon  144359184  144359423     .      +     .  Parent=transcript:ENST00000530047;Name=ENSE000...
2160496     6          havana  exon  111322569  111322678     .      -     .  Parent=transcript:ENST00000434009;Name=ENSE000...
839952     15          havana  exon   76227713   76227897     .      -     .  Parent=transcript:ENST00000565910;Name=ENSE000...
957782     16  ensembl_havana  exon   67541653   67541782     .      +     .  Parent=transcript:ENST00000379312;Name=ENSE000...
1632979    21  ensembl_havana  exon   37840658   37840709     .      -     .  Parent=transcript:ENST00000609713;Name=ENSE000...
1953399     4          havana  exon  165464390  165464586     .      +     .  Parent=transcript:ENST00000511992;Name=ENSE000...
In [123]: edf.type.value_counts()
Out[123]:
exon                             1180596
CDS                               704604
five_prime_UTR                    142387
three_prime_UTR                   133938
transcript                         96375
gene                               42470
processed_transcript               28228
...
Name: type, dtype: int64

Again, we use isin for filtering.

A quick value count reveals that most entries are exons, coding sequences (CDS), and untranslated regions (UTR). These are sub-gene elements.

We see 42,470 genes, but let’s dig deeper. What are their names and functions? This information is in the attributes column.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
In [127]: ndf = edf[edf.type == 'gene']
In [173]: ndf = ndf.copy()
In [133]: ndf.sample(10).attributes.values
Out[133]:
array(['ID=gene:ENSG00000228611;Name=HNF4GP1;biotype=processed_pseudogene;description=hepatocyte nuclear factor 4 gamma pseudogene 1 [Source:HGNC Symbol%3BAcc:HGNC:35417];gene_id=ENSG00000228611;havana_gene=OTTHUMG00000016986;havana_version=2;logic_name=havana;version=2',
       'ID=gene:ENSG00000177189;Name=RPS6KA3;biotype=protein_coding;description=ribosomal protein S6 kinase A3 [Source:HGNC Symbol%3BAcc:HGNC:10432];gene_id=ENSG00000177189;havana_gene=OTTHUMG00000021231;havana_version=5;logic_name=ensembl_havana_gene;version=12',
       'ID=gene:ENSG00000231748;Name=RP11-227H15.5;biotype=antisense;gene_id=ENSG00000231748;havana_gene=OTTHUMG00000018373;havana_version=1;logic_name=havana;version=1',
       'ID=gene:ENSG00000227426;Name=VN1R33P;biotype=unitary_pseudogene;description=vomeronasal 1 receptor 33 pseudogene [Source:HGNC Symbol%3BAcc:HGNC:37353];gene_id=ENSG00000227426;havana_gene=OTTHUMG00000154474;havana_version=1;logic_name=havana;version=1',
       'ID=gene:ENSG00000087250;Name=MT3;biotype=protein_coding;description=metallothionein 3 [Source:HGNC Symbol%3BAcc:HGNC:7408];gene_id=ENSG00000087250;havana_gene=OTTHUMG00000133282;havana_version=3;logic_name=ensembl_havana_gene;version=8',
       'ID=gene:ENSG00000177108;Name=ZDHHC22;biotype=protein_coding;description=zinc finger DHHC-type containing 22 [Source:HGNC Symbol%3BAcc:HGNC:20106];gene_id=ENSG00000177108;havana_gene=OTTHUMG00000171575;havana_version=3;logic_name=ensembl_havana_gene;version=5',
       'ID=gene:ENSG00000249784;Name=SCARNA22;biotype=scaRNA;description=small Cajal body-specific RNA 22 [Source:HGNC Symbol%3BAcc:HGNC:32580];gene_id=ENSG00000249784;logic_name=ncrna;version=1',
       'ID=gene:ENSG00000079101;Name=CLUL1;biotype=protein_coding;description=clusterin like 1 [Source:HGNC Symbol%3BAcc:HGNC:2096];gene_id=ENSG00000079101;havana_gene=OTTHUMG00000178252;havana_version=7;logic_name=ensembl_havana_gene;version=16',
       'ID=gene:ENSG00000229224;Name=AC105398.3;biotype=antisense;gene_id=ENSG00000229224;havana_gene=OTTHUMG00000152025;havana_version=1;logic_name=havana;version=1',
       'ID=gene:ENSG00000255552;Name=LY6G6E;biotype=protein_coding;description=lymphocyte antigen 6 complex%2C locus G6E (pseudogene) [Source:HGNC Symbol%3BAcc:HGNC:13934];gene_id=ENSG00000255552;havana_gene=OTTHUMG00000166419;havana_version=1;logic_name=ensembl_havana_gene;version=7'], dtype=object)

This column contains semicolon-separated tag-value pairs. We’ll extract gene names, IDs, and descriptions using regular expressions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
import re


RE_GENE_NAME = re.compile(r'Name=(?P<gene_name>.+?);')
def extract_gene_name(attributes_str):
    res = RE_GENE_NAME.search(attributes_str)
    return res.group('gene_name')


ndf['gene_name'] = ndf.attributes.apply(extract_gene_name)

First, we extract gene names.

In the regex Name=(?P<gene_name>.+?); , +? ensures a non-greedy match, stopping at the first semicolon. Otherwise, it would match up to the last semicolon.

We compile the regex with re.compile for efficiency, as we’ll apply it to many attribute strings.

extract_gene_name is a helper function used with pd.apply, which applies a function to each DataFrame or Series entry.

In this case, we extract the gene name for each entry in ndf.attributes and add it to ndf as a new column called gene_name.

Gene IDs and descriptions are extracted similarly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
RE_GENE_ID = re.compile(r'gene_id=(?P<gene_id>ENSG.+?);')
def extract_gene_id(attributes_str):
    res = RE_GENE_ID.search(attributes_str)
    return res.group('gene_id')


ndf['gene_id'] = ndf.attributes.apply(extract_gene_id)


RE_DESC = re.compile('description=(?P<desc>.+?);')
def extract_description(attributes_str):
    res = RE_DESC.search(attributes_str)
    if res is None:
        return ''
    else:
        return res.group('desc')


ndf['desc'] = ndf.attributes.apply(extract_description)

RE_GENE_ID is more specific, as every gene_id must start with ENSG, where ENS represents ensembl and G stands for gene.

We return an empty string for entries without a description. As we won’t need the attributes column anymore, let’s remove it with .drop:

1
2
3
4
5
6
7
8
9
In [224]: ndf.drop('attributes', axis=1, inplace=True)
In [225]: ndf.head()
Out[225]:
   seqid          source  type  start    end score strand phase          gene_id gene_name                                               desc
16     1          havana  gene  11869  14409     .      +     .  ENSG00000223972   DDX11L1  DEAD/H-box helicase 11 like 1 [Source:HGNC Sym...
28     1          havana  gene  14404  29570     .      -     .  ENSG00000227232    WASH7P  WAS protein family homolog 7 pseudogene [Sourc...
71     1          havana  gene  52473  53312     .      +     .  ENSG00000268020    OR4G4P  olfactory receptor family 4 subfamily G member...
74     1          havana  gene  62948  63887     .      +     .  ENSG00000240361   OR4G11P  olfactory receptor family 4 subfamily G member...
77     1  ensembl_havana  gene  69091  70008     .      +     .  ENSG00000186092     OR4F5  olfactory receptor family 4 subfamily F member...

Here, attributes specifies the column to drop.

axis=1 indicates column removal (the default is axis=0 for rows).

inplace=True modifies the DataFrame directly instead of creating a copy.

A quick .head check confirms that the attributes column is gone, replaced by gene_name, gene_id, and desc.

Out of curiosity, let’s see if gene_id and gene_name are unique:

1
2
3
4
5
6
In [232]: ndf.shape
Out[232]: (42470, 11)
In [233]: ndf.gene_id.unique().shape
Out[233]: (42470,)
In [234]: ndf.gene_name.unique().shape
Out[234]: (42387,)

There are fewer gene names than gene IDs, implying that some gene names are associated with multiple gene IDs. Let’s find them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
In [243]: count_df = ndf.groupby('gene_name').count().ix[:, 0].sort_values().ix[::-1]
In [244]: count_df.head(10)
Out[244]:
gene_name
SCARNA20           7
SCARNA16           6
SCARNA17           5
SCARNA15           4
SCARNA21           4
SCARNA11           4
Clostridiales-1    3
SCARNA4            3
C1QTNF9B-AS1       2
C11orf71           2
Name: seqid, dtype: int64
In [262]: count_df[count_df > 1].shape
Out[262]: (63,)
In [263]: count_df.shape
Out[263]: (42387,)
In [264]: count_df[count_df > 1].shape[0] / count_df.shape[0]
Out[264]: 0.0014863047632528842

We group entries by gene_name and count the items in each group with .count().

Inspecting the output of ndf.groupby('gene_name').count() shows that all columns are counted for each group, but most have the same value.

Note that NA values are excluded from the count. We only need the count from the first column, seqid, which has no NA values (ensured by .ix[:, 0]).

We then sort the counts using .sort_values and reverse the order with .ix[::-1].

The result shows that a gene name can be shared by up to seven gene IDs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
In [255]: ndf[ndf.gene_name == 'SCARNA20']
Out[255]:
        seqid   source  type      start        end score strand phase          gene_id gene_name                                                                    desc
179399   1     ensembl  gene  171768070  171768175  .     +      .     ENSG00000253060  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
201037   1     ensembl  gene  204727991  204728106  .     +      .     ENSG00000251861  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
349203   11    ensembl  gene  8555016    8555146    .     +      .     ENSG00000252778  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
718520   14    ensembl  gene  63479272   63479413   .     +      .     ENSG00000252800  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
837233   15    ensembl  gene  75121536   75121666   .     -      .     ENSG00000252722  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
1039874  17    ensembl  gene  28018770   28018907   .     +      .     ENSG00000251818  SCARNA20  Small Cajal body specific RNA 20 [Source:RFAM%3BAcc:RF00601]
1108215  17    ensembl  gene  60231516   60231646   .     -      .     ENSG00000252577  SCARNA20  small Cajal body-specific RNA 20 [Source:HGNC Symbol%3BAcc:HGNC:32578]

Examining the SCARNA20 genes reveals that they are distinct, located at different genomic positions despite sharing the same name. Their descriptions aren’t helpful in differentiating them.

Key takeaway: Gene names are not always unique to gene IDs, with about 0.15% shared by multiple genes.

How Long Is a Typical Gene?

Let’s add a length column to ndf:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
In [277]: ndf['length'] = ndf.end - ndf.start + 1
In [278]: ndf.length.describe()
Out[278]:
count    4.247000e+04
mean     3.583348e+04
std      9.683485e+04
min      8.000000e+00
25%      8.840000e+02
50%      5.170500e+03
75%      3.055200e+04
max      2.304997e+06
Name: length, dtype: float64

.describe() provides basic length statistics:

  • Mean gene length: ~36,000 bases

  • Median gene length: ~5,200 bases

  • Minimum gene length: 8 bases

  • Maximum gene length: ~2.3 million bases

The mean being much larger than the median suggests a right-skewed length distribution. Let’s visualize this.

1
2
3
4
5
import matplotlib as plt


ndf.length.plot(kind='hist', bins=50, logy=True)
plt.show()

Pandas simplifies matplotlib integration, making DataFrame and Series plotting easy.

Here, kind='hist' requests a histogram with 50 bins and a logarithmic y-axis (logy=True).

The histogram shows that most genes fall within the first bin. However, some are over two million bases long. Let’s identify them:

1
2
3
4
5
6
7
8
9
In [39]: ndf[ndf.length > 2e6].sort_values('length').ix[::-1]
Out[39]:
        seqid          source  type      start        end score strand phase gene_name          gene_id                                               desc   length
2309345     7  ensembl_havana  gene  146116002  148420998     .      +     .   CNTNAP2  ENSG00000174469  contactin associated protein-like 2 [Source:HG...  2304997
2422510     9  ensembl_havana  gene    8314246   10612723     .      -     .     PTPRD  ENSG00000153707  protein tyrosine phosphatase%2C receptor type ...  2298478
2527169     X  ensembl_havana  gene   31097677   33339441     .      -     .       DMD  ENSG00000198947    dystrophin [Source:HGNC Symbol%3BAcc:HGNC:2928]  2241765
440886     11  ensembl_havana  gene   83455012   85627922     .      -     .      DLG2  ENSG00000150672  discs large MAGUK scaffold protein 2 [Source:H...  2172911
2323457     8  ensembl_havana  gene    2935353    4994972     .      -     .     CSMD1  ENSG00000183117  CUB and Sushi multiple domains 1 [Source:HGNC ...  2059620
1569914    20  ensembl_havana  gene   13995369   16053197     .      +     .   MACROD2  ENSG00000172264  MACRO domain containing 2 [Source:HGNC Symbol%...  2057829

The longest gene is CNTNAP2 (contactin-associated protein-like 2). According to its wikipedia page:

This gene spans almost 1.6% of chromosome 7 and is one of the largest in the human genome.

Our analysis confirms this! Now, let’s look at the shortest genes:

1
2
3
4
5
6
7
8
In [40]: ndf.sort_values('length').head()
Out[40]:
        seqid  source  type      start        end score strand phase   gene_name          gene_id                                               desc  length
682278     14  havana  gene   22438547   22438554     .      +     .       TRDD1  ENSG00000223997  T cell receptor delta diversity 1 [Source:HGNC...       8
682282     14  havana  gene   22439007   22439015     .      +     .       TRDD2  ENSG00000237235  T cell receptor delta diversity 2 [Source:HGNC...       9
2306836     7  havana  gene  142786213  142786224     .      +     .       TRBD1  ENSG00000282431  T cell receptor beta diversity 1 [Source:HGNC ...      12
682286     14  havana  gene   22449113   22449125     .      +     .       TRDD3  ENSG00000228985  T cell receptor delta diversity 3 [Source:HGNC...      13
1879625     4  havana  gene   10238213   10238235     .      -     .  AC006499.9  ENSG00000271544                                                         23

The length difference between the extremes is five orders of magnitude (2.3 million vs. 8), highlighting life’s incredible diversity.

A single gene can produce multiple proteins through alternative splicing, a concept we haven’t explored here. The GFF3 file contains this information, but it’s beyond this post’s scope.

Gene Distribution Among Chromosomes

Finally, let’s analyze gene distribution across chromosomes, which also demonstrates the .merge method for combining DataFrames. Intuitively, longer chromosomes should contain more genes. Let’s see.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
In [53]: ndf = ndf[ndf.seqid.isin(chrs)]
In [54]: chr_gene_counts = ndf.groupby('seqid').count().ix[:, 0].sort_values().ix[::-1]
Out[54]: chr_gene_counts
seqid
1     3902
2     2806
11    2561
19    2412
17    2280
3     2204
6     2154
12    2140
7     2106
5     2002
16    1881
X     1852
4     1751
9     1659
8     1628
10    1600
15    1476
14    1449
22     996
20     965
13     872
18     766
21     541
Y      436
Name: source, dtype: int64

We reuse the chrs variable to exclude unassembled sequences. The output confirms that the largest chromosome, Chromosome 1, has the most genes. While Chromosome Y has the fewest genes, it’s not the smallest chromosome.

Surprisingly, there seem to be no genes in the mitochondrion (MT). This is incorrect.

Filtering the original df DataFrame reveals that all MT genes come from the insdc source, which we excluded when creating edf (focusing on havana, ensembl, and ensembl_havana).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
In [134]: df[(df.type == 'gene') & (df.seqid == 'MT')]
Out[134]:
        seqid source  type  start    end score strand phase                                         attributes
        2514003    MT  insdc  gene    648   1601     .      +     .  ID=gene:ENSG00000211459;Name=MT-RNR1;biotype=M...
        2514009    MT  insdc  gene   1671   3229     .      +     .  ID=gene:ENSG00000210082;Name=MT-RNR2;biotype=M...
        2514016    MT  insdc  gene   3307   4262     .      +     .  ID=gene:ENSG00000198888;Name=MT-ND1;biotype=pr...
        2514029    MT  insdc  gene   4470   5511     .      +     .  ID=gene:ENSG00000198763;Name=MT-ND2;biotype=pr...
        2514048    MT  insdc  gene   5904   7445     .      +     .  ID=gene:ENSG00000198804;Name=MT-CO1;biotype=pr...
        2514058    MT  insdc  gene   7586   8269     .      +     .  ID=gene:ENSG00000198712;Name=MT-CO2;biotype=pr...
        2514065    MT  insdc  gene   8366   8572     .      +     .  ID=gene:ENSG00000228253;Name=MT-ATP8;biotype=p...
        2514069    MT  insdc  gene   8527   9207     .      +     .  ID=gene:ENSG00000198899;Name=MT-ATP6;biotype=p...
        2514073    MT  insdc  gene   9207   9990     .      +     .  ID=gene:ENSG00000198938;Name=MT-CO3;biotype=pr...
        2514080    MT  insdc  gene  10059  10404     .      +     .  ID=gene:ENSG00000198840;Name=MT-ND3;biotype=pr...
        2514087    MT  insdc  gene  10470  10766     .      +     .  ID=gene:ENSG00000212907;Name=MT-ND4L;biotype=p...
        2514091    MT  insdc  gene  10760  12137     .      +     .  ID=gene:ENSG00000198886;Name=MT-ND4;biotype=pr...
        2514104    MT  insdc  gene  12337  14148     .      +     .  ID=gene:ENSG00000198786;Name=MT-ND5;biotype=pr...
        2514108    MT  insdc  gene  14149  14673     .      -     .  ID=gene:ENSG00000198695;Name=MT-ND6;biotype=pr...
        2514115    MT  insdc  gene  14747  15887     .      +     .  ID=gene:ENSG00000198727;Name=MT-CYB;biotype=pr...

This snippet demonstrates combining conditions during filtering with & (logical “and”). The logical “or” operator is |.

Parentheses around each condition are mandatory, unlike in standard Python, which uses literal and and or.

Let’s use the gdf DataFrame from earlier to get each chromosome’s length:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
In [61]: gdf = gdf[gdf.seqid.isin(chrs)]
In [62]: gdf.drop(['start', 'end', 'score', 'strand', 'phase' ,'attributes'], axis=1, inplace=True)
In [63]: gdf.sort_values('length').ix[::-1]
Out[63]:
        seqid  source        type     length
0           1  GRCh38  chromosome  248956422
1364641     2  GRCh38  chromosome  242193529
1705855     3  GRCh38  chromosome  198295559
1864567     4  GRCh38  chromosome  190214555
1964921     5  GRCh38  chromosome  181538259
2080148     6  GRCh38  chromosome  170805979
2196981     7  GRCh38  chromosome  159345973
2514125     X  GRCh38  chromosome  156040895
2321361     8  GRCh38  chromosome  145138636
2416560     9  GRCh38  chromosome  138394717
328938     11  GRCh38  chromosome  135086622
235068     10  GRCh38  chromosome  133797422
483370     12  GRCh38  chromosome  133275309
634486     13  GRCh38  chromosome  114364328
674767     14  GRCh38  chromosome  107043718
767312     15  GRCh38  chromosome  101991189
865053     16  GRCh38  chromosome   90338345
990810     17  GRCh38  chromosome   83257441
1155977    18  GRCh38  chromosome   80373285
1559144    20  GRCh38  chromosome   64444167
1201561    19  GRCh38  chromosome   58617616
2594560     Y  GRCh38  chromosome   54106423
1647482    22  GRCh38  chromosome   50818468
1616710    21  GRCh38  chromosome   46709983
2513999    MT  GRCh38  chromosome      16569

We drop irrelevant columns for clarity.

.drop can remove multiple columns in one go.

The MT entry remains; we’ll address it shortly. Now, let’s merge the two datasets based on the seqid column.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
In [73]: cdf = chr_gene_counts.to_frame(name='gene_count').reset_index()
In [75]: cdf.head(2)
Out[75]:
  seqid  gene_count
0     1        3902
1     2        2806
In [78]: merged = gdf.merge(cdf, on='seqid')
In [79]: merged
Out[79]:
   seqid  source        type     length  gene_count
0      1  GRCh38  chromosome  248956422        3902
1     10  GRCh38  chromosome  133797422        1600
2     11  GRCh38  chromosome  135086622        2561
3     12  GRCh38  chromosome  133275309        2140
4     13  GRCh38  chromosome  114364328         872
5     14  GRCh38  chromosome  107043718        1449
6     15  GRCh38  chromosome  101991189        1476
7     16  GRCh38  chromosome   90338345        1881
8     17  GRCh38  chromosome   83257441        2280
9     18  GRCh38  chromosome   80373285         766
10    19  GRCh38  chromosome   58617616        2412
11     2  GRCh38  chromosome  242193529        2806
12    20  GRCh38  chromosome   64444167         965
13    21  GRCh38  chromosome   46709983         541
14    22  GRCh38  chromosome   50818468         996
15     3  GRCh38  chromosome  198295559        2204
16     4  GRCh38  chromosome  190214555        1751
17     5  GRCh38  chromosome  181538259        2002
18     6  GRCh38  chromosome  170805979        2154
19     7  GRCh38  chromosome  159345973        2106
20     8  GRCh38  chromosome  145138636        1628
21     9  GRCh38  chromosome  138394717        1659
22     X  GRCh38  chromosome  156040895        1852
23     Y  GRCh38  chromosome   54106423         436

As chr_gene_counts is a Series (which doesn’t support merging), we convert it to a DataFrame with .to_frame.

.reset_index() turns the original index (seqid) into a new column and resets the index to 0-based integers.

cdf.head(2) shows the result. Next, we use .merge to combine the DataFrames based on seqid.

The merged DataFrame still lacks the MT entry because .merge defaults to an inner join. You can perform left, right, or outer joins by adjusting the how parameter (see documentation for details).

You might encounter the related .join method. Both .merge and .join combine DataFrames but have different APIs.

The official documentation states:

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

In essence, .merge is more general and used internally by .join.

Finally, let’s calculate the correlation between chromosome length and gene_count.

1
2
3
4
5
In [81]: merged[['length', 'gene_count']].corr()
Out[81]:
              length  gene_count
length      1.000000    0.728221
gene_count  0.728221    1.000000

.corr calculates the Pearson correlation between all column pairs in a DataFrame.

We have only one pair, and the correlation is positive (0.73).

This indicates that larger chromosomes tend to have more genes. Let’s visualize this after sorting by length:

1
2
3
4
5
6
7
8
ax = merged[['length', 'gene_count']].sort_values('length').plot(x='length', y='gene_count', style='o-')
# add some margin to both ends of x axis
xlim = ax.get_xlim()
margin = xlim[0] * 0.1
ax.set_xlim([xlim[0] - margin, xlim[1] + margin])
# Label each point on the graph
for (s, x, y) in merged[['seqid', 'length', 'gene_count']].sort_values('length').values:
    ax.text(x, y - 100, str(s))

While there’s a positive correlation overall, it doesn’t hold true for every chromosome. Notably, Chromosomes 17, 16, 15, 14, and 13 exhibit a negative correlation, meaning the gene count decreases as chromosome size increases.

Findings and Future Research

This concludes our exploration of manipulating a human genome annotation file (GFF3 format) using the SciPy stack, primarily employing IPython, Pandas, and matplotlib. We learned common Pandas operations and uncovered intriguing facts about our genome:

  1. Despite the first draft being over 15 years old, about 0.37% of the human genome remains incomplete.
  2. This particular GFF3 file reveals approximately 42,000 genes in the human genome.
  3. Gene lengths vary drastically, from a few dozen to over two million bases.
  4. Gene distribution across chromosomes is uneven. Generally, larger chromosomes harbor more genes, but some show an inverse correlation.

The GFF3 file is rich in information, and we’ve only scratched the surface. Further exploration could involve these questions:

  1. How many transcripts does a gene typically have? What proportion of genes have multiple transcripts?
  2. What is the typical number of isoforms per transcript?
  3. How many exons, CDS, and UTRs does a transcript typically have? What are their size distributions?
  4. Can we categorize genes based on their function described in the description column?
Licensed under CC BY-NC-SA 4.0