Welcome to gffpandas’s documentation!¶
Table of content¶
What is gffpandas?¶
gffpandas is a Python library, which can be used for annotation data. It facilitates the work with GFF3 files in regard to process them as to obtain desired annotation entries of them. Thereby, it is an easy to use and time-saving library.
This gffpandas library is an alternative to gffutils or bcbio-gff, but it is inspired by the Python library pandas. This means that the data frame structure is used to work with the annotation data, because gffpandas reads in the GFF3 file and makes a data frame out of it. With gffpandas it is possible to process a GFF3 file by different functions. One big advantage is that several functions can be combined so that the required annotation entries can be selected. Furthermore, the processed annotation data can be safed again as GFF3 file or as csv or tsv file.
To see the single functions see How to use gffpandas.
Background information¶
The Python library gffpandas facilitates working on generatic feature format version 3 (GFF3) files.
The GFF3 file contains location and attribute information about features, as e.g. genes, of DNA, RNA or protein sequences. It has one general format. This format includes a header, which is marked with a hash at the begin of the line. The header describes meta-data about the feature. The location and attribute information are described in nine columns, which are the following:
seq_id | source | type | start | end | score | strand | phase | attributes |
- seq_id: identification number of the sequence.
- source: it gives information about how the annotation was generated. Normally, it is a database name or software name.
- type: it describes the feature type, as e.g. gene, CDS, tRNA, exon etc.
- start: it gives the start position of the feature [base pair (bp)].
- end: it gives the end position of the feature [bp].
- score: describes the score of the feature. It is written as a floating point number.
- strand: gives the information, whether the feature is coded on the positive (+) or minus (-) strand. Otherwise, the strand can be ‘.’ for features which are not stranded or ‘?’ when the strand of the feature is unknown.
- phase: The phase is required for all coding sequence (CDS)-features and gives the information about, at which position the CDS begins in the reading frame. It can be position 0, 1 or 2.
- attributes: The attribute column is written in a ‘tag=value’ format and contains information about the following tags [1]: ID, Dbxref, gbkey, genome, genomic, mol_type, serovar, strain, Name, gene, locus_tag, Parent, Genbank, product, protein_id, transl_table
So far, only GFF3 files containing only one GFF3 file, i.e. one header, can be used.
[1] | https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |
How to use gffpandas¶
Example Tutorial:¶
The following GFF3 file will be used as example, to show how gffpandas has to be used. It contains a header and eleven annotation entries:
##gff-version 3
##sequence-region NC_016810.1 1 20
NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344
NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001
NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002
NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003
NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004
NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005
NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
The library can be imported as the following:
import gffpandas.gffpandas as gffpd
First step is to read in the GFF3 file with the method called ‘read_gff3’. Then a dataframe (.df) or rather header (.header) can be returned:
>>> annotation = gffpd.read_gff3('annotation.gff')
>>> print(annotation.header)
>>> print(annotation.df)
Out[1]:
##gff-version 3
##sequence-region NC_016810.1 1 20
seq_id source type start end score strand phase \
0 NC_016810.1 RefSeq region 1 4000 . + .
1 NC_016810.1 RefSeq gene 1 20 . + .
2 NC_016810.1 RefSeq CDS 13 235 . + 0
3 NC_016810.1 RefSeq gene 1 20 . + .
4 NC_016810.1 RefSeq CDS 341 523 . + 0
5 NC_016810.1 RefSeq gene 1 600 . - .
6 NC_016810.1 RefSeq CDS 21 345 . - 0
7 NC_016810.1 RefSeq gene 41 255 . + .
8 NC_016810.1 RefSeq CDS 61 195 . + 0
9 NC_016810.1 RefSeq gene 170 546 . + .
10 NC_016810.1 RefSeq CDS 34 335 . + 0
attributes
0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge...
1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_...
2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_...
6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_...
8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_...
10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
The created data frame contains all eleven annotation entries and can be changed now. Depending on which annotation entries are desired, different options of gffpandas can be used and/or combined.
In this example, the user wants to return a GFF3 file, but only its coding sequences (‘CDS’), which base pair length (bp) is minimal 10 bp long and maximal 250 bp long. Therefore, the following functions will be combined:
>>> combined_df = annotation.filter_feature_of_type(['CDS']).filter_by_length(10, 250).to_gff3('temp.gff')
>>> gff_content = open('temp.gff').read()
>>> print(gff_content)
Out[2]:
##gff-version 3
##sequence-region NC_016810.1 1 20
NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
Methods included in gffpandas:¶
In this subsection, the possible functions of gffpandas will be presented.
filter_feature_of_type¶
For example:
>>> filtered_df = annotation.filter_feature_of_type(['gene'])
>>> print(filtered_df.df)
Out[2]:
seq_id source type start end score strand phase \
1 NC_016810.1 RefSeq gene 1 20 . + .
3 NC_016810.1 RefSeq gene 1 20 . + .
5 NC_016810.1 RefSeq gene 1 600 . - .
7 NC_016810.1 RefSeq gene 41 255 . + .
9 NC_016810.1 RefSeq gene 170 546 . + .
attributes
1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_...
3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_...
7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_...
9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_...
filter_by_length¶
For example:
>>> filtered_by_length = annotation.filter_by_length(min_length=10, max_length=300)
>>> print(filtered_by_length.df)
Out[3]:
seq_id source type start end score strand phase \
1 NC_016810.1 RefSeq gene 1 20 . + .
2 NC_016810.1 RefSeq CDS 13 235 . + 0
3 NC_016810.1 RefSeq gene 1 20 . + .
4 NC_016810.1 RefSeq CDS 341 523 . + 0
7 NC_016810.1 RefSeq gene 41 255 . + .
8 NC_016810.1 RefSeq CDS 61 195 . + 0
attributes
1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_...
2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_...
8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
get_feature_by_attribute¶
For example:
>>> feature_by_attribute = annotation.get_feature_by_attribute('gbkey', ['CDS'])
>>> print(feature_by_attribute.df)
Out[4]:
seq_id source type start end score strand phase \
2 NC_016810.1 RefSeq CDS 13 235 . + 0
4 NC_016810.1 RefSeq CDS 341 523 . + 0
6 NC_016810.1 RefSeq CDS 21 345 . - 0
8 NC_016810.1 RefSeq CDS 61 195 . + 0
10 NC_016810.1 RefSeq CDS 34 335 . + 0
attributes
2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
attributes_to_columns¶
For example:
>>> attr_to_columns = annotation.attributes_to_columns()
>>> print(attr_to_columns)
Out[5]:
seq_id source type start end score strand phase \
0 NC_016810.1 RefSeq region 1 4000 . + .
1 NC_016810.1 RefSeq gene 1 20 . + .
2 NC_016810.1 RefSeq CDS 13 235 . + 0
3 NC_016810.1 RefSeq gene 1 20 . + .
4 NC_016810.1 RefSeq CDS 341 523 . + 0
5 NC_016810.1 RefSeq gene 1 600 . - .
6 NC_016810.1 RefSeq CDS 21 345 . - 0
7 NC_016810.1 RefSeq gene 41 255 . + .
8 NC_016810.1 RefSeq CDS 61 195 . + 0
9 NC_016810.1 RefSeq gene 170 546 . + .
10 NC_016810.1 RefSeq CDS 34 335 . + 0
attributes \
0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge...
1 ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_...
2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
4 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
5 ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_...
6 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_...
8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_...
10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
Dbxref ... gbkey \
0 taxon:216597 ... Src
1 None ... Gene
2 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS
3 None ... Gene
4 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS
5 None ... Gene
6 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS
7 None ... Gene
8 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS
9 None ... Gene
10 UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_00517... ... CDS
gene genome locus_tag mol_type product \
0 None genomic None genomic DNA None
1 thrL None SL1344_0001 None None
2 None None None None thr operon leader peptide
3 thrA None SL1344_0002 None None
4 None None None None thr operon leader peptide
5 thrX None SL1344_0003 None None
6 None None None None thr operon leader peptide
7 thrB None SL1344_0004 None None
8 None None None None thr operon leader peptide
9 thrC None SL1344_0005 None None
10 None None None None thr operon leader peptide
protein_id serovar strain transl_table
0 None Typhimurium SL1344 None
1 None None None None
2 YP_005179941.1 None None 11
3 None None None None
4 YP_005179941.1 None None 11
5 None None None None
6 YP_005179941.1 None None 11
7 None None None None
8 YP_005179941.1 None None 11
9 None None None None
10 YP_005179941.1 None None 11
overlaps_with¶
For example:
>>> overlapings = annotation.overlaps_with(seq_id='NC_016811.1', type='gene',
start=40, end=300, strand='+')
>>> no_overlap = annotation.overlaps_with(seq_id='NC_016811.1', start=1, end=4000,
strand='+', complement=True)
>>> print(overlapings.df)
>>> print(no_overlap.df)
Out[6]:
seq_id source type start end score strand phase \
0 NC_016810.1 RefSeq region 1 4000 . + .
2 NC_016810.1 RefSeq CDS 13 235 . + 0
7 NC_016810.1 RefSeq gene 41 255 . + .
8 NC_016810.1 RefSeq CDS 61 195 . + 0
9 NC_016810.1 RefSeq gene 170 546 . + .
10 NC_016810.1 RefSeq CDS 34 335 . + 0
attributes
0 Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=ge...
2 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
7 ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_...
8 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
9 ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_...
10 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:Y...
Out[7]:
Empty DataFrame
Columns: [seq_id, source, type, start, end, score, strand, phase, attributes]
Index: []
find_duplicated_entries¶
For example:
>>> redundant_entries = annotation.find_duplicated_entries(seq_id='NC_016811.1', type='gene')
>>> print(redundant_entries.df)
Out[8]:
seq_id source type start end score strand phase \
3 NC_016810.1 RefSeq gene 1 20 . + .
attributes
3 ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_...
The following methods of the library won’t return a data frame:
to_gff3¶
For example:
>>> annotation.to_gff3('temp.gff')
>>> gff3_file = open('temp.gff').read()
>>> print(gff3_file)
Out[9]:
##gff-version 3
##sequence-region NC_016810.1 1 20
NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344
NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001
NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002
NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003
NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004
NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005
NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
to_csv¶
For example:
>>> annotation.to_csv('temp.csv')
>>> csv_file = open('temp.csv').read()
>>> print(csv_file)
Out[9]:
seq_id,source,type,start,end,score,strand,phase,attributes
NC_016810.1,RefSeq,region,1,4000,.,+,.,Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344
NC_016810.1,RefSeq,gene,1,20,.,+,.,ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001
NC_016810.1,RefSeq,CDS,13,235,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1,RefSeq,gene,1,20,.,+,.,ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002
NC_016810.1,RefSeq,CDS,341,523,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1,RefSeq,gene,1,600,.,-,.,ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003
NC_016810.1,RefSeq,CDS,21,345,.,-,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1,RefSeq,gene,41,255,.,+,.,ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004
NC_016810.1,RefSeq,CDS,61,195,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1,RefSeq,gene,170,546,.,+,.,ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005
NC_016810.1,RefSeq,CDS,34,335,.,+,0,Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
to_tsv¶
For example:
>>> annotation.to_tsv('temp.tsv')
>>> tsv_file = open('temp.tsv').read()
>>> print(tsv_file)
Out[10]:
seq_id source type start end score strand phase attributes
NC_016810.1 RefSeq region 1 4000 . + . Dbxref=taxon:216597;ID=id0;gbkey=Src;genome=genomic;mol_type=genomic DNA;serovar=Typhimurium;strain=SL1344
NC_016810.1 RefSeq gene 1 20 . + . ID=gene1;Name=thrL;gbkey=Gene;gene=thrL;locus_tag=SL1344_0001
NC_016810.1 RefSeq CDS 13 235 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene1;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 20 . + . ID=gene2;Name=thrA;gbkey=Gene;gene=thrA;locus_tag=SL1344_0002
NC_016810.1 RefSeq CDS 341 523 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene2;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 1 600 . - . ID=gene3;Name=thrX;gbkey=Gene;gene=thrX;locus_tag=SL1344_0003
NC_016810.1 RefSeq CDS 21 345 . - 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene3;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 41 255 . + . ID=gene4;Name=thrB;gbkey=Gene;gene=thrB;locus_tag=SL1344_0004
NC_016810.1 RefSeq CDS 61 195 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene4;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
NC_016810.1 RefSeq gene 170 546 . + . ID=gene5;Name=thrC;gbkey=Gene;gene=thrC;locus_tag=SL1344_0005
NC_016810.1 RefSeq CDS 34 335 . + 0 Dbxref=UniProtKB%252FTrEMBL:E1W7M4%2CGenbank:YP_005179941.1;ID=cds0;Name=YP_005179941.1;Parent=gene5;gbkey=CDS;product=thr operon leader peptide;protein_id=YP_005179941.1;transl_table=11
stats_dic¶
For example:
>>> statistics = annotation.stats_dic()
>>> print(statistics.df)
Out[11]:
{'Maximal_bp_length': 599, 'Minimal_bp_length': 19, 'Counted_strands': + 9
- 2
Name: strand, dtype: int64, 'Counted_feature_types': gene 5
CDS 5
region 1
Name: type, dtype: int64}
gffpandas API¶
gffpandas.gffpandas module¶
-
class
gffpandas.gffpandas.
Gff3DataFrame
(input_gff_file=None, input_df=None, input_header=None)[source]¶ Bases:
object
This class contains header information in the header attribute and a actual annotation data in the pandas dataframe in the df attribute.
-
attributes_to_columns
() → pandas.core.frame.DataFrame[source]¶ Saving each attribute-tag to a single column.
Attribute column will be split by the tags in the single columns. For this method only a pandas DataFrame and not a Gff3DataFrame will be returned. Therefore, this data frame can not be saved as gff3 file.
Returns: pandas dataframe, whereby the attribute column of the gff3 file are splitted into the different attribute tags Return type: pandas DataFrame
-
filter_by_length
(min_length=None, max_length=None) → gffpandas.gffpandas.Gff3DataFrame[source]¶ Filtering the pandas dataframe by the gene_length.
For this method the desired minimal and maximal bp length have to be given.
Parameters: Returns: original header and dataframe with features, whose lengths fits the set parameters, saved as object of the class Gff3DataFrame
Return type: class ‘gffpandas.gffpandas.Gff3DataFrame’
-
filter_feature_of_type
(feature_type_list) → gffpandas.gffpandas.Gff3DataFrame[source]¶ Filtering the pandas dataframe by feature_type.
For this method a list of feature-type(s) has to be given, as e.g. [‘CDS’, ‘ncRNA’].
Parameters: feature_type_list (list) – List of name(s) of the desired feature(s) Returns: original header and dataframe of the selected features saved as object of the class Gff3DataFrame Return type: class ‘gffpandas.gffpandas.Gff3DataFrame’
-
find_duplicated_entries
(seq_id=None, type=None) → gffpandas.gffpandas.Gff3DataFrame[source]¶ Find entries which are redundant.
For this method the chromosom accession number (seq_id) as well as the feature-type have to be given. Then all entries which are redundant according to start- and end-position as well as strand-type will be found.
Parameters: Returns: original header and dataframe containing the duplicated entries, both saved as object of the class Gff3DataFrame
Return type: class ‘gffpandas.gffpandas.Gff3DataFrame’
-
get_feature_by_attribute
(attr_tag, attr_value_list) → gffpandas.gffpandas.Gff3DataFrame[source]¶ Filtering the pandas dataframe by a attribute.
The 9th column of a gff3-file contains the list of feature attributes in a tag=value format. For this method the desired attribute tag as well as the corresponding value have to be given. If the value is not available an empty dataframe would be returned.
Parameters: Returns: original header and dataframe with the entries, which contain the desired attribute values, both saved as object of the class Gff3DataFrame
Return type: class ‘gffpandas.gffpandas.Gff3DataFrame’
-
overlaps_with
(seq_id=None, start=None, end=None, type=None, strand=None, complement=False) → gffpandas.gffpandas.Gff3DataFrame[source]¶ To see which entries overlap with a comparable feature.
For this method the chromosom accession number has to be given. The start and end bp position for the to comparable feature have to be given, as well as optional the feature-type of it and if it is on the sense (+) or antisense (-) strand.
Possible overlaps (see code):——–=================——————————–=====================——–——-===================———————-===================———————========================———–——============———————–———=====================———–——————============———–By selecting ‘complement=True’, all the feature, which do not overlap with the to comparable feature will be returned.
Parameters: Returns: original header and dataframe, containing the entries which overlap or do not overlap (complement=True) with the given parameters, both saved as object of the class Gff3DataFrame
Return type: class ‘gffpandas.gffpandas.Gff3DataFrame’
-
stats_dic
() → dict[source]¶ Gives the following statistics for the data:
The maximal bp-length, minimal bp-length, the count of sense (+) and antisense (-) strands as well as the count of each available feature.
Returns: information about the given dataframe, which are the length of the longest and shortest feature entry (in bp), the number of feature on the sense and antisense strand and the number of different feature types. Return type: dictionary
-
to_csv
(output_file=None) → None[source]¶ Create a csv file.
The pandas data frame is saved as a csv file.
Parameters: output_file (str) – Desired name of the output csv file Returns: csv file with the content of the dataframe Return type: data file in csv format
-
Requirements and installation¶
Requirements:¶
The Python library gffpandas was developed with Python 3. Thus, the user is advised to run gffpandas on Python 3.4 or a higher version. For an easy installation pip and setuptools should be installed. gffpandas is dependent on the Python library pandas, which needs to be installed, when using gffpandas.
Installation:¶
gffpandas is hosted on the PyPI server and can thus be installed by pip3:
$ pip3 install gffpandas