4. gffpandas API

4.1. gffpandas.gffpandas module

class gffpandas.gffpandas.Gff3DataFrame(input_gff_file=None, input_df=None, input_header=None)[source]

Bases: object

This class contains header information in the header attribute and a actual annotation data in the pandas dataframe in the df attribute.

attributes_to_columns() → pandas.core.frame.DataFrame[source]

Saving each attribute-tag to a single column.

Attribute column will be split by the tags in the single columns. For this method only a pandas DataFrame and not a Gff3DataFrame will be returned. Therefore, this data frame can not be saved as gff3 file.

Returns:pandas dataframe, whereby the attribute column of the gff3 file are splitted into the different attribute tags
Return type:pandas DataFrame
filter_by_length(min_length=None, max_length=None) → gffpandas.gffpandas.Gff3DataFrame[source]

Filtering the pandas dataframe by the gene_length.

For this method the desired minimal and maximal bp length have to be given.

Parameters:
  • min_length (int) – minimal bp length of the feature
  • max_length (int) – maximal bp length of the feature
Returns:

original header and dataframe with features, whose lengths fits the set parameters, saved as object of the class Gff3DataFrame

Return type:

class ‘gffpandas.gffpandas.Gff3DataFrame’

filter_feature_of_type(feature_type_list) → gffpandas.gffpandas.Gff3DataFrame[source]

Filtering the pandas dataframe by feature_type.

For this method a list of feature-type(s) has to be given, as e.g. [‘CDS’, ‘ncRNA’].

Parameters:feature_type_list (list) – List of name(s) of the desired feature(s)
Returns:original header and dataframe of the selected features saved as object of the class Gff3DataFrame
Return type:class ‘gffpandas.gffpandas.Gff3DataFrame’
find_duplicated_entries(seq_id=None, type=None) → gffpandas.gffpandas.Gff3DataFrame[source]

Find entries which are redundant.

For this method the chromosom accession number (seq_id) as well as the feature-type have to be given. Then all entries which are redundant according to start- and end-position as well as strand-type will be found.

Parameters:
  • seq_id (str) – corresponding accession number
  • type (str) – feature type
Returns:

original header and dataframe containing the duplicated entries, both saved as object of the class Gff3DataFrame

Return type:

class ‘gffpandas.gffpandas.Gff3DataFrame’

get_feature_by_attribute(attr_tag, attr_value_list) → gffpandas.gffpandas.Gff3DataFrame[source]

Filtering the pandas dataframe by a attribute.

The 9th column of a gff3-file contains the list of feature attributes in a tag=value format. For this method the desired attribute tag as well as the corresponding value have to be given. If the value is not available an empty dataframe would be returned.

Parameters:
  • attr_tag (str) – Name of attribute tag, by which the df will be filtered
  • attr_value_list (list) – List of value name or several value names, which has/have to be associated with the attribute tag. If an entry includes the value with the corresponding tag it is selected
Returns:

original header and dataframe with the entries, which contain the desired attribute values, both saved as object of the class Gff3DataFrame

Return type:

class ‘gffpandas.gffpandas.Gff3DataFrame’

overlaps_with(seq_id=None, start=None, end=None, type=None, strand=None, complement=False) → gffpandas.gffpandas.Gff3DataFrame[source]

To see which entries overlap with a comparable feature.

For this method the chromosom accession number has to be given. The start and end bp position for the to comparable feature have to be given, as well as optional the feature-type of it and if it is on the sense (+) or antisense (-) strand.

Possible overlaps (see code):
——–=================——————
————–=====================——–

——-===================—————
——-===================—————

——========================———–
——============———————–

———=====================———–
——————============———–

By selecting ‘complement=True’, all the feature, which do not overlap with the to comparable feature will be returned.

Parameters:
  • seq_id (str) – accession number of the feature
  • start (int) – start position of the feature
  • end (int) – end position of the feature
  • type (str) – type of the feature
  • strand (str) – minus (-) for antisense and plus (+) for sense strand
Returns:

original header and dataframe, containing the entries which overlap or do not overlap (complement=True) with the given parameters, both saved as object of the class Gff3DataFrame

Return type:

class ‘gffpandas.gffpandas.Gff3DataFrame’

stats_dic() → dict[source]

Gives the following statistics for the data:

The maximal bp-length, minimal bp-length, the count of sense (+) and antisense (-) strands as well as the count of each available feature.

Returns:information about the given dataframe, which are the length of the longest and shortest feature entry (in bp), the number of feature on the sense and antisense strand and the number of different feature types.
Return type:dictionary
to_csv(output_file=None) → None[source]

Create a csv file.

The pandas data frame is saved as a csv file.

Parameters:output_file (str) – Desired name of the output csv file
Returns:csv file with the content of the dataframe
Return type:data file in csv format
to_gff3(gff_file) → None[source]

Create a gff3 file.

The pandas dataframe is saved as a gff3 file.

Parameters:gff_file (str) – Desired name of the output gff file
Returns:gff3 file with the content of the dataframe
Return type:data file in gff3 format
to_tsv(output_file=None) → None[source]

Create a tsv file.

The pandas data frame is saved as a tsv file.

Parameters:output_file (str) – Desired name of the output tsv file
Returns:tsv file with the content of the dataframe
Return type:data file in tsv format
gffpandas.gffpandas.read_gff3(input_file)[source]