Removing degenerate characters

Degenerate IUPAC base symbols represent a site position that can have multiple possible characters. For a DNA example, “Y” represents pyrimidines where the site can be either “C” or “T”.

Note

In many molecular evolutionary and phylogenetic analyses, the gap character “-” is treated “N”, meaning any base.

Let’s create sample data with degenerate characters

Omit aligned columns containing a degenerate character

Omit all degenerate characters except gaps from an alignment

If we create the app with the argument gap_is_degen=False, we can omit degenerate characters but retain gaps.

Omit k-mers which contain degenerate characters

If we create omit_degenerates with the argument motif_length, it will split sequences into non-overlapping tuples of the specified length and exclude any tuple that contains a degenerate character.