API documentation

Examples

You can follow the interactive API example documentation in a Juypyter notebook using mybinder:

https://mybinder.org/badge_logo.svg

Working with motifs

The Motif class stores motif information. There are several ways to create a Motif instance.

from gimmemotifs.motif import Motif,read_motifs

# Read from file
motifs = read_motifs("example.pfm")

for motif in motifs:
    print(motif)
AP1_nTGAGTCAy
CTCF_CCAsyAGrkGGCr
# Create from scratch
m = Motif([[0,1,0,0],[0,0,1,0]])
m.id = "CpG"
print(m)
CpG_CG
# Or from a consensus sequence
from gimmemotifs.motif import motif_from_consensus
ap1 = motif_from_consensus("TGASTCA")
print(ap1.to_ppm())
>TGASTCA
0.0001  0.0001  0.0001  0.9998
0.0001  0.0001  0.9998  0.0001
0.9998  0.0001  0.0001  0.0001
0.0001  0.4999  0.4999  0.0001
0.0001  0.0001  0.0001  0.9998
0.0001  0.9998  0.0001  0.0001
0.9998  0.0001  0.0001  0.0001

Read motifs from files in other formats.

motifs = read_motifs("MA0099.3.jaspar", fmt="jaspar")
print(motifs[0])

You can convert a motif to several formats.

motifs = read_motifs("example.pfm")

# pwm
print(motifs[0].to_ppm())
>AP1
0.5558      0.1469  0.2734  0.0240
0.0020      0.0015  0.0017  0.9948
0.0039      0.0019  0.9502  0.0439
0.9697      0.0220  0.0018  0.0065
0.0377      0.3311  0.6030  0.0283
0.0033      0.0031  0.0043  0.9893
0.0188      0.9775  0.0023  0.0014
0.9951      0.0021  0.0012  0.0015
0.0121      0.3096  0.1221  0.5561
# pfm
print(motifs[0].to_pfm())
>AP1
555.8       146.9   273.4   24.0
2.0 1.5     1.7     994.8000000000001
3.9 1.9     950.2   43.9
969.7       22.0    1.8     6.5
37.699999999999996  331.1   603.0   28.299999999999997
3.3 3.1     4.3     989.3
18.8        977.5   2.3     1.4
995.1       2.1     1.2     1.5
12.1        309.59999999999997      122.1   556.1
# consensus sequence
print(motifs[0].to_consensus())
ATGAsTCAy
# TRANSFAC
print(motifs[0].to_transfac())
DE  AP1     unknown
0   555     146     273     24      A
1   2       1       1       994     T
2   3       1       950     43      G
3   969     22      1       6       A
4   37      331     603     28      s
5   3       3       4       989     T
6   18      977     2       1       C
7   995     2       1       1       A
8   12      309     122     556     y
XX
# MEME
print(motifs[0].to_meme())
MOTIF AP1
BL   MOTIF AP1 width=0 seqs=0
letter-probability matrix: alength= 4 w= 9 nsites= 1000.1 E= 0
0.5558      0.1469  0.2734  0.024
0.002       0.0015  0.0017  0.9948
0.0039      0.0019  0.9502  0.0439
0.9697      0.022   0.0018  0.0065
0.0377      0.3311  0.603   0.0283
0.0033      0.0031  0.0043  0.9893
0.0188      0.9775  0.0023  0.0014
0.9951      0.0021  0.0012  0.0015
0.0121      0.3096  0.1221  0.5561

Some other useful tidbits.

m = motif_from_consensus("NTGASTCAN")
print(len(m))
9
# Trim by information content
m.trim(0.5)
print(m.to_consensus(), len(m))
TGAsTCA 7
# Slices
print(m[:3].to_consensus())
TGA
# Shuffle
random_motif = motif_from_consensus("NTGASTGAN").randomize()
print(random_motif)
random_snCTAGTAn

To convert a motif to an image, use plot_logo(). Many different file formats are supported, such as png, svg and pdf.

m = motif_from_consensus("NTGASTCAN")
m.plot_logo(fname="ap1.png")
_images/ap1.png

Motif scanning

For very simple scanning, you can just use a Motif instance. Let’s say we have a FASTA file called test.fa that looks like this:

>seq1
AAAAAAAAAAAAAAAAAAAAAA
>seq2
CGCGCGTGAGTCACGCGCGCGCG

TGASTCAAAAAAAAAATGASTCA

Now we can use this file for scanning.

from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.fasta import Fasta

f = Fasta("test.fa")
m = motif_from_consensus("TGAsTCA")

m.scan(f)
{'seq1': [], 'seq2': [6, 6], 'seq3': [0, 16, 0, 16]}

This return a dictionary with the sequence names as keys. The value is a list with positions where the motif matches. Here, as the AP1 motif is a palindrome, you see matches on both forward and reverse strand. This is more clear when we use scan_all() that returns position, score and strand for every match.

m.scan_all(f)
{'seq1': [],
 'seq2': [(6, 9.02922042678255, 1), (6, 9.02922042678255, -1)],
 'seq3': [(0, 8.331251500673487, 1),
 (16, 8.331251500673487, 1),
 (0, 8.331251500673487, -1),
 (16, 8.331251500673487, -1)]}

The number of matches to return is set to 50 by default, you can control this by setting the nreport argument. Use scan_rc=False to only scan the forward orientation.

m.scan_all(f, nreport=1, scan_rc=False)
{'seq1': [],
 'seq2': [(6, 9.02922042678255, 1)],
 'seq3': [(0, 8.331251500673487, 1)]}

While this functionality works, it is not very efficient. To scan many motifs in potentially many sequences, use the functionality in the scanner module. If you only want the best match per sequence, is a utility function called scan_to_best_match, otherwise, use the Scanner class.

from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.scanner import scan_to_best_match

m1 = motif_from_consensus("TGAsTCA")
m1.id = "AP1"
m2 = motif_from_consensus("CGCG")
m2.id = "CG"
motifs = [m1, m2]

print("motif\tpos\tscore")
result = scan_to_best_match("test.fa", motifs)
for motif, matches in result.items():
    for match in matches:
        print(f"{motif}\t{match[1]}\t{match[0]}")
motif       pos     score
CG  0       -18.26379789133924
CG  0       5.554366880674296
CG  0       -7.743307225501047
AP1 0       -20.052563923836903
AP1 6       9.029486018303187
AP1 0       8.331550321011443

The matches are in the same order as the sequences in the original file.

While this function can be very useful, a Scanner instance is much more flexible. You can scan different input formats (BED, FASTA, regions), and control the thresholds and output.

As an example we will use the file Gm12878.CTCF.top500.w200.fa that contains 500 top CTCF peaks. We will get the CTCF motif and scan this file in a number of different ways.

from gimmemotifs.motif import read_motifs
from gimmemotifs.scanner import Scanner
from gimmemotifs.fasta import Fasta
import numpy as np

# Input file
fname = "Gm12878.CTCF.top500.w200.fa"

# Select the CTCF motif from the default motif database
motifs = [m for m in read_motifs() if "CTCF" in m.factors['direct']][:1]

# Initialize the scanner
s = Scanner()
s.set_motifs(motifs)

Now let’s get the best score for the CTCF motif for each sequence.

scores = [r[0] for r in s.best_score("examples/Gm12878.CTCF.top500.w200.fa")]
print("{}\t{:.2f}\t{:.2f}\t{:.2f}".format(
    len(scores),
    np.mean(scores),
    np.min(scores),
    np.max(scores)
    ))
500 11.00   1.45    15.07

In many cases you’ll want to set a threshold. In this example we’ll use a 1% FPR threshold, based on scanning randomly selected sequences from the ghg38 genome. The first time you run this, it will take a while. However, the tresholds will be cached. This means that for the same combination of motifs and genome, the previously generated threshold will be used.

# Set a 1% FPR threshold based on random hg38 sequence
s.set_genome("hg38")
s.set_threshold(fpr=0.01)

# get the number of sequences with at least one match
counts = [n[0] for n in s.count("Gm12878.CTCF.top500.w200.fa", nreport=1)]
print(counts[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# or the grand total of number of sequences with 1 match
print(s.total_count("examples/Gm12878.CTCF.top500.w200.fa", nreport=1))
[404]
# Scanner.scan() just gives all information
seqs = Fasta("Gm12878.CTCF.top500.w200.fa")[:10]
for i,result in enumerate(s.scan(seqs)):
    seqname = seqs.ids[i]
    for m,matches in enumerate(result):
        motif = motifs[m]
        for score, pos, strand in matches:
            print(seqname, motif, score, pos, strand)
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.4959558370929 143 -1
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.821440417077262 22 -1
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.658439190070851 82 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 120 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.72460285196088 83 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 27 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.037330832055455 7 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.998038360035828 159 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.668660161058972 101 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 145 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.848270770440264 185 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.668171128367552 165 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 12.785329839873164 27 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 126 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 67 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 28 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 185 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.261096435278661 67 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 147 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.022594547749485 126 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.194691222675097 7 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 37 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 95 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 153 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.972530270193543 75 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 17 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 133 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 92 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 34 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.246743494108388 15 -1
chr6:53036754-53036954 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.764279993851783 62 1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.697109967765122 33 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.204664711685334 149 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 92 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 130 -1
chr14:50328834-50329034 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.148765667117496 133 1
chr1:114889205-114889405 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.752478102244137 124 1

Finding de novo motifs

Let’s take the Gm12878.CTCF.top500.w200.fa file as example again. For a basic example we’ll just use two motif finders, as they’re quick to run.

from gimmemotifs.denovo import gimme_motifs

peaks = "Gm12878.CTCF.top500.w200.fa"
outdir = "CTCF.gimme"
params = {
    "tools": "Homer,BioProspector",
    }

motifs = gimme_motifs(peaks, outdir, params=params)
2017-06-30 07:37:00,079 - INFO - starting full motif analysis
2017-06-30 07:37:00,082 - INFO - preparing input (FASTA)
2017-06-30 07:37:32,949 - INFO - starting motif prediction (medium)
2017-06-30 07:37:32,949 - INFO - tools: BioProspector, Homer
2017-06-30 07:37:40,540 - INFO - BioProspector_width_5 finished, found 5 motifs
2017-06-30 07:37:41,308 - INFO - BioProspector_width_7 finished, found 5 motifs
2017-06-30 07:37:41,609 - INFO - BioProspector_width_6 finished, found 5 motifs
2017-06-30 07:37:42,003 - INFO - BioProspector_width_8 finished, found 5 motifs
2017-06-30 07:37:44,054 - INFO - Homer_width_5 finished, found 5 motifs
2017-06-30 07:37:45,201 - INFO - Homer_width_6 finished, found 5 motifs
2017-06-30 07:37:48,246 - INFO - Homer_width_7 finished, found 5 motifs
2017-06-30 07:37:50,503 - INFO - Homer_width_8 finished, found 5 motifs
2017-06-30 07:37:54,649 - INFO - BioProspector_width_9 finished, found 5 motifs
2017-06-30 07:37:56,169 - INFO - BioProspector_width_10 finished, found 5 motifs
2017-06-30 07:37:56,656 - INFO - Homer_width_9 finished, found 5 motifs
2017-06-30 07:37:59,313 - INFO - Homer_width_10 finished, found 5 motifs
2017-06-30 07:37:59,314 - INFO - all jobs submitted
2017-06-30 07:39:21,298 - INFO - predicted 60 motifs
2017-06-30 07:39:21,326 - INFO - 53 motifs are significant
2017-06-30 07:39:21,410 - INFO - clustering significant motifs.
2017-06-30 07:39:47,031 - INFO - creating reports
2017-06-30 07:40:41,024 - INFO - finished
2017-06-30 07:40:41,024 - INFO - output dir: CTCF.gimme
2017-06-30 07:40:41,024 - INFO - report: CTCF.gimme/motif_report.html

This will basically run the same pipeline as the gimme motifs command. All output files will be stored in outdir and gimme_motifs returns a list of Motif instances. If you only need the motifs but not the graphical report, you can decide to skip it by setting create_report to False. Additionally, you can choose to skip clustering (cluster=False) or to skip calculation of significance (filter_significant=False). For instance, the following command will only predict motifs and cluster them.

motifs = gimme_motifs(peaks, outdir,
    params=params, filter_significant=False, create_report=False)

All parameters for motif finding are set by the params argument.

Although the gimme_motifs() function is probably the easiest way to run the de novo finding tools, you can also run any of the tools directly. In this case you would also have to supply the background file if the specific tool requires it.

from gimmemotifs.tools import get_tool
from gimmemotifs.background import MatchedGcFasta

m = get_tool("homer")  # tool name is case-insensitive

# Create a background fasta file with a similar GC%
fa = MatchedGcFasta("TAp73alpha.fa", number=1000)
fa.writefasta("bg.fa")

# Run motif prediction
params = {
    "background": "bg.fa",
    "width": "20",
    "number": 5,
}

motifs, stdout, stderr = m.run("TAp73alpha.fa", params=params)
print(motifs[0].to_consensus())
nnnCnTGynnnGrCwTGyyn

Motif statistics

With some motifs, a sample file and a background file you can calculate motif statistics. Let’s say I wanted to know which of the p53-family motifs is most enriched in the file TAp73alpha.fa.

First, we’ll generate a GC%-matched genomic background. Then we only select p53 motifs.

from gimmemotifs.background import MatchedGcFasta
from gimmemotifs.fasta import Fasta
from gimmemotifs.stats import calc_stats
from gimmemotifs.motif import default_motifs

sample = "TAp73alpha.fa"
bg = MatchedGcFasta(sample, genome="hg19", number=1000)

motifs = [m for m in default_motifs() if any(f in m.factors['direct'] for f in ["TP53", "TP63", "TP73"])]

stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, genome="hg19")

print("Stats for", motifs[0])
for k, v in stats[str(motifs[0])].items():
    print(k,v)

print()

best_motif = sorted(motifs, key=lambda x: stats[str(x)]["recall_at_fdr"])[-1]
print("Best motif (recall at 10% FDR):", best_motif)
Stats for GM.5.0.p53.0001_rCATGyCCnGrCATGy
recall_at_fdr 0.833
fraction_fpr 0.416
score_at_fpr 9.05025905735
enr_at_fpr 41.6
max_enrichment 55.5
phyper_at_fpr 3.33220067463e-132
mncp 1.85474606318
roc_auc 0.9211925
roc_auc_xlim 0.0680115
pr_auc 0.927368602993
max_fmeasure 0.867519181586
ks_pvalue 0.0
ks_significance inf

Best motif (recall at 10% FDR): GM.5.0.p53.0001_rCATGyCCnGrCATGy

A lot of statistics are generated and you will not always need all of them. You can choose one or more specific metrics with the additional stats argument.

metrics = ["roc_auc", "recall_at_fdr"]
stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, stats=metrics, genome="hg19")

for metric in metrics:
    for motif in motifs:
        print("{}\t{}\t{:.2f}".format(
            motif.id, metric, stats[str(motif)][metric]
            ))
p53_M5923_1.01        roc_auc 0.63
p53_M5922_1.01 roc_auc 0.64
p53_Average_10  roc_auc 0.83
p53_Average_8   roc_auc 0.93
p53_M3568_1.01  roc_auc 0.83
p53_M5923_1.01  recall_at_fdr   0.00
p53_M5922_1.01  recall_at_fdr   0.00
p53_Average_10  recall_at_fdr   0.24
p53_Average_8   recall_at_fdr   0.83
p53_M3568_1.01  recall_at_fdr   0.42

Motif comparison

Compare two motifs.

from gimmemotifs.comparison import MotifComparer
from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.motif import read_motifs
m1 = motif_from_consensus("RRRCATGYYY")
m2 = motif_from_consensus("TCRTGT")

mc = MotifComparer()
score, pos, orient = mc.compare_motifs(m1, m2)

if orient == -1:
    m2 = m2.rc()
pad1, pad2 = "", ""
if pos < 0:
    pad1 = " " * -pos
elif pos > 0:
    pad2 =" " * pos
print(pad1 + m1.to_consensus())
print(pad2 + m2.to_consensus())
rrrCATGyyy
  ACAyGA

Find closest match in a motif database.

motifs = [
    motif_from_consensus("GATA"),
    motif_from_consensus("NTATAWA"),
    motif_from_consensus("ACGCG"),
]

mc = MotifComparer()
results = mc.get_closest_match(motifs, dbmotifs=read_motifs("HOMER"), metric="seqcor")

# Load motifs
db = read_motifs("HOMER", as_dict=True)

for motif in motifs:
    match, scores = results[motif.id]
    print("{}: {} - {:.3f}".format(motif.id, match, scores[0]))
    dbmotif = db[match]
    orient = scores[2]
    if orient == -1:
        dbmotif = dbmotif.rc()
    padm, padd = 0, 0
    if scores[1] < 0:
        padm = -scores[1]
    elif scores[1] > 0:
        padd = scores[1]
    print(" " * padm + motif.to_consensus())
    print(" " * padd + dbmotif.to_consensus())
    print()
GATA: AGATAASR_GATA3(Zf)/iTreg-Gata3-ChIP-Seq(GSE20898)/Homer - 0.823
 GATA
AGATAAnr

NTATAWA: NGYCATAAAWCH_CDX4(Homeobox)/ZebrafishEmbryos-Cdx4.Myc-ChIP-Seq(GSE48254)/Homer - 0.747
  nTATAwA
nnynrTAAAnnn

ACGCG: NCCACGTG_c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer - 0.744
 ACGCG
CACGTGGn

Auto-generated

This part of the API documentation is not yet complete.

The Motif class

Motif core functions

class gimmemotifs.motif.Motif(pfm=None, ppm=None, places=4)

Representation of a transcription factor binding motif.

Examples

>>> motif = Motif([[0,1,0,0], [0.5,0,0,0.5], [0,0,1,0]])
>>> print(motif.to_ppm())
>
0   1   0   0
0.5 0   0   0.5
0   0   1   0
>>> print(motif.to_consensus())
CwG
average_motifs(other, pos, orientation, include_bg=False)

Return the average of two motifs.

Combine this motif with another motif and return the average as a new Motif object. The position and orientatien need to be supplied. The pos parameter is the position of the second motif relative to this motif.

For example, take the following two motifs: Motif 1: CATGYT Motif 2: GGCTTGY

With position -2, the motifs are averaged as follows: xxCATGYT GGCTTGYx

Parameters:
  • other (Motif object) – Other Motif object.

  • pos (int) – Position of the second motif relative to this motif.

  • orientation (int) – Orientation, should be 1 or -1. If the orientation is -1 then the reverse complement of the other motif is used for averaging.

  • include_bg (bool , optional) – Extend both motifs with background frequencies (0.25) before averaging. False by default.

Returns:

motif – New Motif object containing average motif.

Return type:

motif object

property consensus

Motif converted to consensus sequence.

Returns:

Consensus sequence.

Return type:

str

consensus_scan(fa)

Scan FASTA with the motif as a consensus sequence.

Parameters:

fa (Fasta object) – Fasta object to scan

Returns:

matches – Dictionaru with matches.

Return type:

dict

property hash

Return hash of motif.

This is an unique identifier of a motif, regardless of the id.

Returns:

Hash of motif.

Return type:

str

ic_pos(row1, row2=None)

Calculate the information content of one position.

Returns:

score – Information content.

Return type:

float

property information_content

Return the total information content of the motif.

Returns:

Motif information content.

Return type:

float

property max_score

Return the maximum logodds score.

Returns:

score – Maximum logodds score.

Return type:

float

property min_score

Return the minimum logodds score.

Returns:

score – Minimum logodds score.

Return type:

float

pfm_to_ppm(pfm, pseudo=0.001)

Convert PFM with counts to a PFM with fractions (PPM).

Parameters:
  • pfm (array_like) – 2-dimensional array_like with counts.

  • pseudo (float) – Pseudocount used in conversion.

Returns:

2-dimensional array with probability count matrix

Return type:

array_like

Plot motif logo.

This is an implementation of the logo presented here: http://www.ensembl.info/2018/10/15/new-ensembl-motif-features/

Parameters:
  • fname (str, optional) – If fname is set, the plot will be saved with fname as filename.

  • ic (bool, optional) – Use the bit score. If this is set to False, the frequency will be used.

  • title (bool, optional) – Plot the motif id as the title.

  • letters (bool, optional) – Plot the nucleotides in the bars.

  • height (float, optional) – Height of the plot.

Plot motif logo

Parameters:
  • fname (str, optional) – If fname is set, the plot will be saved with fname as filename.

  • kind (str, optional) – Type of logo to plot, can be ‘information’, ‘frequency’, ‘energy’ or ‘ensembl’.

  • title (bool, optional) – Plot the motif id as the title.

  • ylabel (bool, optional) – Plot the Y axis label.

  • add_left (int, optional) – Add non-informative positions to the left (to align logo)

pwm_max_score()

Return the maximum PWM score.

DEPRECATED: use max_score instead.

Returns:

score – Maximum PWM score.

Return type:

float

pwm_min_score()

Return the minimum PWM score.

DEPRECATED: use min_score instead.

Returns:

score – Minimum PWM score.

Return type:

float

pwm_scan_score(fa, cutoff=0, nreport=1, scan_rc=True)

Scan sequences with this motif.

Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. Only the score of the matches is returned.

Parameters:
  • fa (Fasta object) – Fasta object to scan.

  • cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.

  • nreport (int , optional) – Maximum number of matches to report.

  • scan_rc (bool , optional) – Scan the reverse complement. True by default.

Returns:

matches – Dictionary with motif matches. Only the score of the matches is returned.

Return type:

dict

pwm_scan_to_gff(fa, gfffile, cutoff=0.9, nreport=50, scan_rc=True, append=False)

Scan sequences with this motif and save to a GFF file.

Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. The output is save to a file in GFF format.

Parameters:
  • fa (Fasta object) – Fasta object to scan.

  • gfffile (str) – Filename of GFF output file.

  • cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.

  • nreport (int , optional) – Maximum number of matches to report.

  • scan_rc (bool , optional) – Scan the reverse complement. True by default.

  • append (bool , optional) – Append to GFF file instead of overwriting it. False by default.

randomize()

Create a new motif with shuffled positions.

Shuffle the positions of this motif and return a new Motif instance.

Returns:

m – Motif instance with shuffled positions.

Return type:

Motif instance

rc()

Return the reverse complemented motif.

Returns:

New Motif instance with the reverse complement of the input motif.

Return type:

Motif instance

sample(n_seqs, rng=None)

Sample n_seqs random sequences from a motif. The sequences follow the distribution of the motif ppm.

Parameters:
  • n_seqs (int) – number of sequences to sample

  • rng (np.random.Generator) – random number generator, optional

Returns:

sequences – A list of all the samples sequences

Return type:

List[str]

scan(fa, cutoff=0.9, nreport=50, scan_rc=True)

Scan sequences with this motif.

Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. Only the position of the matches is returned.

Parameters:
  • fa (Fasta object) – Fasta object to scan.

  • cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.

  • nreport (int , optional) – Maximum number of matches to report.

  • scan_rc (bool , optional) – Scan the reverse complement. True by default.

Returns:

matches – Dictionary with motif matches. Only the position of the matches is returned.

Return type:

dict

scan_all(fa, cutoff=0.9, nreport=50, scan_rc=True)

Scan sequences with this motif.

Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. The score, position and strand for every match is returned.

Parameters:
  • fa (Fasta object) – Fasta object to scan.

  • cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.

  • nreport (int , optional) – Maximum number of matches to report.

  • scan_rc (bool , optional) – Scan the reverse complement. True by default.

Returns:

matches – Dictionary with motif matches. The score, position and strand for every match is returned.

Return type:

dict

score_kmer(kmer)

Calculate the log-odds score for a specific k-mer.

Note: this is not necessarily the fastest way for scanning.

Parameters:

kmer (str) – String representing a kmer. Should be the same length as the motif.

Returns:

score – Log-odd score.

Return type:

float

shuffle()

Create a new motif with shuffled positions.

Shuffle the positions of this motif and return a new Motif instance.

Returns:

Motif instance with shuffled positions.

Return type:

Motif instance

to_consensus(ppm=None, precision=4)

Convert position probability matrix to consensus sequence.

Parameters:
  • ppm (array_like, optional) – If not supplied, the ppm of the Motif object will be used.

  • precision (int, optional) – Precision used for rounding.

Returns:

Consensus sequence.

Return type:

str

to_meme()

Return motif formatted in MEME format

Returns:

m – String of motif in MEME format.

Return type:

str

to_motevo()

Return motif formatted in MotEvo (TRANSFAC-like) format

Returns:

m – String of motif in MotEvo format.

Return type:

str

to_ppm(precision=4, extra_str='')

Return ppm as string.

Parameters:
  • precision (int, optional, default 4) – Floating-point precision.

  • | (extra_str) – Extra text to include with motif id line.

Returns:

motif_str – Motif formatted in ppm format.

Return type:

str

to_pwm(precision=4, extra_str='')

Return ppm as string.

Parameters:
  • precision (int, optional, default 4) – Floating-point precision.

  • | (extra_str) – Extra text to include with motif id line.

Returns:

motif_str – Motif formatted in ppm format.

Return type:

str

to_transfac()

Return motif formatted in TRANSFAC format

Returns:

m – String of motif in TRANSFAC format.

Return type:

str

trim(edge_ic_cutoff=0.4)

Trim positions with an information content lower than the threshold.

The default threshold is set to 0.4. The Motif will be changed in-place.

Parameters:

edge_ic_cutoff (float, optional) – Information content threshold. All motif positions at the flanks with an information content lower thab this will be removed.

Returns:

m

Return type:

Motif instance

gimmemotifs.motif.gimme_motifs(inputfile, outdir, params=None, filter_significant=True, cluster=True, create_report=True)

De novo motif prediction based on an ensemble of different tools.

Parameters:
  • inputfile (str) – Filename of input. Can be either BED, narrowPeak or FASTA.

  • outdir (str) – Name of output directory.

  • params (dict, optional) – Optional parameters.

  • filter_significant (bool, optional) – Filter motifs for significance using the validation set.

  • cluster (bool, optional) – Cluster similar predicted (and significant) motifs.

  • create_report (bool, optional) – Create output reports (both .txt and .html).

Returns:

motifs – List of predicted motifs.

Return type:

list

Examples

>>> from gimmemotifs.motif import gimme_motifs
>>> gimme_motifs("input.fa", "motifs.out")
gimmemotifs.motif.motif_from_align(align)

Convert alignment to motif.

Converts a list with sequences to a motif. Sequences should be the same length.

Parameters:

align (list) – List with sequences (A,C,G,T).

Returns:

m – Motif created from the aligned sequences.

Return type:

Motif instance

gimmemotifs.motif.motif_from_consensus(cons, n=1200)

Convert consensus sequence to motif.

Converts a consensus sequences using the nucleotide IUPAC alphabet to a motif.

Parameters:
  • cons (str) – Consensus sequence using the IUPAC alphabet.

  • n (int , optional) – Count used to convert the sequence to a PFM.

Returns:

m – Motif created from the consensus.

Return type:

Motif instance

gimmemotifs.motif.parse_motifs(motifs)

Parse motifs in a variety of formats to return a list of motifs.

Parameters:

motifs (list or str) – Filename of motif, list of motifs or single Motif instance.

Returns:

motifs – List of Motif instances.

Return type:

list

gimmemotifs.motif.read_motifs(infile=None, fmt='pfm', as_dict=False)

Read motifs from a file or stream or file-like object.

Parameters:
  • infile (string or file-like object, optional) – Motif database, filename of motif file or file-like object. If infile is not specified the default motifs as specified in the config file will be returned.

  • fmt (string, optional) – Motif format, can be ‘pfm’, ‘transfac’, ‘xxmotif’, ‘jaspar’ or ‘align’.

  • as_dict (boolean, optional) – Return motifs as a dictionary with motif_id, motif pairs.

Returns:

motifs – List of Motif instances. If as_dict is set to True, motifs is a dictionary.

Return type:

list or dict

Prediction of de novo motifs

De novo motif prediction.

This module contains functions to predict de novo motifs using one or more de novo motif tools. The main function is gimme_motifs, which is likely the only method that you’ll need from this module.

Examples

from gimmemotifs.motif import gimme_motifs

peaks = “Gm12878.CTCF.top500.w200.fa” outdir = “CTCF.gimme” params = {“tools”: “Homer,BioProspector”, “genome”: “hg38”}

motifs = gimme_motifs(peaks, outdir, params=params)

Motif scanning

Scanner core functions

class gimmemotifs.scanner.Scanner(ncpus=None, random_state=None, progress=None)

scan sequences with motifs

best_match(seqs, scan_rc=True, zscore=False, gc=False)

give the best match of each motif in each sequence returns an iterator of nested lists containing tuples: (score, position, strand)

best_score(seqs, scan_rc=True, zscore=False, gc=False)

give the score of the best match of each motif in each sequence returns an iterator of lists containing floats

count(seqs, nreport=100, scan_rc=True)

count the number of matches above the cutoff returns an iterator of lists containing integer counts

property progress

Whether to show progress bars for slow operations.

scan(seqs, nreport=100, scan_rc=True, zscore=False, gc=False)

Scan a set of regions or sequences.

set_background(fname=None, genome=None, size=200, nseq=None, gc=False, gc_bins=None)

Set the background to use for FPR and z-score calculations.

Background can be specified either as a genome name or as the name of a FASTA file.

Parameters:
  • fname (str, optional) – Name of FASTA file to use as background.

  • genome (str, optional) – Name of genome to use to retrieve random sequences.

  • size (int, optional) – Size of genomic sequences to retrieve. The default is 200.

  • nseq (int, optional) – Number of genomic sequences to retrieve.

set_genome(genome=None, genomes_dir=None)

Set the genome to converting regions to sequences

Parameters:
  • genome (str, optional) – Path to the genome fasta or the genome name as found in the genomepy genomes directory.

  • genomes_dir (str, optional) – Path to the genomepy genomes directory. Taken from the genomepy config if unspecified.

set_threshold(fpr=None, threshold=None, gc=False)

Set motif scanning threshold based on background sequences.

Parameters:
  • fpr (float, optional) – Desired FPR, between 0.0 and 1.0.

  • threshold (float or str, optional) – Desired motif threshold, expressed as the fraction of the difference between minimum and maximum score of the PWM. Should either be a float between 0.0 and 1.0 or a filename with thresholds as created by ‘gimme threshold’.

total_count(seqs, nreport=100, scan_rc=True)

count the number of matches above the cutoff returns an iterator of lists containing integer counts

gimmemotifs.scanner.scan_regionfile_to_table(input_table, genome, scoring, pfmfile=None, ncpus=None, zscore=True, gc=True, random_state=None, progress=None)

Scan regions in input table for motifs. Return a dataframe with the motif count/score per region.

Parameters:
  • input_table (str) – Filename of a table with regions as first column. Accepts a feather file.

  • genome (str) – Genome name. Can be a FASTA file or a genomepy genome name.

  • scoring (str) – “count” or “score”. “count” returns the occurrence of each motif (with an FPR threshold of 0.01). “score” returns the best match score of each motif.

  • pfmfile (str, optional) – Specify a PFM file for scanning (or a list of Motif instances).

  • zscore (bool, optional) – Use z-score normalized motif scores. Only used if scoring=”score”.

  • gc (bool, optional) – Equally distribute GC percentages in background sequences.

  • ncpus (int, optional) – If defined this specifies the number of cores to use.

  • random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).

  • progress (bool or None, optional) – provide progress bars for long computations.

Returns:

table – DataFrame with motifs as column names and regions as index. Values are either motif counts or best motif match scores per region, depending on the ‘scoring’ parameter.

Return type:

pandas.DataFrame

gimmemotifs.scanner.scan_to_best_match(fname, motifs, ncpus=None, genome=None, score=False, zscore=False, gc=False, random_state=None, progress=None)

Scan a FASTA file for motifs. Return a dictionary with the best match per motif.

Parameters:
  • fname (str or Fasta) – Filename of a sequence file in FASTA format.

  • motifs (str or list, optional) – Specify a PFM file for scanning (or a list of Motif instances).

  • genome (str) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name.

  • score (bool, optional) – return the best score instead of the best match

  • zscore (bool, optional) – Use z-score normalized motif scores.

  • gc (bool, optional) – Equally distribute GC percentages in background sequences.

  • ncpus (int, optional) – If defined this specifies the number of cores to use.

  • random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).

  • progress (bool or None, optional) – provide progress bars for long computations.

Returns:

result – Dictionary with motif as key and best score/match as values.

Return type:

dict

gimmemotifs.scanner.scan_to_file(inputfile, pfmfile, filepath_or_buffer=None, nreport=1, fpr=0.01, cutoff=None, bed=False, scan_rc=True, table=False, score_table=False, bgfile=None, genome=None, ncpus=None, zscore=True, gcnorm=True, random_state=None, progress=None)

Scan file for motifs.

Parameters:
  • inputfile (str) – path to FASTA, BED or regions file.

  • pfmfile (str or list, optional) – Specify a PFM file for scanning (or a list of Motif instances).

  • filepath_or_buffer (Any, optional) – where to write the output. If unspecified, writes to stdout.

  • nreport (int , optional) – Maximum number of matches to report.

  • fpr (float, optional) – Desired false positive rate, between 0.0 and 1.0.

  • cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif length.

  • scan_rc (bool , optional) – Scan the reverse complement. default: True.

  • table (bool, optional) – output motif counts in tabular format

  • score_table (bool, optional) – output motif scores in tabular format

  • bed (bool, optional) – outputs BED6 format, instead of GTF/GFF format (default).

  • bgfile (str, optional) – FASTA file to use as background sequences. Required if no genome is given.

  • genome (str, optional) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name. Required if no bgfile is given.

  • zscore (bool, optional) – Use z-score normalized motif scores.

  • gcnorm (bool, optional) – Equally distribute GC percentages in background sequences.

  • ncpus (int, optional) – If defined this specifies the number of cores to use.

  • random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).

  • progress (bool or None, optional) – provide progress bars for long computations.

Maelstrom

Class implementing the maelstrom method (Bruse & van Heeringen, 2018)

Examples

run_maelstrom(input, “hg38”, outdir) mr = MaelstromResult(outdir)

class gimmemotifs.maelstrom.MaelstromResult(outdir)

Class for working with maelstrom output.

plot_heatmap(kind='final', min_freq=0.01, threshold=2, name=True, indirect=True, figsize=None, max_number_factors=5, aspect=1, cmap='RdBu_r', **kwargs)

Plot clustered heatmap of predicted motif activity.

Parameters:
  • kind (str, optional) – Which data type to use for plotting. Default is ‘final’, which will plot the result of the rank aggregation. Other options are ‘freq’ for the motif frequencies, or any of the individual activities such as ‘rf.score’.

  • min_freq (float, optional) – Minimum frequency of motif occurrence.

  • threshold (float, optional) – Minimum activity (absolute) of the rank aggregation result.

  • name (bool, optional) – Use factor names instead of motif names for plotting.

  • indirect (bool, optional) – Include indirect factors (computationally predicted or non-curated). Default is True.

  • max_number_factors (int, optional) – Truncate the list of factors to this maximum size.

  • figsize (tuple, optional) – Tuple of figure size (width, height).

  • aspect (int, optional) – Aspect ratio for tweaking the plot.

  • cmap (str, optional) – Color paletter to use, RdBu_r by default.

  • kwargs (other keyword arguments) – All other keyword arguments are passed to sns.heatmap

Returns:

cg – A seaborn ClusterGrid instance.

Return type:

ClusterGrid

plot_scores(motifs, name=True, max_len=50)

Create motif scores boxplot of different clusters. Motifs can be specified as either motif or factor names. The motif scores will be scaled and plotted as z-scores.

Parameters:
  • motifs (iterable or str) – List of motif or factor names.

  • name (bool, optional) – Use factor names instead of motif names for plotting.

  • max_len (int, optional) – Truncate the list of factors to this maximum length.

Returns:

g – Returns the seaborn FacetGrid object with the plot.

Return type:

FacetGrid

class gimmemotifs.maelstrom.Moap

Moap base class.

Motif activity prediction.

classmethod create(name, ncpus=None, **kwargs)

Create a Moap instance based on the predictor name.

Parameters:
  • name (str) – Name of the predictor (eg. Xgboost, BayesianRidge, …)

  • ncpus (int, optional) – Number of threads. Default is the number specified in the config.

  • kwargs (any, optional) – keyword arguments passed to predictor. Unused arguments are dropped.

Returns:

moap – moap instance.

Return type:

Moap instance

classmethod list_classification_predictors()

List available classification predictors.

classmethod list_predictors()

List available predictors.

classmethod list_regression_predictors()

List available regression predictors.

classmethod register_predictor(name)

Register method to keep list of predictors.

gimmemotifs.maelstrom.moap(inputfile, method='hypergeom', scoring=None, outfile=None, motiffile=None, pfmfile=None, genome=None, zscore=True, gc=True, subsample=None, random_state=None, ncpus=None, progress=None)

Run a single motif activity prediction algorithm.

inputfilestr

:1File with regions (chr:start-end) in first column and either cluster name in second column or a table with values.

methodstr, optional

Motif activity method to use. Any of ‘bayesianridge’, ‘xgboost’, ‘mwu’, ‘hypergeom’, ‘rf’, ‘multitasklasso’, ‘svr’. Default is ‘hypergeom’.

scoring: str, optional

Either ‘score’ or ‘count’

outfilestr, optional

Name of outputfile to save the fitted activity values.

motiffilestr, optional

Table with motif scan results. First column should be exactly the same regions as in the inputfile.

pfmfilestr, optional

File with motifs in pfm format. Required when motiffile is not supplied.

genomestr, optional

Genome name, as indexed by gimme. Required when motiffile is not supplied.

zscorebool, optional

Use z-score normalized motif scores.

gcbool, optional

Equally distribute GC percentages in background sequences.

subsamplefloat, optional

Fraction of regions to use.

random_statenumpy.random.RandomState object, optional

make predictions deterministic (where possible).

ncpusint, optional

Number of threads to use. Default is the number specified in the config.

progressbool or None, optional

provide progress bars for long computations.

pandas DataFrame with motif activity

gimmemotifs.maelstrom.run_maelstrom(infile, genome, outdir, pfmfile=None, filter_redundant=True, filter_cutoff=0.8, plot=True, cluster=False, score_table=None, count_table=None, methods=None, ncpus=None, zscore=True, gc=True, center=True, aggregation='int_stouffer', plot_all_motifs=False, plot_no_motifs=False, random_state=None)

Find differential motifs.

Parameters:
  • infile (str) – Filename of input table. Can be either a tab-separated text file or a feather file.

  • genome (str) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name.

  • outdir (str) – Output directory for all results.

  • pfmfile (str, optional) – Specify a PFM file for scanning.

  • filter_redundant (bool, optional) – Create a non-redundant set of motifs based on correlation of motif scores in the input data.

  • filter_cutoff (float, optional) – Cutoff to use for non-redundant motif selection. Default is 0.8.

  • plot (bool, optional) – Create heatmaps.

  • cluster (bool, optional) – If True and if the input table has more than one column, the data is clustered and the cluster activity methods are also run. Not well-tested.

  • score_table (str, optional) – Filename of pre-calculated table with motif scores.

  • count_table (str, optional) – Filename of pre-calculated table with motif counts.

  • methods (list, optional) – Activity methods to use. By default are all used.

  • ncpus (int, optional) – If defined this specifies the number of cores to use.

  • zscore (bool, optional) – Use z-score normalized motif scores.

  • gc (bool, optional) – Use GC% bins to normalize motif scores.

  • center (bool, optional) – Mean-center the input table.

  • aggregation (str, optional) – How to combine scores of the predictors. The default is “int_stouffer”, for inverse normal transform followed by Stouffer’s methods to combine z-scores. Alternatively, “stuart” performs rank aggregation and reports the -log10 of the rank aggregation p-value.

  • plot_all_motifs (bool, optional) – Specify to plot all motifs

  • plot_no_motifs (bool, optional) – Specify to plot no motifs

  • random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).

Motif activity prediction

Motif statistics

Calculate motif enrichment statistics.

gimmemotifs.stats.calc_stats(fg_file=None, bg_file=None, fg_table=None, bg_table=None, motifs=None, stats=None, genome=None, zscore=True, gc=True, ncpus=None)

Calculate motif enrichment metrics.

Parameters:
  • fg_file (str) – Filename of a FASTA, BED or region file with positive sequences.

  • bg_file (str) – Filename of a FASTA, BED or region file with negative sequences.

  • fg_table (str) – Filename of a table with motif scan results of positive sequences.

  • bg_table (str) – Filename of a table with motif scan results of negative sequences.

  • motifs (str, list or Motif instance) – A file with motifs in pwm format, a list of Motif instances or a single Motif instance.

  • genome (str, optional) – Genome or index directory in case of BED/regions.

  • stats (list, optional) – Names of metrics to calculate. See gimmemotifs.rocmetrics.__all__ for available metrics.

  • ncpus (int, optional) – Number of cores to use.

Returns:

result – Dictionary with results where keys are motif ids and the values are dictionary with metric name and value pairs.

Return type:

dict

gimmemotifs.stats.calc_stats_iterator(fg_file=None, bg_file=None, fg_table=None, bg_table=None, motifs=None, stats=None, genome=None, zscore=True, gc=True, ncpus=None)

Calculate motif enrichment metrics.

Parameters:
  • fg_file (str, optional) – Filename of a FASTA, BED or region file with positive sequences.

  • bg_file (str, optional) – Filename of a FASTA, BED or region file with negative sequences.

  • fg_table (str, optional) – Filename of a table with motif scan results of positive sequences.

  • bg_table (str, optional) – Filename of a table with motif scan results of negative sequences.

  • motifs (str, list or Motif instance, optional) – A file with motifs in pfm format, a list of Motif instances or a single Motif instance. If motifs is None, the default motif database is used.

  • genome (str, optional) – Genome or index directory in case of BED/regions.

  • stats (list, optional) – Names of metrics to calculate. See gimmemotifs.rocmetrics.__all__ for available metrics.

  • ncpus (int, optional) – Number of cores to use.

Returns:

result – Dictionary with results where keys are motif ids and the values are dictionary with metric name and value pairs.

Return type:

dict

gimmemotifs.stats.rank_motifs(stats, metrics=('roc_auc', 'recall_at_fdr'))

Determine mean rank of motifs based on metrics.

gimmemotifs.stats.write_stats(stats, fname, header=None)

write motif statistics to text file.

Motif comparison

Module to compare DNA sequence motifs (positional frequency matrices)

class gimmemotifs.comparison.MotifComparer

Class for motif comparison.

Compare two or more motifs using a variety of metrics. Probably the best metric to compare motifs is seqcor. The implementation of this metric is similar to the one used in Grau (2015), where motifs are scored according to the Pearson correlation of the scores along sequence. In this case a de Bruijn of k=7 is used.

Valid metrics are: seqcor - Pearson correlation of motif scores along sequence. pcc - Pearson correlation coefficient of motif PFMs. ed - Euclidean distance-based similarity of motif PFMs. distance - Distance-based similarity of motif PFMs. wic - Weighted Information Content, see van Heeringen 2011. chisq - Chi-squared similarity of motif PFMs. akl - Similarity based on average Kullback-Leibler similarity, see Mahony, 2011. ssd - Sum of squared distances of motif PFMs.

Examples

mc = MotifComparer()

# Compare two motifs score, pos, strand = mc.compare_motifs(m1, m2, metric=”seqcor”)

# Compare a list of motifs to another list of motifs mc.get_all_scores(motifs, dbmotifs, match, metric, combine)

# Get the best match for every motif in a list of reference motifs get_closest_match(motifs, dbmotifs=None)

compare_motifs(m1, m2, match='total', metric='wic', combine='mean', pval=False)

Compare two motifs.

The similarity metric can be any of seqcor, pcc, ed, distance, wic, chisq, akl or ssd. If match is ‘total’ the similarity score is calculated for the whole match, including positions that are not present in both motifs. If match is partial or subtotal, only the matching psotiions are used to calculate the score. The score of individual position is combined using either the mean or the sum.

Note that the match and combine parameters have no effect on the seqcor similarity metric.

Parameters:
  • m1 (Motif instance) – Motif instance 1.

  • m2 (Motif instance) – Motif instance 2.

  • match (str, optional) – Match can be “partial”, “subtotal” or “total”. Not all metrics use this.

  • metric (str, optional) – Distance metric.

  • combine (str, optional) – Combine positional scores using “mean” or “sum”. Not all metrics use this.

  • pval (bool, optional) – Calculate p-vale of match.

Return type:

score, position, strand

get_all_scores(motifs, dbmotifs, match, metric, combine, pval=False, parallel=True, trim=None, ncpus=None)

Pairwise comparison of a set of motifs compared to reference motifs.

Parameters:
  • motifs (list) – List of Motif instances.

  • dbmotifs (list) – List of Motif instances.

  • match (str) – Match can be “partial”, “subtotal” or “total”. Not all metrics use this.

  • metric (str) – Distance metric.

  • combine (str) – Combine positional scores using “mean” or “sum”. Not all metrics use this.

  • pval (bool , optional) – Calculate p-vale of match.

  • parallel (bool , optional) – Use multiprocessing for parallel execution. True by default.

  • trim (float or None) – If a float value is specified, motifs are trimmed used this IC cutoff before comparison.

  • ncpus (int or None) – Specifies the number of cores to use for parallel execution.

Returns:

scores – Dictionary with scores.

Return type:

dict

get_best_matches(motifs, nmatches=1, dbmotifs=None, match='partial', metric='wic', combine='mean', parallel=True, ncpus=None)

Return best match in database for motifs.

Parameters:
  • motifs (list or str) – Filename of motifs or list of motifs.

  • nmatches (int, optional) – Number of matches to return, default is 1.

  • dbmotifs (list or str, optional) – Database motifs, default will be used if not specified.

  • match (str, optional) –

  • metric (str, optional) –

  • combine (str, optional) –

  • ncpus (int, optional) – Number of threads to use.

Returns:

closest_match

Return type:

dict

get_closest_match(motifs, dbmotifs=None, match='partial', metric='wic', combine='mean', parallel=True, ncpus=None)

Return best match in database for motifs.

Parameters:
  • motifs (list or str) – Filename of motifs or list of motifs.

  • dbmotifs (list or str, optional) – Database motifs, default will be used if not specified.

  • match (str, optional) –

  • metric (str, optional) –

  • combine (str, optional) –

  • ncpus (int, optional) – Number of threads to use.

Returns:

closest_match

Return type:

dict

gimmemotifs.comparison.akl(p1, p2)

Calculates motif position similarity based on average Kullback-Leibler similarity.

See Mahony, 2007.

Parameters:
  • p1 (list) – Motif position 1.

  • p2 (list) – Motif position 2.

Returns:

score

Return type:

float

gimmemotifs.comparison.chisq(p1, p2)

Calculates motif position similarity based on chi-square.

Parameters:
  • p1 (list) – Motif position 1.

  • p2 (list) – Motif position 2.

Returns:

score

Return type:

float

gimmemotifs.comparison.seqcor(m1, m2, seq=None)

Calculates motif similarity based on Pearson correlation of scores.

Based on Kielbasa (2015) and Grau (2015). Scores are calculated based on scanning a de Bruijn sequence of 7-mers. This sequence is taken from ShortCAKE (Orenstein & Shamir, 2015). Optionally another sequence can be given as an argument.

Parameters:
  • m1 (Motif instance) – Motif 1 to compare.

  • m2 (Motif instance) – Motif 2 to compare.

  • seq (str, optional) – Sequence to use for scanning instead of k=7 de Bruijn sequence.

Return type:

score, position, strand

gimmemotifs.comparison.ssd(p1, p2)

Calculates motif position similarity based on sum of squared distances.

Parameters:
  • p1 (list) – Motif position 1.

  • p2 (list) – Motif position 2.

Returns:

score

Return type:

float