API documentation¶
Examples¶
You can follow the interactive API example documentation in a Juypyter notebook using mybinder:
Working with motifs¶
The Motif class stores motif information. There are several ways to create a Motif instance.
from gimmemotifs.motif import Motif,read_motifs
# Read from file
motifs = read_motifs("example.pfm")
for motif in motifs:
print(motif)
AP1_nTGAGTCAy
CTCF_CCAsyAGrkGGCr
# Create from scratch
m = Motif([[0,1,0,0],[0,0,1,0]])
m.id = "CpG"
print(m)
CpG_CG
# Or from a consensus sequence
from gimmemotifs.motif import motif_from_consensus
ap1 = motif_from_consensus("TGASTCA")
print(ap1.to_ppm())
>TGASTCA
0.0001 0.0001 0.0001 0.9998
0.0001 0.0001 0.9998 0.0001
0.9998 0.0001 0.0001 0.0001
0.0001 0.4999 0.4999 0.0001
0.0001 0.0001 0.0001 0.9998
0.0001 0.9998 0.0001 0.0001
0.9998 0.0001 0.0001 0.0001
Read motifs from files in other formats.
motifs = read_motifs("MA0099.3.jaspar", fmt="jaspar")
print(motifs[0])
You can convert a motif to several formats.
motifs = read_motifs("example.pfm")
# pwm
print(motifs[0].to_ppm())
>AP1
0.5558 0.1469 0.2734 0.0240
0.0020 0.0015 0.0017 0.9948
0.0039 0.0019 0.9502 0.0439
0.9697 0.0220 0.0018 0.0065
0.0377 0.3311 0.6030 0.0283
0.0033 0.0031 0.0043 0.9893
0.0188 0.9775 0.0023 0.0014
0.9951 0.0021 0.0012 0.0015
0.0121 0.3096 0.1221 0.5561
# pfm
print(motifs[0].to_pfm())
>AP1
555.8 146.9 273.4 24.0
2.0 1.5 1.7 994.8000000000001
3.9 1.9 950.2 43.9
969.7 22.0 1.8 6.5
37.699999999999996 331.1 603.0 28.299999999999997
3.3 3.1 4.3 989.3
18.8 977.5 2.3 1.4
995.1 2.1 1.2 1.5
12.1 309.59999999999997 122.1 556.1
# consensus sequence
print(motifs[0].to_consensus())
ATGAsTCAy
# TRANSFAC
print(motifs[0].to_transfac())
DE AP1 unknown
0 555 146 273 24 A
1 2 1 1 994 T
2 3 1 950 43 G
3 969 22 1 6 A
4 37 331 603 28 s
5 3 3 4 989 T
6 18 977 2 1 C
7 995 2 1 1 A
8 12 309 122 556 y
XX
# MEME
print(motifs[0].to_meme())
MOTIF AP1
BL MOTIF AP1 width=0 seqs=0
letter-probability matrix: alength= 4 w= 9 nsites= 1000.1 E= 0
0.5558 0.1469 0.2734 0.024
0.002 0.0015 0.0017 0.9948
0.0039 0.0019 0.9502 0.0439
0.9697 0.022 0.0018 0.0065
0.0377 0.3311 0.603 0.0283
0.0033 0.0031 0.0043 0.9893
0.0188 0.9775 0.0023 0.0014
0.9951 0.0021 0.0012 0.0015
0.0121 0.3096 0.1221 0.5561
Some other useful tidbits.
m = motif_from_consensus("NTGASTCAN")
print(len(m))
9
# Trim by information content
m.trim(0.5)
print(m.to_consensus(), len(m))
TGAsTCA 7
# Slices
print(m[:3].to_consensus())
TGA
# Shuffle
random_motif = motif_from_consensus("NTGASTGAN").randomize()
print(random_motif)
random_snCTAGTAn
To convert a motif to an image, use plot_logo()
.
Many different file formats are supported, such as png, svg and pdf.
m = motif_from_consensus("NTGASTCAN")
m.plot_logo(fname="ap1.png")
Motif scanning¶
For very simple scanning, you can just use a Motif instance.
Let’s say we have a FASTA file called test.fa
that looks like this:
>seq1
AAAAAAAAAAAAAAAAAAAAAA
>seq2
CGCGCGTGAGTCACGCGCGCGCG
TGASTCAAAAAAAAAATGASTCA
Now we can use this file for scanning.
from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.fasta import Fasta
f = Fasta("test.fa")
m = motif_from_consensus("TGAsTCA")
m.scan(f)
{'seq1': [], 'seq2': [6, 6], 'seq3': [0, 16, 0, 16]}
This return a dictionary with the sequence names as keys.
The value is a list with positions where the motif matches.
Here, as the AP1 motif is a palindrome, you see matches on both forward and reverse strand.
This is more clear when we use scan_all()
that returns position, score and strand for every match.
m.scan_all(f)
{'seq1': [],
'seq2': [(6, 9.02922042678255, 1), (6, 9.02922042678255, -1)],
'seq3': [(0, 8.331251500673487, 1),
(16, 8.331251500673487, 1),
(0, 8.331251500673487, -1),
(16, 8.331251500673487, -1)]}
The number of matches to return is set to 50 by default, you can control this by setting the nreport
argument.
Use scan_rc=False
to only scan the forward orientation.
m.scan_all(f, nreport=1, scan_rc=False)
{'seq1': [],
'seq2': [(6, 9.02922042678255, 1)],
'seq3': [(0, 8.331251500673487, 1)]}
While this functionality works, it is not very efficient.
To scan many motifs in potentially many sequences, use the functionality in the scanner
module.
If you only want the best match per sequence, is a utility function called scan_to_best_match
, otherwise, use the Scanner
class.
from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.scanner import scan_to_best_match
m1 = motif_from_consensus("TGAsTCA")
m1.id = "AP1"
m2 = motif_from_consensus("CGCG")
m2.id = "CG"
motifs = [m1, m2]
print("motif\tpos\tscore")
result = scan_to_best_match("test.fa", motifs)
for motif, matches in result.items():
for match in matches:
print(f"{motif}\t{match[1]}\t{match[0]}")
motif pos score
CG 0 -18.26379789133924
CG 0 5.554366880674296
CG 0 -7.743307225501047
AP1 0 -20.052563923836903
AP1 6 9.029486018303187
AP1 0 8.331550321011443
The matches are in the same order as the sequences in the original file.
While this function can be very useful, a Scanner
instance is much more flexible.
You can scan different input formats (BED, FASTA, regions), and control the thresholds and output.
As an example we will use the file Gm12878.CTCF.top500.w200.fa
that contains 500 top CTCF peaks.
We will get the CTCF motif and scan this file in a number of different ways.
from gimmemotifs.motif import read_motifs
from gimmemotifs.scanner import Scanner
from gimmemotifs.fasta import Fasta
import numpy as np
# Input file
fname = "Gm12878.CTCF.top500.w200.fa"
# Select the CTCF motif from the default motif database
motifs = [m for m in read_motifs() if "CTCF" in m.factors['direct']][:1]
# Initialize the scanner
s = Scanner()
s.set_motifs(motifs)
Now let’s get the best score for the CTCF motif for each sequence.
scores = [r[0] for r in s.best_score("examples/Gm12878.CTCF.top500.w200.fa")]
print("{}\t{:.2f}\t{:.2f}\t{:.2f}".format(
len(scores),
np.mean(scores),
np.min(scores),
np.max(scores)
))
500 11.00 1.45 15.07
In many cases you’ll want to set a threshold. In this example we’ll use a 1% FPR threshold, based on scanning randomly selected sequences from the ghg38 genome. The first time you run this, it will take a while. However, the tresholds will be cached. This means that for the same combination of motifs and genome, the previously generated threshold will be used.
# Set a 1% FPR threshold based on random hg38 sequence
s.set_genome("hg38")
s.set_threshold(fpr=0.01)
# get the number of sequences with at least one match
counts = [n[0] for n in s.count("Gm12878.CTCF.top500.w200.fa", nreport=1)]
print(counts[:10])
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
# or the grand total of number of sequences with 1 match
print(s.total_count("examples/Gm12878.CTCF.top500.w200.fa", nreport=1))
[404]
# Scanner.scan() just gives all information
seqs = Fasta("Gm12878.CTCF.top500.w200.fa")[:10]
for i,result in enumerate(s.scan(seqs)):
seqname = seqs.ids[i]
for m,matches in enumerate(result):
motif = motifs[m]
for score, pos, strand in matches:
print(seqname, motif, score, pos, strand)
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.4959558370929 143 -1
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.821440417077262 22 -1
chr11:190037-190237 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.658439190070851 82 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 120 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.72460285196088 83 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 27 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.037330832055455 7 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.998038360035828 159 -1
chr14:106873577-106873777 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.668660161058972 101 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 145 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.848270770440264 185 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.668171128367552 165 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 12.785329839873164 27 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 126 -1
chr14:106765204-106765404 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 67 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 28 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 185 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.261096435278661 67 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.450778540447134 147 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.022594547749485 126 -1
chr15:22461178-22461378 C2H2_ZF_Average_200_CCAsyAGrkGGCr 10.194691222675097 7 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 37 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 95 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.886792072933595 153 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.972530270193543 75 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 17 -1
chr14:107119996-107120196 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.949273408029276 133 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 14.16061638444734 92 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.25063146496227 34 -1
chr14:107238917-107239117 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.246743494108388 15 -1
chr6:53036754-53036954 C2H2_ZF_Average_200_CCAsyAGrkGGCr 8.764279993851783 62 1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.697109967765122 33 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 13.204664711685334 149 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 92 -1
chr14:107147705-107147905 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.222131525579154 130 -1
chr14:50328834-50329034 C2H2_ZF_Average_200_CCAsyAGrkGGCr 11.148765667117496 133 1
chr1:114889205-114889405 C2H2_ZF_Average_200_CCAsyAGrkGGCr 9.752478102244137 124 1
Finding de novo motifs¶
Let’s take the Gm12878.CTCF.top500.w200.fa
file as example again.
For a basic example we’ll just use two motif finders, as they’re quick to run.
from gimmemotifs.denovo import gimme_motifs
peaks = "Gm12878.CTCF.top500.w200.fa"
outdir = "CTCF.gimme"
params = {
"tools": "Homer,BioProspector",
}
motifs = gimme_motifs(peaks, outdir, params=params)
2017-06-30 07:37:00,079 - INFO - starting full motif analysis
2017-06-30 07:37:00,082 - INFO - preparing input (FASTA)
2017-06-30 07:37:32,949 - INFO - starting motif prediction (medium)
2017-06-30 07:37:32,949 - INFO - tools: BioProspector, Homer
2017-06-30 07:37:40,540 - INFO - BioProspector_width_5 finished, found 5 motifs
2017-06-30 07:37:41,308 - INFO - BioProspector_width_7 finished, found 5 motifs
2017-06-30 07:37:41,609 - INFO - BioProspector_width_6 finished, found 5 motifs
2017-06-30 07:37:42,003 - INFO - BioProspector_width_8 finished, found 5 motifs
2017-06-30 07:37:44,054 - INFO - Homer_width_5 finished, found 5 motifs
2017-06-30 07:37:45,201 - INFO - Homer_width_6 finished, found 5 motifs
2017-06-30 07:37:48,246 - INFO - Homer_width_7 finished, found 5 motifs
2017-06-30 07:37:50,503 - INFO - Homer_width_8 finished, found 5 motifs
2017-06-30 07:37:54,649 - INFO - BioProspector_width_9 finished, found 5 motifs
2017-06-30 07:37:56,169 - INFO - BioProspector_width_10 finished, found 5 motifs
2017-06-30 07:37:56,656 - INFO - Homer_width_9 finished, found 5 motifs
2017-06-30 07:37:59,313 - INFO - Homer_width_10 finished, found 5 motifs
2017-06-30 07:37:59,314 - INFO - all jobs submitted
2017-06-30 07:39:21,298 - INFO - predicted 60 motifs
2017-06-30 07:39:21,326 - INFO - 53 motifs are significant
2017-06-30 07:39:21,410 - INFO - clustering significant motifs.
2017-06-30 07:39:47,031 - INFO - creating reports
2017-06-30 07:40:41,024 - INFO - finished
2017-06-30 07:40:41,024 - INFO - output dir: CTCF.gimme
2017-06-30 07:40:41,024 - INFO - report: CTCF.gimme/motif_report.html
This will basically run the same pipeline as the gimme motifs
command.
All output files will be stored in outdir
and gimme_motifs
returns a list of Motif instances.
If you only need the motifs but not the graphical report, you can decide to skip it by setting create_report
to False
.
Additionally, you can choose to skip clustering (cluster=False
) or to skip calculation of significance (filter_significant=False
).
For instance, the following command will only predict motifs and cluster them.
motifs = gimme_motifs(peaks, outdir,
params=params, filter_significant=False, create_report=False)
All parameters for motif finding are set by the params
argument.
Although the gimme_motifs()
function is probably the easiest way to run the de novo finding tools, you can also run any of the tools directly. In this case you would also have to supply the background file if the specific tool requires it.
from gimmemotifs.tools import get_tool
from gimmemotifs.background import MatchedGcFasta
m = get_tool("homer") # tool name is case-insensitive
# Create a background fasta file with a similar GC%
fa = MatchedGcFasta("TAp73alpha.fa", number=1000)
fa.writefasta("bg.fa")
# Run motif prediction
params = {
"background": "bg.fa",
"width": "20",
"number": 5,
}
motifs, stdout, stderr = m.run("TAp73alpha.fa", params=params)
print(motifs[0].to_consensus())
nnnCnTGynnnGrCwTGyyn
Motif statistics¶
With some motifs, a sample file and a background file you can calculate motif statistics.
Let’s say I wanted to know which of the p53-family motifs is most enriched in the file TAp73alpha.fa
.
First, we’ll generate a GC%-matched genomic background. Then we only select p53 motifs.
from gimmemotifs.background import MatchedGcFasta
from gimmemotifs.fasta import Fasta
from gimmemotifs.stats import calc_stats
from gimmemotifs.motif import default_motifs
sample = "TAp73alpha.fa"
bg = MatchedGcFasta(sample, genome="hg19", number=1000)
motifs = [m for m in default_motifs() if any(f in m.factors['direct'] for f in ["TP53", "TP63", "TP73"])]
stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, genome="hg19")
print("Stats for", motifs[0])
for k, v in stats[str(motifs[0])].items():
print(k,v)
print()
best_motif = sorted(motifs, key=lambda x: stats[str(x)]["recall_at_fdr"])[-1]
print("Best motif (recall at 10% FDR):", best_motif)
Stats for GM.5.0.p53.0001_rCATGyCCnGrCATGy
recall_at_fdr 0.833
fraction_fpr 0.416
score_at_fpr 9.05025905735
enr_at_fpr 41.6
max_enrichment 55.5
phyper_at_fpr 3.33220067463e-132
mncp 1.85474606318
roc_auc 0.9211925
roc_auc_xlim 0.0680115
pr_auc 0.927368602993
max_fmeasure 0.867519181586
ks_pvalue 0.0
ks_significance inf
Best motif (recall at 10% FDR): GM.5.0.p53.0001_rCATGyCCnGrCATGy
A lot of statistics are generated and you will not always need all of them.
You can choose one or more specific metrics with the additional stats
argument.
metrics = ["roc_auc", "recall_at_fdr"]
stats = calc_stats(motifs=motifs, fg_file=sample, bg_file=bg, stats=metrics, genome="hg19")
for metric in metrics:
for motif in motifs:
print("{}\t{}\t{:.2f}".format(
motif.id, metric, stats[str(motif)][metric]
))
p53_M5923_1.01 roc_auc 0.63
p53_M5922_1.01 roc_auc 0.64
p53_Average_10 roc_auc 0.83
p53_Average_8 roc_auc 0.93
p53_M3568_1.01 roc_auc 0.83
p53_M5923_1.01 recall_at_fdr 0.00
p53_M5922_1.01 recall_at_fdr 0.00
p53_Average_10 recall_at_fdr 0.24
p53_Average_8 recall_at_fdr 0.83
p53_M3568_1.01 recall_at_fdr 0.42
Motif comparison¶
Compare two motifs.
from gimmemotifs.comparison import MotifComparer
from gimmemotifs.motif import motif_from_consensus
from gimmemotifs.motif import read_motifs
m1 = motif_from_consensus("RRRCATGYYY")
m2 = motif_from_consensus("TCRTGT")
mc = MotifComparer()
score, pos, orient = mc.compare_motifs(m1, m2)
if orient == -1:
m2 = m2.rc()
pad1, pad2 = "", ""
if pos < 0:
pad1 = " " * -pos
elif pos > 0:
pad2 =" " * pos
print(pad1 + m1.to_consensus())
print(pad2 + m2.to_consensus())
rrrCATGyyy
ACAyGA
Find closest match in a motif database.
motifs = [
motif_from_consensus("GATA"),
motif_from_consensus("NTATAWA"),
motif_from_consensus("ACGCG"),
]
mc = MotifComparer()
results = mc.get_closest_match(motifs, dbmotifs=read_motifs("HOMER"), metric="seqcor")
# Load motifs
db = read_motifs("HOMER", as_dict=True)
for motif in motifs:
match, scores = results[motif.id]
print("{}: {} - {:.3f}".format(motif.id, match, scores[0]))
dbmotif = db[match]
orient = scores[2]
if orient == -1:
dbmotif = dbmotif.rc()
padm, padd = 0, 0
if scores[1] < 0:
padm = -scores[1]
elif scores[1] > 0:
padd = scores[1]
print(" " * padm + motif.to_consensus())
print(" " * padd + dbmotif.to_consensus())
print()
GATA: AGATAASR_GATA3(Zf)/iTreg-Gata3-ChIP-Seq(GSE20898)/Homer - 0.823
GATA
AGATAAnr
NTATAWA: NGYCATAAAWCH_CDX4(Homeobox)/ZebrafishEmbryos-Cdx4.Myc-ChIP-Seq(GSE48254)/Homer - 0.747
nTATAwA
nnynrTAAAnnn
ACGCG: NCCACGTG_c-Myc(bHLH)/LNCAP-cMyc-ChIP-Seq(Unpublished)/Homer - 0.744
ACGCG
CACGTGGn
Auto-generated¶
This part of the API documentation is not yet complete.
The Motif class¶
Motif core functions
- class gimmemotifs.motif.Motif(pfm=None, ppm=None, places=4)¶
Representation of a transcription factor binding motif.
Examples
>>> motif = Motif([[0,1,0,0], [0.5,0,0,0.5], [0,0,1,0]]) >>> print(motif.to_ppm()) > 0 1 0 0 0.5 0 0 0.5 0 0 1 0 >>> print(motif.to_consensus()) CwG
- average_motifs(other, pos, orientation, include_bg=False)¶
Return the average of two motifs.
Combine this motif with another motif and return the average as a new Motif object. The position and orientatien need to be supplied. The pos parameter is the position of the second motif relative to this motif.
For example, take the following two motifs: Motif 1: CATGYT Motif 2: GGCTTGY
With position -2, the motifs are averaged as follows: xxCATGYT GGCTTGYx
- Parameters:
other (Motif object) – Other Motif object.
pos (int) – Position of the second motif relative to this motif.
orientation (int) – Orientation, should be 1 or -1. If the orientation is -1 then the reverse complement of the other motif is used for averaging.
include_bg (bool , optional) – Extend both motifs with background frequencies (0.25) before averaging. False by default.
- Returns:
motif – New Motif object containing average motif.
- Return type:
motif object
- property consensus¶
Motif converted to consensus sequence.
- Returns:
Consensus sequence.
- Return type:
str
- consensus_scan(fa)¶
Scan FASTA with the motif as a consensus sequence.
- Parameters:
fa (Fasta object) – Fasta object to scan
- Returns:
matches – Dictionaru with matches.
- Return type:
dict
- property hash¶
Return hash of motif.
This is an unique identifier of a motif, regardless of the id.
- Returns:
Hash of motif.
- Return type:
str
- ic_pos(row1, row2=None)¶
Calculate the information content of one position.
- Returns:
score – Information content.
- Return type:
float
- property information_content¶
Return the total information content of the motif.
- Returns:
Motif information content.
- Return type:
float
- property max_score¶
Return the maximum logodds score.
- Returns:
score – Maximum logodds score.
- Return type:
float
- property min_score¶
Return the minimum logodds score.
- Returns:
score – Minimum logodds score.
- Return type:
float
- pfm_to_ppm(pfm, pseudo=0.001)¶
Convert PFM with counts to a PFM with fractions (PPM).
- Parameters:
pfm (array_like) – 2-dimensional array_like with counts.
pseudo (float) – Pseudocount used in conversion.
- Returns:
2-dimensional array with probability count matrix
- Return type:
array_like
- plot_ensembl_logo(fname=None, ic=True, title=True, letters=True, height=2)¶
Plot motif logo.
This is an implementation of the logo presented here: http://www.ensembl.info/2018/10/15/new-ensembl-motif-features/
- Parameters:
fname (str, optional) – If fname is set, the plot will be saved with fname as filename.
ic (bool, optional) – Use the bit score. If this is set to False, the frequency will be used.
title (bool, optional) – Plot the motif id as the title.
letters (bool, optional) – Plot the nucleotides in the bars.
height (float, optional) – Height of the plot.
- plot_logo(fname=None, kind='information', title=True, ylabel=True, add_left=0, ax=None)¶
Plot motif logo
- Parameters:
fname (str, optional) – If fname is set, the plot will be saved with fname as filename.
kind (str, optional) – Type of logo to plot, can be ‘information’, ‘frequency’, ‘energy’ or ‘ensembl’.
title (bool, optional) – Plot the motif id as the title.
ylabel (bool, optional) – Plot the Y axis label.
add_left (int, optional) – Add non-informative positions to the left (to align logo)
- pwm_max_score()¶
Return the maximum PWM score.
DEPRECATED: use max_score instead.
- Returns:
score – Maximum PWM score.
- Return type:
float
- pwm_min_score()¶
Return the minimum PWM score.
DEPRECATED: use min_score instead.
- Returns:
score – Minimum PWM score.
- Return type:
float
- pwm_scan_score(fa, cutoff=0, nreport=1, scan_rc=True)¶
Scan sequences with this motif.
Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. Only the score of the matches is returned.
- Parameters:
fa (Fasta object) – Fasta object to scan.
cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.
nreport (int , optional) – Maximum number of matches to report.
scan_rc (bool , optional) – Scan the reverse complement. True by default.
- Returns:
matches – Dictionary with motif matches. Only the score of the matches is returned.
- Return type:
dict
- pwm_scan_to_gff(fa, gfffile, cutoff=0.9, nreport=50, scan_rc=True, append=False)¶
Scan sequences with this motif and save to a GFF file.
Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. The output is save to a file in GFF format.
- Parameters:
fa (Fasta object) – Fasta object to scan.
gfffile (str) – Filename of GFF output file.
cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.
nreport (int , optional) – Maximum number of matches to report.
scan_rc (bool , optional) – Scan the reverse complement. True by default.
append (bool , optional) – Append to GFF file instead of overwriting it. False by default.
- randomize()¶
Create a new motif with shuffled positions.
Shuffle the positions of this motif and return a new Motif instance.
- Returns:
m – Motif instance with shuffled positions.
- Return type:
Motif instance
- rc()¶
Return the reverse complemented motif.
- Returns:
New Motif instance with the reverse complement of the input motif.
- Return type:
Motif instance
- sample(n_seqs, rng=None)¶
Sample n_seqs random sequences from a motif. The sequences follow the distribution of the motif ppm.
- Parameters:
n_seqs (int) – number of sequences to sample
rng (np.random.Generator) – random number generator, optional
- Returns:
sequences – A list of all the samples sequences
- Return type:
List[str]
- scan(fa, cutoff=0.9, nreport=50, scan_rc=True)¶
Scan sequences with this motif.
Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. Only the position of the matches is returned.
- Parameters:
fa (Fasta object) – Fasta object to scan.
cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.
nreport (int , optional) – Maximum number of matches to report.
scan_rc (bool , optional) – Scan the reverse complement. True by default.
- Returns:
matches – Dictionary with motif matches. Only the position of the matches is returned.
- Return type:
dict
- scan_all(fa, cutoff=0.9, nreport=50, scan_rc=True)¶
Scan sequences with this motif.
Scan sequences from a FASTA object with this motif. Less efficient than using a Scanner object. By setting the cutoff to 0.0 and nreport to 1, the best match for every sequence will be returned. The score, position and strand for every match is returned.
- Parameters:
fa (Fasta object) – Fasta object to scan.
cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif lengh.
nreport (int , optional) – Maximum number of matches to report.
scan_rc (bool , optional) – Scan the reverse complement. True by default.
- Returns:
matches – Dictionary with motif matches. The score, position and strand for every match is returned.
- Return type:
dict
- score_kmer(kmer)¶
Calculate the log-odds score for a specific k-mer.
Note: this is not necessarily the fastest way for scanning.
- Parameters:
kmer (str) – String representing a kmer. Should be the same length as the motif.
- Returns:
score – Log-odd score.
- Return type:
float
- shuffle()¶
Create a new motif with shuffled positions.
Shuffle the positions of this motif and return a new Motif instance.
- Returns:
Motif instance with shuffled positions.
- Return type:
Motif instance
- to_consensus(ppm=None, precision=4)¶
Convert position probability matrix to consensus sequence.
- Parameters:
ppm (array_like, optional) – If not supplied, the ppm of the Motif object will be used.
precision (int, optional) – Precision used for rounding.
- Returns:
Consensus sequence.
- Return type:
str
- to_meme()¶
Return motif formatted in MEME format
- Returns:
m – String of motif in MEME format.
- Return type:
str
- to_motevo()¶
Return motif formatted in MotEvo (TRANSFAC-like) format
- Returns:
m – String of motif in MotEvo format.
- Return type:
str
- to_ppm(precision=4, extra_str='')¶
Return ppm as string.
- Parameters:
precision (int, optional, default 4) – Floating-point precision.
| (extra_str) – Extra text to include with motif id line.
- Returns:
motif_str – Motif formatted in ppm format.
- Return type:
str
- to_pwm(precision=4, extra_str='')¶
Return ppm as string.
- Parameters:
precision (int, optional, default 4) – Floating-point precision.
| (extra_str) – Extra text to include with motif id line.
- Returns:
motif_str – Motif formatted in ppm format.
- Return type:
str
- to_transfac()¶
Return motif formatted in TRANSFAC format
- Returns:
m – String of motif in TRANSFAC format.
- Return type:
str
- trim(edge_ic_cutoff=0.4)¶
Trim positions with an information content lower than the threshold.
The default threshold is set to 0.4. The Motif will be changed in-place.
- Parameters:
edge_ic_cutoff (float, optional) – Information content threshold. All motif positions at the flanks with an information content lower thab this will be removed.
- Returns:
m
- Return type:
Motif instance
- gimmemotifs.motif.gimme_motifs(inputfile, outdir, params=None, filter_significant=True, cluster=True, create_report=True)¶
De novo motif prediction based on an ensemble of different tools.
- Parameters:
inputfile (str) – Filename of input. Can be either BED, narrowPeak or FASTA.
outdir (str) – Name of output directory.
params (dict, optional) – Optional parameters.
filter_significant (bool, optional) – Filter motifs for significance using the validation set.
cluster (bool, optional) – Cluster similar predicted (and significant) motifs.
create_report (bool, optional) – Create output reports (both .txt and .html).
- Returns:
motifs – List of predicted motifs.
- Return type:
list
Examples
>>> from gimmemotifs.motif import gimme_motifs >>> gimme_motifs("input.fa", "motifs.out")
- gimmemotifs.motif.motif_from_align(align)¶
Convert alignment to motif.
Converts a list with sequences to a motif. Sequences should be the same length.
- Parameters:
align (list) – List with sequences (A,C,G,T).
- Returns:
m – Motif created from the aligned sequences.
- Return type:
Motif instance
- gimmemotifs.motif.motif_from_consensus(cons, n=1200)¶
Convert consensus sequence to motif.
Converts a consensus sequences using the nucleotide IUPAC alphabet to a motif.
- Parameters:
cons (str) – Consensus sequence using the IUPAC alphabet.
n (int , optional) – Count used to convert the sequence to a PFM.
- Returns:
m – Motif created from the consensus.
- Return type:
Motif instance
- gimmemotifs.motif.parse_motifs(motifs)¶
Parse motifs in a variety of formats to return a list of motifs.
- Parameters:
motifs (list or str) – Filename of motif, list of motifs or single Motif instance.
- Returns:
motifs – List of Motif instances.
- Return type:
list
- gimmemotifs.motif.read_motifs(infile=None, fmt='pfm', as_dict=False)¶
Read motifs from a file or stream or file-like object.
- Parameters:
infile (string or file-like object, optional) – Motif database, filename of motif file or file-like object. If infile is not specified the default motifs as specified in the config file will be returned.
fmt (string, optional) – Motif format, can be ‘pfm’, ‘transfac’, ‘xxmotif’, ‘jaspar’ or ‘align’.
as_dict (boolean, optional) – Return motifs as a dictionary with motif_id, motif pairs.
- Returns:
motifs – List of Motif instances. If as_dict is set to True, motifs is a dictionary.
- Return type:
list or dict
Prediction of de novo motifs¶
De novo motif prediction.
This module contains functions to predict de novo motifs using one or more de novo motif tools. The main function is gimme_motifs, which is likely the only method that you’ll need from this module.
Examples
from gimmemotifs.motif import gimme_motifs
peaks = “Gm12878.CTCF.top500.w200.fa” outdir = “CTCF.gimme” params = {“tools”: “Homer,BioProspector”, “genome”: “hg38”}
motifs = gimme_motifs(peaks, outdir, params=params)
Motif scanning¶
Scanner core functions
- class gimmemotifs.scanner.Scanner(ncpus=None, random_state=None, progress=None)¶
scan sequences with motifs
- best_match(seqs, scan_rc=True, zscore=False, gc=False)¶
give the best match of each motif in each sequence returns an iterator of nested lists containing tuples: (score, position, strand)
- best_score(seqs, scan_rc=True, zscore=False, gc=False)¶
give the score of the best match of each motif in each sequence returns an iterator of lists containing floats
- count(seqs, nreport=100, scan_rc=True)¶
count the number of matches above the cutoff returns an iterator of lists containing integer counts
- property progress¶
Whether to show progress bars for slow operations.
- scan(seqs, nreport=100, scan_rc=True, zscore=False, gc=False)¶
Scan a set of regions or sequences.
- set_background(fname=None, genome=None, size=200, nseq=None, gc=False, gc_bins=None)¶
Set the background to use for FPR and z-score calculations.
Background can be specified either as a genome name or as the name of a FASTA file.
- Parameters:
fname (str, optional) – Name of FASTA file to use as background.
genome (str, optional) – Name of genome to use to retrieve random sequences.
size (int, optional) – Size of genomic sequences to retrieve. The default is 200.
nseq (int, optional) – Number of genomic sequences to retrieve.
- set_genome(genome=None, genomes_dir=None)¶
Set the genome to converting regions to sequences
- Parameters:
genome (str, optional) – Path to the genome fasta or the genome name as found in the genomepy genomes directory.
genomes_dir (str, optional) – Path to the genomepy genomes directory. Taken from the genomepy config if unspecified.
- set_threshold(fpr=None, threshold=None, gc=False)¶
Set motif scanning threshold based on background sequences.
- Parameters:
fpr (float, optional) – Desired FPR, between 0.0 and 1.0.
threshold (float or str, optional) – Desired motif threshold, expressed as the fraction of the difference between minimum and maximum score of the PWM. Should either be a float between 0.0 and 1.0 or a filename with thresholds as created by ‘gimme threshold’.
- total_count(seqs, nreport=100, scan_rc=True)¶
count the number of matches above the cutoff returns an iterator of lists containing integer counts
- gimmemotifs.scanner.scan_regionfile_to_table(input_table, genome, scoring, pfmfile=None, ncpus=None, zscore=True, gc=True, random_state=None, progress=None)¶
Scan regions in input table for motifs. Return a dataframe with the motif count/score per region.
- Parameters:
input_table (str) – Filename of a table with regions as first column. Accepts a feather file.
genome (str) – Genome name. Can be a FASTA file or a genomepy genome name.
scoring (str) – “count” or “score”. “count” returns the occurrence of each motif (with an FPR threshold of 0.01). “score” returns the best match score of each motif.
pfmfile (str, optional) – Specify a PFM file for scanning (or a list of Motif instances).
zscore (bool, optional) – Use z-score normalized motif scores. Only used if scoring=”score”.
gc (bool, optional) – Equally distribute GC percentages in background sequences.
ncpus (int, optional) – If defined this specifies the number of cores to use.
random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).
progress (bool or None, optional) – provide progress bars for long computations.
- Returns:
table – DataFrame with motifs as column names and regions as index. Values are either motif counts or best motif match scores per region, depending on the ‘scoring’ parameter.
- Return type:
pandas.DataFrame
- gimmemotifs.scanner.scan_to_best_match(fname, motifs, ncpus=None, genome=None, score=False, zscore=False, gc=False, random_state=None, progress=None)¶
Scan a FASTA file for motifs. Return a dictionary with the best match per motif.
- Parameters:
fname (str or Fasta) – Filename of a sequence file in FASTA format.
motifs (str or list, optional) – Specify a PFM file for scanning (or a list of Motif instances).
genome (str) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name.
score (bool, optional) – return the best score instead of the best match
zscore (bool, optional) – Use z-score normalized motif scores.
gc (bool, optional) – Equally distribute GC percentages in background sequences.
ncpus (int, optional) – If defined this specifies the number of cores to use.
random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).
progress (bool or None, optional) – provide progress bars for long computations.
- Returns:
result – Dictionary with motif as key and best score/match as values.
- Return type:
dict
- gimmemotifs.scanner.scan_to_file(inputfile, pfmfile, filepath_or_buffer=None, nreport=1, fpr=0.01, cutoff=None, bed=False, scan_rc=True, table=False, score_table=False, bgfile=None, genome=None, ncpus=None, zscore=True, gcnorm=True, random_state=None, progress=None)¶
Scan file for motifs.
- Parameters:
inputfile (str) – path to FASTA, BED or regions file.
pfmfile (str or list, optional) – Specify a PFM file for scanning (or a list of Motif instances).
filepath_or_buffer (Any, optional) – where to write the output. If unspecified, writes to stdout.
nreport (int , optional) – Maximum number of matches to report.
fpr (float, optional) – Desired false positive rate, between 0.0 and 1.0.
cutoff (float , optional) – Cutoff to use for motif scanning. This cutoff is not specifically optimized and the strictness will vary a lot with motif length.
scan_rc (bool , optional) – Scan the reverse complement. default: True.
table (bool, optional) – output motif counts in tabular format
score_table (bool, optional) – output motif scores in tabular format
bed (bool, optional) – outputs BED6 format, instead of GTF/GFF format (default).
bgfile (str, optional) – FASTA file to use as background sequences. Required if no genome is given.
genome (str, optional) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name. Required if no bgfile is given.
zscore (bool, optional) – Use z-score normalized motif scores.
gcnorm (bool, optional) – Equally distribute GC percentages in background sequences.
ncpus (int, optional) – If defined this specifies the number of cores to use.
random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).
progress (bool or None, optional) – provide progress bars for long computations.
Maelstrom¶
Class implementing the maelstrom method (Bruse & van Heeringen, 2018)
Examples
run_maelstrom(input, “hg38”, outdir) mr = MaelstromResult(outdir)
- class gimmemotifs.maelstrom.MaelstromResult(outdir)¶
Class for working with maelstrom output.
- plot_heatmap(kind='final', min_freq=0.01, threshold=2, name=True, indirect=True, figsize=None, max_number_factors=5, aspect=1, cmap='RdBu_r', **kwargs)¶
Plot clustered heatmap of predicted motif activity.
- Parameters:
kind (str, optional) – Which data type to use for plotting. Default is ‘final’, which will plot the result of the rank aggregation. Other options are ‘freq’ for the motif frequencies, or any of the individual activities such as ‘rf.score’.
min_freq (float, optional) – Minimum frequency of motif occurrence.
threshold (float, optional) – Minimum activity (absolute) of the rank aggregation result.
name (bool, optional) – Use factor names instead of motif names for plotting.
indirect (bool, optional) – Include indirect factors (computationally predicted or non-curated). Default is True.
max_number_factors (int, optional) – Truncate the list of factors to this maximum size.
figsize (tuple, optional) – Tuple of figure size (width, height).
aspect (int, optional) – Aspect ratio for tweaking the plot.
cmap (str, optional) – Color paletter to use, RdBu_r by default.
kwargs (other keyword arguments) – All other keyword arguments are passed to sns.heatmap
- Returns:
cg – A seaborn ClusterGrid instance.
- Return type:
ClusterGrid
- plot_scores(motifs, name=True, max_len=50)¶
Create motif scores boxplot of different clusters. Motifs can be specified as either motif or factor names. The motif scores will be scaled and plotted as z-scores.
- Parameters:
motifs (iterable or str) – List of motif or factor names.
name (bool, optional) – Use factor names instead of motif names for plotting.
max_len (int, optional) – Truncate the list of factors to this maximum length.
- Returns:
g – Returns the seaborn FacetGrid object with the plot.
- Return type:
FacetGrid
- class gimmemotifs.maelstrom.Moap¶
Moap base class.
Motif activity prediction.
- classmethod create(name, ncpus=None, **kwargs)¶
Create a Moap instance based on the predictor name.
- Parameters:
name (str) – Name of the predictor (eg. Xgboost, BayesianRidge, …)
ncpus (int, optional) – Number of threads. Default is the number specified in the config.
kwargs (any, optional) – keyword arguments passed to predictor. Unused arguments are dropped.
- Returns:
moap – moap instance.
- Return type:
Moap instance
- classmethod list_classification_predictors()¶
List available classification predictors.
- classmethod list_predictors()¶
List available predictors.
- classmethod list_regression_predictors()¶
List available regression predictors.
- classmethod register_predictor(name)¶
Register method to keep list of predictors.
- gimmemotifs.maelstrom.moap(inputfile, method='hypergeom', scoring=None, outfile=None, motiffile=None, pfmfile=None, genome=None, zscore=True, gc=True, subsample=None, random_state=None, ncpus=None, progress=None)¶
Run a single motif activity prediction algorithm.
- inputfilestr
:1File with regions (chr:start-end) in first column and either cluster name in second column or a table with values.
- methodstr, optional
Motif activity method to use. Any of ‘bayesianridge’, ‘xgboost’, ‘mwu’, ‘hypergeom’, ‘rf’, ‘multitasklasso’, ‘svr’. Default is ‘hypergeom’.
- scoring: str, optional
Either ‘score’ or ‘count’
- outfilestr, optional
Name of outputfile to save the fitted activity values.
- motiffilestr, optional
Table with motif scan results. First column should be exactly the same regions as in the inputfile.
- pfmfilestr, optional
File with motifs in pfm format. Required when motiffile is not supplied.
- genomestr, optional
Genome name, as indexed by gimme. Required when motiffile is not supplied.
- zscorebool, optional
Use z-score normalized motif scores.
- gcbool, optional
Equally distribute GC percentages in background sequences.
- subsamplefloat, optional
Fraction of regions to use.
- random_statenumpy.random.RandomState object, optional
make predictions deterministic (where possible).
- ncpusint, optional
Number of threads to use. Default is the number specified in the config.
- progressbool or None, optional
provide progress bars for long computations.
pandas DataFrame with motif activity
- gimmemotifs.maelstrom.run_maelstrom(infile, genome, outdir, pfmfile=None, filter_redundant=True, filter_cutoff=0.8, plot=True, cluster=False, score_table=None, count_table=None, methods=None, ncpus=None, zscore=True, gc=True, center=True, aggregation='int_stouffer', plot_all_motifs=False, plot_no_motifs=False, random_state=None)¶
Find differential motifs.
- Parameters:
infile (str) – Filename of input table. Can be either a tab-separated text file or a feather file.
genome (str) – Genome name. Can be either the name of a FASTA-formatted file or a genomepy genome name.
outdir (str) – Output directory for all results.
pfmfile (str, optional) – Specify a PFM file for scanning.
filter_redundant (bool, optional) – Create a non-redundant set of motifs based on correlation of motif scores in the input data.
filter_cutoff (float, optional) – Cutoff to use for non-redundant motif selection. Default is 0.8.
plot (bool, optional) – Create heatmaps.
cluster (bool, optional) – If True and if the input table has more than one column, the data is clustered and the cluster activity methods are also run. Not well-tested.
score_table (str, optional) – Filename of pre-calculated table with motif scores.
count_table (str, optional) – Filename of pre-calculated table with motif counts.
methods (list, optional) – Activity methods to use. By default are all used.
ncpus (int, optional) – If defined this specifies the number of cores to use.
zscore (bool, optional) – Use z-score normalized motif scores.
gc (bool, optional) – Use GC% bins to normalize motif scores.
center (bool, optional) – Mean-center the input table.
aggregation (str, optional) – How to combine scores of the predictors. The default is “int_stouffer”, for inverse normal transform followed by Stouffer’s methods to combine z-scores. Alternatively, “stuart” performs rank aggregation and reports the -log10 of the rank aggregation p-value.
plot_all_motifs (bool, optional) – Specify to plot all motifs
plot_no_motifs (bool, optional) – Specify to plot no motifs
random_state (numpy.random.RandomState object, optional) – make predictions deterministic (where possible).
Motif activity prediction¶
Motif statistics¶
Calculate motif enrichment statistics.
- gimmemotifs.stats.calc_stats(fg_file=None, bg_file=None, fg_table=None, bg_table=None, motifs=None, stats=None, genome=None, zscore=True, gc=True, ncpus=None)¶
Calculate motif enrichment metrics.
- Parameters:
fg_file (str) – Filename of a FASTA, BED or region file with positive sequences.
bg_file (str) – Filename of a FASTA, BED or region file with negative sequences.
fg_table (str) – Filename of a table with motif scan results of positive sequences.
bg_table (str) – Filename of a table with motif scan results of negative sequences.
motifs (str, list or Motif instance) – A file with motifs in pwm format, a list of Motif instances or a single Motif instance.
genome (str, optional) – Genome or index directory in case of BED/regions.
stats (list, optional) – Names of metrics to calculate. See gimmemotifs.rocmetrics.__all__ for available metrics.
ncpus (int, optional) – Number of cores to use.
- Returns:
result – Dictionary with results where keys are motif ids and the values are dictionary with metric name and value pairs.
- Return type:
dict
- gimmemotifs.stats.calc_stats_iterator(fg_file=None, bg_file=None, fg_table=None, bg_table=None, motifs=None, stats=None, genome=None, zscore=True, gc=True, ncpus=None)¶
Calculate motif enrichment metrics.
- Parameters:
fg_file (str, optional) – Filename of a FASTA, BED or region file with positive sequences.
bg_file (str, optional) – Filename of a FASTA, BED or region file with negative sequences.
fg_table (str, optional) – Filename of a table with motif scan results of positive sequences.
bg_table (str, optional) – Filename of a table with motif scan results of negative sequences.
motifs (str, list or Motif instance, optional) – A file with motifs in pfm format, a list of Motif instances or a single Motif instance. If motifs is None, the default motif database is used.
genome (str, optional) – Genome or index directory in case of BED/regions.
stats (list, optional) – Names of metrics to calculate. See gimmemotifs.rocmetrics.__all__ for available metrics.
ncpus (int, optional) – Number of cores to use.
- Returns:
result – Dictionary with results where keys are motif ids and the values are dictionary with metric name and value pairs.
- Return type:
dict
- gimmemotifs.stats.rank_motifs(stats, metrics=('roc_auc', 'recall_at_fdr'))¶
Determine mean rank of motifs based on metrics.
- gimmemotifs.stats.write_stats(stats, fname, header=None)¶
write motif statistics to text file.
Motif comparison¶
Module to compare DNA sequence motifs (positional frequency matrices)
- class gimmemotifs.comparison.MotifComparer¶
Class for motif comparison.
Compare two or more motifs using a variety of metrics. Probably the best metric to compare motifs is seqcor. The implementation of this metric is similar to the one used in Grau (2015), where motifs are scored according to the Pearson correlation of the scores along sequence. In this case a de Bruijn of k=7 is used.
Valid metrics are: seqcor - Pearson correlation of motif scores along sequence. pcc - Pearson correlation coefficient of motif PFMs. ed - Euclidean distance-based similarity of motif PFMs. distance - Distance-based similarity of motif PFMs. wic - Weighted Information Content, see van Heeringen 2011. chisq - Chi-squared similarity of motif PFMs. akl - Similarity based on average Kullback-Leibler similarity, see Mahony, 2011. ssd - Sum of squared distances of motif PFMs.
Examples
mc = MotifComparer()
# Compare two motifs score, pos, strand = mc.compare_motifs(m1, m2, metric=”seqcor”)
# Compare a list of motifs to another list of motifs mc.get_all_scores(motifs, dbmotifs, match, metric, combine)
# Get the best match for every motif in a list of reference motifs get_closest_match(motifs, dbmotifs=None)
- compare_motifs(m1, m2, match='total', metric='wic', combine='mean', pval=False)¶
Compare two motifs.
The similarity metric can be any of seqcor, pcc, ed, distance, wic, chisq, akl or ssd. If match is ‘total’ the similarity score is calculated for the whole match, including positions that are not present in both motifs. If match is partial or subtotal, only the matching psotiions are used to calculate the score. The score of individual position is combined using either the mean or the sum.
Note that the match and combine parameters have no effect on the seqcor similarity metric.
- Parameters:
m1 (Motif instance) – Motif instance 1.
m2 (Motif instance) – Motif instance 2.
match (str, optional) – Match can be “partial”, “subtotal” or “total”. Not all metrics use this.
metric (str, optional) – Distance metric.
combine (str, optional) – Combine positional scores using “mean” or “sum”. Not all metrics use this.
pval (bool, optional) – Calculate p-vale of match.
- Return type:
score, position, strand
- get_all_scores(motifs, dbmotifs, match, metric, combine, pval=False, parallel=True, trim=None, ncpus=None)¶
Pairwise comparison of a set of motifs compared to reference motifs.
- Parameters:
motifs (list) – List of Motif instances.
dbmotifs (list) – List of Motif instances.
match (str) – Match can be “partial”, “subtotal” or “total”. Not all metrics use this.
metric (str) – Distance metric.
combine (str) – Combine positional scores using “mean” or “sum”. Not all metrics use this.
pval (bool , optional) – Calculate p-vale of match.
parallel (bool , optional) – Use multiprocessing for parallel execution. True by default.
trim (float or None) – If a float value is specified, motifs are trimmed used this IC cutoff before comparison.
ncpus (int or None) – Specifies the number of cores to use for parallel execution.
- Returns:
scores – Dictionary with scores.
- Return type:
dict
- get_best_matches(motifs, nmatches=1, dbmotifs=None, match='partial', metric='wic', combine='mean', parallel=True, ncpus=None)¶
Return best match in database for motifs.
- Parameters:
motifs (list or str) – Filename of motifs or list of motifs.
nmatches (int, optional) – Number of matches to return, default is 1.
dbmotifs (list or str, optional) – Database motifs, default will be used if not specified.
match (str, optional) –
metric (str, optional) –
combine (str, optional) –
ncpus (int, optional) – Number of threads to use.
- Returns:
closest_match
- Return type:
dict
- get_closest_match(motifs, dbmotifs=None, match='partial', metric='wic', combine='mean', parallel=True, ncpus=None)¶
Return best match in database for motifs.
- Parameters:
motifs (list or str) – Filename of motifs or list of motifs.
dbmotifs (list or str, optional) – Database motifs, default will be used if not specified.
match (str, optional) –
metric (str, optional) –
combine (str, optional) –
ncpus (int, optional) – Number of threads to use.
- Returns:
closest_match
- Return type:
dict
- gimmemotifs.comparison.akl(p1, p2)¶
Calculates motif position similarity based on average Kullback-Leibler similarity.
See Mahony, 2007.
- Parameters:
p1 (list) – Motif position 1.
p2 (list) – Motif position 2.
- Returns:
score
- Return type:
float
- gimmemotifs.comparison.chisq(p1, p2)¶
Calculates motif position similarity based on chi-square.
- Parameters:
p1 (list) – Motif position 1.
p2 (list) – Motif position 2.
- Returns:
score
- Return type:
float
- gimmemotifs.comparison.seqcor(m1, m2, seq=None)¶
Calculates motif similarity based on Pearson correlation of scores.
Based on Kielbasa (2015) and Grau (2015). Scores are calculated based on scanning a de Bruijn sequence of 7-mers. This sequence is taken from ShortCAKE (Orenstein & Shamir, 2015). Optionally another sequence can be given as an argument.
- Parameters:
m1 (Motif instance) – Motif 1 to compare.
m2 (Motif instance) – Motif 2 to compare.
seq (str, optional) – Sequence to use for scanning instead of k=7 de Bruijn sequence.
- Return type:
score, position, strand
- gimmemotifs.comparison.ssd(p1, p2)¶
Calculates motif position similarity based on sum of squared distances.
- Parameters:
p1 (list) – Motif position 1.
p2 (list) – Motif position 2.
- Returns:
score
- Return type:
float