Command-line reference¶

In addition to gimme motifs the GimmeMotifs package contains several other tools that can perform the various substeps of GimmeMotifs, as well as other useful tools. Run them to see the options.

List of tools¶

gimme motifs
gimme scan
gimme maelstrom
gimme background
gimme logo
gimme match
gimme cluster
gimme threshold
gimme location
gimme diff
gimme motif2factors

Input formats¶

Most tools in this section take a file in PFM format as input. This is actually a file with Position Specific Scoring Matrices (PSSMs) containing frequencies. It looks like this:

>motif1
3611  0.0769  0.4003  0.1664
2716  0.0283  0.5667  0.1381
6358  0.0016  0.3344  0.0330
0016  0.9859  0.0016  0.0157
8085  0.0063  0.0502  0.1397
>motif2
2276  0.0157  0.0330  0.7284
0031  0.0016  0.9984  0.0016
0377  0.3799  0.0016  0.5856
0816  0.7096  0.0173  0.1962
1350  0.4035  0.0675  0.3987

The frequencies are separated by tabs, and in the order A,C,G,T.

Command: gimme motifs¶

The gimme motifs command can be used for known and/or de novo motif analysis. By default it runs both.

Quick example of de novo motif analysis¶

You can try GimmeMotifs with a small example dataset included in the examples directory, included with GimmeMotifs. This example does not require any additional configuration if GimmeMotifs is installed correctly.

Change to a directory where you have write permissions and run the following command (substitute the filename with the location of the file on your system):

gimme motifs /usr/share/gimmemotifs/examples/TAp73alpha.fa p73 --denovo -g hg19

The first argument is the name of the input file and the second argument defines the name of the output directory that is created. All output files are stored in this directory. The genome is set to the hg38 genome. This requires you to have installed hg38 using genomepy. Alternatively, you can also supply the path to a genome FASTA file with the -g option.

Depending on your computer, this analysis will take around 15-20 minutes. By default, the three top-performing de novo motif tools will be used: MEME, Homer and BioProspector. Once GimmeMotifs is finished you can open p73/p73_motif_report.html in your browser.

You can also run the same analysis with a BED file or narrowPeak file as input, or a text file with regions in chrom:start-end format

Best practices and tips¶

GimmeMotifs is multi-threaded¶

GimmeMotifs runs multi-threaded and by default uses 12 threads. All the de novo programs will be run in parallel as much as possible. Of course some programs are still single-threaded, and will not really benefit from multithreading. You can change the number of threads that are used using the -N parameter.

Running time¶

The running time of the de novo part of GimmeMotifs largely depends on three factors:

the size of the input dataset;
the motif prediction tools you use;
the size of the motifs to be identified.

Size of input dataset

While GimmeMotifs is developed specifically for ChIP-seq datasets, most motif prediction tools are not. In practice this means that it does not make much sense to predict motifs on a large amount of sequences, as this will usually not result in higher quality motifs. Therefore GimmeMotifs uses an absolute limit for the prediction set. By default 20% of the sequences are used as input for motif prediction, but with an absolute maximum. This is controlled by the abs_max parameter in the configuration file, which is set to 1000 by default. In general, if you have a large amount of peaks, you can also consider to run GimmeMotifs on the top sequences of your input, for instance the 5000 highest peaks.

Motif prediction tools

By default, gimme motifs uses three de novo motif prediction tools: MEME, BioProspector and Homer. These we found to be the best performing programs for ChIP-seq data (Bruse & van Heeringen, 2018). You can include a large variety of other tools by using the -t parameter. This will result in an increased running time and some tools, such as GADEM, will take a very long time. The following tools are supported:

With the exception of RPMCMC and YAMDA, all tools come installed with GimmeMotifs when using the bioconda package. AMD, HMS, Improbizer, MotifSampler and DiNAMO are not supported on OSX.

Please note: all these programs include their own license and many are free for academic or non-commercial use only. For commercial use of any of these programs, please consult the respective author! GimmeMotifs itself can be freely used commercially.

Motif size

The default setting for motif size is -a xl, which searches for motifs with a length of up to 20. You can use different analysis sizes: small (up to 8), medium (up to 10) or large (up to 14). The running time can be significantly shorter for shorter motifs. However, keep in mind that the xl analysis setting results in the best motifs in general.

Small input sets¶

GimmeMotifs is developed for larger datasets, such as ChIP-seq peaks, where you have the luxury to use a large fraction of your input for validation. This means that at least several hundred sequences would be optimal. If you want to run GimmeMotifs on a small input dataset, it might be worthwile to increase the fraction used for motif prediction with the -f argument, for instance -f 0.5.

Running on FASTA files¶

It is possible to run GimmeMotifs on a FASTA file as input instead of a BED file. This is detected automatically if your inputfile is correctly formatted according to FASTA specifications. Please note that for best results, all the sequences should be of the same length. This is not necessary for motif prediction, but the statistics and positional preference plots will be wrong if sequences have different lengths.

Intermediate results¶

GimmeMotifs produces a lot of intermediate results, such as all predicted motifs, FASTA files used for validation and so on. These are deleted by default (as they can get quite large), but if you are interested in them, you can specify the -k option.

Detailed options for gimme motifs¶

Positional arguments

INPUT

The inputfile needs to be in BED, FASTA, narrowPeak or region format. By default gimme motifs will take the center of these features, and extend those to the size specified by the -s or --size argument. By default this is 200 bp. Keep in mind that the smaller the regions are, the better motif discovery will work. BED-fomatted files need to contain at least three tab-separated columns describing chromosome name, start and end. The fourth column is optional. If it is specified it will be used by some motif prediction tools to sort the features before motif prediction. FASTA files can be used as input for motif prediction. For best results it is recommended to use sequences of the same size. Peak files in narrowPeak format, such as produced by MACS2, can also directly be used as input. With these files, gimme motifs will use the summit of the peak and create regions of size 200 centered at this summit. Use the -s parameter to change this size. Finally, region files can be used. These contain one column, with regions specified in chrom:start-end format.
OUTDIR

The name of the output directory. All output files will be saved in this directory. If the directory already exists files will be overwritten.

Optional arguments

-b BACKGROUND, --background BACKGROUND

Type of background to use. There are five options: gc, genomic, random, promoter or the path to file with background sequences (FASTA, BED or regions). By default gc is used, which generates random regions from the genome with a similar GC% as your input sequences. The genomic background will select random genomic regions without taking the sequence composition into account. The random background will create artificial sequences with a similar nucleotide distribution as your input sequences. The promoter background will select random promoters. For this option, your genome needs to be installed with genomepy using the --annotation option. Finally, you can select your own custom background by supplying the path to a file.
-g GENOME

Name of the genome to use. This can be the name of a genome installed with genomepy or the path to a FASTA file.
--denovo

Only run de novo motif analysis. By default, the analysis includes known motifs. For specific arguments for de novo motif analysis, see below.
--known

Only run known motif analysis. By default, the analysis includes de novo motifs. For specific arguments for known motif analysis, see below.
--noreport

Don’t create a HTML report, only plain text output files.
--rawscore

Don’t use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length. This means that the same logodds score will mean different things for motifs with a different length. By default, GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distribution of scores. The logodds score is then scaled using this distribution.
--nogc

By default GimmeMotifs calculates the motif logodds score distribution for regions with a different GC%. The score is then normalized according to the GC% bin per input sequence. Use this argument to turn this off.
-N INT, --threads INT

Number of threads to use (default is 12).

Optional arguments for known motif analysis

-p PFMFILE

PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP. By default, a database of clustered vertebrate motifs is used, gimme.vertebrate.v5.0. This database has a limited motif redundancy.

Optional arguments for de novo motif analysis

-t TOOLS, --tools TOOLS

The de novo motif prediction tools to use, separated by commas. This can be any combination of the following: AMD, BioProspector, ChIPMunk, DiNAMO, GADEM, DREME, HMS, Homer, Improbizer, MDmodule, MEME, MEMEW, MotifSampler, POSMO, ProSampler, RPMCMC, Trawler, Weeder, XXmotif, YAMDA. By default TOOLS is BioProspector,Homer,MEME. Note that some tools may not be installed. Running gimme motifs -h will always list the tools that are supported on your installation of GimmeMotifs.
-a, --analysis

The size of motifs to look for: small (5-8), medium (5-12), large (6-15) or xl (6-20). The larger the motifs, the longer the de novo motif prediction will take. By default, xl will be used as this generally yields the best motifs. However, some prediction tools take a very long time in combination with the xl setting.
k, --keepintermediate

Keep intermediate files.
-s, -–singlestrand

Only use the forward strand for prediction. By default both strands are used.
-f FRACTION, --fraction FRACTION

This parameter controls the fraction of the sequences used for prediction. This 0.2 by default, so in this case a randomly chosen 20% of the sequences will be used for prediction. The remaining sequences will be used for validation (enrichment, ROC curves etc.). If you have a large set of sequences (ie. most ChIP-seq peak sets), this is fine. However, if your set is smaller, it might be worthwile to increase this prediction fraction. The number of sequences that is used is also influenced by the abs_max parameter in the configuration file. Regardless of the -f parameter, the total number of sequences used for motif prediction will never exceed the number specified by abs_max.
-s N, --size N

This is the size of the sequences used for motif prediction. Smaller sequences will result in a faster analysis, but you are of course limited by the accuracy of your data. For the tested ChIP-seq data sets 200 performs fine. If this parameter is set to 0, the original size of the regions in the input file will be used.

Command: gimme maelstrom¶

This command can be used to identify differential motifs between two or more data sets. See the maelstrom tutorial for more details.

Positional arguments:

INPUTFILE             file with regions and clusters
GENOME                genome
DIR                   output directory

Optional arguments:

-h, --help            show this help message and exit
-p pfmfile, --pfmfile pfmfile
                      PFM file with motifs (default:
                      gimme.vertebrate.v5.0.pfm)
--no-filter           Don't remove redundant motifs.
-F FLOAT, --filter_cutoff FLOAT
                      Cutoff to select non-redundant motifs. Default is 0.8,
                      increase this value to get fewer motifs.
--nocenter            Don't mean-center the rows by default
-m NAMES, --methods NAMES
                      Run with specific methods
-a method, --aggregation method
                      How to combine motifs from individual methods. Default
                      is "int_stouffer", for inverse normal transform of
                      ranks, followed by Stouffer's method to combine
                      z-scores. Alternatively, specify "stuart" for log-
                      transformed rank aggregation p-values.
-N INT, --nthreads INT
                      Number of threads (default 12)
--rawscore            Don't z-score normalize motif scores
--nogc                Don't use GC% bins

Input file formats

The input can be in one of two possible formats. In both cases the genomic location should be present as chrom:start-end in the first column. The first option is a two-column format and looks like this:

loc    cluster
chr15:49258903-49259103    NK
chr10:72370313-72370513    NK
chr4:40579259-40579459    Monocytes
chr10:82225678-82225878    T-cells
chr5:134237941-134238141    B-cells
chr5:58858731-58858931    B-cells
chr20:24941608-24941808    NK
chr5:124203116-124203316    NK
chr17:40094476-40094676    Erythroblast
chr17:28659327-28659527    T-cells

This can be the result of a clustering analysis, for instance.

The second option looks like this:

loc    NK    Monocytes    T-cells    B-cells
chr12:93507547-93507747    3.118    2.522    1.933    0.197
chr7:38236460-38236660    1.098    0.502    0.201    0.190
chr10:21357147-21357347    0.528    -0.066    -1.04    -0.343
chr6:115521512-115521712    0.406    -0.376    -0.480    -0.667
chr2:97359808-97360008    1.501    0.905    0.719    0.031
chr16:16684549-16684749    0.233    -0.362    -0.837    -0.746
chrX:138964544-138964744    0.330    -0.291    -0.686    -0.777
chr2:186923973-186924173    0.430    -0.258    -1.164    -0.723
chrX:113834470-113834670    0.560    -0.036    -0.686    -0.692

This is a tab-separated table, with a header describing the experiments. In case of sequencing data, such as ChIP-seq, ATAC-seq or DNaseI seq, we recommend to use log-transformed read counts which are mean-centered per row. For optimal results, it is recommended to normalize between experiments (columns) after the log-transformatiion step, for instance by quantile normalization or scaling. By default, gimme maelstrom will mean-center the input, disable this with --nocenter.

The second input format generally gives better results than the first one and would be the recommended format.

The output scores of gimme maelstrom represent the combined result of multiple methods. This z-score represents the combined result of multiple methods. The individual results from different methods are ranked from high-scoring motif to low-scoring motif and converted to z-scores using the inverse normal transformation. The z-scores from individual methods are then combined using Stouffer’s method. The score that is shown is the aggregated z-score. A higher z-score means that presence of the motif or a higher motif score is associated with higher signal in a specific sample.

By default, gimme maelstrom selects a non-redundant set of motifs by clustering the motifs based on scores in the set of input sequences. You can disable this by using the --no-filter argument. You can tweak the number of selected motifs by changing the --filter-cutoff parameter. By default this is set to 0.8. Increase this value to select fewer motifs, decrease it to select more motifs. Keep in mind that you may start to lose biologically relevant motifs if you set this too high.

Command: gimme scan¶

Scan a set of sequences with a set of motifs, and get the resulting matches in GFF, BED or table format. If the FASTA header includes a chromosome location in chrom:start-end format, the BED output will return the genomic location of the motif match. The GFF file will always have the motif location relative to the input sequence.

A basic command would look like this:

$ gimme scan peaks.bed -g hg38 -b > motifs.bed

The threshold that is used for scanning can be specified in a number of ways. The default threshold is set to a motif-specific 1% FPR by scanning random genomic sequences. You can change the FPR with the -f option and/or the set of sequences that is used to determine the FPR with the -B option.

For instance, this command would scan with thresholds based on 5% FPR with random genomic mouse sequences.

$ gimme scan input.fa -g mm10 -f 0.05 -b > gimme.scan.bed

And this command would base a 10% FPR on the input file hg38.promoters.fa:

$ gimme scan input.fa -f 0.1 -B hg38.promoters.fa -b > gimme.scan.bed

Alternatively, you can specify the threshold as a single score. This score is relative and is based on the maximum and minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif should be at least 95% of the (maximum score - minimum score). This should probably not be set much lower than 0.8, and should be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might be different for each motif, the use of the FPR-based threshold is preferred. One reason to use a single score as threshold is when you want a match for each motif, regardless of the score. This command would give one match for every motif for every sequence, regardless of the score.

$ gimme scan input.bed -g hg38 -c 0 -n 1 -b > matches.bed

Finally, gimme scan can return the scanning results in table format. The -t will yield a table with number of matches, while the -T will have the score of the best match.

Positional arguments:

INPUT

The inputfile needs to be in BED, FASTA or region format. BED-fomatted files need to contain at least three tab-separated columns describing chromosome name, start and end. Region files can also be used. These contain one column, with regions specified in chrom:start-end format.

Optional arguments

-g GENOME

Name of the genome to use. This can be the name of a genome installed with genomepy or the path to a FASTA file.
-p PFMFILE, --pfmfile PFMFILE

PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP. By default, a database of clustered vertebrate motifs is used, gimme.vertebrate.v5.0. This database has a limited motif redundancy.
-f, --fpr

Base the motif score threshold on this FPR. By default this is set to 1%, equivalent to -f 0.01. The score threshold is based on scanning random genomic regions with the same size and the same GC% distribution. This threshold is calculated once for a specific sequence size and cached. Therefore, scanning will take longer the first time you use a specific FPR with a specific input sequence size.
-B, --bgfile

Specify a FASTA file to use for FPR calculation, instead of taking random genomic regions.
-c, --cutoff

Use this score cutoff instead of an FPR-based threshold. This score is relative and is based on the maximum and minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif should be at least 95% of the (maximum score - minimum score). This should probably not be set much lower than 0.8, and should be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might be different for each motif, the use of the FPR-based threshold is preferred.
-n, --nreport

Maximum number of matches to report per motif per sequence. By default this is set to 1.
-r, --norc

Don’t scan the reverse complement of the sequence. By default both strands will be scanned.
-b, --bed

Output motif matches in BED format.
-t, --table

Ouput number of matches in a table format, where columns represent motifs and rows represent input sequences.
-T, --score_table

Ouput maximum motif score in a table format, where columns represent motifs and rows represent input sequences. The score will be reported for each motif, regardless if it is a good match or not.
-z, --zscore

Use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length. This means that the same logodds score will mean different things for motifs with a different length. By default, GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distribution of scores. The logodds score is then scaled using this distribution.
--gc

Use this option to calculate the motif logodds score distribution based on regions with a similar GC%.
-N INT, --threads INT

Number of threads to use (default is 12).

Command: gimme background¶

Generate random sequences according to one of several methods:

random - randomly generated sequence with the same dinucleotide distribution as the input sequences according to a 1st order Markov model
genomic - sequences randomly chosen from the genome
gc - sequences randomly chosen from the genome with the same GC% as the input sequences
promoter - random promoter sequences

The background types gc and random need a set of input sequences in BED or FASTA format. If the input sequences are in BED format, the genome version needs to be specified with -g.

Positional arguments:

FILE        outputfile
TYPE        type of background sequences to generate
            (random,genomic,gc,promoter)

Optional arguments:

-h, --help  show this help message and exit
-i FILE     input sequences (BED or FASTA)
-f TYPE     output format (BED or FASTA
-l INT      length of random sequences
-n NUMBER   number of sequence to generate
-g GENOME   genome version (not for type 'random')
-m N        order of the Markov model (only for type 'random', default 1)

Command: gimme logo¶

Convert one or more motifs in a PFM file to a sequence logo. Most of these logos are made possible by the excellent Logomaker package. You can optionally supply a PFM file, otherwise gimme logo uses the default gimme.vertebrate.v5.0. With the -i option, you can choose one or more motifs to convert.

This will convert all the motifs in CTCF.pfm to a sequence logo:

$ gimme logo -p CTCF.pfm

This will create a logo for GM.5.0.Ets.0026 from the default database.

$ gimme logo -i GM.5.0.Ets.0026

You can specify four types of sequence logos:

information
frequency
energy
ensembl

You can leave the motif title out with the --notitle argument.

$ gimme logo JASPAR2020_vertebrates -i MA1115.1_POU5F1 -k energy --notitle

Positional arguments:

pfmfile

PFM file with motifs. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP.

Optional arguments:

-i IDS, --ids IDS

Comma-separated list of motif ids (default is all ids).
-k TYPE, --kind TYPE

Type of motif (information, frequency, energy or ensembl). The default is information.
--notitle

Don’t include motif ID as title.
-h, --help

Show help message.

Command: gimme match¶

Find the the best match of every motif in a PFM file with input motif(s) to a database of reference motifs. By default the gimme.vertebrate.v5.0 database is used, however, other databases can be specified using the -d argument. This can be a custom PFM file, or any of the databases included with GimmeMotifs such as, for instance, JASPAR2020_vertebrates, HOMER, HOCOCOMOv11_HUMAN or CIS-BP. If an ouput file is specified, a graphical output with aligned motifs will be created. However, this is slow for many motifs and can consume a lot of memory (see issue). It works fine for a few motifs at a time.

Positional arguments:

PFMFILE     File with input pfms

Optional arguments:

-h, --help  show this help message and exit
-d DBFILE   File with pfms to match against (default:
            gimme.vertebrate.v5.0.pfm)
-n INT      Number of top matches to report
-o FILE     Output file with graphical report (png, svg, ps, pdf)

Command: gimme cluster¶

Cluster a set of motifs with the WIC metric.

Positional arguments:

INPUTFILE     Inputfile (PFM format)
OUTDIR        Name of output directory

Optional arguments:

-h, --help    show this help message and exit
-s            Don't compare reverse complements of motifs
-t THRESHOLD  Cluster threshold

Command: gimme threshold¶

Create a file with motif-specific thresholds based on a specific background file and a specific FPR. The FPR should be specified as a float between 0.0 and 1.0. You can use this threshold file with the -c argument of gimme scan. Note that gimme scan by default determines an FPR based on random genomic background sequences. You can use this command to create the threshold file explicitly, or when you want to determine the threshold based on a different type of background. For instance, this command would create a file with thresholds for the motifs in custom.pwm with a FPR of 1%, based on the sequences in promoters.fa.

$ gimme threshold custom.pwm 0.05 promoters.fa > custom.threshold.txt

Positional arguments:

PFMFILE     File with pwms
FAFILE      FASTA file with background sequences
FPR         Desired fpr

Command: gimme location¶

Create the positional preference plots for all the motifs in the input PWM file. This will give best results if all the sequences in the FASTA-formatted inputfile have the same length. Keep in mind that this only makes sense if the sequences are centered around a similar feature (transcription start site, highest point in a peak, etc.). The default threshold for motif scanning is 0.95, see gimme scan for more details.

Positional arguments:

PFMFILE     File with pwms
FAFILE      Fasta formatted file

Optional arguments:

-h, --help  show this help message and exit
-w WIDTH    Set width to W (default: determined from fastafile)
-i IDS      Comma-separated list of motif ids to plot (default is all ids)
-c CUTOFF   Cutoff for motif scanning (default 0.95)

Command: gimme diff¶

This is a simple command to visualize differential motifs between different data sets. You are probably better of using gimme maelstrom, however, in some cases this visualization might still be informative. The input consists of a number of FASTA files, separated by a comma. These are compared to a background file. The last two arguments are a file with pwms and and output image. The gimme diff command then produces two heatmaps (enrichment and frequency) of all enriched, differential motifs. Reported motifs are at least 3 times enriched compared to the background (change with the -e argument) and have a minimum frequency in at least one of the input data sets of 1% (change with the -f argument). You can specify motif threshold with the -c argument (which can be a file generated with gimme threshold).

For a command like this…

$ gimme diff VEGT_specific.summit.200.fa,XBRA_specific.summit.200.fa,XEOMES_specific.summit.200.fa random.w200.fa gimme_diff_tbox.png -p tbox.pwm -f 0.01 -c threshold.0.01.txt

…the output will look like this (based on ChIP-seq peaks of T-box factors from Gentsch et al. 2013):

The image layout is not always optimal. If you want to customize the image, you can either save it as a .svg file, or use the numbers that are printed to stdout. The columns are in the same order as the image, the row order may be different as these are clustered before plotting.

Note that the results might differ quite a lot depending on the threshold that is chosen! Compare for instance an FPR of 1% vs an FPR of 5%.

Positional arguments:

FAFILES               FASTA-formatted inputfiles OR a BED file with an
                      identifier in the 4th column, for instance a cluster
                      number.
BGFAFILE              FASTA-formatted background file
PNGFILE               outputfile (image)

Optional arguments:

-h, --help            show this help message and exit
-p PFMFILE, --pfmfile PFMFILE
                      PWM file with motifs (default:
                      gimme.vertebrate.v3.1.pwm)
-c , --cutoff         motif score cutoff or file with cutoffs (default 0.9)
-e MINENR, --enrichment MINENR
                      minimum enrichment in at least one of the datasets
                      compared to background
-f MINFREQ, --frequency MINFREQ
                      minimum frequency in at least one of the datasets
-g VERSION, --genome VERSION
                      Genome version. Only necessary in combination with a
                      BED file with clusters as inputfile.

Command: gimme motif2factors¶

With motif2factors you can convert an existing motif database to a motif database for your species of interest. This conversion is done by orthology, which is not the ideal way to do this. When converting the original database to a database of your favourite critter, only the relations between motifs and transcription factors are changed. A method like this; based on orthology is not capable of inferring whether or not the motif has changed.

However the main advantage of this method is that it is a easy and fast way to get a species-specific database, works surprisingly well, and does not require any special expertise or infrastructure to work.

The method starts by downloading the genome assemblies of your species-of-interest (new-reference), the species the database is based on (database-references), and some other related species for better orthology inference (ortholog-references); by default a selection of vertebrate species. From each of these assemblies, for each gene the longest protein is taken, and compared with orthofinder for orthology:

Then based on orthology, we can replace the names of transcription factors in the original database with the names of our new species. One problem with this method is that the names of transcription factors in the database not necessarily (necessarily not, bioinformatics…) have to overlap with the names used in the genome assembly. To overcome this problem mygene.info is queried to still link differently named TFs can still be linked to genes and thus to orthologs. We tested this for gimme.vertebrate.v5.0, and worked well in our case. However it might be possible that this generates too many false positives in your case, and you can tweak the lookup on mygene.info with the --strict/--medium/--lenient flags.

If you happen to work with e.g. a non-public genome or a genome with a different type of annotation, you can supply your own files. This can either be:

a fasta and gtf file with the naming schemes
- {genomes_dir}/{genome_name}/{genome_name}.fa
- {genomes_dir}/{genome_name}/{genome_name}.annotation.gtf
a peptide file with the naming scheme
- {genomes_dir}/{genome_name}/{genome_name}.pep.fa

The genome_name can be the name of the assembly/genome you use, e.g. hg38 for human. The genomes_dir can optionally be specified on the command line. Otherwise the default genomes dir is used (use genomepy config show to get this location). If one or more genomes were previously installed with genomepy, their full path can be used as well (regardless of the genomes_dir). The peptide file should be of the format:

>identifier1
MKNTMKKQSGVVDTFKKAITAKSQWHDKDEFLDVIYWFKQIIGVILGLLWGFIPLKGFLG
>identifier2
MRPLRIIIQRKGKSYGELHGYGQIYKSKMPVSKILLAVNEKRNNHNNISILDDFRRVSSI

where the identifier can be any

1
1|2
3|1|2

and gimmemotifs will use key 1. Usually 1 represents the gene_name, 2 the gene_id, and 3 the transcript_id.

Optional arguments:

--new-reference ASSEMBLY [ASSEMBLY ...]
                      The assembly the new motif2factors file will be based on.
--database db         The database you want to change convert to your species of interest. (default is gimme.vertebrate.v5.0)
--database-references ASSEMBLY [ASSEMBLY ...]
                      The assembly(s) on which the orginal motif2factors is based on. (default is human and mouse)
--ortholog-references ASSEMBLY [ASSEMBLY ...]
                      Extra assemblies for better orthology inference between the new reference and database reference. (default is a range of vertebrate species)
--genomes_dir DIR     Where to find/store genomepy genomes. Defaults to the genomepy config settings.
--tmpdir DIR          Where to place intermediate files. Defaults to system temp.
--outdir OUTDIR       Where to save the results to. Defaults to current working directory.
--strict, --medium, --lenient
                      How strict should the names of the genes in the assembly be followed. Strict: base names only on what is in the annotation file; Medium: base on annotation file, as well as on
                      mygene.info name and symbol query; Lenient: based on annotation file, and mygeneinfo name, symbol, alias, other_names, accession, accession.protein, refseq, refseq.protein,
                      ensembl, ensembl.gene. Lenient is the default, but in case of false-positive hits you can tune this stricter.
--threads INT         Maximum number of parallel threads used.
--keep-intermediate   Keep temporary files, do not delete tmpdir.