.. _`command-line`: Command-line reference ====================== In addition to ``gimme motifs`` the GimmeMotifs package contains several other tools that can perform the various substeps of GimmeMotifs, as well as other useful tools. Run them to see the options. List of tools ------------- * :ref:`gimme motifs` * :ref:`gimme scan` * :ref:`gimme maelstrom` * :ref:`gimme background` * :ref:`gimme logo` * :ref:`gimme match` * :ref:`gimme cluster` * :ref:`gimme threshold` * :ref:`gimme location` * :ref:`gimme diff` * :ref:`gimme motif2factors` Input formats ------------- Most tools in this section take a file in PFM format as input. This is actually a file with Position Specific Scoring Matrices (PSSMs) containing *frequencies*. It looks like this: :: >motif1 0.3611 0.0769 0.4003 0.1664 0.2716 0.0283 0.5667 0.1381 0.6358 0.0016 0.3344 0.0330 0.0016 0.9859 0.0016 0.0157 0.8085 0.0063 0.0502 0.1397 >motif2 0.2276 0.0157 0.0330 0.7284 0.0031 0.0016 0.9984 0.0016 0.0377 0.3799 0.0016 0.5856 0.0816 0.7096 0.0173 0.1962 0.1350 0.4035 0.0675 0.3987 The frequencies are separated by tabs, and in the order A,C,G,T. .. _`gimme_motifs`: Command: gimme motifs --------------------- The ``gimme motifs`` command can be used for known and/or *de novo* motif analysis. By default it runs both. Quick example of *de novo* motif analysis ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can try GimmeMotifs with a small example dataset included in the examples directory, included with GimmeMotifs. This example does not require any additional configuration if GimmeMotifs is installed correctly. Change to a directory where you have write permissions and run the following command (substitute the filename with the location of the file on your system): :: gimme motifs /usr/share/gimmemotifs/examples/TAp73alpha.fa p73 --denovo -g hg19 The first argument is the name of the input file and the second argument defines the name of the output directory that is created. All output files are stored in this directory. The genome is set to the ``hg38`` genome. This requires you to have installed ``hg38`` using `genomepy `_. Alternatively, you can also supply the path to a genome FASTA file with the ``-g`` option. Depending on your computer, this analysis will take around 15-20 minutes. By default, the `three top-performing `_ *de novo* motif tools will be used: MEME, Homer and BioProspector. Once GimmeMotifs is finished you can open `p73/p73\_motif\_report.html `_ in your browser. You can also run the same analysis with a BED file or narrowPeak file as input, or a text file with regions in ``chrom:start-end`` format Best practices and tips ~~~~~~~~~~~~~~~~~~~~~~~ GimmeMotifs is multi-threaded +++++++++++++++++++++++++++++ GimmeMotifs runs multi-threaded and by default uses 12 threads. All the *de novo* programs will be run in parallel as much as possible. Of course some programs are still single-threaded, and will not really benefit from multithreading. You can change the number of threads that are used using the ``-N`` parameter. Running time ++++++++++++ The running time of the *de novo* part of GimmeMotifs largely depends on three factors: - the size of the input dataset; - the motif prediction tools you use; - the size of the motifs to be identified. **Size of input dataset** While GimmeMotifs is developed specifically for ChIP-seq datasets, most motif prediction tools are not. In practice this means that it does not make much sense to predict motifs on a large amount of sequences, as this will usually not result in higher quality motifs. Therefore GimmeMotifs uses an absolute limit for the prediction set. By default 20% of the sequences are used as input for motif prediction, but with an absolute maximum. This is controlled by the ``abs_max`` parameter in the configuration file, which is set to 1000 by default. In general, if you have a large amount of peaks, you can also consider to run GimmeMotifs on the top sequences of your input, for instance the 5000 highest peaks. **Motif prediction tools** By default, ``gimme motifs`` uses three *de novo* motif prediction tools: MEME, BioProspector and Homer. These we found to be the best performing programs for ChIP-seq data (`Bruse & van Heeringen, 2018 `_). You can include a large variety of other tools by using the ``-t`` parameter. This will result in an increased running time and some tools, such as GADEM, will take a very long time. The following tools are supported: - `AMD `_ - `BioProspector `_ - `ChIPMunk `_ - `DiNAMO `_ - `GADEM `_ - `DREME `_ - `HMS `_ - `Homer `_ - `Improbizer `_ - `MDmodule `_ - `MEME `_ - `MotifSampler `_ - `POSMO `_ - `ProSampler `_ - `RPMCMC `_ - `Trawler `_ - `Weeder `_ - `XXmotif `_ - `YAMDA `_ With the exception of RPMCMC and YAMDA, all tools come installed with GimmeMotifs when using the bioconda package. AMD, HMS, Improbizer, MotifSampler and DiNAMO are not supported on OSX. **Please note:** all these programs include their own license and many are free **for academic or non-commercial** use only. For commercial use of any of these programs, please consult the respective author! GimmeMotifs itself can be freely used commercially. **Motif size** The default setting for motif size is ``-a xl``, which searches for motifs with a length of up to 20. You can use different analysis sizes: ``small`` (up to 8), ``medium`` (up to 10) or ``large`` (up to 14). The running time can be significantly shorter for shorter motifs. However, keep in mind that the ``xl`` analysis setting results in the best motifs in general. Small input sets ++++++++++++++++ GimmeMotifs is developed for larger datasets, such as ChIP-seq peaks, where you have the luxury to use a large fraction of your input for validation. This means that at least several hundred sequences would be optimal. If you want to run GimmeMotifs on a small input dataset, it might be worthwile to increase the fraction used for motif prediction with the ``-f`` argument, for instance ``-f 0.5``. Running on FASTA files ++++++++++++++++++++++ It is possible to run GimmeMotifs on a FASTA file as input instead of a BED file. This is detected automatically if your inputfile is correctly formatted according to FASTA specifications. Please note that for best results, all the sequences should be of the same length. This is not necessary for motif prediction, but the statistics and positional preference plots will be wrong if sequences have different lengths. Intermediate results ++++++++++++++++++++ GimmeMotifs produces a lot of intermediate results, such as all predicted motifs, FASTA files used for validation and so on. These are deleted by default (as they can get quite large), but if you are interested in them, you can specify the ``-k`` option. Detailed options for gimme motifs +++++++++++++++++++++++++++++++++ **Positional arguments** - ``INPUT`` The inputfile needs to be in BED, FASTA, narrowPeak or region format. By default ``gimme motifs`` will take the center of these features, and extend those to the size specified by the ``-s`` or ``--size`` argument. By default this is 200 bp. Keep in mind that the smaller the regions are, the better motif discovery will work. **BED-fomatted** files need to contain at least three tab-separated columns describing chromosome name, start and end. The fourth column is optional. If it is specified it will be used by some motif prediction tools to sort the features before motif prediction. **FASTA** files can be used as input for motif prediction. For best results it is recommended to use sequences of the same size. Peak files in **narrowPeak** format, such as produced by MACS2, can also directly be used as input. With these files, ``gimme motifs`` will use the summit of the peak and create regions of size 200 centered at this summit. Use the ``-s`` parameter to change this size. Finally, **region** files can be used. These contain one column, with regions specified in ``chrom:start-end`` format. - ``OUTDIR`` The name of the output directory. All output files will be saved in this directory. If the directory already exists files will be overwritten. **Optional arguments** - ``-b BACKGROUND``, ``--background BACKGROUND`` Type of background to use. There are five options: ``gc``, ``genomic``, ``random``, ``promoter`` or the path to file with background sequences (FASTA, BED or regions). By default ``gc`` is used, which generates random regions from the genome with a similar GC% as your input sequences. The ``genomic`` background will select random genomic regions without taking the sequence composition into account. The ``random`` background will create artificial sequences with a similar nucleotide distribution as your input sequences. The ``promoter`` background will select random promoters. For this option, your genome needs to be installed with `genomepy `_ using the ``--annotation`` option. Finally, you can select your own custom background by supplying the path to a file. - ``-g GENOME`` Name of the genome to use. This can be the name of a genome installed with `genomepy `_ or the path to a FASTA file. - ``--denovo`` Only run *de novo* motif analysis. By default, the analysis includes known motifs. For specific arguments for *de novo* motif analysis, see below. - ``--known`` Only run known motif analysis. By default, the analysis includes *de novo* motifs. For specific arguments for known motif analysis, see below. - ``--noreport`` Don't create a HTML report, only plain text output files. - ``--rawscore`` Don't use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length. This means that the same logodds score will mean different things for motifs with a different length. By default, GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distribution of scores. The logodds score is then scaled using this distribution. - ``--nogc`` By default GimmeMotifs calculates the motif logodds score distribution for regions with a different GC%. The score is then normalized according to the GC% bin per input sequence. Use this argument to turn this off. - ``-N INT``, ``--threads INT`` Number of threads to use (default is 12). **Optional arguments for known motif analysis** - ``-p PFMFILE`` PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, ``JASPAR2020_vertebrates``, ``HOMER``, ``HOCOCOMOv11_HUMAN`` or ``CIS-BP``. By default, a database of clustered vertebrate motifs is used, ``gimme.vertebrate.v5.0``. This database has a limited motif redundancy. **Optional arguments for de novo motif analysis** - ``-t TOOLS``, ``--tools TOOLS`` The *de novo* motif prediction tools to use, separated by commas. This can be any combination of the following: AMD, BioProspector, ChIPMunk, DiNAMO, GADEM, DREME, HMS, Homer, Improbizer, MDmodule, MEME, MEMEW, MotifSampler, POSMO, ProSampler, RPMCMC, Trawler, Weeder, XXmotif, YAMDA. By default ``TOOLS`` is ``BioProspector,Homer,MEME``. Note that some tools may not be installed. Running ``gimme motifs -h`` will always list the tools that are supported on your installation of GimmeMotifs. - ``-a``, ``--analysis`` The size of motifs to look for: small (5-8), medium (5-12), large (6-15) or xl (6-20). The larger the motifs, the longer the *de novo* motif prediction will take. By default, xl will be used as this generally yields the best motifs. However, some prediction tools take a very long time in combination with the xl setting. - ``k``, ``--keepintermediate`` Keep intermediate files. - ``-s``, ``-–singlestrand`` Only use the forward strand for prediction. By default both strands are used. - ``-f FRACTION``, ``--fraction FRACTION`` This parameter controls the fraction of the sequences used for prediction. This 0.2 by default, so in this case a randomly chosen 20% of the sequences will be used for prediction. The remaining sequences will be used for validation (enrichment, ROC curves etc.). If you have a large set of sequences (ie. most ChIP-seq peak sets), this is fine. However, if your set is smaller, it might be worthwile to increase this prediction fraction. The number of sequences that is used is also influenced by the ``abs_max`` parameter in the configuration file. Regardless of the ``-f`` parameter, the total number of sequences used for motif prediction will never exceed the number specified by ``abs_max``. - ``-s N``, ``--size N`` This is the size of the sequences used for motif prediction. Smaller sequences will result in a faster analysis, but you are of course limited by the accuracy of your data. For the tested ChIP-seq data sets 200 performs fine. If this parameter is set to ``0``, the original size of the regions in the input file will be used. .. _`gimme_maelstrom`: Command: gimme maelstrom ------------------------ This command can be used to identify differential motifs between two or more data sets. See the :ref:`maelstrom tutorial` for more details. **Positional arguments:** :: INPUTFILE file with regions and clusters GENOME genome DIR output directory **Optional arguments:** :: -h, --help show this help message and exit -p pfmfile, --pfmfile pfmfile PFM file with motifs (default: gimme.vertebrate.v5.0.pfm) --no-filter Don't remove redundant motifs. -F FLOAT, --filter_cutoff FLOAT Cutoff to select non-redundant motifs. Default is 0.8, increase this value to get fewer motifs. --nocenter Don't mean-center the rows by default -m NAMES, --methods NAMES Run with specific methods -a method, --aggregation method How to combine motifs from individual methods. Default is "int_stouffer", for inverse normal transform of ranks, followed by Stouffer's method to combine z-scores. Alternatively, specify "stuart" for log- transformed rank aggregation p-values. -N INT, --nthreads INT Number of threads (default 12) --rawscore Don't z-score normalize motif scores --nogc Don't use GC% bins **Input file formats** The input can be in one of two possible formats. In both cases the genomic location should be present as ``chrom:start-end`` in the first column. The first option is a two-column format and looks like this: :: loc cluster chr15:49258903-49259103 NK chr10:72370313-72370513 NK chr4:40579259-40579459 Monocytes chr10:82225678-82225878 T-cells chr5:134237941-134238141 B-cells chr5:58858731-58858931 B-cells chr20:24941608-24941808 NK chr5:124203116-124203316 NK chr17:40094476-40094676 Erythroblast chr17:28659327-28659527 T-cells This can be the result of a clustering analysis, for instance. The second option looks like this: :: loc NK Monocytes T-cells B-cells chr12:93507547-93507747 3.118 2.522 1.933 0.197 chr7:38236460-38236660 1.098 0.502 0.201 0.190 chr10:21357147-21357347 0.528 -0.066 -1.04 -0.343 chr6:115521512-115521712 0.406 -0.376 -0.480 -0.667 chr2:97359808-97360008 1.501 0.905 0.719 0.031 chr16:16684549-16684749 0.233 -0.362 -0.837 -0.746 chrX:138964544-138964744 0.330 -0.291 -0.686 -0.777 chr2:186923973-186924173 0.430 -0.258 -1.164 -0.723 chrX:113834470-113834670 0.560 -0.036 -0.686 -0.692 This is a tab-separated table, with a header describing the experiments. In case of sequencing data, such as ChIP-seq, ATAC-seq or DNaseI seq, we recommend to use **log-transformed** read counts which are **mean-centered per row**. For optimal results, it is recommended to normalize between experiments (columns) after the log-transformatiion step, for instance by quantile normalization or scaling. By default, ``gimme maelstrom`` will mean-center the input, disable this with ``--nocenter``. The second input format generally gives better results than the first one and would be the recommended format. The output scores of ``gimme maelstrom`` represent the combined result of multiple methods. This z-score represents the combined result of multiple methods. The individual results from different methods are ranked from high-scoring motif to low-scoring motif and converted to z-scores using the inverse normal transformation. The z-scores from individual methods are then combined using Stouffer's method. The score that is shown is the aggregated z-score. A higher z-score means that presence of the motif or a higher motif score is associated with higher signal in a specific sample. By default, ``gimme maelstrom`` selects a non-redundant set of motifs by clustering the motifs based on scores in the set of input sequences. You can disable this by using the ``--no-filter`` argument. You can tweak the number of selected motifs by changing the ``--filter-cutoff`` parameter. By default this is set to ``0.8``. Increase this value to select fewer motifs, decrease it to select more motifs. Keep in mind that you may start to lose biologically relevant motifs if you set this too high. .. _`gimme_scan`: Command: gimme scan ------------------- Scan a set of sequences with a set of motifs, and get the resulting matches in GFF, BED or table format. If the FASTA header includes a chromosome location in ``chrom:start-end`` format, the BED output will return the genomic location of the motif match. The GFF file will always have the motif location relative to the input sequence. A basic command would look like this: :: $ gimme scan peaks.bed -g hg38 -b > motifs.bed The threshold that is used for scanning can be specified in a number of ways. The default threshold is set to a motif-specific 1% FPR by scanning random genomic sequences. You can change the FPR with the ``-f`` option and/or the set of sequences that is used to determine the FPR with the ``-B`` option. For instance, this command would scan with thresholds based on 5% FPR with random genomic mouse sequences. :: $ gimme scan input.fa -g mm10 -f 0.05 -b > gimme.scan.bed And this command would base a 10% FPR on the input file ``hg38.promoters.fa``: :: $ gimme scan input.fa -f 0.1 -B hg38.promoters.fa -b > gimme.scan.bed Alternatively, you can specify the threshold as a single score. This score is relative and is based on the maximum and minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif should be at least 95% of the (maximum score - minimum score). This should probably not be set much lower than 0.8, and should be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might be different for each motif, the use of the FPR-based threshold is preferred. One reason to use a single score as threshold is when you want a match for each motif, regardless of the score. This command would give one match for every motif for every sequence, regardless of the score. :: $ gimme scan input.bed -g hg38 -c 0 -n 1 -b > matches.bed Finally, ``gimme scan`` can return the scanning results in table format. The ``-t`` will yield a table with number of matches, while the ``-T`` will have the score of the best match. **Positional arguments:** - ``INPUT`` The inputfile needs to be in BED, FASTA or region format. **BED-fomatted** files need to contain at least three tab-separated columns describing chromosome name, start and end. **Region** files can also be used. These contain one column, with regions specified in ``chrom:start-end`` format. **Optional arguments** - ``-g GENOME`` Name of the genome to use. This can be the name of a genome installed with `genomepy `_ or the path to a FASTA file. - ``-p PFMFILE``, ``--pfmfile PFMFILE`` PFM file with motifs to use for known motif analysis. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, ``JASPAR2020_vertebrates``, ``HOMER``, ``HOCOCOMOv11_HUMAN`` or ``CIS-BP``. By default, a database of clustered vertebrate motifs is used, ``gimme.vertebrate.v5.0``. This database has a limited motif redundancy. - ``-f``, ``--fpr`` Base the motif score threshold on this FPR. By default this is set to 1%, equivalent to ``-f 0.01``. The score threshold is based on scanning random genomic regions with the same size and the same GC% distribution. This threshold is calculated once for a specific sequence size and cached. Therefore, scanning will take longer the first time you use a specific FPR with a specific input sequence size. - ``-B``, ``--bgfile`` Specify a FASTA file to use for FPR calculation, instead of taking random genomic regions. - ``-c``, ``--cutoff`` Use this score cutoff instead of an FPR-based threshold. This score is relative and is based on the maximum and minimum possible score for each motif. For example, a score of 0.95 means that the score of a motif should be at least 95% of the (maximum score - minimum score). This should probably not be set much lower than 0.8, and should be generally at least 0.9-0.95 for good specificity. Generally, as the optimal threshold might be different for each motif, the use of the FPR-based threshold is preferred. - ``-n``, ``--nreport`` Maximum number of matches to report per motif per sequence. By default this is set to 1. - ``-r``, ``--norc`` Don't scan the reverse complement of the sequence. By default both strands will be scanned. - ``-b``, ``--bed`` Output motif matches in BED format. - ``-t``, ``--table`` Ouput number of matches in a table format, where columns represent motifs and rows represent input sequences. - ``-T``, ``--score_table`` Ouput maximum motif score in a table format, where columns represent motifs and rows represent input sequences. The score will be reported for each motif, regardless if it is a good match or not. - ``-z``, ``--zscore`` Use z-score normalization for motif scores. The raw logodds motif score are dependent on motif length. This means that the same logodds score will mean different things for motifs with a different length. By default, GimmeMotifs uses the scores in a set of genomic background regions to determine the a background distribution of scores. The logodds score is then scaled using this distribution. - ``--gc`` Use this option to calculate the motif logodds score distribution based on regions with a similar GC%. - ``-N INT, --threads INT`` Number of threads to use (default is 12). .. _`gimme_background`: Command: gimme background ------------------------- Generate random sequences according to one of several methods: - ``random`` - randomly generated sequence with the same dinucleotide distribution as the input sequences according to a 1st order Markov model - ``genomic`` - sequences randomly chosen from the genome - ``gc`` - sequences randomly chosen from the genome with the same GC% as the input sequences - ``promoter`` - random promoter sequences The background types ``gc`` and ``random`` need a set of input sequences in BED or FASTA format. If the input sequences are in BED format, the genome version needs to be specified with ``-g``. **Positional arguments:** :: FILE outputfile TYPE type of background sequences to generate (random,genomic,gc,promoter) **Optional arguments:** :: -h, --help show this help message and exit -i FILE input sequences (BED or FASTA) -f TYPE output format (BED or FASTA -l INT length of random sequences -n NUMBER number of sequence to generate -g GENOME genome version (not for type 'random') -m N order of the Markov model (only for type 'random', default 1) .. _`gimme_logo`: Command: gimme logo ------------------- Convert one or more motifs in a PFM file to a sequence logo. Most of these logos are made possible by the excellent `Logomaker `_ package. You can optionally supply a PFM file, otherwise ``gimme logo`` uses the default ``gimme.vertebrate.v5.0``. With the ``-i`` option, you can choose one or more motifs to convert. This will convert all the motifs in ``CTCF.pfm`` to a sequence logo: :: $ gimme logo -p CTCF.pfm This will create a logo for ``GM.5.0.Ets.0026`` from the default database. :: $ gimme logo -i GM.5.0.Ets.0026 You can specify four types of sequence logos: =========== ================================================= information .. image:: images/MA1115.1_POU5F1.information.png :scale: 50 % frequency .. image:: images/MA1115.1_POU5F1.frequency.png :scale: 50 % energy .. image:: images/MA1115.1_POU5F1.energy.png :scale: 50 % ensembl .. image:: images/MA1115.1_POU5F1.ensembl.png :scale: 50 % =========== ================================================= You can leave the motif title out with the ``--notitle`` argument. :: $ gimme logo JASPAR2020_vertebrates -i MA1115.1_POU5F1 -k energy --notitle .. image:: images/MA1115.1_POU5F1.energy.notitle.png :scale: 50 % **Positional arguments:** - ``pfmfile`` PFM file with motifs. You can use a custom PFM file, or use any of the databases included with GimmeMotifs such as, for instance, ``JASPAR2020_vertebrates``, ``HOMER``, ``HOCOCOMOv11_HUMAN`` or ``CIS-BP``. **Optional arguments:** - ``-i IDS``, ``--ids IDS`` Comma-separated list of motif ids (default is all ids). - ``-k TYPE``, ``--kind TYPE`` Type of motif (information, frequency, energy or ensembl). The default is information. - ``--notitle`` Don't include motif ID as title. - ``-h``, ``--help`` Show help message. .. _`gimme_match`: Command: gimme match -------------------- Find the the best match of every motif in a PFM file with input motif(s) to a database of reference motifs. By default the ``gimme.vertebrate.v5.0`` database is used, however, other databases can be specified using the ``-d`` argument. This can be a custom PFM file, or any of the databases included with GimmeMotifs such as, for instance, ``JASPAR2020_vertebrates``, ``HOMER``, ``HOCOCOMOv11_HUMAN`` or ``CIS-BP``. If an ouput file is specified, a graphical output with aligned motifs will be created. However, this is slow for many motifs and can consume a lot of memory (`see issue`_). It works fine for a few motifs at a time. .. _`see issue`: https://github.com/simonvh/gimmemotifs/issues/5 **Positional arguments:** :: PFMFILE File with input pfms **Optional arguments:** :: -h, --help show this help message and exit -d DBFILE File with pfms to match against (default: gimme.vertebrate.v5.0.pfm) -n INT Number of top matches to report -o FILE Output file with graphical report (png, svg, ps, pdf) .. _`gimme_cluster`: Command: gimme cluster ---------------------- Cluster a set of motifs with the WIC metric. **Positional arguments:** :: INPUTFILE Inputfile (PFM format) OUTDIR Name of output directory **Optional arguments:** :: -h, --help show this help message and exit -s Don't compare reverse complements of motifs -t THRESHOLD Cluster threshold .. _`gimme_threshold`: Command: gimme threshold ------------------------ Create a file with motif-specific thresholds based on a specific background file and a specific FPR. The FPR should be specified as a float between 0.0 and 1.0. You can use this threshold file with the ``-c`` argument of :ref:`gimme scan`. Note that :ref:`gimme scan` by default determines an FPR based on random genomic background sequences. You can use this command to create the threshold file explicitly, or when you want to determine the threshold based on a different type of background. For instance, this command would create a file with thresholds for the motifs in ``custom.pwm`` with a FPR of 1%, based on the sequences in ``promoters.fa``. :: $ gimme threshold custom.pwm 0.05 promoters.fa > custom.threshold.txt **Positional arguments:** :: PFMFILE File with pwms FAFILE FASTA file with background sequences FPR Desired fpr .. _`gimme_location`: Command: gimme location ----------------------- Create the positional preference plots for all the motifs in the input PWM file. This will give best results if all the sequences in the FASTA-formatted inputfile have the same length. Keep in mind that this only makes sense if the sequences are centered around a similar feature (transcription start site, highest point in a peak, etc.). The default threshold for motif scanning is 0.95, see ``gimme scan`` for more details. **Positional arguments:** :: PFMFILE File with pwms FAFILE Fasta formatted file **Optional arguments:** :: -h, --help show this help message and exit -w WIDTH Set width to W (default: determined from fastafile) -i IDS Comma-separated list of motif ids to plot (default is all ids) -c CUTOFF Cutoff for motif scanning (default 0.95) .. _`gimme_diff`: Command: gimme diff ------------------- This is a simple command to visualize differential motifs between different data sets. You are probably better of using :ref:`gimme maelstrom`, however, in some cases this visualization might still be informative. The input consists of a number of FASTA files, separated by a comma. These are compared to a background file. The last two arguments are a file with pwms and and output image. The `gimme diff` command then produces two heatmaps (enrichment and frequency) of all enriched, differential motifs. Reported motifs are at least 3 times enriched compared to the background (change with the ``-e`` argument) and have a minimum frequency in at least one of the input data sets of 1% (change with the ``-f`` argument). You can specify motif threshold with the ``-c`` argument (which can be a file generated with :ref:`gimme threshold`). For a command like this... :: $ gimme diff VEGT_specific.summit.200.fa,XBRA_specific.summit.200.fa,XEOMES_specific.summit.200.fa random.w200.fa gimme_diff_tbox.png -p tbox.pwm -f 0.01 -c threshold.0.01.txt ...the output will look like this (based on ChIP-seq peaks of T-box factors from `Gentsch et al. 2013`_): .. image:: images/gimme_diff_tbox.png The image layout is not always optimal. If you want to customize the image, you can either save it as a ``.svg`` file, or use the numbers that are printed to stdout. The columns are in the same order as the image, the row order may be different as these are clustered before plotting. Note that the results might differ quite a lot depending on the threshold that is chosen! Compare for instance an FPR of 1% vs an FPR of 5%. .. _`Gentsch et al. 2013`: https://doi.org/10.1016/j.celrep.2013.08.012 **Positional arguments:** :: FAFILES FASTA-formatted inputfiles OR a BED file with an identifier in the 4th column, for instance a cluster number. BGFAFILE FASTA-formatted background file PNGFILE outputfile (image) **Optional arguments:** :: -h, --help show this help message and exit -p PFMFILE, --pfmfile PFMFILE PWM file with motifs (default: gimme.vertebrate.v3.1.pwm) -c , --cutoff motif score cutoff or file with cutoffs (default 0.9) -e MINENR, --enrichment MINENR minimum enrichment in at least one of the datasets compared to background -f MINFREQ, --frequency MINFREQ minimum frequency in at least one of the datasets -g VERSION, --genome VERSION Genome version. Only necessary in combination with a BED file with clusters as inputfile. .. _`gimme_motif2factors`: Command: gimme motif2factors ---------------------------- With motif2factors you can convert an existing motif database to a motif database for your species of interest. This conversion is done by orthology, which is not the ideal way to do this. When converting the original database to a database of your favourite critter, only the relations between motifs and transcription factors are changed. A method like this; based on orthology is not capable of inferring whether or not the motif has changed. However the main advantage of this method is that it is a easy and fast way to get a species-specific database, works surprisingly well, and does not require any special expertise or infrastructure to work. The method starts by downloading the genome assemblies of your species-of-interest (new-reference), the species the database is based on (database-references), and some other related species for better orthology inference (ortholog-references); by default a selection of vertebrate species. From each of these assemblies, for each gene the longest protein is taken, and compared with orthofinder for orthology: .. _`David M. Emms & Steven Kelly 2019`: https://doi.org/10.1186/s13059-019-1832-y Then based on orthology, we can replace the names of transcription factors in the original database with the names of our new species. One problem with this method is that the names of transcription factors in the database not necessarily (necessarily not, bioinformatics...) have to overlap with the names used in the genome assembly. To overcome this problem mygene.info is queried to still link differently named TFs can still be linked to genes and thus to orthologs. We tested this for gimme.vertebrate.v5.0, and worked well in our case. However it might be possible that this generates too many false positives in your case, and you can tweak the lookup on mygene.info with the ``--strict``/``--medium``/``--lenient`` flags. If you happen to work with e.g. a non-public genome or a genome with a different type of annotation, you can supply your own files. This can either be: * a fasta and gtf file with the naming schemes * ``{genomes_dir}/{genome_name}/{genome_name}.fa`` * ``{genomes_dir}/{genome_name}/{genome_name}.annotation.gtf`` * a peptide file with the naming scheme * ``{genomes_dir}/{genome_name}/{genome_name}.pep.fa`` The ``genome_name`` can be the name of the assembly/genome you use, e.g. hg38 for human. The ``genomes_dir`` can optionally be specified on the command line. Otherwise the default genomes dir is used (use `genomepy config show` to get this location). If one or more genomes were previously installed with genomepy, their full path can be used as well (regardless of the ``genomes_dir``). The peptide file should be of the format: :: >identifier1 MKNTMKKQSGVVDTFKKAITAKSQWHDKDEFLDVIYWFKQIIGVILGLLWGFIPLKGFLG >identifier2 MRPLRIIIQRKGKSYGELHGYGQIYKSKMPVSKILLAVNEKRNNHNNISILDDFRRVSSI where the identifier can be any * 1 * 1|2 * 3|1|2 and gimmemotifs will use key 1. Usually 1 represents the gene_name, 2 the gene_id, and 3 the transcript_id. **Optional arguments:** :: --new-reference ASSEMBLY [ASSEMBLY ...] The assembly the new motif2factors file will be based on. --database db The database you want to change convert to your species of interest. (default is gimme.vertebrate.v5.0) --database-references ASSEMBLY [ASSEMBLY ...] The assembly(s) on which the orginal motif2factors is based on. (default is human and mouse) --ortholog-references ASSEMBLY [ASSEMBLY ...] Extra assemblies for better orthology inference between the new reference and database reference. (default is a range of vertebrate species) --genomes_dir DIR Where to find/store genomepy genomes. Defaults to the genomepy config settings. --tmpdir DIR Where to place intermediate files. Defaults to system temp. --outdir OUTDIR Where to save the results to. Defaults to current working directory. --strict, --medium, --lenient How strict should the names of the genes in the assembly be followed. Strict: base names only on what is in the annotation file; Medium: base on annotation file, as well as on mygene.info name and symbol query; Lenient: based on annotation file, and mygeneinfo name, symbol, alias, other_names, accession, accession.protein, refseq, refseq.protein, ensembl, ensembl.gene. Lenient is the default, but in case of false-positive hits you can tune this stricter. --threads INT Maximum number of parallel threads used. --keep-intermediate Keep temporary files, do not delete tmpdir.