#######################################################################
 #                                                                     #
 #  The Manual of RSPB (Repetitive Sequences with Precise Boundaries)  #
 #                                                                     #
 #                        Version 0.20  2012-7-25                      #
 #                               LU, Chen                              #
 #                          chenlubio@gmail.com                        #
 #                                                                     #
 #                                                                     #
 #######################################################################

1. What is RSPB?

       RSPB is a series of Perl scripts to identify related repetitive sequences 
   with precise boundaries in genome. Now it is used to find miniature 
   inverted-repeat transposable elements (MITEs). RSPB is designed to run on 
   Linux platform.

2. System requirement
   
   Linux OS (recommend 64-bit version)
   Multi-core CPU
   8G or more memory 
   At least thirty times free disk space of genome size

3. Prerequisites

   To run RSPB, you need do following preparations first:

   (1) Install Perl 5.10.1 or higher version (including the 
       Algorithm::ClusterPoints module) and BioPerl 1.6.1 or higher version 
       (including BioPerl and BioPerl-run).
      *Note: If you are not the system manager, you should modify the first line 
       of all RSPB scripts to define the path of Perl interpreter, and add a line 
       to include BioPerl path in @inc, just like below:
         use lib "/$HOME/perl5/site/lib";

   (2) Download NCBI blast (ftp://ftp.ncbi.nih.gov/blast/executables/release/, 
       not blast+), Muscle (http://www.drive5.com/muscle/) and Mdust 
       (in RSPB package or http://sourceforge.net/projects/gicl/files/other/), 
       and then copy the executable file blastall, formatdb, muscle and mdust 
       to the directory in environment variable.
      *Note: Please remember change the name of Muscle executable file to 
       "muscle" (Do not contain version number and platform name).

   (3) Download and install RepeatMasker 
       (http://www.repeatmasker.org/RMDownload.html)

   (4) Set the path containing genome sequences writable.

   (5) If user ran MITE-hunter to identify MITEs before, please run hunter2ref.pl 
       to combine the MITE-hunter results first, 
          perl hunter2ref.pl -p MITE-hunter_output_path -r output_file
       Then set the output file as -r parameter of RSPB_manage.pl.

4. Install RSPB

   Download the RSPB_0.20.zip
   Decompress the compressed files using this command:
     unzip RSPB_0.20.zip

5. Run RSPB to identify MITEs

   RSPB has 7 steps. Step 1-3 can be simply run by RSPB_manage.pl.
    Usage:
      perl RSPB_manage.pl -i input_file -q built_query_path -d blastdb_name 
      [optional parameters]

        Option  Required  Default  Description
   
          -i      Yes        -     Query sequences file name (FASTA format).
                                   If the query sequences are separated to 
                                   several files by chromosomes, please combine 
                                   them to one file first.
          -q      Yes        -     Directory containing query seuqnces and 
                                   outputs of subsequent scripts.
          -d      Yes        -     Blast database name of genome sequences.
                                   User can specify a existed dababase.
          -r       No        -     The known MITEs or other repetitive sequences 
                                   file name (FASTA format).
          -F       No       30     Cut-off value of Mdust to filter low-complexity 
                                   sequences (lower values might mask more).
                                   Recommend 28 for large genome (>500 Mb).
          -e       No      1e-20   E-value threshold of blast.
                                   Recommand lower E-value for large genome.
          -W       No       18     Word size of blastn.
                                   Recommand 21 for large genome.
          -n       No        5     Minimum number of a group of related sequences.
                                   Recommend higher value (eg. 10) for large 
                                   genome.
          -m       No       1000   Maximum length of related sequences.
          -a       No        4     Thread number used by blastn.
     
     
   User can also run RSPB step by step:
   (1) Cleave the genome sequence to query sequences 
       Usage: 
         perl 01_cleave_objectseq.pl -i input_file -q target_path -d 
         blastdb_name [-F mdust_score]

        Option  Required  Default  Description

          -i      Yes        -     Genome sequences file (FASTA format). 
                                   If the genome sequences are separated files 
                                   by chromosomes, please combine them to one 
                                   file before this step.
          -q      Yes        -     Directory containing query sequences and 
                                   outputs of subsequent scripts ($target_path).
          -d      Yes        -     Blast database name of genome.
                                   User can specify a existed database.
          -F       No        30    Cut-off value of Mdust to filter low-complexity
                                   sequences (lower values might mask more).
                                   Recommend 28 for large genome (>500 Mb).

        Output: a directory "$target_path" containing query sequences 
                "genome_query/" and a list file "querylist" of query sequences, 
                and a blast database of genome sequence .
        
   (2) Blast the query sequences to genome
        Usage:
          perl 02_blast_queryseq.pl -l query_list -d blastdb [-e e-value] 
          [-W word_size] [-a thread_num]

        Option  Required  Default  Description

          -l      Yes        -     "querylist" file of query sequences.
          -d      Yes        -     Blast database name of genome.
                                   User can specify a existed dababase.
          -e       No      1e-20   E-value threshold of blast.
                                   Recommend lower E-value for large genome.
          -W       No       18     Word size of blastn.
                                   Recommend 21 for large genome.
          -a       No        4     Thread number used by blastn.

        Output: blast result of each query sequence and a log file 
                "$target_path/step2.log".

        *Note: a. This step usually spend more than one day, even more than one 
                  week for large size genome which contains many repetive 
                  sequences, or draft genome which is made up by many scaffolds. 
                  So I recommend to run the program on background, just like:
                   nohup perl 02_blast_queryseq.pl [options] &

               b. If this program stop before finished, user can run the redo.pl
                  to generate a new list file named "redolist":
                    perl redo.pl -q querylist -l step2.log -o redolist
                  Then continue run 02_blast_queryseq.pl using the redolist as 
                  -l parameter.
              
   (3) Identity the related repetitive sequences with precise boundaries
        Usage:
          perl 03_identify_RS.pl -l query_list -g genome_seq 
          [-r known_repetitive_seq] [-n minimum_number] [-m max_hit_length]
          
        Option  Required  Default  Description

          -l      Yes        -     "querylist" file of query sequences.
          -g      Yes        -     Genome sequences file name (FASTA format).
          -r       No        -     The known MITEs or other repetitive sequences 
                                   file name (FASTA format).
          -n       No        5     Minimum number of a group of related sequences.
                                   Recommend higher value (eg. 10) for large 
                                   genome.
          -m       No       1000   Maximum length of related sequences.

        Output: a folder named "$target_path/rspb_group" contain several ".aln" 
                files of multiple sequence alignments (MSA) with 100bp flanking 
                sequences, and a log file "$target_path/step3.log".
        *Note:  This step usually spend more than three days, even more than 
                half of a month for large size genome. Please refer to notes of 
                step 2.


   After finish above three steps, the directory "$target_path/genome.query" can 
   be deleted to save disk space.

   STEP (4) and (5) should be done by user manually.
   (4) Manually open each ".aln" file (recommend GeneDoc or BioEdit, etc) to check 
       whether the MSA have typical TIR and TSD sequences. 
      *Note: In our analysis of rice genome, the length of a typical MITE 
       sequences should less than 800 bp.
       
   (5) Choose one typical MITE from each identified MITE group as a seed 
       (excluding the flanking sequences) and divide them to different families. 
       Then rename the seed sequences with their family name. If one family have 
       several members, please add a dash "-" and number after the family name, 
       eg. hAT3-2.

       
   (6) Use RepeatMasker to scan the genome with the seed sequences as reference
       library. I used these parameters:
         Repeatmasker genome_seq -lib seeds -nolow -no_is -s -cutoff 250 -pa 8
       
   (7) Parse the RepeatMasker results to get all MITEs in genome
        Usage:
          perl 07_get_mite_from_RM.pl -s seeds_file -R RepeatMasker_ori.out 
          -o mite_list_file [-l max_MITE_length]
          
        Option  Required  Default	 Description

          -s      Yes        -     Seeds file (FASTA format)
          -R      Yes        -     The .ori.out file of RepeatMasker output
          -o      Yes        -     MITE list file 
          -l       No       800    Maximum length of MITE elements.

        Output: a set of list files.

5. Output format of MITE list:

 # family_name        Chr_number   Start_in_chr     End_in_chr  Descrption
   -----------------------------------------------------------------------   
   PIF/Harbinger_2      Chr01        33443188        33443342     partial
   PIF/Harbinger_2      Chr01        33465353        33465557     partial
   PIF/Harbinger_2      Chr01        33466538        33466779        full
   PIF/Harbinger_2      Chr01        33533125        33533342     partial
   PIF/Harbinger_2      Chr01        33541019        33541230     partial
   PIF/Harbinger_2      Chr01        33613820        33614055        full
   PIF/Harbinger_2      Chr01        33658933        33659129     partial