####################################################################### # # # The Manual of RSPB (Repetitive Sequences with Precise Boundaries) # # # # Version 0.20 2012-7-25 # # LU, Chen # # chenlubio@gmail.com # # # # # ####################################################################### 1. What is RSPB? RSPB is a series of Perl scripts to identify related repetitive sequences with precise boundaries in genome. Now it is used to find miniature inverted-repeat transposable elements (MITEs). RSPB is designed to run on Linux platform. 2. System requirement Linux OS (recommend 64-bit version) Multi-core CPU 8G or more memory At least thirty times free disk space of genome size 3. Prerequisites To run RSPB, you need do following preparations first: (1) Install Perl 5.10.1 or higher version (including the Algorithm::ClusterPoints module) and BioPerl 1.6.1 or higher version (including BioPerl and BioPerl-run). *Note: If you are not the system manager, you should modify the first line of all RSPB scripts to define the path of Perl interpreter, and add a line to include BioPerl path in @inc, just like below: use lib "/$HOME/perl5/site/lib"; (2) Download NCBI blast (ftp://ftp.ncbi.nih.gov/blast/executables/release/, not blast+), Muscle (http://www.drive5.com/muscle/) and Mdust (in RSPB package or http://sourceforge.net/projects/gicl/files/other/), and then copy the executable file blastall, formatdb, muscle and mdust to the directory in environment variable. *Note: Please remember change the name of Muscle executable file to "muscle" (Do not contain version number and platform name). (3) Download and install RepeatMasker (http://www.repeatmasker.org/RMDownload.html) (4) Set the path containing genome sequences writable. (5) If user ran MITE-hunter to identify MITEs before, please run hunter2ref.pl to combine the MITE-hunter results first, perl hunter2ref.pl -p MITE-hunter_output_path -r output_file Then set the output file as -r parameter of RSPB_manage.pl. 4. Install RSPB Download the RSPB_0.20.zip Decompress the compressed files using this command: unzip RSPB_0.20.zip 5. Run RSPB to identify MITEs RSPB has 7 steps. Step 1-3 can be simply run by RSPB_manage.pl. Usage: perl RSPB_manage.pl -i input_file -q built_query_path -d blastdb_name [optional parameters] Option Required Default Description -i Yes - Query sequences file name (FASTA format). If the query sequences are separated to several files by chromosomes, please combine them to one file first. -q Yes - Directory containing query seuqnces and outputs of subsequent scripts. -d Yes - Blast database name of genome sequences. User can specify a existed dababase. -r No - The known MITEs or other repetitive sequences file name (FASTA format). -F No 30 Cut-off value of Mdust to filter low-complexity sequences (lower values might mask more). Recommend 28 for large genome (>500 Mb). -e No 1e-20 E-value threshold of blast. Recommand lower E-value for large genome. -W No 18 Word size of blastn. Recommand 21 for large genome. -n No 5 Minimum number of a group of related sequences. Recommend higher value (eg. 10) for large genome. -m No 1000 Maximum length of related sequences. -a No 4 Thread number used by blastn. User can also run RSPB step by step: (1) Cleave the genome sequence to query sequences Usage: perl 01_cleave_objectseq.pl -i input_file -q target_path -d blastdb_name [-F mdust_score] Option Required Default Description -i Yes - Genome sequences file (FASTA format). If the genome sequences are separated files by chromosomes, please combine them to one file before this step. -q Yes - Directory containing query sequences and outputs of subsequent scripts ($target_path). -d Yes - Blast database name of genome. User can specify a existed database. -F No 30 Cut-off value of Mdust to filter low-complexity sequences (lower values might mask more). Recommend 28 for large genome (>500 Mb). Output: a directory "$target_path" containing query sequences "genome_query/" and a list file "querylist" of query sequences, and a blast database of genome sequence . (2) Blast the query sequences to genome Usage: perl 02_blast_queryseq.pl -l query_list -d blastdb [-e e-value] [-W word_size] [-a thread_num] Option Required Default Description -l Yes - "querylist" file of query sequences. -d Yes - Blast database name of genome. User can specify a existed dababase. -e No 1e-20 E-value threshold of blast. Recommend lower E-value for large genome. -W No 18 Word size of blastn. Recommend 21 for large genome. -a No 4 Thread number used by blastn. Output: blast result of each query sequence and a log file "$target_path/step2.log". *Note: a. This step usually spend more than one day, even more than one week for large size genome which contains many repetive sequences, or draft genome which is made up by many scaffolds. So I recommend to run the program on background, just like: nohup perl 02_blast_queryseq.pl [options] & b. If this program stop before finished, user can run the redo.pl to generate a new list file named "redolist": perl redo.pl -q querylist -l step2.log -o redolist Then continue run 02_blast_queryseq.pl using the redolist as -l parameter. (3) Identity the related repetitive sequences with precise boundaries Usage: perl 03_identify_RS.pl -l query_list -g genome_seq [-r known_repetitive_seq] [-n minimum_number] [-m max_hit_length] Option Required Default Description -l Yes - "querylist" file of query sequences. -g Yes - Genome sequences file name (FASTA format). -r No - The known MITEs or other repetitive sequences file name (FASTA format). -n No 5 Minimum number of a group of related sequences. Recommend higher value (eg. 10) for large genome. -m No 1000 Maximum length of related sequences. Output: a folder named "$target_path/rspb_group" contain several ".aln" files of multiple sequence alignments (MSA) with 100bp flanking sequences, and a log file "$target_path/step3.log". *Note: This step usually spend more than three days, even more than half of a month for large size genome. Please refer to notes of step 2. After finish above three steps, the directory "$target_path/genome.query" can be deleted to save disk space. STEP (4) and (5) should be done by user manually. (4) Manually open each ".aln" file (recommend GeneDoc or BioEdit, etc) to check whether the MSA have typical TIR and TSD sequences. *Note: In our analysis of rice genome, the length of a typical MITE sequences should less than 800 bp. (5) Choose one typical MITE from each identified MITE group as a seed (excluding the flanking sequences) and divide them to different families. Then rename the seed sequences with their family name. If one family have several members, please add a dash "-" and number after the family name, eg. hAT3-2. (6) Use RepeatMasker to scan the genome with the seed sequences as reference library. I used these parameters: Repeatmasker genome_seq -lib seeds -nolow -no_is -s -cutoff 250 -pa 8 (7) Parse the RepeatMasker results to get all MITEs in genome Usage: perl 07_get_mite_from_RM.pl -s seeds_file -R RepeatMasker_ori.out -o mite_list_file [-l max_MITE_length] Option Required Default Description -s Yes - Seeds file (FASTA format) -R Yes - The .ori.out file of RepeatMasker output -o Yes - MITE list file -l No 800 Maximum length of MITE elements. Output: a set of list files. 5. Output format of MITE list: # family_name Chr_number Start_in_chr End_in_chr Descrption ----------------------------------------------------------------------- PIF/Harbinger_2 Chr01 33443188 33443342 partial PIF/Harbinger_2 Chr01 33465353 33465557 partial PIF/Harbinger_2 Chr01 33466538 33466779 full PIF/Harbinger_2 Chr01 33533125 33533342 partial PIF/Harbinger_2 Chr01 33541019 33541230 partial PIF/Harbinger_2 Chr01 33613820 33614055 full PIF/Harbinger_2 Chr01 33658933 33659129 partial