Rather than sitting on my hard disk getting dusty, I thought I should start publishing the bioinformatics scripts that I’ve written over the past few years of my PhD.
The first to go up is a Perl script called “Genome RE Sites” – it searches a genome of your choice for a restriction endonuclease recognition site and outputs the co-ordinates of all cut sites.
[ Update ] : You can find an online version of this tool here.
I use a technique called Chromosome Conformation Capture, which uses restriction enzymes, so I frequently need to generate a list of cut sites to help me analyse data.
This script needs to be run on the command line, either on a linux system or using something like ActivePerl on Windows. Usage of this script is:
perl genome_RE_sites.pl [output file] [search string]
..where [output file] is the filename that will hold your results and [search string] is the restriction enzyme recognition site. The latter can be ignored and will default to HindIII (AAGCTT). There are a few other configuration options that you’ll need to edit before using the script, such as location of downloaded genome sequences and chromosome names. Please leave a comment if you have any problems with these.
Download the code here, or copy and paste from below:
#!/usr/bin/perl
use warnings;
use strict;
#############################################################
# Name: Genome RE Sites #
# Author: Phil Ewels #
# Version 1.0 - 25/05/2011 #
# --------------------------------------------------------- #
# Outputs a file with locations of restriction endonuclease #
# sites (not resulting fragments) #
# --------------------------------------------------------- #
# Genome RE Sites is licensed under a Creative Commons #
# Attribution-ShareAlike 3.0 Unported License. #
# Based on a work at www.tallphil.co.uk. #
#############################################################
#===============================
#==== CONFIGURATION OPTIONS ====
#===============================
# Set by command line (optional)
my ($output,$re_search) = @ARGV;
if (!defined $output) {
die "Usage is genome_RE_sites.pl [output file] [search string]\nLeave blank to use defaults\n";
} elsif(!defined $re_search) {
# Restriction site to use
$re_search = 'AAGCTT'; # HindIII as default
warn "Using file defaults: search string = $re_search, output = $output\n";
} else {
warn "Using command line variables: search string = $re_search, output = $output\n";
}
# Path to chromosome fasta files. Replace chromosome number with %s
my $fn_base = 'D:\Genome Sequences\Mouse\chr%s.fa';
warn "Looking for Genome Sequences in $fn_base\n\n";
# Chromosomes to use. Default is (1..19,'X','Y') - other options are 'MT' etc.
my @chromosomes = (1..19,'X','Y');
#==================================
#== END OF CONFIGURATION OPTIONS ==
#==================================
open (OUT,'>',$output) or die $!;
# go through each chromosome
foreach my $chromosome (@chromosomes) {
my $filename = sprintf($fn_base, $chromosome);
open (IN,$filename) or die "Can't read file: $!";
warn "Starting Chromosome $chromosome ($filename)\n";
my $sequence = '';
$_ = ; # Remove fasta header
while (my $line = ) {
chomp ($line);
$sequence .= uc($line); # Make everything upper case
}
my $offset = 0;
my $pos = index($sequence, $re_search, $offset);
while ($pos != -1) {
print OUT $chromosome."\t". # Chromosome Name
$pos."\t". # Position Start
( $pos + length($re_search) )."\n"; # Position Finish
$offset = $pos + length($re_search);
$pos = index($sequence, $re_search, $offset);
}
close IN;
}
I hope to make this into a server-side app some time soon, to make it more user friendly (no downloading or customisation needed – just selection from some boxes and then downloading the results). Stay tuned for updates…
There is now an online version of this tool here, with a more user-friendly interface (no command-line knowledge needed!).