tallphil

Genome RE Sites

Wed, 19th Oct, 2011, 15:50:44  |  Category: Bioinformatics

Rather than sitting on my hard disk getting dusty, I thought I should start publishing the bioinformatics scripts that I’ve written over the past few years of my PhD.

The first to go up is a Perl script called “Genome RE Sites” – it searches a genome of your choice for a restriction endonuclease recognition site and outputs the co-ordinates of all cut sites.

[ Update ] : You can find an online version of this tool here.

I use a technique called Chromosome Conformation Capture, which uses restriction enzymes, so I frequently need to generate a list of cut sites to help me analyse data.

This script needs to be run on the command line, either on a linux system or using something like ActivePerl on Windows. Usage of this script is:

perl genome_RE_sites.pl [output file] [search string]

..where [output file] is the filename that will hold your results and [search string] is the restriction enzyme recognition site. The latter can be ignored and will default to HindIII (AAGCTT). There are a few other configuration options that you’ll need to edit before using the script, such as location of downloaded genome sequences and chromosome names. Please leave a comment if you have any problems with these.

Download the code here, or copy and paste from below:

#!/usr/bin/perl
use warnings;
use strict;
#############################################################
# Name: Genome RE Sites                                     #
# Author: Phil Ewels                                        #
# Version 1.0 - 25/05/2011                                  #
# --------------------------------------------------------- #
# Outputs a file with locations of restriction endonuclease #
# sites (not resulting fragments)                           #
# --------------------------------------------------------- #
# Genome RE Sites is licensed under a Creative Commons      #
# Attribution-ShareAlike 3.0 Unported License.              #
# Based on a work at www.tallphil.co.uk.                    #
#############################################################

#===============================
#==== CONFIGURATION OPTIONS ====
#===============================

# Set by command line (optional)
my ($output,$re_search) = @ARGV;
if (!defined $output) {
	die "Usage is genome_RE_sites.pl [output file] [search string]\nLeave blank to use defaults\n";
} elsif(!defined $re_search) {
	# Restriction site to use
	$re_search = 'AAGCTT'; # HindIII as default
	warn "Using file defaults: search string = $re_search, output = $output\n";
} else {
	warn "Using command line variables: search string = $re_search, output = $output\n";
}

# Path to chromosome fasta files. Replace chromosome number with %s
my $fn_base = 'D:\Genome Sequences\Mouse\chr%s.fa';
warn "Looking for Genome Sequences in $fn_base\n\n";

# Chromosomes to use. Default is (1..19,'X','Y') - other options are 'MT' etc.
my @chromosomes = (1..19,'X','Y');

#==================================
#== END OF CONFIGURATION OPTIONS ==
#==================================

open (OUT,'>',$output) or die $!;

# go through each chromosome
foreach my $chromosome (@chromosomes) {
	my $filename = sprintf($fn_base, $chromosome);
	open (IN,$filename) or die "Can't read file: $!";
	warn "Starting Chromosome $chromosome ($filename)\n";
	my $sequence = '';
	$_ = ; # Remove fasta header
	while (my $line = ) {
		chomp ($line);
		$sequence .= uc($line); # Make everything upper case
	}

	my $offset = 0;
	my $pos = index($sequence, $re_search, $offset);
	while ($pos != -1) {
		print OUT $chromosome."\t". # Chromosome Name
			      $pos."\t". # Position Start
			    ( $pos + length($re_search) )."\n"; # Position Finish
		$offset = $pos + length($re_search);
		$pos = index($sequence, $re_search, $offset);
	}

	close IN;
}

2 Responses to “Genome RE Sites”

  1. Phil says:

    I hope to make this into a server-side app some time soon, to make it more user friendly (no downloading or customisation needed – just selection from some boxes and then downloading the results). Stay tuned for updates…

  2. Phil says:

    There is now an online version of this tool here, with a more user-friendly interface (no command-line knowledge needed!).

Leave a Reply

Live Preview