Stothard Research Group


We have completed sequencing of more than 400 cattle genomes!


BASys - performs automated, in-depth annotation of bacterial genomic (chromosomal and plasmid) sequences. It accepts raw DNA sequence data and an optional list of gene identification information and provides extensive textual and hyperlinked image output. BASys uses more than 30 programs to determine nearly 60 annotation subfields for each gene, including gene/protein name, GO function, COG function, possible paralogues and orthologues, molecular weight, isoelectric point, operon structure, subcellular localization, signal peptides, transmembrane regions, secondary structure, 3-D structure, reactions, and pathways. The textual annotations and images that are provided by BASys can be generated in approximately 24 hours for an average bacterial chromosome (5 megabases).
Availability: http://wishart.biology.ualberta.ca/basys/

BacMap - an interactive visual database containing all publicly available bacterial genomes. A fully labeled and zoomable genome map is provided for each genome. Sequence and text queries can be used to identify genes of interest, or maps can be navigated using a simple interface. BacMap is designed to serve as an intuitive and convenient tool for identifying orthologues and paralogues, studying operon conservation, and determining gene function.
Availability: http://bacmap.wishartlab.com/

CGView Server - a comparative genomics tool for circular genomes (plasmid, bacterial, mitochondrial, and chloroplast) that allows sequence feature information to be visualized in the context of sequence analysis results. A genome sequence is supplied to the program in FASTA, GenBank, EMBL, or raw format. Up to three comparison sequences (or sequence sets) in FASTA format can also be submitted. The CGView Server uses BLAST to compare the genome sequence to the comparison sequences, and then converts the results and any available feature information (from the GenBank, EMBL, or optional GFF file) or analysis information (from an optional GFF file) into a high-quality graphical map showing the entire genome sequence, or a zoomed view of a region of interest. Several options are available for specifying how the BLAST comparisons are conducted, and for controlling how results are displayed.
Availability: http://stothard.afns.ualberta.ca/cgview_server/

DrugBank - a unique bioinformatics/cheminformatics resource that combines detailed drug (i.e. chemical) data with comprehensive drug target (i.e. protein) information.
Availability: http://www.drugbank.ca/

GView - a java package used to display and navigate bacterial genomes. GView is useful for producing high-quality genome maps for use in publications and websites, or as a visualization tool in a sequence annotation pipeline. Users can interact with the genome using a powerful pan-and-zoom interface, or GView can write static images of a genome to a file. GView can draw a genome using either circular or linear layouts, with additional layout types planned for future release.
Availability: https://www.gview.ca/

HMDB - a freely available electronic database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education.
Availability: http://www.hmdb.ca/

PolySearch - supports more than 50 different classes of queries against nearly a dozen different types of text, scientific abstract or bioinformatic databases. The typical query supported by PolySearch is "Given X, find all Y's" where X or Y can be diseases, tissues, cell compartments, gene/protein names, SNPs, mutations, drugs and metabolites. PolySearch also exploits a variety of techniques in text mining and information retrieval to identify, highlight and rank informative abstracts, paragraphs or sentences.
Availability: http://wishart.biology.ualberta.ca/polysearch/

PROSESS - a web server designed to evaluate and validate protein structures solved by either X-ray crystallography or NMR spectroscopy. PROSESS integrates a variety of previously developed, well-known and thoroughly tested methods.
Availability: http://www.prosess.ca/

Sequence Extractor - generates a clickable restriction map and PCR primer map of a DNA sequence. Protein translations and intron/exon boundaries are also shown. Use Sequence Extractor to build DNA constructs in silico.
Availability: http://bioinformatics.org/seqext/

The Sequence Manipulation Suite - a collection of simple programs for generating, formatting, and analyzing short DNA and protein sequences. The Sequence Manipulation Suite is commonly used by molecular biologists, for teaching purposes, and for program and algorithm testing.
Availability: http://bioinformatics.org/sms2/


backup.sh - this shell script archives directories of interest on a Linux-based system. When it is first run, and on the first of each month, this script generates a full backup of the files and directories listed in the include.conf file. Files and directories listed in the exclude.conf file are not included in the archive. These full backups are not overwritten by future backups. Each Sunday the script performs a full backup that is overwritten the following Sunday. Every day the script performs an incremental backup, storing the files that have changed since the last full backup. These incremental backup files are named after the day of the week they are performed, and are overwritten each week. The script sends an email on the first of each month, and whenever any backup fails. The script splits each full backup into a series of smaller files, suitable for burning to CD or DVD. When a full backup is generated, the MD5 hash value of the complete backup file is written to a README file in the same directory as the backup files. Included in the README are directions for assembling the split backup files into the original file.
Developed by: Paul Stothard.
Availability: backup_script.zip, backup_script.tar.gz.

blast_hit_features.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases), since this script uses GenBank files to obtain feature information for sequence hits. For each entry in the BLAST results, the GI number of the hit, if available, is used to obtain the corresponding sequence record from NCBI, in GenBank format. The features in the GenBank file are compared to the coordinates of the HSP, and features overlapping with the HSP are added to the existing BLAST results. The modified results are written to a new file. The three nearest features preceding the HSP (located to the left of the HSP) and the three nearest features located after the HSP (located to the right of the HSP) are also added to the output.
Developed by: Paul Stothard.
Availability: blast_hit_features.zip, blast_hit_features.tar.gz.

blast_hit_flanking_sequence.pl - this Perl script accepts blastn search results obtained from local_blast_client.pl or remote_blast_client.pl. In addition to the BLAST results file, the script requires the query sequences and database sequences in FASTA format. For each BLAST result, the script constructs a modified query sequence, in which the query is extended using sequence extracted from the hit sequence. The amount of hit sequence added to the ends of the query can be specified using the -u and -d options.
Developed by: Paul Stothard.
Availability: blast_hit_flanking_sequence.zip, blast_hit_flanking_sequence.tar.gz.

blast_hits_in_ucsc_genome_browser.pl - this Perl script accepts BLAST results obtained from local_blast_client.pl or remote_blast_client.pl. The results must have been obtained using blastn, tblastn, or tblastx searches (i.e. nucleotide databases) and database sequences downloaded from the UCSC Genome Browser site (http://hgdownload.cse.ucsc.edu/downloads.html). The BLAST results are converted to annotation files for the UCSC Genome Browser, and a separate HTML file containing links to each feature in the annotation files is created. Clicking on a link in the HTML file loads the genome region involving the BLAST HSP into the UCSC Genome Browser and passes the annotations in the relevant annotation file to the browser for inclusion in the view.
Developed by: Paul Stothard.
Availability: blast_hits_in_ucsc_genome_browser.zip, blast_hits_in_ucsc_genome_browser.tar.gz.

build_cluster_script.pl - this Perl script creates an executable shell script with the specified command repeated n number of times. Every occurrence of the '$' symbol in the command is replaced by a number, from 1 to n. Alternatively, the "-l" option causes letters to be used in place of numbers (eg 'aa' instead of '1', 'ab' instead of '2'). This script can be used to generate scripts for batch processing on a computer cluster.
Developed by: Jason Grant and Paul Stothard.
Availability: build_cluster_script.zip, build_cluster_script.tar.gz.

cDNA_library_entropy.pl - this Perl script accepts a directory containing one or more sequence files in multi-FASTA format. Typically each file will contain the sequences obtained from a single EST library or tissue type. By default the script looks for a UniGene annotation identifier in each sequence title, for example 'Bt.22094'. A different ID type can be specified using the -m option. The number of sequences present for each ID is determined. The script uses these counts to calculate the information entropy of the library in bits. This value increases as the number of distinct sequences in a library increases, and decreases as the number of replicates of a particular sequence increases. The -d option can be used to obtain the information entropy of combinations of libraries. For example, specifying '-d 2' causes all possible combinations of two libraries to be evaluated. This script is intended to aid in the selection of tissues for SNP discovery by mRNA sequencing.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.

CGView - a Java package for generating high quality, zoomable maps of circular genomes. Its primary purpose is to serve as a component of sequence annotation pipelines. Feature information and rendering options are supplied to the program using an XML file, a tab delimited file, or an NCBI ptt file. CGView converts the input into a graphical map (PNG, JPG, or Scalable Vector Graphics format), complete with labels, a title, and legends. In addition to the default full view map, the program can generate a series of hyperlinked maps showing expanded views. The linked maps can be explored using any web browser, allowing rapid genome browsing, and facilitating data sharing. The feature labels in maps can be hyperlinked to external resources, allowing CGView maps to be integrated with existing web site content or databases.
Developed by: Paul Stothard.
Availability: http://bioinformatics.org/cgview/

CGView Comparison Tool (CCT) - a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation.
Developed by: Paul Stothard and Jason Grant.
Availability: http://stothard.afns.ualberta.ca/downloads/CCT/

cgview_xml_builder.pl - this Perl script accepts a variety of input files pertaining to circular genomes and generates an XML file for the CGView genome drawing program. This script can create the XML to display a variety of sequence composition plots, gene expression data, COG information, BLAST results, and more. See the included README file for additional information.
Developed by: Paul Stothard.
Availability: cgview_xml_builder.zip, cgview_xml_builder.tar.gz.

combine_output_files.pl - this Perl script combines files that are part of a file series (created by split_records.pl for example). Several options are avialable for controlling how comments and header lines are handled. This script can be used to combine results files generated on a computer cluster.
Developed by: Paul Stothard.
Availability: combine_output_files.zip, combine_output_files.tar.gz.

genome_pattern_search.pl - a Perl program that reads a genomic sequence in FASTA format and searches for the patterns you specify using regular expressions. A summary is generated for each sequence match, including: the sequence fragment that matched the pattern; the position of the first base; the position of the last base; the strand on which the match was found; the name of the gene containing the match or "not in gene"; the name of the nearest downstream gene; a description of the gene; the distance to the nearest downstream gene; the total times this exact sequence was found; the percentage of the instances of this exact sequence that were found inside of genes; and the average number of base pairs to the downstream gene for this exact sequence.
Developed by: Paul Stothard.
Availability: genome_pattern_search.zip, genome_pattern_search.tar.gz.

get_cds.pl - this Perl script accepts a GenBank or EMBL file and extracts the protein translations or the DNA coding sequences and writes them to a new file in FASTA format. Information indicating the reading frame and position of the coding sequence relative to the source sequence is added to the titles.
Developed by: Paul Stothard.
Availability: get_cds.zip, get_cds.tar.gz.

get_genes_in_area.pl - this Perl script accepts as input a position or list of positions in a genome and returns descriptions of nearby genes. The descriptions include position and function information, along with identifiers that can be used to access related records in other databases.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.

get_orfs.pl - this Perl script accepts a sequence file as input and extracts the open reading frames (ORFs) greater than or equal to the size you specify. The resulting ORFs can be returned as DNA sequences or as protein sequences. The titles of the sequences include start, stop, strand, and reading frame information. The sequence numbering includes the stop codon (when encountered) but the translations do not include a stop codon character.
Developed by: Paul Stothard.
Availability: get_orfs.zip, get_orfs.tar.gz.

get_snps_by_gene_ontology.pl - this Perl script accepts a species name and a Gene Ontology (GO) accession number, and returns a list of SNPs located in or nearby genes associated with the GO accession. Several fields of information are provided for each SNP, including ID, location, flanking sequence, and alleles. Gene and transcript identifiers and descriptions of gene function are also provided.
Developed by: Paul Stothard.
Availability: get_snps_by_gene_ontology.zip, get_snps_by_gene_ontology.tar.gz.

local_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It then submits each sequence to a locally installed version of the blastall program. For each of the hits obtained, the script retrieves a descriptive title by performing a separate Entrez search of NCBI's databases. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
Developed by: Paul Stothard.
Availability: local_blast_client.zip, local_blast_client.tar.gz.

md5_sums.pl - this Perl script accepts a list of directories and recursively generates a list of the files in the directories and their MD5 values. An optional list of directories and files to exclude from the calculation can also be supplied. The MD5 calculation can be skipped for large files, using the optional size parameter.
Developed by: Jason Grant.
Availability: md5_sums.zip, md5_sums.tar.gz.

ncbi_monitor.pl - this Perl script performs NCBI Entrez searches to identify publications related to genomic regions of interest in a species of interest. More specifically, this script accepts an organism name, chromosome name, and base position as input. It then retrieves the IDs for all Entrez Gene records located within a certain distance of the base position (the distance can be adjusted using the -f option). For each Gene record the script obtains IDs of PubMed records identified by NCBI as being related to the Gene record. If the script has previously written output to the specified output directory (i.e. the directory supplied using the -o option), it examines the previously obtained PubMed IDs to see which IDs are new. An email message describing the newly obtained records is then sent to the email address supplied using the -e option. The PubMed results are also written to a file in the output directory. If the -h option is specified, NCBI's HomoloGene database is also queried for each Gene record, in an attempt to obtain additional PubMed records, linked to the HomoloGene hits. These PubMed records may describe results obtained in other species, but could be relevant nonetheless.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.

ncbi_search.pl - this Perl script uses NCBI's Entrez Programming Utilities to perform searches of NCBI databases. The script can return complete database records, or sequence IDs.
Developed by: Paul Stothard.
Availability: ncbi_search.zip, ncbi_search.tar.gz.

NGS-SNP - this collection of scripts annotates raw SNP lists returned from programs such as Maq. SNPs are classified as synonymous, nonsynonymous, 3' UTR, etc. regardless of whether or not they match existing SNP records. Included among the annotations, several of which are not available from any existing SNP annotation tools, are the results of detailed comparisons with orthologous sequences. These comparisons allow, for example, SNPs to be sorted or filtered based on how drastically the SNP changes the score of a protein alignment. Other fields indicate whether or not the SNP-altered residue exhibits co-evolution with other residues in the protein, the names of overlapping protein domains or features, and the conservation of both the SNP site and flanking regions. NCBI, Ensembl, and Uniprot IDs are provided for genes, transcripts, and proteins when applicable, along with Gene Ontology terms, a gene description, phenotypes linked to the gene, and an indication of whether the SNP is novel or known. A "Model_Annotations" field provides several annotations obtained by transferring in silico the SNP to an orthologous gene, typically in a well-characterized species.
Developed by: Paul Stothard, Jason Grant, and Xiaoping Liao.
Availability: NGS-SNP.

obtain_reference_transcripts.pl - this Perl script builds a FASTA file consisting of the canonical transcripts for all the genes in Ensembl for a given organism. The canonical transcript is defined as either the longest CDS (if the gene encodes a protein), or as the longest cDNA. Ensembl gene entries can be associated with many transcripts--this script aims to get the "best" single transcript for each gene. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads derived from cDNAs. The -a option can be used to specify that all transcripts should be downloaded, not just the canonical ones.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.

obtain_reference_chromosomes.pl - this Perl script builds a FASTA file consisting of the chromosome sequences in Ensembl for a given organism. The resulting FASTA file is suitable for sequence searches and for mapping sequence reads.
Developed by: Paul Stothard.
Availability: Included in the NGS-SNP package.

random_sequence_sample.pl - this Perl script selects a random sample of sequences from a FASTA file containing multiple sequences. The sample is written to a new text file. Sampling can be performed with or without replacement.
Developed by: Paul Stothard.
Availability: random_sequence_sample.zip, random_sequence_sample.tar.gz.

remote_blast_client.pl - this Perl script accepts a FASTA file containing multiple sequences as input. It submits each sequence to NCBI's BLAST servers, to identify related sequences in a database of interest. An optional 'limit by entrez query' value can be supplied to restrict the search. For each BLAST hit a descriptive title is obtained using a separate Entrez search. Each BLAST hit and its descriptive title are written to a single tab-delimited output file.
Developed by: Paul Stothard.
Availability: remote_blast_client.zip, remote_blast_client.tar.gz.

remote_in_silico_pcr.rb - this Ruby script accepts as input a list of primer sequences and uses the remote "UCSC In-Silico PCR" site to perform in silico PCR on the specified genome. By default only the top hit is returned for each primer pair--all the hits can be returned by using the '-m' option.
Developed by: Jason Grant.
Availability: remote_in_silico_pcr.zip, remote_in_silico_pcr.tar.gz.

space_check.sh - this shell script monitors hard drive space and sends an email when space becomes scarce. On the first day of each month the script sends an email report of hard drive space.
Developed by: Paul Stothard.
Availability: space_check.zip, space_check.tar.gz.

split_records.pl - this Perl script splits an input file into multiple output files, to allow analysis jobs to be divided among nodes in a computer cluster. Several options are included for handling header lines, for specifying the record separator, and for controlling how files are named.
Developed by: Jason Grant and Paul Stothard.
Availability: split_records.zip, split_records.tar.gz.