This page has been accessed
0 times since 26-Mar-99 CronCount
Patterns in sequences
Searching for information within sequences.
Most common problems and their solutions: |
Question: I
have a gene sequence. I want to know where are the restriction sites.
Solution: Search your DNA for restriction sites
[WebCutter]
Question: I have a gene sequence. I want to
know where are the coding regions.
This is a non-trivial problem, particularly for higher eukaryotes with
complex exon-intron structure and highly variable GC-content.
Simple solution:
Search for Open Reading Frames (ORFs).
[ORF Finder]
[Alces Webtranslator]
[FramePlot]
[AAT]
Simple programs look only for start and stop codons and show
you the areas that principally CAN code for protein. This method fails in
eukaryotes because of introns and even in prokaryotes the existence of ORF
is not the proof for the existence of protein-coding gene. Both ORF Finder
(and FramePlot) let you search your ORF product against
general databases to find homologous genes. This gives you additional proof
that coding region is real.
In GC-rich bacterial genomes stop-codons are rare and the beginning of ORFs are
extremely difficult to find. In this case it is wise to use the FramePlot. It
calculates GC at 3rd position (GC3) in all codons of potential ORF and plots it
against average GC content. Regions with higher GC3 are likely to be coding
regions.
The real solution: Find ORFs with complex
mathematical methods. [See special section on this
page]
FIND FUNCTIONAL DOMAINS, PROMOTERS, SPLICING
SITES, SIGNALS AND PATTERNS IN YOUR SEQUENCE
In some cases the sequence of an unknown protein is too distantly related to any
protein of known structure to detect its resemblance by overall sequence
alignment, but it can be identified by the occurrence in its sequence of a
particular cluster of residue types which is variously known as a pattern,
motif, signature, or fingerprint. These motifs arise because of particular
requirements on the structure of specific region(s) of a protein which may be
important, for example, for their binding properties or for their enzymatic
activity. These requirements impose very tight constraints on the evolution of
those limited (in size) but important portion(s) of a protein sequence.
Question: I have a protein sequence.
I want to know where could be functional domains, phosphorylation
sites, transport signals, ATP binding sites, active sites of enzymes, etc.
Solution 1:
Search protein against PROSITE database.
[PPSRCH]
Current release of PROSITE
contains 997 documentation entries that describe 1335
different patterns, rules and profiles/matrices. PROSITE is handicraft. All
patterns and profiles are verified to represent real biological information.
Some servers offer the possibility to search also PROSITE-prerelease database,
which entries have not yet been confirmed.
Sequences are presented
(and searched) in form of pattern [FY]-C-R-N-P-[DNR] or profile.
Profile is table of probabilities of each amino acid to occur in given
position. Profiles are more sensitive and usually longer. Patterns are simpler
and shorter. They do not detect rare exceptions to common consensus sequence.
Some examples of patterns:
C-x-[DN]-x(4)-[FY]-x-C-x-C
Aspartic acid and Asparagine hydroxylation site
x(k) means ANY k amino acids
[] means ANY of the enclosed amino acids is permitted
{DERK}(6)-[LIVMFWSTAG](2)-[LIVMFYSTAGCQ]-[AGS]-C
Prokaryotic membrane lipoprotein lipid attachment site.
{} means NONE of the enclosed amino acids is permitted
(k) means that the previous amino acid type is repeated k times
[KRHQSA]-[DENQ]-E-L>
Endoplasmic reticulum targeting sequence
> means the C-terminal
Glutamine amidotransferases class-II active site
< means the N-terminal
(k1,k2) means from k1 to k2 occurrences of the previous amino acid type
PPSRCH allows multiple sequences in input. Analyze results carefully: all
those databases contain both eukaryotic and prokaryotic patterns.
Solution 2:
Search protein against BLOCKS database.
[Blocks WWW server]
Blocks are multiply aligned
ungapped segments corresponding to the most highly conserved regions of
proteins. Block Searcher, Get Blocks and BlockMaker are aids to detection
and verification of protein sequence homology. They compare a protein
or DNA sequence to a database of protein blocks, retrieve blocks,
and create new blocks, respectively. BLOCKS allows you also search for a
suitable conserved region where to design PCR primers!
BLOCKS (together with smaller PRINTS database) is currently consisting of 5188
entries representing 1204 protein groups. Blocks are generated automatically
from protein database entries. Blocks entries are not manually checked,
therefore they might be of lower quality but on the other hand, BLOCKS is
currently the most comprehensive motif database.
Block searcher does not accept multiple sequence entries. Accepts DNA!
Block database entries are poorly documented, fortunately the search result
file contain links to similar entries from PROSITE or PRINTS database, which are
well documented.
Solution 3: Search your protein against protein
superfamilies (Pfam) database
[Pfam]
Pfam is a database of multiple alignments of protein domains
or conserved protein regions. They represent some evolutionary conserved
structure which has implications for the protein's function. Pfam is actually
formed in two separate ways. Pfam-A are accurate human crafted multiple
alignments whereas Pfam-B is an automatic clustering of the rest of Swissprot
using the program Domainer. Pfam-A database contains ca 570 protein superfamilies
(they cover 50% of SwissProt database). The Pfam entries are usually longer than
BLOCKS or PROSITE patterns or motifs. Unlike PROSITE, Pfam also contains multiple
alignment for each conserved domain.
Question: Where could my protein be located in the cell?
Solution: Search for protein sorting signals.
[Psort]
Predicts localization of both eukaryotic and prokaryotic proteins.
Question: I have a gene sequence. I want to know
where are the transcription factor binding sites and promoter(s).
Solution A: Search your DNA sequence against
TRANSFAC or TFD database.
[SignalScan]
[FastM]
TRANSFAC is a database on
eukaryotic cis-acting regulatory DNA elements and trans-acting factors.
It covers the whole range from yeast to human. Currently contains info
for 4602 binding sites and 2285 transcription factors.
FastM server lets you search for interesting combinations: e.g. in which
genes binding sites for A and B occur within defined distance.
Solution B: Analyze your DNA
with some promoter-finding program.
[NNPP]
Question: I
have a gene sequence. I want to know how good (typical) are the translation
start and stop codons of that gene. Will my gene be expressed normally in ...
cells?
Solution:
Compare your gene start and end to other genes from the same species.
[TransTerm]
TransTerm database contains
statistics and species specific consensus sequence around start and stop
codons of genes from several hundreds of different organisms. Remember that
translation is initiated differently in eukaryotes and prokaryotes. Eukaryotic
ribosomes recognize the start codon itself. In human cells, the good consensus
for starting translation is RNNATGG... In prokaryotic genes ribosomes
bind to so called Shine-Dalgarno sequence that is placed 10-20 nucleotides in
front of start codon.
IDENTIFY SEQUENCE PATTERN IN PROTEIN FAMILY.
Question: I have a family of protein sequences.
All of them contain common motif (active site or binding site for other
molecules or transport signal, etc). I want to extract the general consensus
sequence of this motif.
Solution: Identify pattern in family of sequences
You could try the following specialized pages:
[Blockmaker]
BLOCKS server. Uses the same algorithms than are used for preparing BLOCK
database entries.
[eMOTIF] in Stanford.
Needs aligned sequences for input.
[PRATT] in EBI.
Pattern identification from non-aligned sequences.
SEARCH DNA PATTERN AGAINST DATABASE.
Question: I am working with a transcription
factor. I have identified (refined) it's binding site. I want to know what other
genes could contain this binding site.
Solution:
Search pattern against DNA database:
Currently the only program (that I know) to do that is
PATSCAN.
Patscan accepts both patterns and matrixes
ANALYSIS OF CODON USAGE. CORRESPONDENCE ANALYSIS OF GENES.
Question:
I have many genes from one species. I want to know which codons are
preferred and how are genes distributed by their codon usage.
Solution:
Do codon usage analysis and/or correspondence analysis
on your genes.
Codon usage
Correspondence analysis is a statistical
method that tells you about distribution and similarity of your genes based on
codon usage. Lyon server
is able to calculate correspondence analysis on many genes, but it is complicated to use.
When you need to do a lot of codon usage analysis, I suggest to download and install the
program codonW.
COMPLEX ANALYSIS OF LONG SEQUENCES. AUTOMATIC IDENTIFICATION
OF GENES AND PROMOTERS
Question: I have an eukaryotic gene sequence. I
want to know whether there are any exons, promoters, splicing sites,
transcription factor binding sites, polyadenylation sites or other eukaryotic
gene features in it.
Solution:
Victor Solovyov's collection of programs
in EBI, England
in BMC, Texas, US
Both collections include famous eukaryotic
exon-finding programs like Grail II, FGENEH, Genie, Hexon as well as programs
for predicting promoters, splicing sites and transcription factor binding sites.
Most of those programs try to find potential splice junctions, open reading
frames and promoters in complex. They are based on advanced learning-recognition
methods like Hidden Markov Models and Neural Networks use sample datasets for
training. Therefore, they are good only for the species DNA they were trained on
(mostly for human and mammalian DNA)
Question: I have a prokaryotic gene sequence. I
want to know where are the ORFs, promoters, ribosome binding sites or other
prokaryotic gene features in it.
Solution: GeneMark by G.Borodovsky
in EBI, England
in GIT,
Georgia, US
Genemark is a learning program
- it needs to know at least 10 kb of the sequence before making correct predictions.
Fortunately, data for most common prokaryotes and lower eukaryotes has
been already included in program. EBI version has also data for human and
A.thaliana genes, based on their GC content.
For short overview about sequence comparisions and alignments read
additional tutorial.
Collection of programs in Pasteur Institute, France
Collection of
programs in CMS, Italy
Recent paper in
Nature Genetics, with hyperlinks.
1. Download complete genes for elongation factor Tu (tuf)
and elongation factor G (fus) from the following organisms:
- Rickettsia prowazekii
- Chlamydia trachomatis
- Mycoplasma pneumonia
- Bacillus subtilis
- Treponema pallidum
- Escherichia coli
- Mycobacterium leprae
- Mycobacterium tuberculosis
There may be more than one gene from each organism.
Hint: After retrieving the gene from
DNA databank it may contain additional
sequences or other genes. Click on link CDS beside your correct gene to
retrieve only coding part of the gene.
Use the following
form to calculate
basic characteristics for those genes: total GC content, GC content at each codon
position and observed Nc (number of used codons) value. There is two Nc values
on form output, the first is expected Nc (theoretically calculated from GC3
content), the second is observed Nc.
Make two plots from the data:
Plot A. GC1, GC2 and GC3 plotted against total GC of the gene.
Plot B. Observed Nc plotted against GC3s. See how much observed codon
usage differs from the theoretical codon usage.
Send or show the plots. On plot A, draw 3 lines
through all GC1, GC2 and GC3 points. Do they have different slope? Why?
On plot B, see how much observed codon usage differs from the theoretical
codon usage. What could cause the difference from theoretical value?
Which genome has most significant difference between theoretical
and observed codon usage?
2. Analyze the frequency of codons (codon usage) in tuf genes using
links mentioned above. Which codons are preferred
in your genes? Are the preferred codons same in each organism?
Hint:
Here is the codon table to help you find which codons are coding for the same
amino acid. You do not have to compare all codon families, take 2 or 3 (Proline
and Threonine are good examples for analyzing codon usage)
Phe UUU
Ser UCU
Tyr UAU
Cys UGU
UUC
UCC
UAC
UGC
Leu
UUA
UCA
TER UAA
TER UGA
UUG
UCG
UAG Trp
UGG
CUU
Pro CCU
His
CAU Arg
CGU
CUC
CCC
CAC
CGC
CUA
CCA Gln
CAA
CGA
CUG
CCG
CAG
CGG
Ile AUU
Thr
ACU Asn
AAU
Ser AGU
AUC
ACC
AAC
AGC
AUA
ACA Lys
AAA
Arg AGA
Met
AUG
ACG
AAG
AGG
Val GUU
Ala GCU
Asp GAU
Gly GGU
GUC
GCC
GAC
GGC
GUA
GCA
Glu
GAA
GGA
GUG
GCG
GAG
GGG
Hint:
If you are using Alces server for
codon usage analysis remember that Alces server has three different modes:
A.Translate
B.Codon Table (what you probably need)
C. CAI values
To change the output you need to change the button at the end of form just
above the submit button. Another trap is first button on top of form -
this has to be set to "Raw" if you use Copy-Paste.
Anyway, for all codon usage programs you have to use fasta format
or just plain sequence - numbers have to be removed. SRS5 has those options
(save sequence in fasta format, but those might be difficult to find. They are
now changing their forms, so their outlook is not entirely consistent.
Anyway, try to use numerous buttons in SRS form to convert your output
sequence to fasta format. Or try to use sequence converters in Singapore or
in Texas.
3. Identify Open Reading Frames (ORFs) on your personal contig that was
assigned for
previous
homework.
Try several different ORF-finding programs. Choose 3 ORFs for
further analysis at your own choice.
Hint:
Frameplot might not be able to return you the gif-image if your
sequence is bigger than 35 kb. In this case use only half of your contig for
finding ORFs.
Which program was most convenient for finding ORFs? Why?
4. Find patterns in protein sequences. First read some good
documentation pages about
PROSITE,
BLOCKS and
Pfam
databases.
Follow the links and try to understand what are patterns, what
are matrixes and how are they selected for databases.
Now find potential active sites and other patterns using PROSITE, BLOCKS and
Pfam database. Take one sample protein
and see what results can you get from those databases.
After getting results, try to identify the sample protein by BLAST search.
Compare sequence description in database with the data you got from your pattern
search. Are those discovered patterns mentioned in
database description?
Now try to identify patterns in your own sequences.
Translate some (at least 3) ORFs from task 3 to protein sequence.
Alternatively use some proteins from your own scientific project.
Which homologies do you find from each database? Did you find any useful
information from those searches? Can you predict protein function based on this
search?
5. Test the promoter identification programs: try to identify promoters in that
piece of eukaryotic DNA.
6. Generate a result file with answers to each task. Answer to red questions. Send it to me
by email or with the form below.
Feel free to email or see me if you have any questions about interpreting your
results, understanding the program input, output or algorithms etc.
A form for sending your results:
Maido Remm
Back to homepage