LIU   Bio 537 Computer Lab

This course is designed to introduce the basic computer methods and bioinformatics used to support molecular analysis of genes. We will generate a lot of data. You will need a floppy disk to store your data. A Zip disk is highly recommended. Organize your results on the disk or it will become impossible to find anything.



RESOURCES

PubMed Find out what is already known.

EBI European site; Lots of tools for analysis of protein and DNA

Restriction map option 1, option 2

ExPASy - Translate tool Translate DNA

NCBI, National Center for Biotechnology Information Access to GenBank, The human genome and much more.

Primer3 PCR primer design. May be more than you need.

SDSC Biology Workbench Big and extremely useful site with free registration. Allows you to save data on their supercomputer and use many tools that work together.

NCBI BLAST Home Page For sequence homology searches

Free RasMol download The RasMol program is the easiest way to display proteins in 3-D

Clustalw Align multiple proteins and make trees (also see Biology Workbench)

Clustal W on-line help

Protein Data Bank Database of protein 3-D structures.

 

EXERCISES

Lab 1. Statistical and graphing analysis of data using Excel.

  1. Linear regression using linest(...) to find the best line on a graph with the least squares method.
  2. Standard deviation using stdev(...) to determine how varible data is and standard error for evaluating how likely it is that two conditions are giving different results.
  3. Chi-square, chitest(...), to analyze data that comes in classes rather than as a continuous variable.
  4. Regression analysis (r), correl(...), to see if common factors affect two variables.
  5. Lab assignment to be submitted as lab report for this lab. Get your personalized data for this plotting problem here .

Labs 2-4. GENE PROJECTS: Manipulating DNA and protein sequences to support molecular biology.

Hand-in reports on this section (Labs 2, 3, 4) are due February 19. For each lab I have indicated what I want; surprise me with a few additional analyses if you would like extra credit but do not print out pages of output. Raw printouts will just be discarded. You may use 'copy and paste' to put a relevant segment of output into your report if it is not too big.

Imagine that you have just begun a lab project on some gene. What information would you like to know to help you understand your project, permit PCR manipulation and to identify and modify key features?

Lab 2

  1. Choose a gene to study. Suggested choices include C. elegans ras, tlr-3, SUMO, Macaque hemoglobin, human swi2, superoxide dismutase, E. coli UmuDC, rat rhodopsin, CD44, p53, kaposin B, catenin. For some exercises involving noncoding DNA you may use Human chromosome 17 (AC083783)(any 2000 bases). You can choose a gene and change later if you need to.
  2. Investigate your ‘gene-of-interest’ (GOI). Using PubMed and web. What does your protein do? What are its special features? Are there active sites or points of interaction with other proteins? It is essential that you understand the function of your gene and the special characteristics of your gene. I want to know that you know what you are working on.
  3. Get the DNA. Go to NCBI's web page. From the upper left scroll menu next to search box, choose 'nucleotide' to search GenBank. Enter search terms. You will get lots of responses. Choose the best. Click to see the GenBank Nucleotide DNA record for your gene. Extract important information about location of sites. Switch the file to Fasta format (used by most programs) with your gene sequence. Use the pull-down menu to select Fasta format. Reformating should occur automatically. To save the Fasta formatted gene use the sendto pulldown menu. You can also copy the gene to the clipboard, which will be convenient for many purposes.
  4. What are the intron/exon boundaries in your gene based on the GenBank record?
  5. Make a restriction map to find out how to manipulate your gene. Go to the EBI site and explore the gene analysis tools (focus on DNA for now). Make a restriction map of your gene. Translate in six reading frames. EBI and other web sites have many features ... these might provide a good source for extra credit analyses.
  6. Try finding ORFs (open reading frames, i.e. possible coding regions) with OrfFind at NCBI or the ExPASY site. Note that all DNA has six reading frames. ORFs are regions that can make protein. What is the largest ORF that is not the real (biggest) ORF?

Report: Describe the function of your gene and major features from GenBank. How many exons are present? What are their sizes? Paste the sequence of your gene in Fasta format under your description. For all labs in this section make certain that you label what gene you are using. Find a restriction enzyme that cuts once within the main ORF of your gene as much as possible dividing it in two. What is the enzyme? This enzyme could be used, for example, in subcloning. What length of DNA would be in your gene fragment after restriction digestion? Show a simple map. Show the main ORFs in your DNA. Indicate an ORF not used by the cell.

Lab 3

  1. PCR is a powerful method for amplifying a segment of DNA. The region to be amplified is defined by two primers, forward and reverse. PCR primers are designed using DNA sequence, direct sequence for forward primers, and inverse complement for reverse primers.

    Design PCR primers (16-mers) that would amplify only the coding region of your gene. PCR primers could be 16-mers, in inverted configuration, that are at the gene boundaries. Do not use Primer3 for this exercise since it will give the wrong primers. Make another pair that would amplify only ~200 bp of one exon. Why might you wish to be able to amplify the entire coding region? What would the template be? Why might you want primers to amplify within one exon? What might the template be? If you don't know the exons of your gene you can find them from the chromosomal Genbank record for your gene (you may be using a dDNA or mRNA copy now). Or use part of AC083783 chromosome 17 DNA, find an orf using Molecular Toolkit, and make primers for the orf you have identified (probably an exon). Make a third pair that spans an exon/intron boundary. Why might these be useful? Question: Why do you want to avoid repeats for PCR primers? Note that the Primer3 program makes primers for within a region, not the entire region and so is not appropriate for most of this exercise. But now try Primer3 and compare the Primer3 result to what you have done so far.

  2. Find sequences similar to yours (often refered to as 'hits') by running NCBI Blastn with your Fasta DNA file (go to NCBI then click 'Blast' near the top of the page or on the left side. Choose nucleotide Blast, paste your sequence, submit, format and then check for results.... which may take a few minutes). What are the hits to your gene? Did you get back your original gene as one of the hits? What hit is closest to your gene but not identical (check the alignment to make certain that there are differences... otherwise the hit may be just your gene with another name)? What fraction of residues are identical for this hit? What is the Score and 'E value'? (High score and low E is good). What region of the gene is generally most highly conserved? Do not print out your entire blast results. Read the Blast site carefully for data. Make certain that you understand each piece of information in Blast since it is one of the most important methods in molecular biology. Please read the results and extract the data you need.

    Gene homologs are direct evolutionary descendents - e.g. beta-adrenergic receptors in mouse and humans. Gene paralogs are similar sequences within one organism that diverged long ago. e.g. beta-adrenergic receptor and alpha-adrenergic receptors in human. Homologs are from different species; paralogs are from the same species.

  3. Find 'homologs' of your human gene in another mammal (they should be similar). How similar are they in '% identity'? Design pcr primers that would amplify 100-200 bases of DNA from both your gene and a homolog. Why would these primers be useful?
  4. Now try to find a 'paralog'. A related gene in humans. How similar is your gene to its paralog? Design PCR primers that would amplify your gene, but not the paralog. Why might you use such primers?
  5. Find a homolog in insects or lower eukaryote. How similar are the genes (E-value, %identity).
  6. Run Blastn with a large piece of chromosome 17 (AC083783). What genes are present in the segment? Are the functions of all the genes known?

Report:
1. Sequence of each PCR primer pair, and the process you used to design them. How did you determine the start and stop of the coding region? How did you know that you were within an exon or at an exon/intron boundary?
2. Blastn. Show name and E-value of best alignment that is not identical to query. Now do the same for the worst alignment.
3.Name a homolog and paralog of your gene.
4. Cut and paste in one alignment from the bottom of Blastn. What is the region of highest homology?

Lab 4

  1. Do a Blastp search to find proteins similar to yours. Are there any interesting/informative hits? Do conserved regions of protein correspond to conserved areas of DNA? How do these correlate with functional domains or active site regions? Note that the regions of conservation supplied by the Blast server are a useful resource but are too broad for our purpose.
  2. With your knowledge of sequences of your protein and close homologs, try to design a mutagenic primer, ~40-mer, that could be used to alter a key residue of your protein (e.g. a phosphorylation site) to test a hypothesis about its role in function (use the genetic code to guide your design). If you do not know a functional site in your protein, choose a site at random. Try to design a primer that could be used to introduce the same mutation into several different homologs (i.e. will hybridize to DNA of genes homologous to your gene derived from several distinct species).
  3. Alternative: If you have trouble finding a feature in your protein to mutate, try designing an oligo to introduce a drug resistance mutation into HIV protease or reverse transcriptase or find a variant of influenza C hemagglutinin and design an oligo to mutate the parent viral strain (HINT: You may want to Blast your gene first. Mutant versions of the original gene will have very low E-values and only one or a few mismatches in alignment. Using the Blast alignments you can find where a mutation lies. One possible example would be to reverse a drug resistance mutation in the HIV reverse transcriptase gene (AY237815). The protein file for HIV RT is AAO84275. Gln151 in the WT virus is mutated to drug resistance by a change to Gly151 in this sequence. The position of the change with a bit of surrounding sequence is indicated below.
    QYNYLPQGWKGSF

    Design a mutagenic oligo to reverse the drug resistance mutation in AY237815.



Find homologs and paralogs. Try searching for paralogs of human hemoglobin alpha by searching a human database. Try finding distant orthologs by searching other genomes. For example search E. coli by following the path NCBI, Genomic biology, genome resources, microbial genomes, E. coli, little 'L' box at end of line. Or search for E. coli topoisomerase homologs in another bacterial species. Do they share homology over the entire gene?

Find out more about your gene. Go to NCBI. Search Entrez 'Gene' for your gene. Open the file. There may be lots of information about your gene and human disease. Click the link SNP: GeneView. This tells you about mutations or variation in your gene (which could be useful for the exercise above.
Report: Compare Blastn to Blastp. How did your results differ? As before, list the name and E-value of a nearly related protein that is not the same as your query protein. Also do the worst match. Describe functional region(s) of your protein. Use GenBank, Medline, and alignment in Blastp (the motif finder in Blastp is helpful). Write out and describe your mutagenic oligo and what it is designed to do (This is partly a test of your ability to find important regions of your protein. If you find them, mutate them!)

 

Labs 5-7. Protein structure and function. Joint report for labs 5,6,7 due March 12

Lab 5. Major resource sites for genes and proteins

  1. Locate your gene on a larger map or genome of the organism it is derived from. Go to the UCSC Human Genome site. Click on 'server' to search the genome. For the site, enter your accession number if human (e.g. NM_005591). Use my example number if your gene is not human, or find a human gene that you would like to use for this session. Submitting your number should take you to a map of the region around your gene. What chromosome is your gene on? What genes are on either side of it? Are they known genes? What is the code of a piece of DNA (clone or BAC) that contains the genomic sequence for your gene? Explore the resources at this important site a bit.
  2. Visit another site for getting genomic data: NCBI Map Viewer.
    Go to the NCBI site. Click for Map Viewer on the right side of the page. First try blasting your gene against one of the genomes available (e.g. the human genome). How do the results differ from Blasting the nr database?
    Now try exploring a genome. Try clicking on the human link in Map Viewer and you will get a diagram of human chromosomes. Enter a name in the Find window (try rhodopsin or try fmr1 if your gene name doesn't work). Searching will lead to small red marks on the chromosomes indicating where the gene (or related genes) lie. Beneath the diagram is a list of hits. Click on 'Map Element' to see the region of the chromosome surrounding the gene. Try your gene. What genes flank it?
  3. Explore the genetics of your gene in OMIM (Online Mendelian Inheritance in Man) site. This site contains all known information about gene function in humans. The 'Statistics' page gives an overview of known mutations. Enter your accession number in the main page to see if mutations in your gene cause disease in humans. Remember that this must be a human accession number. Or just enter a word in the search box (but you may get data for other genes with similar names or that act with your gene). Try rhodopsin or fmr1 as examples. Extract a known mutation within your gene or the examples I gave.
  4. Biology workbench is an integrated environment for complex analyses. We will use it for analyzing a group of genes simultaneously. Go to Biology Workbench . Follow instructions for free registration. Registration assigns you space in the memory of the San Diago supercomputer. For now, choose the Protein entry point. You will get a screen with multiple program options. Choose NDJIM to select a protein. Then 'Run'. You will be given options to select a protein database (some nonredundant database such as GenPept might be good) and enter keywords (e.g. name of your protein). Do that, then 'run' to get hits. Select genes you want using checkboxes. Select several related genes from different species. Run 'import'. Now you should have several genes on your workbench. NDJINN is somewhat like 'Entrez', but there are more options and you can select several genes at once. Also, Unlike NCBI, Workbench remembers your work. A year later you could log on and your sequences would still be there.
    Click to activate all of your genes on your workbench and choose the ClustalW program which will let you compare several genes at once. Click Run. Accept the data, and Run again, this time for real. ClustalW aligns the proteins. Now you can see which residues are conserved and which vary. Make note of several variable and conserved residues (you will need them in a week or two). Lower in the output is a phylogenetic tree of your protein. (This is not a very sophisticated phylogenetic tree, but it will be accurate enough for what we are doing). Which homologs are most similar?

 

Report: What is location of your gene (or some gene) in a genome? What is a flanking gene? What diseases do mutations in the gene cause? What is the location of a known mutation in the gene? What genes flank yours in the genome it comes from? What are the best homologs when you Blast a specific species genome? Present a phylogenetic tree from Biology workbench of proteins related to your gene. Which two genes are most distant?

 

Lab 6. Visualizing structure

We will now look at protein structure. Retrieve the structure of the protein most similar to your protein. Go to NCBI Blastp and search for your protein, now searching the 'pdb' database (select from the scroll of database options).
  1. When you get your list of hits look for the best hit (lowest E-value). Find the 4-character PDB (protein database) identifier (e.g. 1A6M) buried in the numbers returned or click on the accession number to get more data.
    Go to the Protein DataBase (PDB) website. Enter your identifier in the search window. You should find yourself at a page with information about the 3-D structure of your protein or a similar protein. Most of the proteins I originally assigned are in PDB, but some of you are working with associated proteins that may or may not be present in the database. Choose a structure, noting identifier code, and follow links to download/view. First take a look at the protein using the display option 'KING'. This lets you see your protein in 3D on the web and manipulate it a bit, but may take a while to load. Now download the uncompressed, entire file (not just header). NOTE: you may not have permission to download files on your machine. If you do not, do the exercise using one of the PDB files available on the desktop of your machine in a folder labeled RasMol, or Protein or PDB. Any pdb file will do. Here are some that may be available. Rhodopsin: 1F88.pdb, Whale myoglobin: 1A6M.pdb, human hemoglobin (sickle form): 2HBS.pdb .
  2. RasMol/RasWin is a program for viewing protein and DNA 3-D structure. Start RasMol by clicking on three-ball icon. Use 'File, Load' to load your downloaded PDB file. Experiment with various 'display' options. Then read the RasMol tutorial to better understand the program. Learn to use RasMol using PDB files on your computer and a guided set of examples using the program.
    Choose a conserved part of the protein (using results from clustalW or blast) and color it blue. Choose a non-conserved portion of the protein and color it red. Use 'select within' to color nearby residues orange.
  3. Using PubMed, OMIM, GenBank file annotation etc.) find an important functional part(s) of the protein, a binding site, an active site, a residue that when mutated causes a phenotype or disease. Highlight these sites. 'Export' the file to a GIF format file that can be used on a web page or inserted into a MsWord document.
    Try finding neighboringing residues of the heme in myoglobin, 1A6M.pdb. Open 1A6M.pdb then type on the command line:
    select within (4.0, hetero and not water)
    spacefill
    color red

    this selects atoms within 4 angstroms of heme, but avoids water molecules... effectively you have selected ligand-binding residues.

  4. View activation conformational changes
    Ras changes conformation when it binds GTP. It is hard to see differences in the two structures though. Aligning will help. Go to CE server (google: ce protein alignment ; or go to cl.sdsc.edu. Choose ‘two chains’ option. Enter two pdb codes (use ras + GDP = 1LF5:A and ras + GTP = 1LF0:A to align protein. You will get back both proteins aligned in one file so you can see the differences. Study the regions of similarity and differences for the two conformations of ras. Can you see the switch I and switch II domains? It is easier to see the differences if you experiment with the view with the 'difference' option.

Report: Include your protein's name, species, accession number, and the name of the nearest PDB homolog (i.e. whose structure is known.) Show at least two images of proteins, one of which should be your protein (or homolog) with an important region highlighted. The other from one of the exercises with RasMol above. When you print RasMol pictures always remember to 'set background white' on the command line so you don't end up using a whole cartridge of black ink! Also, I use your mastery of the white background as evidence that you have learned to use the command line in RasMol. List PDB codes and equivalent protein names and atomic resolution for your PDB and one other. For the ras conformational change report two residues (from switch I and switch II) that change position between the GTP/GDP-bound states.

Lab 7. Protein Motifs and Interacting proteins

Now we want to look at our proteins in the context of the cell. Go to the UCSC human genome link (above) and coose 'Gene Sorter'. Type in a gene name or accession number. You get the tissue distribution of your gene (green is overexpressed, red is underexpressed). Record the most obvious results. One way to understand protein function is to find ‘motifs’, small regions of the protein, that are modified (phosphorylated, glycosylated) or that carry out some function or interaction. We will use ProSite (you can Google: Prosite protein). Go to the site and read the info there. Paste your protein sequence into the ‘Tools for ProSite’ window and hit the ‘Quickscan’ button. You will get a list of motifs and where they occur in your protein. Note that just because a motif is present doesn’t mean it is used in the cell or that interactions occur. Prosite lists the potential sites. You have to study them in greater detail. Record at least three motifs in your protein.

Now to look at protein interaction directly. So far, in the lab, we have looked at one protein, with a specific function. We have also looked at homologs of this protein, and studied how it has changed or mutated. Now we want to get a broader picture of the function of the protein. Proteins do not function in a vacuum. They interact with other proteins, physically, and biochemically. Here we will look for proteins that physically interact with your protein. Presumably these are involved in some common process. The following exercise will work best with proteins that have known interaction partners. If your protein does not work, try p53.


Go to Database of interacting proteins (DIP). Choose 'Search'. Search a 'node'. A node in the interaction map of all proteins is a spot where proteins interact. Choose the Node Annotation [name of your protein] and Organism. Click the Links button to see the results. You should see your original protein and a list of others that interact with it. Write them down.

For human genes another option is the human genome resources at NCBI. From NCBI home, choose the toolbar link to Human Genome Resources. If your gene is not human you can use p53 for this exercise. Search for your human gene. You will get a list of possible genes. If you are using p53, use TP53 as your gene and click on it. If you are using another gene, search the options to find the one that is your gene.

On the page for your gene is a huge collection of data! You might want to look it over. For this exercise, go to the Table of contents to the right near the top of the page and choose 'Interactions'. You will be directed to a report listing other proteins that interact with your protein. Write down the other proteins. If you used the same protein for DIP and NCBI, are the listed partners the same?

Now go to PubMed and see what these interacting proteins do. Are they subunits of a single enzyme? Are they inhibitors of your protein? Activators of your protein? Do they just use your protein to localize? Can you get a better idea of the processes your protein is involved with?

The study of interactions, and processes of molecules in the cell is called 'Systems Biology'. Systems biology treats the cell as the sum of all of the relationships and regulation of its individual genes.
To look further at systems biology resources Google 'Reactome'. You will get a site with curated information about systems. Try looking at the information on apoptosis, or other topics. For a more technical systems biology site, try Googling 'JWS online' which allows you to model metabolic reactions. Choose the first model and run, then when you get the model choose 'evaluate'. You get a nice graph of the predicted rate of this complex reaction. The eventual goal of researchers is to duplicate the dynamics of the complete human metabolism, in the computer (a virtural person).

Report:

Report three ProSite motifs. Where are they in your protein? What do they do? What other proteins might interact with your protein based on the motif evidence?

Report two genes that interact with your protein (or p53) based on interaction databases. Explain the functional significance of the interaction based on PubMed study of the roles of the proteins. Compare interactions reported by DIP to those reported by NCBI.

Define your gene in terms of gene regulation, main process that it is involved in, and how it functions in that process with other genes. Show a JWS online graph

All reports for the Bioinformatics section of Bio 537 are due no latter than March 12, 2009 (M816) Hard copies only, no e-reports will be accepted.

 


Updated 2/19/09      Dr. Lorraine Marsh