Survey of Genomics

Survey of Genomics

Townley Chisholm and Ayush Noori | EduSTEM Advanced Biology

Please complete the following exercises for a fun and challenging survey of bioinformatics, with a focus on genomics. After you are finished, check your work using the attached answer key.

Open a web browser and search for NCBI ( Bookmark the NCBI home page and then click on the Taxonomy link in the left-hand column under Resource List. Now click on Taxonomy again under Databases to go to the Taxonomy homepage.

1. Click on Taxonomy Statistics.

a. How many organisms were in the sequence database as of 2018 (all dates)?

b. How many organisms were in the database in 2009 (click on 2009)?

2. Now visit the extinct organisms link (scroll down left hand column and click on extinct organisms), choose an organism (Homo sapiens neandertalensis is good) on the new page and see how many nucleotide sequences have been sequenced for that organism.

Nucleotide: _______ (mRNA based)
Nucleotide Genome Survey Sequence: ________ (DNA based)

a. What are most of the genes? (Click on the number next to Nucleotide in Entrez records to find out.)

b. Why were so many copies of these genes sequenced?

c. Go back to the previous page and look at the lineage for this organism.

d. Note that Svante Paabo et al have published a complete genome for Homo sapiens neandertalensis derived from a single toe bone 38,000 years old.

3. Now go back to the Taxonomy Browser and click on Drosophila melanogaster. In the new page, click on Drosophila melanogaster again.

a. How many entries are there under Nucleotide?

b. How many entries are there under EST?

c. What are ESTs? Start by looking here:

4. The Drosophila genome been sequenced many times; this sequence is the “Reference Genome”. Click on the 1 beside Genome and then click on Representative and then scroll down to see the X chromosome in the chart.

a. How big is the X in base pairs?

b. How many genes are annotated here?

c. Now click on the X above the X chromosome to view the Genome Data Viewer. Explore a bit.

Now return to the NCBI homepage and click on OMIM (scroll down under All Databases). Type in hypercholesterolemia, autosomal dominant. On the new page click on #143890 and read the Text (under Clinical Synopsis).

d. Why are there different gene map loci for this disorder?

e. You are interested in 19p13.2 Hypercholesterolemia, familial. What is the molecular cause of this condition? Look under Pathogenesis. Why is it autosomal dominant?

Click on the Clinical Synopsis label.

f. What are the effects on cholesterol levels in the blood and on the cardiovascular system for heterozygotes?

g. What are the effects on cholesterol levels in the blood and on the cardiovascular system for homozygotes?

5. Click on the Population Genetics label. What are the frequencies of heterozygotes ___________ and homozygotes ___________ in the population?

a. How common is this heterozygous disorder in the general population?

b. How common is this heterozygous disorder among heart attack survivors?

c. Why so common for heart attack survivors?

6. Why are the Afrikaans and Ashkenazi populations particularly at risk?

a. How common are heterozygotes in Afrikaans-speaking populations in S. Africa?

b. When was the Jewish community established in Lithuania? ___________

c. What did that population do afterward? With what result?

7. Click on Text Description and read the text. What roles do the SNPs play in the forms of hypercholesterolemia other than the form due to mutations in the LDLR? How do you think these associations were found?

In the Phenotype-Gene Relationships Table, click on 606945 (LDLR, or Low Density Lipoprotein Receptor, in the Gene/Locus column). Once there read the description, cloning and gene function parts.

a. What is this gene’s cytogenetic location? What is its genomic location?

b. Under Cloning and Expression: 13 of the 18 exons of LDLR are homologous to parts of other genes. Describe these similarities.

c. What is a “mosaic protein” and why can’t it be assigned to a supergene family?

d. What is the significance of the DNA sequence homology shared by the LDL receptor and Epidermal Growth Factor?

e. How did this LDL receptor evolve? Include a definition of exon shuffling in your answer.

f. Under Gene Function: What is the main function of the gene and its protein?

g. Under LDL Receptor as a Viral Receptor: How does hepatitis C virus use the receptor?

8. Now click on the Evolution option (under Table of Contents) to read about Alu sequences in this outdated entry.

a. How long are they?

b. How many copies do you have?

c. What percent of your DNA do they make up?

d. Where did they come from and what do they do?

e. How long have these Alu sequences been in our DNA (on average)? Why are they found only in primates?

f. How often should Alu’s be found in the genome if distributed at random? Describe the clustering of Alu’s.

9. Return to the NCBI homepage and then click on PubMed (under All Databases) and search for “autosomal dominant hypercholesterolemia and ALU sequences”. Under the search bar, click on Sort by: Most Recent.

Click on the “Free PMC Article” link under the 2010 study, “Genomic characterization of large rearrangements of the LDLR gene in Czech patients with familial hypercholesterolemia” (the fourth article in the list).

a. What % of familial hypercholesterolemia is due to large DNA rearrangements?

b. How many Alu elements are within the LDLR gene? Where are they?

c. What percentage of the introns do Alu elements comprise?

d. According to this more recent article how many Alu copies lie in the human genome?

e. What % of the genome do they comprise?

f. What is the connection between Alu elements and faulty LDLR’s? Explain.

Navigate back to the list of articles. Read the abstract for the fifth entry in the list, “Genomic characterization of five deletions in the LDL receptor gene in Danish Familial Hypercholesterolemic subjects.”

g. What deletions caused the LDLR mutations in this study?

10. Now return to the NCBI homepage and search Nucleotide (under Popular Resources) for “Drosophila melanogaster twin of eyeless (toy)”; scroll down to Items.

What are the number of bases in the twin of eyeless mRNA Transcription factor Toy (toy) mRNA transcript variant C (first entry)?

a. Return to the NCBI main page and search PubMed for article # 7914031 and read the article synopsis. What surprising conclusion does the article reach?

b. Return to the NCBI homepage and search all databases for human aniridia by typing in NM_001604. On this new page record the number of bases in this mRNA ____________.

c. Click on FASTA. Now look under “Analyze this sequence” (on the right) and click on Run BLAST to go to the site that performs alignment comparisons.

d. The Query Sequence is already entered for you. Under “CHOOSE SEARCH SET” click on “others (nr etc)”; make sure that “nucleotide collection” is in the Choose Search Set.

e. Under “PROGRAM SELECTION” click on “Somewhat similar sequences (blastn).”

f. Now hit BLAST (at the bottom of the page) and wait for your results.

The BLAST algorithms produce “scores” and “E values”; higher scores indicate longer sequences of identical sequence while E values close to zero indicate very low likelihood of similarities due to chance – thus, any similarities with E values close to zero must be due to homology.

g. How long is the human gene (query)?

h. Look at the Color Key for alignment scores. How similar are these sequences?

i. What are the organisms with similar, nearly identical PAX6 genes?

j. Scroll down to the first entry in “Alignments” and click on the blue Gene label beside the first entry and click again on Genomic Context on the page that appears to find where the gene is located. Scroll down and look under “Genomic Context.” Record that position here: ____________.
Nucleotides? ____________ Exons? ____________

k. While on this page look in “Summary” box to see where the gene is expressed and record that information here:

If you click on Phenotypes you will find a list of developmental defects caused by defects in this gene.

l. Now go back to the FASTA sequence page from Step 10c and click on the blue Unigene label (under the Related Information column on the right). Now click on Paired Box 6. How similar is it to other PAX-6 proteins in:

Organism Percent Similarity Aligned Amino Acid Length
Mice (Mus musculus)
Chicken (Gallus gallus)
Zebra fish (Danio rerio)
Fly (Drosophila melanogaster)

m. Now go back to the BLAST results page. Scroll to the first entry under Alignments. Click on Genome Data Viewer (right hand side) beside the first entry to see the genes adjacent to this gene on Chromosome 11. How many base pairs are you seeing in this first view? ____________
How many genes? ____________

n. What are the adjacent genes (mouse over the gene initials to view)?

11. Now go back to the NCBI homepage and click on Structure under All Databases; then type 6Pax in the Search box and hit go.

a. Look at the result. What animal is it from? ____________

b. Click on the View Structure (in the right hand box) to view the graphic. How many polypeptides are here?

c. What other molecule is here?

12. What is the role of this protein? Read the Citation abstract to find out. Note that the full text version of the article is available free. Summarize its role:

13. Look at the molecular model and click on the “Full Featured 3D viewer”. Try selecting Style > Rendering Shortcuts > Toggle Side Chains. Now rotate this molecule with the mouse and zoom in and out (using controls at the top of the page under View).

a. How would you describe the degree of fit between the protein and the other molecule?

b. What is the difference between the major and minor DNA grooves?

c. How do different parts of Pax-6 bind to the bases in these grooves?

With these courses, we hope to further our mission to make high-quality STEMX education accessible for all. For questions or support, please feel free to reach out to me at

Best Regards,

Ayush Noori

EduSTEM Boston Chapter Founder


  1. NCBI PubMed

The premier source of past and present medical literature. Most supplemental information in Extensions is available via PubMed. When searching PubMed, be sure to use the “Free full text,” and “Sort by: Best Match” filters to find relevant and accessible results.

  1. RCSB Protein Data Bank (PDB)

A large database of useful 3D structures of large biological molecules, including proteins and nucleic acids. Use the search bar to find a molecule of interest, which can then be examined using the Web-based 3D viewer.