Survey of Genomics Answer Key - Part 1

Survey of Genomics Answer Key - Part 1

Ayush Noori | EduSTEM Advanced Biology

Below, please find the answers for the Survey of Genomics exercises, questions 1-8. Check your work after you’ve finished!

Open a web browser and search for NCBI ( ). Bookmark the NCBI home page and then click on the Taxonomy link in the left-hand column under Resource List. Now click on Taxonomy again under Databases to go to the Taxonomy homepage.

1. Click on Taxonomy Statistics.

a. How many organisms were in the sequence database as of 2018 (all dates)?

In 2009, there were 22,010 total taxonomy nodes in the database, with 16,952 species.

b. How many organisms were in the database in 2009 (click on 2009)?

In 2009, there were 22,010 total taxonomy nodes in the database, with 16,952 species.

2. Now visit the extinct organisms link (scroll down left hand column and click on extinct organisms), choose an organism ( Homo sapiens neandertalensis is good) on the new page and see how many nucleotide sequences have been sequenced for that organism.

Nucleotide: 46 (mRNA based)

Nucleotide Genome Survey Sequence: 1,326 (DNA based)

a. What are most of the genes? (Click on the number next to Nucleotide in Entrez records to find out.)

Most of these genes are mitochondrial genes. Out of the 46 nucleotide sequences, 40 are mitochondrial genes.

b. Why were so many copies of these genes sequenced?

Mitochondrial DNA (mtDNA) has largely been sequenced over nuclear DNA for several reasons:

  • The Neanderthal cells contained between 500 to over 1,000 copies of the mtDNA genome, while they only had one copy of their nuclear DNA. Therefore, it is vastly easier to extract mtDNA from the fossilized bones.
  • Hominid mtDNA is composed of approximately 16,500 base pairs which comprise 37 genes. Compared to the 3,235 Mb nuclear genome, the mtDNA genome is significantly easier to sequence, especially in 1997 when Pääbo et al. published the first Neanderthal mtDNA sequence.
  • Due to the higher mutation rate of mtDNA compared to nuclear DNA, mtDNA is often more useful at discovering phylogenetic relationships between species that have recently diverged in evolutionary time.

c. Go back to the previous page and look at the lineage for this organism.


Eukaryota; Opisthokonta; Metazoa; Eumetazoa; Bilateria; Deuterostomia; Chordata; Craniata; Vertebrata; Gnathostomata; Teleostomi; Euteleostomi; Sarcopterygii; Dipnotetrapodomorpha; Tetrapoda; Amniota; Mammalia; Theria; Eutheria; Boreoeutheria; Euarchontoglires; Primates; Haplorrhini; Simiiformes; Catarrhini; Hominoidea; Hominidae; Homininae; Homo; Homo sapiens

d. Note that Svante Paabo et al have published a complete genome for Homo sapiens neandertalensis derived from a single toe bone 38,000 years old.

Above is a dorsal view of the Neanderthal toe bone from the Pääbo lab (image not included). Pääbo et al. published the first paper in 1997 which reported a Neanderthal mtDNA sequence. Below is the 3.5g section of the right humerus, which was hydrolyzed in acid, then cloned via PCR in a plasmid vector and sequenced (image not included).

Source: Krings M, Stone A, Schmitz RW, Krainitzki H, Stoneking M, Pääbo S. Neandertal DNA sequences and the origin of modern humans. Cell. 1997 Jul 11;90(1):19-30.

Pääbo et al. then published a draft Neanderthal genome in 2010 based on data from three Neanderthals, and ultimately were delighted to find a proximal toe phalanx containing an abundance of Neanderthal DNA. Their findings, published in Nature in 2013 , may be useful in elucidating the evolutionary differences between Neanderthals and Denisovans, another group of ancient hominids. Denisovan DNA was sequenced from a finger phalanx bone discovered in the same cave as the Neanderthal toe bone – the East Gallery of Denisova Cave in the Altai Mountains.


  1. Green RE, Krause J, Pääbo S, et al. A Draft Sequence of the Neandertal Genome. Science. 2010 May 7; 328(5979): 710–722.

  2. Prüfer K, Racimo F, Pääbo S, et al. The complete genome sequence of a Neanderthal from the Altai Mountains. Nature. 2014 Jan 2;505(7481):43-9.

3. Now go back to the Taxonomy Browser and click on Drosophila melanogaster . In the new page, click on Drosophila melanogaster again.

a. How many entries are there under Nucleotide?

There are 225,995 entries under Nucleotide.

b. How many entries are there under EST?

There are 821,005 entries under Nucleotide EST.

c. What are ESTs? Start by looking here: .

Expressed sequence tags, or ESTs, are cDNA sequences from 600bp to 1000bp in length which specify known mRNAs. The RNA population within a given cell is reverse transcribed to yield a cDNA library, high-throughput single-pass sequencing of cDNA clones then produces ESTs. EST analysis is used to identify protein-coding genes and characterize gene sets expressed across a wide variety of organisms.

4. The Drosophila genome been sequenced many times; this sequence is the “Reference Genome”. Click on the 1 beside Genome and then click on Representative and then scroll down to see the X chromosome in the chart.

a. How big is the X in base pairs?

The X chromosome is 23.54 Mb, or 23,542,271 base pairs, in length.

b. How many genes are annotated here?

2,671 genes are annotated here.

c. Now click on the X above the X chromosome to view the Genome Data Viewer. Explore a bit.

Below is a map of the X chromosome (NC_004354.4).

Now return to the NCBI homepage and click on OMIM (scroll down under All Databases). Type in hypercholesterolemia, autosomal dominant. On the new page click on #143890 and read the Text (under Clinical Synopsis).

d. Why are there different gene map loci for this disorder?

Familial hypercholesterolemia (FHC) is caused by dysfunctional or absent low-density lipoprotein receptors (LDLRs), which normally bind to low-density lipoproteins (LDLs) and remove them from the bloodstream. There are different gene map loci for this disorder since any of a variety of mutations (in different, but related, genes) which affect LDLRs can cause FHC. Some LDLR mutations result in functional defects in the receptors, while others cause the LDLRs to be underexpressed, in both cases leading to elevated LDL levels and abnormal LDL deposition in skin, tendons, and arteries. The most common mutation is in the LDLR gene on chromosome 19p13, however, mutations in the APOB, LDLRAP1, and PCSK9 genes can also cause FHC, among others.

e. You are interested in 19p13.2 Hypercholesterolemia, familial. What is the molecular cause of this condition? Look under Pathogenesis. Why is it autosomal dominant?

As described above, FHC is caused by defects in the LDLRs. LDL is normally bound by LDLRs at the plasma membrane and trafficked to lysosomes, where it is degraded to repress cholesterol synthesis (via suppression of HMG CoA reductase). In FHC, mutations cause dysfunctional binding to LDLRs, or underexpression of the receptors in liver cells. In addition, stimulation of cholesterol synthesis via the uptake of cholesterol-rich very low-density lipoproteins (VLDLs) exacerbates the condition.

FHC is autosomal dominant as a single mutation in an allele for an LDL receptor gene is sufficient to cause disease symptoms – the affected individuals are heterozygous for the mutant allele. Since FHC is caused by a mutation, it can therefore be inherited in an autosomal dominant fashion.

Click on the Clinical Synopsis label.

f. What are the effects on cholesterol levels in the blood and on the cardiovascular system for heterozygotes?

Elevated LDL levels in FHC can lead to corneal arcus (opaque discoloration of the corneal margin), xanthelasma (yellow deposits of cholesterol and fat around the eyelids), coronary disease after 30 years of age, and tendinous xanthomas (yellow depositions of cholesterol-rich fat in tendons of the hands, feet, and heel) after 20 years of age. The formation of arterial atheromatous plaques can obstruct blood flow and result in a heart attack of stroke, as well as ischemia of various organs or muscles.

g. What are the effects on cholesterol levels in the blood and on the cardiovascular system for homozygotes?

Homozygotes experience similar symptoms, however with greater severity. Along with corneal arcus and xanthelasma, they can suffer from coronary disease in childhood, tendinous xanthomas during the first four years of life, and planar xanthomas (plaques or papules containing cholesterol spread over large areas of the body).

5. Click on the Population Genetics label. What are the frequencies of heterozygotes (1 in 500) and homozygotes (1 in 1,000,000) in the population? Thus, FHC is the most frequent Mendelian disorder.

a. How common is this heterozygous disorder in the general population?

Among the general population, the prevalence of heterozygous FH is 1 in 500.

b. How common is this heterozygous disorder among heart attack survivors?

The prevalence of FHC among survivors of heart attack (myocardial infarction) is 1 in 20.

c. Why so common for heart attack survivors?

The prevalence of FHC among heart attack survivors is likely greater than the general population since the symptoms, specifically the formation of arterial atheromatous plaques as described above, can promote early-onset myocardial infarction.

6. Why are the Afrikaans and Ashkenazi populations particularly at risk?

The Afrikaans and Ashkenazi populations are particularly at risk due to the founder effect, the loss of genetic variation in the gene pool which occurs when, “a new isolated population is founded by a small number of individuals possessing limited genetic variation relative to the larger population from which they have migrated.” Afrikaners are descended from a small population of Dutch and Huguenot settlers of the 17th century, while Ashkenazi Jews are descended from communities along the Rhine River in Western Germany and Northern France (and later in Lithuania) in the Middle Ages who were isolated from the population at large for political and religious reasons. The prevalence of mutant LDLR alleles must have been elevated in the initial founding communities, causing increased prevalence in the descendent Afrikaans and Ashkenazi populations.

Source: Merriam-Webster

a. How common are heterozygotes in Afrikaans-speaking populations in S. Africa?

The prevalence of heterozygous FHC in Afrikaans-speaking rural populations of South Africa is 1 in 71. Note that this study (Steyn et al. ) only examined three communities in rural South Africa.

b. When was the Jewish community established in Lithuania?

The Jewish community was established in Lithuania in 1338 AD.

c. What did that population do afterward? With what result?

The establishment of the Lithuanian population of Ashkenazi Jews was followed by a diaspora of the Ashkenazi Jewish population throughout Eastern Europe, thereby spreading the mutant deletion of gly197 (which causes FHC) across the world, eventually to Israel, South Africa, Russia, the Netherlands, and the United States, among other countries.

7. Click on Text Description and read the text. What roles do the SNPs play in the forms of hypercholesterolemia other than the form due to mutations in the LDLR? How do you think these associations were found?

FHC can be altered by various single-nucleotide polymorphisms (SNPs). For example, the phenotype of individuals with the mutation IVS14+1G-A can be altered by a SNP in APOA2 (-265T-C, can alter cholesterol and LDL levels), a SNP in EPHX2, or a SNP in GHR. A L5261 substitution SNP in the promoter of the G-substrate gene (GSBS) increases cholesterol levels in the blood, while a SNP in intron 17 of the ITIH4 gene increases susceptibility of FHC in the Japanese population.

a. In the Phenotype-Gene Relationships Table, click on 606945 (LDLR, or Low Density Lipoprotein Receptor, in the Gene/Locus column). Once there read the description, cloning and gene function parts.

b. What is this gene’s cytogenetic location? What is its genomic location?

The cytogenic location of LDLR is 19p13.2. The genomic coordinates of LDLR are 19:11,089,361-11,133,829.

Under Cloning and Expression: 13 of the 18 exons of LDLR are homologous to parts of other genes. Describe these similarities.

By studying a 5.3 kilobase cDNA for LDLR, Südhof et al. found that 13 of the 18 exons in the 45 kilobase LDLR gene are homologous to those of other genes. They report that mature LDLR is composed of 839 amino acids with five domains.

  • Five of the exons are similar to a C9 component of a complement. Composed of multiple repeat sequences of 40 amino acids each, the first domain is approx. 300 amino acid residues in total length and contains the binding site for LDL apoproteins B and E. The repeat sequence of 40 amino acids is homologous to a single 40 amino acid sequence of human complement component C9, the final component of the complement system, which participates in the formation of the Membrane Attack Complex (MAC), a bacterial defense mechanism. Below is Figure 7 from Südhof et al. , “Comparison of consensus sequence in the binding domain of the LDL receptor with the homologous sequence from complement factor C9.”
  • Three exons encode a sequence similar to a repeat sequence in the precursor for human and mouse EGF (epidermal growth factor, which stimulates cell growth and differentiation by binding to EGFR) and in three zymogens of the coagulation cascade: factor IX (cleaved by factor XIa or VIIa into the active factor IXa, and in turn hydrolyzes an arginine-isoleucine bond in factor X), factor X (activated into factor Xa by factor IX, and cleaves prothrombin [thrombin precursor] into the active thrombin, thrombin then directly acts to form the cross-linked fibrin clots), and protein C (the active form proteolytically inactivates factor Va and factor VIIIa, regulating anticoagulation). Below is Figure 8 from Südhof et al. , “Amino acid alignment of segments A, B, and C from the LDL receptor with homologous regions from the EGF precursor and several proteins of the blood clotting system.”
  • Five other exons encode nonrepeat sequences that are shared with the EGF precursor, as described below.

Source: Südhof TC, Goldstein JL, Brown MS, Russell DW. The LDL Receptor Gene: A Mosaic of Exons Shared with Different Proteins. Science. 1985 May 17; 228(4701): 815–822.

c. What is a “mosaic protein” and why can’t it be assigned to a supergene family?

A mosaic protein is a protein composed of different protein domains, while a supergene family is a group of genes which are located nearby on a chromosome and are functionally related. Since a mosaic protein like LDLR is comprised of different and functionally-distinct domains inherited from ancestral genes of different families, it belongs to different supergene families and cannot be assigned a single ancestral gene or gene cluster.

d. What is the significance of the DNA sequence homology shared by the LDL receptor and Epidermal Growth Factor?

The genes for human EGF and human LDLR show 33% homology over 400 residues, with eight homologous exons separated by nine introns. Furthermore, Südhof et al. report that, “of the nine introns that separate these exons, five are located in identical positions in the two protein sequences,” and one has shifted over several codons. This suggests that the EGF and LDLR genes evolved by duplication of the same ancestral gene, and that they diverged over time by incorporating exons from other gene families (such as the five C9 exons, etc.).

Source: Südhof TC, Russell DW, Goldstein JL, Brown MS, Sanchez-Pescador R, Bell GI. Cassette of eight exons shared by genes for LDL receptor and EGF precursor. Science. 1985 May 17;228(4701):893-5.

e. How did this LDL receptor evolve? Include a definition of exon shuffling in your answer.

As described above, the EGF and LDLR genes evolved by duplication of the same ancestral gene and diverged over time by incorporating introns and exons from other gene families. The incorporation of exons from other gene families is called exon shuffling – when recombination between unexpressed intron sequences can rearrange exons to create novel function.

Source: Gilbert W. Genes-in-pieces revisited. Science. 1985 May 17;228(4701):823-4.

f. Under Gene Function: What is the main function of the gene and its protein?

As described earlier, the LDLR gene encodes for the LDL receptor protein, which binds apoprotein B100 in the outer phospholipid layer of LDL and facilitates endocytosis of LDL via clathrin-coated pits. Lysosomal LDL is responsible for suppression of cholesterol synthesis via repression of HMG CoA reductase activity.

g. Under LDL Receptor as a Viral Receptor : How does hepatitis C virus use the receptor?

The hepatitis C virus (HCV) colocalizes with very low-density lipoproteins (VLDLs) and enters the cell. Endocytosis of the VLDLs via LDLR recognition of apoprotein B and E allows HCR to enter the cell as well. Other flaviviruses such as hepatitis G virus and bovine viral diarrheal virus (BVDV) use this method as well, and though this is the principal mechanism of HCV entry into cells, HCV may also be able to use other receptors as well. Other groups have reported that human rhinovirus (HRV2) and Rous sarcoma virus A enter the cell via the same mechanism of action.

Source: Agnello V, Ábel G, Elfahal M, Knight GB, Zhang Q-X. Hepatitis C virus and other Flaviviridae viruses enter cells via low density lipoprotein receptor. Proc Natl Acad Sci U S A. 1999 Oct 26; 96(22): 12766–12771.

8. Now click on the Evolution option (under Table of Contents) to read about Alu sequences in this outdated entry.

a. How long are they?

Alu sequences are approximately 300bp long.

b. How many copies do you have?

We have 300,000 to 500,000 copies.

c. What percent of your DNA do they make up?

They make up 3% of total DNA.

d. Where did they come from and what do they do?

Alu elements are evolutionarily derived from the gene for 7SL RNA. Ullu and Tschudi first proposed that deletion of a 155-base pair sequence of 7SL RNA gave rise to mammalian Alu sequences. Kriegs et al. report that the deletion in 7SL RNA generated the fossil Alu monomer (FAM), which in turn gave rise to the primate-specific free right Alu monomer (FRAM) and the free left Alu monomer type A (FLAM-A). Fusion of FLAM-C, a descendent of FLAM-A, with FRAM gave rise to the modern Alu sequence, a dimer where FRAM contains a 31bp insert absent from FLAM-C, which contains a unique RNA polymerase III promoter.

Please see Figure 2 from Kriegs et al ., “Evolutionary history of 7SL RNA-derived SINEs in


Ullu E and Tschudi C. Alu sequences are processed 7SL RNA genes. Nature. 1984 Nov 8-14;312(5990):171-2.

Kriegs JO, Churakov G, Jurka J, Brosius J, Schmitz J. Evolutionary history of 7SL RNA-derived SINEs in Supraprimates. Trends Genet. 2007 Apr;23(4):158-61. Epub 2007 Feb 20.

The 7SL RNA is a component of the signal-recognition particle (SRP). The SRP is made of six polypeptides and a single 7SL RNA and traffics secretory proteins to their required cellular location by recognizing the signal peptide and halting protein synthesis by binding to the ribosome, until SRP binds the SRP-receptor in the destination membrane, which contains a transmembrane pore.

Alu repeats comprise 85% of LDLR intronic sequences (outside exon-intron junctions). While the function of 7SL RNA is clearly defined, the biological functions of Alu sequences remain unclear. According to Hobbs et al. , recombination of the Alu sequences in intron 4 and intron 5 of the LDLR gene is one cause of FHC. Lehrman et al. report that recombination between Alu sequences in intron 15 and exon 18 caused a 7.8 kilobase deletion in LDLR and subsequent homozygous FHC. Similarly, Horsthemke et al. report that unequal crossing-over between two Alu sequences in intron 12 and intron 14 led to a 4 kilobase deletion and FHC. In summary, Rudiger et al. discuss that recombination hotspots often involve Alu sequences, and hence dysfunctional Alu recombination can cause FHC.

e. How long have these Alu sequences been in our DNA (on average)? Why are they found only in primates?

The average Alu sequence probably integrated into its current genomic location between 15 and 30 million years ago. These Alu repeats are specific to primates, therefore were likely not present 65 million years ago. According to Kriegs et al. (see Figure 2 above), the fusion of FRAM with FLAM-C to generate Alu repeat sequences occurred after the divergence of primates from their ancestor (from other species such as Dermoptera, flying lemurs).

f. How often should Alu 's be found in the genome if distributed at random? Describe the clustering of Alu 's.

The high copy number of the Alu sequence suggests that one should expect to find one sequence every 3,000 to 5,000 base pairs in the human genome if they were randomly distributed. However, studies of the albumin/alpha-fetoprotein family, thymidine kinase, and beta-tubulin genes demonstrate clustering of Alu repeat sequences in some parts of the genome, where they are located in much closer proximity than expected. The beta-tubulin gene has 10 Alu repeats in less than 5 kilobases, and the thymidine kinase gene as 13 Alu repeats in a 10 kilobase stretch. This is also true for the LDLR gene. As I mentioned above, Alu repeats comprise 85% of LDLR intronic sequences (outside exon-intron junctions), or 55% of the total LDLR gene.

With these courses, we hope to further our mission to make high-quality STEMX education accessible for all. For questions or support, please feel free to reach out to me at

Best Regards,

Ayush Noori

EduSTEM Boston Chapter Founder


  1. NCBI PubMed

The premier source of past and present medical literature. Most supplemental information in Extensions is available via PubMed. When searching PubMed, be sure to use the “Free full text,” and “Sort by: Best Match” filters to find relevant and accessible results.

  1. RCSB Protein Data Bank (PDB)

A large database of useful 3D structures of large biological molecules, including proteins and nucleic acids. Use the search bar to find a molecule of interest, which can then be examined using the Web-based 3D viewer.