HUMAN GENETICS - BIO 442
THE HUMAN GENOME, DNA, CHROMOSOMES AND GENE STRUCTURE
The Human Genome Project, an international effort to map all the human chromosomes and also chromosomes of other organisms, began in 1990 and was projected to be completed in 2003 but was completed three years ahead of schedule mostly because of the entry of Craig Venter in 1998. The Human Genome Project began as a publicly funded, international consortium of scientists led by Francis Collins. The funding came primarily from the National Institutes of Health and the Department of Energy and also from a British charity, the Wellcome Trust. Then in 1998, Craig Venter (who had been at NIH) announced that his new company, Celera, could do the job faster and cheaper! And he did!! While much work remains to fill in the gaps, this is an amazing accomplishment and it was done in an amazingly short number of years. It took 100 years from the time the German scientist Friedrich Miescher first isolated nucleic acids from pus (white blood cells) taken from bandages, for scientists to realize nucleic acids were the genetic material. And 30 years later, the entire human genome was sequenced! One scientist has compared this accomplishment to the 1543 publication of the first book on human anatomy. Even though that book identified almost every part of the human body, today we are still struggling to understand how many of the parts work and how they interact. So the party has only begun!
Many genes and gene alignments (synteny) have been found to be common among many organisms.
In humans, all of our 3 billion base pairs and approximately 34,000 genes are compacted and packaged into 23 pairs of chromosomes. In most animals and plants that reproduce sexually, chromosomes come in pairs with one member of each pair from each of the two parents. Each eucaryotic chromosome is composed of a single molecule of double stranded DNA, 5 different histones, and some other non histone proteins. The basic eucaryotic chromosome structure consists of DNA wrapped around the evolutionarily conserved histones. There are five types of histones: H1, H2A, H2B, H3 and H4. Approximately two turns of DNA wrap around an octamer composed of two molecules each of H2A, H2B, H3 and H4. Histone H1 binds in the region where the DNA enters and exits the nucleosome, presumably stabilizing the DNA at this point. The histones contain a large number of basic amino acids (lysine and arginine) which carry a positive charge and which dampen the negative charge on the DNA molecule (PO4=). Each nucleosome unit includes approximately 200 base pairs of DNA, with about 146 of them wrapped around the octamer of histones. Because they are essential to the structure of chromosomes, histones must be replicated along with the DNA during the S period of the cell cycle.
To be a functional eucaryotic chromosome, it must contain the following essential components: a centromere which contains satellite DNA unique to each chromosome and the kinetochore which is a protein structure to which the spindle fibers attach; a telomere at each end of the chromosome which contains a special type of repetitive DNA necessary to prevent shortening of the chromosome through the numerous rounds of replication; and origin(s) of replication which are consensus DNA sequences which bind the various proteins and enzymes required for replication.
Homologous chromosomes are the pairs of chromosomes received one from each parent. They contain different genes (alleles) for the same traits in the same order. Chromatids are exact replicas of one chromosome. They are formed during the S period of the cell cycle and are connected to one another at the centromere region until they separate at anaphase.
Each gene occupies a specific locus (plural, loci). The locus is the gene's "address." Genes at the same locus on (homologous) chromosomes are called alleles (short for allelomorphs). Alleles are alternative forms of a gene otherwise known in the population as polymorphisms. They arise by mutations.
As a budding human geneticist it is important for you to understand that the only genetic disorders that can be detected by looking at chromosomes (karyotyping) are abnormalities involving changes in the number or structure of chromosomes. These include disorders such as trisomy 21 or Down Syndrome and structural rearrangements such as translocations, additions, and deletions. Some microdeletions can be detected by a procedure known as FISH (fluorescence in situ hybridization). Single gene defects cannot be detected by karyotyping an individual.
Even after a gene has been identified for a genetic disorder we may not be able to tell if a person or fetus has a mutation in that gene. An example of this is Marfan Syndrome which is often due to a new spontaneous mutation which can occur anywhere within the fibrillin gene. Detection of genetic disorders is usually possible only if the disorder is caused by one (or a few) different known mutations. It is possible to detect a sickle cell mutation because the same single base change causes the disorder. Although sequencing of genes to find mutations is becoming more common it is still expensive and it may not be able to distinguish a normal polymorphism for a harmful mutation. We will discuss this allelic heterogeneity frequently as we proceed with the course.
There are coding and non coding sequences in nuclear DNA.
Seventy five percent of our genome is unique or single copy DNA which, includes the genes that code for proteins. "Real" genes" have coding (exons, start and stop codons) and non coding DNA (regulatory and other) sequences associated with them. We know that many of these non coding regions are absolutely necessary for the functioning of the gene. These include DNA binding motifs for regulatory proteins, promoters (housekeeping and other), 5' and 3' untranslated regions, introns, poly A tails, RNA processing signals. The coding regions of genes are only a small proportion of the single copy sequences since genes have introns and other non coding regions and there are non coding regions between genes. The remaining 25 % of our DNA is highly or moderately repetitive. Repetitive DNA can be dispersed (15%) or tandemly arranged (10%). For much of this DNA the functions have not been completely established. It is often called "satellite DNA" because when centrifuged in a density gradient this DNA forms bands separate from the bulk of genomic DNA. There are three satellite bands.
One type of repetitive DNA codes for rRNA and tRNA which form gene clusters. The rRNA genes in humans are found tandemly arranged on the p arms of the five D and G groups chromosomes (13, 14, 15, 21, 22). These regions are referred to as the NOR or nucleolar organizing regions and they form the nucleolus of the interphase cell. The nucleolus has a fibrous portion which is open rDNA being transcribed into rRNA and a granular region where ribosomes are being assembled (the ribosomal proteins are made in the cytoplasm and must be transported into the nucleus).
Genes may be unique sequences or belong to a gene family such as the globins, actins, myosins, tubulins that are repetitive. Gene families refer to genes with similar DNA sequences which arose through duplication of an ancestral gene followed by generations of mutations. Gene families may be close to one another in clusters or they may be dispersed, they may form a cluster on the same chromosome or they may be located on different chromosomes. The alpha globin gene cluster is on human chromosome 16 and the related beta globin cluster is on chromosome 11. Examples of gene families include rDNAs, tDNAs, the histone genes, P450 enzyme superfamily, hemoglobin genes, actin genes. Pseudogenes may be part of a gene cluster or family. These gene duplicates are now evolutionary relics. Pseudogenes arose either from duplications which then acquired mutations rendering them untranscribable or untranslatable or they arose by reinsertion of a cDNA of an mRNA which then has no promoter and other essential parts of a functional gene. The pseudogenes that arose from the duplication of a gene cannot be transcribed or translated due to the accumulation of fatal errors such as a nonsense codons or promoter mutations. Another type of pseudogene is referred to also as a retropseudogene because they arise by reinsertion of a cDNA made by a reverse transcriptase using an mRNA template. They contain no introns and they contain no promoter region since these were spliced out of the original RNA transcript, therefore, they cannot be transcribed.
The classic macro satellite DNA has repeats of 100 to 6500 bp. This category includes tandemly repeated satellite DNA from the centromeric repeats (171 bp) unique to each chromosome and the telomeric repeats. The centromeric repeats are referred to as alpha satellite DNA and each chromosome has its unique sequence. Because of this, it is possible to make DNA probes specific to each of out 25 chromosomes. When a fluorescent label is added to the probe, it is possible to count the number of each type of chromosome even in an interphase cell. Therefore, it is possible to check for trisomies in interphase amniotic fluid cells prior to culturing them for karyotyping.
Another type of tandemly repetitive DNA is referred to as mini satellite sequences or VNTRs (variable number of tandem repeats). They are composed of 20 to 100 bp repeats. The third type is tandemly repetitive DNA is referred to as micro satellite sequences or STRs (short tandem repeats) composed of 2 to 10 bp repeats. Since the number of repeats in micro and mini satellites are highly variable (polymorphic) they are very useful in gene mapping and DNA profiling for paternity testing, forensic testing, confirmation of relatedness and dead body identification. Both VNTRs and STRs are polymorphisms in non coding regions and are inherited in a codominant pattern. They are formed by mutations which add or subtract the number of repeats. Most individuals in the population are heterozygous at each of these loci.
Two thirds (66.7%) of the repetitive non coding DNA sequences is in more complex repeated sequences dispersed or scattered throughout the genome. These can be further divided into short and long interspersed sequences, SINES and LINES. LINES are up to 7000 bp in length and represent about 4% of our total human genome. LINES contain a transposable element which makes an RNA coding for reverse transcriptase. The transcriptase can make cDNA from RNA which can reintegrate into another site. SINES are shorter interspersed elements 90 to 500 bp in length. One kind of SINE is the Alu sequence which is about 300 bp in length. Alu sequences are unique to humans (and some apes), they are the most frequent human SINE (approximately 5 x 105 copies, 3 - 6% of the total human genome). These SINEs are named for the restriction enzyme, Alu, which cuts at AGCT, commonly found in the the repeat. (Alu is named for the bacteria, Arthrobacter luteus, from which the enzyme came.) These transposable elements can be a significant source of mutation (e.g., hemophilia) and can play a role in rearrangements and gene duplication (e.g., beta globin genes).
Gene Structure
The definition of a gene has evolved over time. It is no longer a "bead" on a string nor is it merely a sequence of bases that codes for amnio acids in a single polypeptide chain. While the Beadle and Tatum model of "one gene, one enzyme" is enticingly simple, we have had to move on to acknowledge that genes are far more complex. There are non coding regions or untranslated regions (UTRs) of DNA associated with genes. These include promotors, transcriptional regulatory sequences, introns and polyadenylation signals. Post transcriptional processes that modify the initial RNA transcript usually include 5' cap addition, 3' poly A addition, splicing out of introns and sometimes, alternative splicing of introns to form different mRNAs from the same gene. Post translational cleavage of proteins, while rare, can also occur as in the case of insulin and some hormones. The use of alternative promoters is common and is used to generate cell type specific mRNAs. These alternative promoters may be found within introns of the gene. The human dystrophin (DMD) gene which has more than 79 exons has at least eight different alternative promoters! In humans, the vast majority of genes are transcribed individually and, in these cases, the terms gene and transcription unit are essentially equivalent. The usual linear order at a gene site is: regulatory element(s) (where enhancers or suppressors bind); promoter region (where the RNA polymerase complex binds); transcription start site (in 5' UTR) including CAP site; ATG, translation initiation codon; exon(s) (variable number); introns (between exons, 5'GT and 3'AG, variable number); 5' UTR consisting of a translation stop codon (TAA, TGA, or TAG); AATAAA polyadenylation signal; and the site for addition of poly A tail.