Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain...

11
GENOMICS 13, 1198-1208 (1992) Nucleotide Sequence Analysis of 95 kb near the 3’ End of the Murine T-Cell Receptor CY/~ Chain Locus: Strategy and Methodology RICHARD K. WILSON,’ BEN F. KOOP, CHIA CHEN, NICOLETTE HALLORAN,’ ROGER SCIAMMIS, AND LEROY HOOD Division of Biology, California Institute of Technology, Pasadena, California 91125 Received February 14, 1992 The nucleotide sequence of a region at the 3’ terminus of the murine T-cell receptor CX/~ chain locus is pre- sented. This region, which encodes the constant region genes for LYand 6 chain polypeptides and all 50 joining gene segments for the a chain polypeptide, spans 94,647 bp and includes more than 50 noncoding se- quence elements important for T-cell receptor gene re- arrangement and expression. DNA sequencing of this region included complete analysis of two cosmid clones and five additional restriction fragments using a ran- dom subcloning approach with various manual and au- tomated sequencing strategies. The automated se- quencing strategies hold considerable promise for fu- ture large-scale DNA sequencing efforts. 0 i99z Academic Press. Inc. INTRODUCTION T lymphocytes, one of the two major classes of effector cells in the vertebrate immune system, are capable of recognizing and responding to a virtually limitless uni- verse of antigenic determinants. This specific response, which results in subsequent activation of cell-mediated defenses, is regulated by the membrane-bound T-cell antigen receptor (Tcr), a disulfide-linked heterodimeric glycoprotein composed of cx and /I or 7 and 6 polypeptide chains (reviewed in Wilson et al., 1988a). Like the heavy and light chains of immunoglobulins, the Tcr is encoded by discontinuous elements from a large pool of gene seg- ments that rearrange during the course of T-cell develop- ment. For the a and ?Ichains, variable (TcrV) and join- ing (TcrJ) gene segments are combined to form a func- tional variable region sequence that is spliced to a constant (TcrC) region sequence following transcrip- tion. The fl and ?I polypeptide chains are encoded by TcrV and TcrJ gene segments as well as a small diversity (TcrD) gene segment. This mechanism of Tcr gene seg- ment rearrangement provides in part the high degree of Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under Accession No. M64239. ’ Present address: Department of Genetics, Washington University School of Medicine. St. Louis, MO 63110. receptor diversity required by the immune system to re- spond to an infinite number of antigens. Since the initial analysis of Tcr cDNA sequences, the germline organization and the repertoire of the genes encoding all four receptor polypeptide chains have been partially characterized in mice and humans (reviewed in Lai et al, 1989). Physical maps for portions of the mu- rine Q and /3 chain loci have been constructed using cos- mid cloning and field inversion gel electrophoresis (Lai et al., 1987; K. Wang et al., in preparation). The size of the murine a chain locus has not been determined, al- though its human counterpart appears to span at least 1000 kb (Griesser et al., 1988; Roth et ul., 1988; E. Lai, unpublished data). The murine a locus, located at chro- mosomal position 14C-D (Dembic et al., 1985; Kranz et al., 1985), has been estimated to contain approximately 90-100 TcraV and 50 TcroJ gene segments in addition to a single Tcrc& gene (Lai et al, 1989). One of the more striking features of Tcr gene organization is the inclu- sion of the ?Ichain genes within the much larger CY chain locus (Chien et al., 1987a). Experimental evidence has suggested that the 716 receptor is expressed first in devel- opment and that the 6 chain genes are deleted upon later rearrangement of o chain gene segments. Additionally, the ?jand a! chains may share a limited number of TcrV gene segments (Chien et ul., 1987b). The obvious opportunities for subsequent molecular studies of Tcr gene rearrangement, regulation, and ex- pression and their effects in T-cell development and mat- uration make the CY/~ chain locus an outstanding candi- date for large-scale regional DNA sequence characteriza- tion. To date, the majority of large-scale DNA sequencing efforts (e.g., greater than 50,000 bp) have been aimed at determining the complete nucleotide se- quences of viral and organellar genomes (Chee et al., 1990; Goebel et al., 1990; Baer et al., 1984; Shinozaki et al., 1986; McGeoch et al., 1988; Davison and Scott, 1986; Ohyama et al., 1988; Cummings et al., 1990; Chen et al., 1989; Shehee et al., 1989). These projects have utilized manual radioisotopic DNA sequencing methods exclu- sively, with most employing a random or “shotgun” se- quencing approach. Only one project of greater than 50 kb, the human HPRT locus, has been completed using nonradioisotopic DNA sequencing methods (Edwards et 0888-7543/9’2 $5.00 Copyright 61 1992 by Academic Press, Inc. All rights of reproduction in any form reserved. 1198

Transcript of Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain...

Page 1: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

GENOMICS 13, 1198-1208 (1992)

Nucleotide Sequence Analysis of 95 kb near the 3’ End of the Murine T-Cell Receptor CY/~ Chain Locus: Strategy and Methodology

RICHARD K. WILSON,’ BEN F. KOOP, CHIA CHEN, NICOLETTE HALLORAN,’ ROGER SCIAMMIS, AND LEROY HOOD

Division of Biology, California Institute of Technology, Pasadena, California 91125

Received February 14, 1992

The nucleotide sequence of a region at the 3’ terminus of the murine T-cell receptor CX/~ chain locus is pre- sented. This region, which encodes the constant region genes for LY and 6 chain polypeptides and all 50 joining gene segments for the a chain polypeptide, spans 94,647 bp and includes more than 50 noncoding se- quence elements important for T-cell receptor gene re- arrangement and expression. DNA sequencing of this region included complete analysis of two cosmid clones and five additional restriction fragments using a ran- dom subcloning approach with various manual and au- tomated sequencing strategies. The automated se- quencing strategies hold considerable promise for fu- ture large-scale DNA sequencing efforts. 0 i99z Academic

Press. Inc.

INTRODUCTION

T lymphocytes, one of the two major classes of effector cells in the vertebrate immune system, are capable of recognizing and responding to a virtually limitless uni- verse of antigenic determinants. This specific response, which results in subsequent activation of cell-mediated defenses, is regulated by the membrane-bound T-cell antigen receptor (Tcr), a disulfide-linked heterodimeric glycoprotein composed of cx and /I or 7 and 6 polypeptide chains (reviewed in Wilson et al., 1988a). Like the heavy and light chains of immunoglobulins, the Tcr is encoded by discontinuous elements from a large pool of gene seg- ments that rearrange during the course of T-cell develop- ment. For the a and ?I chains, variable (TcrV) and join- ing (TcrJ) gene segments are combined to form a func- tional variable region sequence that is spliced to a constant (TcrC) region sequence following transcrip- tion. The fl and ?I polypeptide chains are encoded by TcrV and TcrJ gene segments as well as a small diversity (TcrD) gene segment. This mechanism of Tcr gene seg- ment rearrangement provides in part the high degree of

Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under Accession No. M64239.

’ Present address: Department of Genetics, Washington University School of Medicine. St. Louis, MO 63110.

receptor diversity required by the immune system to re- spond to an infinite number of antigens.

Since the initial analysis of Tcr cDNA sequences, the germline organization and the repertoire of the genes encoding all four receptor polypeptide chains have been partially characterized in mice and humans (reviewed in Lai et al, 1989). Physical maps for portions of the mu- rine Q and /3 chain loci have been constructed using cos- mid cloning and field inversion gel electrophoresis (Lai et al., 1987; K. Wang et al., in preparation). The size of the murine a chain locus has not been determined, al- though its human counterpart appears to span at least 1000 kb (Griesser et al., 1988; Roth et ul., 1988; E. Lai, unpublished data). The murine a locus, located at chro- mosomal position 14C-D (Dembic et al., 1985; Kranz et al., 1985), has been estimated to contain approximately 90-100 TcraV and 50 TcroJ gene segments in addition to a single Tcrc& gene (Lai et al, 1989). One of the more striking features of Tcr gene organization is the inclu- sion of the ?I chain genes within the much larger CY chain locus (Chien et al., 1987a). Experimental evidence has suggested that the 716 receptor is expressed first in devel- opment and that the 6 chain genes are deleted upon later rearrangement of o chain gene segments. Additionally, the ?j and a! chains may share a limited number of TcrV gene segments (Chien et ul., 1987b).

The obvious opportunities for subsequent molecular studies of Tcr gene rearrangement, regulation, and ex- pression and their effects in T-cell development and mat- uration make the CY/~ chain locus an outstanding candi- date for large-scale regional DNA sequence characteriza- tion. To date, the majority of large-scale DNA sequencing efforts (e.g., greater than 50,000 bp) have been aimed at determining the complete nucleotide se- quences of viral and organellar genomes (Chee et al., 1990; Goebel et al., 1990; Baer et al., 1984; Shinozaki et al., 1986; McGeoch et al., 1988; Davison and Scott, 1986; Ohyama et al., 1988; Cummings et al., 1990; Chen et al., 1989; Shehee et al., 1989). These projects have utilized manual radioisotopic DNA sequencing methods exclu- sively, with most employing a random or “shotgun” se- quencing approach. Only one project of greater than 50 kb, the human HPRT locus, has been completed using nonradioisotopic DNA sequencing methods (Edwards et

0888-7543/9’2 $5.00 Copyright 61 1992 by Academic Press, Inc. All rights of reproduction in any form reserved.

1198

Page 2: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

THE MURINE T-CELL RECEPTOR o/h CHAIN LOCUS 1199

Large scale M I3 DNA preparation Asymmetric PCR

I

+I Autoradlography

Manual analysis and data entry

Electrophoresis with real-time

analysis and data entry

I

Manual editing of COntlguOuS nUCleOtide

FIG. 1. Chart of the tasks required for DNA sequence analysis using the radioisotopic and fluorescent dideoxynucleotide chemis- tries.

al., 1990). Although several large projects have been completed successfully using manual sequencing meth- ods, the time and personnel required for such efforts is prohibitive for routine DNA sequence analysis of inter- esting regions greater than 50 kb. Since we have devel- oped tools to facilitate high-throughput DNA sequence determination (Smith et al., 1986; Wilson et al., 19&3b, 199Oa, 199Ob), we wished to apply these tools to deter- mine the nucleotide sequence of a large region at the 3’ terminus of the Tcr CX/& chain locus. This region, span- ning 94,647 bp, includes one of two known TcraJ gene segments, the TcnX gene, an inverted TcrfiV gene seg- ment, all of the TcraJ gene segments, and the Tcr& gene. Additionally, this region contains positive and neg- ative regulatory elements for Tcr gene expression (Wi- noto and Baltimore, 1989a, 1989b). Initial DNA se- quencing work employed robotic processing of radioiso- topic dideoxynucleotide sequencing chemistry with subsequent manual data acquisition. Later, fluorescent dye-primer and fluorescent dye-terminator chemistries with real-time automated data acquisition were utilized. The tasks involved for both radioisotopic and fluores- cent sequence analysis are outlined in Fig. 1. During the course of this sequencing project, various strategies were evaluated for random and directed subcloning, DNA

template preparation, processing of DNA sequencing re- actions, sample handling, gel electrophoresis, data ac- quisition, sequence fragment assembly, editing and con- firmation, sequence gap closure, and data management and analysis. These evaluations have resulted in the de- velopment of a useful organizational strategy for rapid large-scale DNA sequence analysis of additional large regions of mouse and human T-cell receptor loci. Here, we report the nucleotide sequence of the 3’ terminal re- gion of the murine Tcr a/?j chain locus, and we describe the strategies and methods used for sequence analysis of this region. In the accompanying paper (Koop et al., 1992), we discuss the molecular genetics of the Tcr cod- ing regions and noncoding regulatory elements con- tained in this region and describe experiments based on the DNA sequence that provide insight into the biologi- cal function of regions contained within Tcr a/a locus.

MATERIALS AND METHODS

Plasmid and cosmid clones containing regions of the murine o/h chain locus are shown in Fig. 2. These clones, isolated from a C57/ B10.D2-H-2ti liver DNA library (Winoto et ul., 1985), were provided by Drs. E. Lai, A. Winoto, and K. Wang (California Institute of Tech- nology). pUCll8 DNA was a gift from Dr. Ellson Chen (Genentech, Inc.). Restriction enzymes, T4 DNA polymerase, T4 DNA ligase, and calf intestinal alkaline phosphatase were obtained from Boehringer- Mannheim Biochemicals (Indianapolis, IN). Recombinant Taq DNA polymerase and chemically modified T7 DNA polymerase were pur- chased from U.S. Biochemicals (Cleveland, OH). T7 DNApolymerase and deoxy- and dideoxynucleotides were obtained from Pharmacia- LKB Biotechnology (Piscataway, NJ); fluorescent dye-labeled primers and dideoxynucleotides were purchased from Applied Biosys- terns, Inc. (Foster City, CA). Radioisotopes were purchased from Amersham, Inc. (Arlington, IL). Oligonucleotide primers for sequenc- ing and PCR amplification were synthesized by the Caltech Applied Microchemical Facility. .4 cup horn sonicator for random DNA shear- ing was purchased from Heat Systems-Ultrasonics (Plainview, NY). The Biomek 1000 robot.ic workstation was provided by the Spinco Division of Beckman Instruments (Palo Alto, CA) and was modified for DNA sequencing by addition of an HCB-1000 heater/cooler unit manufactured by Time-Logic, Inc. (Moorpark, CA). PCR amplifica- tion was performed on a Perkin-Elmer-Cetus (Norwalk, CT) DNA thermal cycler. Fluorescent-labeled sequencing reactions were ana- lyzed on a 373A automated DNA sequencer purchased from Applied Biosystems. DNA sequence data were assembled and analyzed us- ing a variety of software, including Microgenie v. 6.0 (Beckman Instruments), Gene Master (Bio-Rad Laboratories, Richmond, CA), SeqMan (DNASTAR, Inc., Madison, WI), SAP (Dr. Roger Staden, MRC-LMB, Cambridge, UK), and the Lasergene package (DNASTAR, Inc.).

Subcloning. Most of the regions to be sequenced were subcloned using a sonication approach as previously described (Bankier and Barrell, 1983; Deinenger, 1983). Additionally, 20-30 pg of intact ce- sium chloride gradient-purified recombinant cosmid DNA was sheared using the same sonication conditions. After sonication and end repair with T4 DNA polymerase, 0% to 1.6-kb fragments were purified by preparative agarose gel electrophoresis and cloned in the SazuI site of Ml3mplO. Two small restriction fragments (BK6.5 and KK4.2, Fig. 2) that proved difficult to subclone using the sonication method were partially digested with Suu3A, HueIII, or RsoI, and re- sulting fragments of a size range of 0.8 to 1.6 kb were purified by preparative agarose gel electrophoresis and cloned in Ml3mplO. Re- gions of cosmid and plasmid clones that were not represented by ana- lyzed subclones were amplified by polymerase chain reaction using 7’oo DNA polymerase (Saiki et ol., 1985), phosphorylated with poly- nucleotide kinase, and cloned in Ml3mplO or pUCll8.

Page 3: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

1200 WILSON ET AL.

Prepuration of sequencing templates. For radioisotopic DNA se- quencing, single-stranded Ml3 DNA was prepared using a standard miniprep procedure (Messing, 1983). From 2 ml of overnight culture, suficient DNA was prepared for several sequencing reactions. For fluorescent DNA sequencing using the dye-primer chemistry, the method used to prepare single-stranded Ml3 DNA rarely provided sufficient template for a single set of reactions. Increased amounts of Ml3 DNA were prepared by increasing bacterial cultures from 2 to 10 ml. DNA preparation was begun from 10 ml of overnight cultures grown in 50-ml centrifuge tubes. Bacterial cells and polyethylene gly- col-precipitated Ml3 phage were pelleted by centrifugation at 4200 rpm for 5 and 20 min, respectively. Phage pellets were dissolved in 200 ~1 of TE buffer (10 rnM Tris-HCl, pH 8.0, and 0.1 rnM EDTA) and transferred to microcentrifuge tubes for subsequent phenol extrac- tion. From this point, the method progressed as in the standard Ml3 DNA miniprep. At the conclusion of the procedure, DNA yields were carefully quantitated by UV spectrophotometry.

As an alternative, we employed asymmetric polymerase chain reac- tion (APCR) as a means of rapid template DNA preparation [Gyllen- sten and Erlich, 1989). As we have described previously (Wilson et al., 19SOa), oligonucleotide primers were designed for optimal amplifica- tion of insert DNA directly from Ml3 plaques. Following the amplifi- cation reaction, sequencing templates were partially purified by iso- propanol precipitation and used in fluorescent dye-primer sequencing reactions. The APCR method allowed reliable yields of DNA from all Ml3 subclones and typically provided sufficient DNA for two fluores- cent reaction sets.

Regions not present in the random subfragment libraries were am- plified using standard PCR conditions and cloned in pUCll8 or Ml3 vectors. From pUCll8 subclones, template DNA for sequencing reac- tions was prepared from 2-ml overnight cultures using a small-scale alkaline lysis procedure (Birnboim and Doly, 1979), followed by a l-h incubation at 37’C with 200 wg/ml RNase A and subsequent phenol extraction and ethanol precipitation. Recombinant plasmid DNA was ethanol precipitated, dissolved in 100 ~1 of TE buffer, and used in linear amplification sequencing reactions with fluorescent dye primers.

DNA sequencing reactions. Dideoxynucleotide sequencing reac- tions were performed manually or using an automated robotic work- station as previously described (Sanger et al., 1977,198@, Wilson et al., 1988b, 1SSOb). During the course of the project, a method for high- throughput manual processing of DNA sequencing reactions also was developed (Koop et al., 1990). All radioisotopic sequencing reactions utilized [a-‘*P]dATP (800 Ci/mmol) with modified or unmodified T7 DNA polymerase. Typically, 24 Ml3 subclones were analyzed per se- quencing experiment, using a standard two-step extension protocol (Tabor and Richardson, 1987). Sequencing reactions employed 7- deaza-dGTP to minimize intrastrand base pairing and MnC& to pro- duce uniform band intensities (Tabor and Richardson, 1989).

Fluorescent dye-primer sequencing reactions utilized either modi- fied or unmodified T7 DNA polymerase with approximately 2.5 fig of single-stranded template DNA. For both APCR and Ml3 templates, reactions were performed using a one-step extension protocol as previ- ously described (Wilson et al., 199Oa). Linear amplification sequenc- ing reactions (Craxton, 1991) using fluorescent dye-labeled primers were performed using approximately 100 ng of double-stranded plas- mid DNA. The parameters for linear amplification sequencing were 95OC for 30 s, 55’C for 30 s, and 72’C for 1 min, for a total of 15 cycles, then 95’C for 30 s and 72’C for 1 min, for an additional 15 cycles. At the conclusion of the reactions, fluorescent-labeled products from A, C, G, and T reactions were combined in ethanol, precipitated, and loaded on the automated sequencer. DNA sequencing reactions per- formed using fluorescent dye-labeled dideoxynucleotides with custom oligonucleotide primers utilized approximately 1 pg of double- stranded plasmid DNA or 10 pg of cosmid DNA. Linear amplification parameters for dye-terminator reactions were 98’C for 2 s, 55’C for 30 s, and 65OC for 2 min, for a total of 27 cycles. Reaction products subse- quently were purified away from unincorporated dye-labeled dideox- ynucleotides by Sephadex G-50 chromatography and analyzed on the automated DNA sequencer.

Electrophoresis and data acquisition. Radioisotopic DNA sequenc- ing reactions were analyzed by autoradiography following standard 7 M urea, 6% polyacrylamide gel electrophoresis. Typical DNA se- quencing gels were 38 cm (w) x 50 cm (h), cast in 0.2-mm-thin or 0.2. to 0.6-mm wedge forms and contained 48 to 60 sample wells. Gels were loaded either manually or using a plotter-based automated sample loading device. DNA sequence data from autoradiographs were en- tered into a raw sequence database using a sonic digitizer. Software for data entry was provided by Bio-Rad (Gene Master and Gel Read) and DNASTAR, Inc. (EditSeq). Fluorescent DNA sequencing reactions were analyzed on an Applied Biosystems 37OA/373A automated DNA sequencer. Gels were typically 7 M urea, 6% polyacrylamide, contain- ing 16 standard or 24 sharkstooth sample wells. DNA sequencer runs were 12 to 14 h at 26-30 W. DNA sequence data were analyzed in real time by the automated instrument and downloaded to a microcom- puter for sequence assembly and analysis.

Sequence assembly, editing, and data anulysis. Raw DNA sequence data from autoradiographs and from the automated fluorescent se- quencer were assembled using several different programs running on a variety of micro- and minicomputers. For relatively small restriction fragments, sequences were easily assembled using Microgenie v. 6.0, running on an IBM-PC/AT computer or using Gene Master, running on a Compaq 386/20 computer. Large restriction fragments and cos- mid clone sequences were assembled using SeqMan (PC version), run- ning on a Compaq 386/20 computer; SAP, running on a DEC VAXsta- tion computer; or SeqMan (Lasergene version), running on a Macin- tosh SE30 or IIcx computer. Sequence files were transferred between computers using either Apple File Exchange or NCSA Telnet.

Editing of consensus sequences at points of conflict was performed manually from autoradiographs or fluorescent chromatograms. Re- gions of the consensus that were represented only once or contained unreadable sequences were reanalyzed on the opposite strand by PCR amplification and/or direct sequencing using the linear amplification method with fluorescent dye primers or dye terminators. Additional confirmation of many areas in the final consensus sequence was per- formed by restriction enzyme analysis.

Analysis of the final DNA sequence for TcrJ gene segments was performed by comparing the consensus to a file containing all known TcraJ gene segment sequences. Since all TcrJ gene segments contain a highly conserved core sequence and franking rearrangement and splice signals, this method was useful for detecting TcrcyJ gene seg- ments that had not been identified previously by T-cell receptor-spe- cific cDNA sequencing. Additional homologies were detected by com- parison of the MTcr sequence to various nucleotide sequence data- bases.

RESULTS AND DISCUSSION

DNA Sequencing Strategies

Cosmid clones and restriction fragments used for DNA sequence analysis of the 3’ terminal 94.6-kb region of the murine T-cell receptor a/a chain locus are shown in Fig. 2. The random subcloning and sequencing ap- proach that we employed for one cosmid clone, cTA25.1, and three large restriction fragments, designated KK10.5, SS9.6, and SN12.2, essentially followed the classical sonication methodology described by Deininger (1983) and Bankier and Barrel1 (1983). As previously demonstrated, a random approach provides rapid accu- mulation of DNA sequence data in the early stages of the project, but later is encumbered by increasing redun- dancy (Anderson, 1981; Deininger, 1983; Bankier and Barrell, 1983). At a point where approximately 95% of the double-stranded sequence of a fragment has been determined, it is convenient to switch to a directed strat-

Page 4: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

THE MURINE T-CELL RECEPTOR a/8 CHAIN LOCUS

0 IO 20 30 40 50 60 70 90 90 ; IOOkb I

. J&l C8 “$5 1

Ja gene segments 1 ’ ca 1 ha CSB-I

T' 2" I TA 52.1 I

TA 49.3 I I TA S.1

1

I TA 2S.l

I I TA4.1

BK65KK4.2 KKlO5 5 9.6 SNl22

1201

FIG. 2. Physical map of the 3’ terminal region of the murine Tcr a/8 chain locus. Solid bar with scale in kilobases indicates the 94,647.bp region sequenced. Tcr gene segment coding regions are indicated as solid boxes, untranslated exons are indicated by shaded boxes, and pseudogenes are indicated by open boxes. The regulatory elements are indicated by hatched boxes. Transcriptional orientation of all coding regions is 5’ to 3’ except as indicated for the TcrW5 gene segment. The inverted orientation of the two silencer elements is indicated, also. Cosmid clones mapped to the region are shown below the gene segments, with thick lines indicating the two cosmids that were utilized for DNA sequence analysis. In addition, the locations of five restriction fragments utilized for DNA sequence analysis are shown below the cosmid map.

egy to obtain the remainder of the sequence. Using the sonication method to randomly shear DNA fragments, our best results were obtained with whole circular cos- mid DNA or restriction fragments larger than 7 kb. In our experiments, we chose to subclone DNA fragments with a size range of 0.8 to 1.6 kb in Ml3mplO. Ligation of larger DNA fragments (i.e., 2-4 kb) into the Ml3 vector greatly reduced the number of subclones that could be produced in a sonication experiment, presumably be- cause larger inserts in Ml3 are less stable (Messing, 1983). Sonication experiments with intact cosmid clones proved very successful as a means of preparing random subclones. Starting with approximately 20 pg of cosmid DNA, we were able to produce several hundred Ml3 sub- clones per ligation reaction. Since the cosmid vector rep- resented only about 12% of the entire clone, it was not physically separated from the insert DNA prior to sub- cloning and sequence analysis. Similarly, the restriction fragment KKlO.5 was sheared as an intact plasmid clone along with the pUCl9 vector.

Initial DNA sequencing of random subclones utilized the Ml3 universal primer. Using this approach, we ex- pected to complete approximately 95% of the double- stranded sequence of each fragment with five- to sixfold average redundancy. The number of subclones required for this analysis was estimated using statistical profiles described by Bankier and Barrel1 (1983). Upon DNA se- quence analysis of the appropriate number of random subclones, contigs were assembled for each restriction fragment or clone, and oligonucleotide primers were syn- thesized to extend sequences from the ends of contigs. This primer-directed or “walking” strategy was used with existing Ml3 subclones and continued until gaps between contigs had been closed or the end of the sub-

clone insert was reached. In the latter case, oligonucleo- tide primers that flanked the remaining gap region were synthesized and used in a series of PCR amplification experiments. Through these experiments, contigs could be oriented and spaced relative to each other. Previously mapped Tcr joining region gene segments often were present in contigs and facilitated the pairing of some of the amplification primers. The DNA products of these PCR experiments were subcloned and sequenced using the universal and reverse as well as custom sequencing primers. As shown in Table 1, the restriction fragments KK10.5, SS9.6, and SN12.2 and the cosmid clone TA25.1 were sequenced using mainly a random strategy and required relatively few custom primers for gap clo- sure. For a second cosmid clone, cTA5.1, we again ini- tially pursued a classical random strategy, although sub- clones that contained sequences from the cosmid vector or overlapping, previously sequenced regions were re- moved by computer screening prior to contig assembly. After several sequence contigs had been assembled, we applied the PCR-directed method for closure of the re- maining gaps. Although the use of PCR as a method for contig joining did not greatly reduce the overall average sequencing redundancy for this particular cosmid clone, it proved an effective method for quickly closing the gap regions. The restriction fragments BK6.5 and KK4.2 produced unusual cloning artifacts upon shearing by the sonication method, and alternative procedures were em- ployed. These fragments were partially digested with re- striction enzymes, and resulting subfragments with a size range of 0.8 to 1.6 kb were cloned in Ml3mplO. Sub- clones produced by this method were sequenced using universal and custom primers. The partial digestions did not provide a random distribution of DNA fragments,

Page 5: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

1202 WILSON ET AL,

TABLE 1

Sequencing Strategies and Statistics

Clone or fragment DNA size (kb) Sequencing strategy Sequencing chemistry Useful subclones (for assembly) Total subclones sequenced Avg. bp per subclone (vector) Total bp read (for assembly) Final sequence redundancy Oligonucleotide primers required

BK6.5 KK4.2 6.5 4.2

Partial RE Partial RE rad rad

60 40 200 91 300 400

19,500 16,000 3 4

10 10

KK10.5 10.5 + 2.8~

Sonic rad

140 200 500

70,000

2

SS9.6 SN12.2 9.6 12.2

Sonic Sonic flr rad

147 153 170 179 432 434

55,749 68,437 5.8 5.6

2 8

cTA25.1 41 + 6v

Sonic rad

653 850 470

310,200 6.6

8

cTA5.1 25.8’

Sonic + PCR flr

384a 709 406

155,980” 6.0 20

Totals 118.6

-

1577 2399

441 695,866

5.9 60

’ Subclones from cosmid vector and regions contained on other clones were removed prior to sequence assembly (n = 162).

and the use of multiple 4-bp-specific restriction enzymes was necessary to obtain approximately 80% coverage for the two fragments. The remaining gaps in the sequence were filled as described above.

Automated DNA Sequencing Methods

Early in the project, DNA sequence analysis was per- formed using radioisotopic methods. Relatively high throughput was maintained using modified robotic sys- tems to perform DNA sequencing reactions and to load polyacrylamide sequencing gels. Standard small-scale procedures for purification of single-stranded recombi- nant Ml3 DNA were utilized and allowed manual prepa- ration of up to 100 subclones per man-day. Using the robotic systems, sequence analysis of this large number of subclones was possible, although the resulting data interpretation task severely limited the rate of data compilation. One day of actual DNA sequencing work producing four films (24 subclones, short and long elec- trophoresis runs) required nearly a second full day for manual analysis of the films using a sonic digitizer tablet.

To reduce the burden of manual data entry from auto- radiographs, we employed fluorescent dye-labeled primer sequencing chemistq (Smith et al., 1986) with automated real-time detection of reaction products. At the time we began using fluorescent sequencing meth- ods, the automated sequencing instrument had a daily capacity of 16 samples, With the later introduction of new programming and a more sophisticated computer system, 24 samples could be processed per run. Although the automated sequencer allowed direct real-time data entry, the fluorescent DNA sequencing chemistry carried additional restrictions that had to be overcome to maintain a throughput compatible with the instru- ments capabilities. Most notably, since four fiuorescent labels were employed, rather than a single radioisotopic label, sequencing reactions were more laborious to per- form and required more template DNA. To facilitate pro- cessing of fluorescent DNA sequencing reactions, the robotic system that was developed for high-throughput radioisotopic DNA sequencing was modified to perform the four-dye fiuorescent chemistry. As previously de- scribed (Wilson et al, 199Ob), this system is capable of

processing fluorescent DNA sequencing reactions for 48 DNA samples in three daily runs. To provide sufficient template DNA for fluorescent DNA sequencing reac- tions, the standard method for preparation of single- stranded DNA necessitated scaling up to larger culture volumes, requiring extensive use of high-speed centrifu- gation equipment. Although this method produces tem- plates of sufficient quantity and quality, it is labor-in- tensive on a large scale, and careful quantitation of each DNA sample is required to ensure successful sequencing results. As an alternative method, we employed the asymmetric PCR method to produce single-stranded template DNA directly from Ml3 plaques. In addition to providing a rapid means of generating sufficient tem- plate DNA for fluorescent DNA sequencing reactions, APCR resulted in more uniform DNA yields, thereby obviating the need to quantitate individual DNA sam- ples. Despite the advantages of APCR as discussed here, it should be stressed that this method can be difficult to optimize for a particular primer-template system and results often are unsatisfactory. Additionally, we experi- enced greatly reduced success rates with APCR template preparation when using different thermal cycling in- struments from the same manufacturer.

Most typically, the automated fluorescent sequencing instrument was run once daily, allowing processing of up to 120 Ml3 subclones per instrument-week. If APCR was used as the means of preparing template DNA, anal- ysis of this number of subclones was possible in a s-day period. When the large-scale procedure for single- stranded DNA preparation was used, at least one addi- tional day was required. As shown in Table 1 for the SS9.6 and cTA5.1 clones, the fluorescent gel reader pro- vided average reads of 434 and 406 bp per subclone, re- spectively. Longer reads were possible for subclones an- alyzed using radioisotopic methods, although film analy- sis tasks limited throughput to approximately 96 subclones per man-week.

Using the Ml3 cloning system, a significant fraction of the subclones analyzed were not useful, because they either lacked an insert or priming site or, utilizing stan- dard single stranded DNA preparative methods, could not be produced in sufficient amounts for DNA sequence analysis. Together with reactions that failed analysis for other miscellaneous causes, nonuseful subclones (n =

Page 6: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

THE MURINE T-CELL RECEPTOR a/6 CHAIN LOCUS 1203

660) represented 27.5% of the total 2399 analyzed. This failure rate was roughly equivalent for both radioisoto- pit and fluorescent DNA sequencing methods.

Depending on the subfragment library used for se- quence analysis, the length of read and the percentage of subclones that failed analysis varied. Nearly all subclone failures were due either to a low yield of single-stranded DNA or to a deletion of the priming site. As with any automated system, instrument failures and downtime must be considered when estimating throughput (for US approximately 15-20%). With the fluorescent gel reader employed in our experiments, occassional reanalysis of a subclone was required when the initial automated analy- sis was not performed properly. These types of problems did not significantly reduce sequencing throughput. Seri- ous hardware problems with the instrument did occur intermittently, causing a complete disruption of DNA sequence analysis. Many of these hardware problems are preceded by symptoms that may be recognized by an experienced operator; steps can be quickly taken to initi- ate repairs so that sequencing throughput is not com- promised. An important example of this is laser perfor- mance. A diagnostic test that will alert the operator to impending laser failure may be performed weekly. If the results of this diagnostic indicate that the laser has be- gun to draw more power than is usually necessary, a new laser can be ordered and repairs scheduled. Laser re- placement usually requires no more than a few hours and should not interrupt the daily routine. In a similar manner, other hardware problems unique to the fluores- cent gel reader may be prediagnosed and quickly solved so as to maintain sequencing throughput.

Determination of error rates for DNA sequence analy- sis is difficult and somewhat subjective. To gain some estimation of the accuracy of our sequencing methods and the final product, we have looked at error percent- ages in unedited raw sequence data as well as in the final assembled sequence. To estimate error rates in raw se- quences, we compared regions of random subclones con- taining Ml3mplO and pTL5 vector sequences to pub- lished final sequences. Since the pTL5 vector was sheared and subcloned along with insert DNA for the two cosmids analyzed, approximately 12% of the result- ing subclones contained sequences from pTL5. Using the fluorescent dye-primer sequencing chemistry with automated real-time detection, raw sequences that were read beyond 400 bp (av = 480 bp) contained an average of 3.2% error. Sequences that were read to less than 400 bp (av = 300 bp) contained an average of 2.8% error. Of this total error rate, approximately one-third were “flags” or ambiguities (“N”). These flags often properly alert one to potential problems that should be resolved during editing after viewing the complementary or other subclone sequences. At least one important observation should be made concerning the error rate analysis: in the group of sequences that were read longer than 400 bp, most errors occurred late in the sequence and often were present as extra bases in a run of two or more of the same nucleotide. The remainder of errors present in the raw

data usually were the result of secondary structure in the template DNA. The random sequencing strategy that we employed provided an effective method of correcting errors that occurred in the raw sequence data. Once raw sequences were assembled by the computer, manual editing was necessary to resolve differences and make corrections to the consensus sequence. As discussed be- low, a high degree of redundancy was provided by the random strategy, and most of the sequence was deter- mined on both DNA strands. Regions of the consensus sequence that were sequenced only once or on one strand were carefully proofread to minimize errors caused by secondary structure effects. Questionable areas were re- sequenced on the complementary strand. The final test of accuracy for the consensus sequence was comparison to coding region sequences that had previously been de- termined by cDNA or partial germline sequence analy- sis. In nearly all cases, discrepencies were due to strain polymorphisms rather than to errors in the nucleotide sequence. As to the accuracy of the final assembled se- quence, we would estimate an error frequency between 0.02 and 0.1%. This estimate is based on the 5.9-fold sequence coverage, the calculation of error rate for raw sequence data, and comparisons of the sequence to cDNA and previously analyzed germline sequences and takes into account that we are unable to distinguish se- quence errors from polymorphisms in noncoding re- gions. Sequence

The 94,647-bp nucleotide sequence of a region at the 3’ terminus of the murine T-cell receptor a/a chain locus has been determined (data not shown, GenBank acces- sion No. M64239). The sequence begins with a BumHI site at the 5’ end of restriction fragment BK6.5. This BamHI site is 2454 bp 5’ to the nonamer recombination recognition sequence of the TcraJl (formerly Jh2) gene segment. The sequence terminates with an NruI site lo- cated 4866 bp 3’ of the putative polyadenylation signal for the Tcra gene. To complete the nucleotide sequence of this region, two cosmid clones and five restriction fragments comprising a total of 118.5 kb were analyzed. As listed in Table 1, each base was sequenced an average of 5,9 times. The positional sequencing redundancy for the entire 94,647-bp region is charted in Fig. 3. A total of 84% of the nucleotide sequence was determined on both strands, and 98.3% of the bases were sequenced more than one time. It should be noted that no regions of the two cosmid clones and five restriction fragments in- cluded in this study proved unclonable in the Ml3 and pUC vectors used for sequence analysis. Regions that were underrepresented in the random libraries were eas- ily cloned following PCR, and there is every reason to expect that subclones spanning these regions would eventually have been found with additional random se- quencing. Sequence Features

Coding regions. The 3’ terminal region of the T-cell receptor a/a chain locus is replete with gene segment

Page 7: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

1204 WILSON ET AL.

FIG. 3. DNA sequencing redundancy. The number of independent sequences covering each base in the final sequence is plotted for both DNA strands. Sequences reading in the forward orientation are plotted above the origin, and reverse reads are plotted below the origin. The scale in kilobases is shown below the plot,

coding sequences. All of the TcraJ gene segments are contained in this region; 35 unique TcraJ gene segments had been previously characterized by cDNA sequence analysis (Arden et cd., 1985; Becker et ~1.; Bill et czl.; 1989; Chien et al., 1984; Chou et al., 1986; Dembic et al., 1985; Hayday et al., 1985; Hochgeschwender et al., 1987; Mar- olleau et al., 1988; Saito et al., 1984; Spinella et al., Take- shita et al., 1989; Winoto et al., 1985; Yoshikai et al., 1985; N. Costlow, unpublished data; D. Kono, personal communication), 15 additional TcraJ gene segment se- quences are reported here. The single Tcr6J gene seg- ment present in the region analyzed in this study was identified previously in a Tcr6 cDNA (Chien et ul., 1987b). In the accompanying paper (Koop et al., 1992), we propose a new system of nomenclature for TcraJ and TcraJ gene segments in which the gene segments are numbered in a 3’ to 5’ direction, beginning with the first gene segment immediately 5 to its respective Tcr& gene.

The majority of TcraJ gene segments contain con- served amino acid residues and are flanked on the left by recombination recognition sequences and on the right by the 5’ consensus sequence necessary for splicing to the first exon of the constant region gene. The consensus nonamer and heptamer recombination sequences are re- quired for rearrangement to a TcraV (or TcrW gene segment during T-cell development. The consensus nonamer sequence is A/G-G-T-T-T-T-T-G-T. The nonamer sequence is followed in germline DNA by a 12- bp spacer and the consensus heptamer sequence C/G-A/ G-C-T-G-T-G. The coding sequence for the joining re- gion immediately follows the heptamer sequence and ends at a typical eucaryote exon-intron junction with the consensus sequence G-G-T-A/G-A-G. Identification of previously uncharacterized TcraJ gene segments was facilitated by comparison of new sequences to a database containing all of the known Tcra!J gene segments. This method of analysis resulted in the identification of 15 additional potential joining region coding sequences.

As discussed in detail in the accompanying paper (Koop et al., 1992), at least 10 TcraJ pseudogenes are

contained in the analyzed region of the murine ti/6 chain locus. Comparison of the mouse Tcr sequence to human Tcr germline sequences has revealed that one pseudo- TcraJ gene segment, the 3’-most TcraJ gene segment in the human CX/~ chain locus (Yoshikai et al., 1985), is unrecognizable in the murine germline, although signifi- cant similarity is observed in the sequences flanking this 3’-most TcraJ gene segment (Koop et al., 1992). Since sequence comparison with the homologous region in hu- man germline DNA reveals strong similarity, the site of the partially deleted TcraJ gene segment has been la- beled TcraJl.

All of the TcrDJ gene segments in the murine genome are contained within a 62,831-bp region. The 5’-most TcraJ gene segment, TcraJ50, is located 8065 bp distal to exon 1 of the inverted TcrcrV5 gene segment. The 3’-most functional TcraJ gene segment, TcraJ2, is lo- cated 2617 bp upstream from exon 1 of the Tcr& gene. The average spacing between TcraJ gene segments is 1347 bp. The largest intragene segment interval is 2723 bp, between TcrcrJ46 and TcraJ47, and the smallest in- terval is 440 bp, between TcraJl8 and TcraJl9. Intra- genie spacing between Tcra!J gene segments appears to be random across the locus, and clustering of the TcraJ gene segments is not observed. The TcraJl gene segment is located 6029 bp upstream from exon 1 of the TcnX gene. Previous reports indicate that the intragenic spac- ing between Tcr&Jl and Tcr6J2 is approximately 12,000 bp (Takashita et al., 1989). The Tcr6Jl gene segment and all of the TcraJ gene segments are arranged in the same transcriptional orientation.

Constant region genes for the LY and 6 chains were identified by homology to cDNA and partial germline sequences (Hayday et al., 1985; Chien et al., 1987a). Al- though both constant region genes are homologous with respect to organization and exon size, their coding re- gions share little amino acid sequence similarity (22%). The majority of Tcr& and TcrE coding sequences are contained in a large first exon, and large fourth exons contain only 3’ untranslated sequences. A consensus polyadenylation signal (AATAAA) is present at the end

Page 8: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

THE MURINE T-CELL RECEPTOR a/b CHAIN LOCUS 1205

of the fourth exon of the TcrX gene. A consensus poly- adenylation signal for the Q chain gene was not observed, although polyadenylation most likely is initiated within the sequence AAAAACCAAA located 510 bp down- stream from the start of TcraC exon 4. The 3’ terminus of the LX chain gene has been identified by cDNA se- quence analysis as AAAAACC(A),, (Hayday et al., 1985).

A single variable gene segment, Tcr&V5, is located 2506 bp distal to the ?I chain polyadenylation signal. As previously reported (Iwashima et al., 1988), Tcr6V5 is set in opposite translational orientation relative to the TcraJl and TcrZ sequences and must rearrange by in- version. Comparison of the 94,647-bp region presented here with a database containing mouse, human, and rab- bit TcraV, Tcr&’ and TcrpV gene segments indicated that no additional identifiable variable gene segments are contained in this region in either transcriptional ori- entation.

Regulatory elements. T-cell receptor molecules, like immunoglobulins and MHC polypeptides, are encoded by discontinuous gene segments organized as discrete loci in the genomes of higher vertebrates. During the course of cellular development, these gene segments are rearranged and joined to produce the final coding se- quence. Since all of the gene segments that rearrange to form a Tcr gene are well characterized with respect to conservation of functionally important sequence char- acteristics, it is fairly straightforward to identify gene segment coding sequences that have not been identified previously. Interspersed between the gene segment cod- ing regions of the Tcr loci, however, are sequences that are necessary for the expression of Tcr genes. These reg- ulatory sequences apparently may or may not share nu- cleotide sequence homology with similar elements from other loci types or across species. At least three noncod- ing sequences representing Tcr gene regulatory function are present in the 94,647-bp region presented here. Re- gions containing these three sequences, the cx enhancer and the a/3 silencers, sil I and sil II, were identified by functional assays (Winoto and Baltimore, 1989a,b). The exact sequences responsible for regulatory action of the enhancer and silencer elements have been determined by comparing the homologous regions of the mouse and human a chain loci (Koop et al., in preparation). In a similar manner, a fourth noncoding region in the murine e/J chain locus, designated CSB-1, is of unknown func- tion and was identified by sequence comparison to hu- man Tcra germline sequences (Koop et al., 1992).

CONCLUSIONS AND SUMMARY

We have used manual and automated methods to complete the nucleotide sequence of 94,647 contiguous bp at the 3’ terminal region of the murine Tcr a/J chain locus. This is the largest contiguous region of metazoan DNA sequenced to date. Since it is likely that in the immediate future, large-scale DNA sequencing projects will begin from individual cosmid clone inserts of 35-45

kb, we wished to define an efficient and relatively inex- pensive strategy for analyzing large regions of DNA cloned in cosmid vectors. In our experiments, subclones with an average insert size of 1 kb were prepared in Ml3 vectors, following direct sonication of whole cosmid DNA. Subclones that contained the cosmid vector were not identified and excluded from sequence analysis, but were identified by sequence homology screening prior to the assembly process and removed. Others have de- scribed the identification and exclusion of Ml3 sub- clones containing vector sequences by plaque hybridiza- tion (Edwards et al., 1990; F. Blattner, personal commu- nication). While this is a necessary task for subclones prepared from bacteriophage A clones, we chose not to direct any efforts into prescreening but rather to process the relatively small number (approximately 12%) of subclones containing regions of the cosmid vector. This approach allowed us to eliminate tedious plaque hybrid- ization steps with only a slight increase in the number of random subclones necessary to complete the initial 90 to 95% of the double-stranded sequence of the cosmid in- sert. Additionally, it allowed for error analysis by com- parison of raw sequence data with known vector se- quences. DNA sequence analysis of subclones prepared by sonication of entire cosmid clones indicated that ran- dom fragmentation was observed as was the case for 6 to 12-kb restriction fragments, and the resulting subfrag- ment library contained the subclones necessary to com- plete the nucleotide sequence of 90 to 95% of the original clone. Regions of the cosmid insert that were missing after random sequence analysis generally were found to be less than a few kilobases and could be cloned in Ml3 or pUCll8 following amplification by PCR. As shown in Table 1, the overall redundancy for the complete se- quence was 5.9. While this degree of oversequencing provided a higher confidence of accuracy, some regions of the sequence were analyzed up to 15 times (Fig. 3). An ideal sequencing strategy would provide three- to four- fold overall redundancy, thereby maintaining a high confidence in the accuracy of the final sequence while reducing the number of sequences required to complete 90 to 95% of the cosmid insert. To this end, one of us (R.W.) believes a high-throughput random sequencing strategy with a limited number of subclones (e.g., 400- 600 per cosmid) is best initially, followed by a switch to a more directed strategy before redundancy becomes too high. Switching to an effective primer-directed sequenc- ing strategy following completion of 80 to 90% of the double-stranded sequence will provide for more even cov- erage of the cosmid insert and more efficient closure. Indeed, the implementation of such a limited random- plus-walking strategy has allowed rapid completion of several cosmid clones in the recently begun Caenorhab- ditis elegans genome sequencing project (Sulston et al,, 1992). Alternatively, because of the high throughput of automated fluorescent sequencing, the high redundancy of the shotgun strategy is attractive, and less time-in- tensive directed sequencing is required. Either approach could be facilitated by preparing random subclones with

Page 9: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

1206 WILSON ET AL.

larger inserts (2-6 kb) and by sequencing all subclones with the forward primer and selected subclones with a reverse primer to establish linkage between “minicon- tigs.”

It is clear from our observations that automated entry of DNA sequence data is a critical step for a large-scale sequencing project. While we were able to process reac- tions and gels for a large number of DNA samples using semiautomated radioisotopic sequencing methods early in the project, accurate and timely entry of the raw se- quence data into databases was unquestionably the limit- ing step for higher throughput. As mentioned previ- ously, high-throughput radioisotopic DNA sequencing methods required nearly a full day of data interpretation for every day spent performing the actual sequencing work. The implementation of fluorescent DNA sequenc- ing chemistry and the automated gel reader provided a dependable and reasonably accurate solution to this problem without significantly compromising through- put. Since personnel costs are undoubtedly the greatest expense in large-scale DNA sequencing projects utiliz- ing the current technology, the most efficient strategy will make the best use of available labor. As the radioiso- topic sequencing progressed, the manual process of en- tering DNA sequence data from several autoradiographs per day into a computer quickly became a cheerless task, and persons performing the sequencing work often were unmotivated to keep throughput at a constant level. By employing the automated fluorescent gel reader system and thereby eliminating the manual data interpretation task, we were able to achieve a much more uniform on- going rate of data accumulation throughout the random sequencing phases of the project. Additionally, as we have described above, the average accuracy of sequences produced by the automated DNA sequencer was approxi- mately 97%, with most errors occurring late in the read. We estimate that manual interpretation of autoradio- graphs performed earlier in the sequencing project had a similar average accuracy figure. While an error rate of 3% certainly would be unacceptable for a sequencing project with no more than single or double coverage, it poses no problem for a random strategy producing five- to sixfold redundancy.

In addition to the data entry task, the process of as- sembling raw DNA sequence data into a contiguous re- gion and resolving discrepancies between individual se- quences is very time-consuming and tedious. In the course of this sequencing project, we investigated several computer programs for the assembly of random se- quences. Two programs, the VAX-VMS version of SAP written by Rodger Staden and SeqMan from DNAS- TAR, Inc., were most useful in the assembly of 40-kb projects. Although the actual sequence assembly process was much slower when using the SeqMan program, it was preferred because of the window-based editing facili- ties. Both programs required considerable interaction with the computer during assembly and editing. After raw DNA sequences were merged by the assembly pro- gram, discrepancies between individual reads were

corrected by reviewing the original autoradiographs or chromatograms. If a discrepancy could not be solved with the available data, additional reads were required. Since the redundancy of the final sequence was fairly high, additional reads usually were required only when filling the last few sequence gaps. The majority of the editing work involved searching for the appropriate film or trace and finding a specific sequence within. Since primary data from more than one source must be com- pared at any particular base in the sequence, and often on the complementary strand, this searching task is in- creased accordingly. In future large-scale DNA sequenc- ing projects, the speed and ease of contig editing can be greatly facilitated by programs that allow direct on- screen viewing of primary fluorescent sequence data within the contig editor facility. We are currently inves- tigating the use of three new assembly programs that make use of on-screen chromatogram editing; the re- cently completed UNIX/Xl1 version of Staden’s SAP, designated XDAP (Dear and Staden, 1991), DNAS- TAR’s Macintosh version of Seqman, and ABI’s Inherit.

Our experience with this initial excursion into large- scale sequencing leads us to several conclusions for fu- ture megabase sequencing efforts. (i) Large-scale se- quencing is a multistep process that must be automated as rapidly as possible (Fig. 1). In this regard, major bot- tlenecks must be identified and eliminated in a system- atic and orderly manner. Currently, preparing DNA subclones for sequencing and the DNA fragment assem- bly and editing process represent two of the most signifi- cant bottlenecks. (ii) A production line sequencing core effort must be established and isolated from all responsi- bilities other than to carry out continuous and high- throughput sequencing. Devices for the automation of both the DNA sequencing are critical in this regard (see Hunkapiller et ul., 199la,b). (iii) Finally, different groups must be assigned discrete tasks that relate to large-scale DNA sequencing, tasks that include quality control, testing of improved protocols, trouble shooting of chemistries and instruments, and the associated in- formatics (assembly, data analysis, etc.). Our view is that incremental improvements in the current ap- proaches to automated fluorescent DNA sequencing will lead to a lo-fold or more improvement in throughput over the next 5 years (Hunkapillar et ul., 199la,b). Cer- tainly this advance will necessitate the application of robotic and automation techniques to the many steps of the large-scale DNA sequencing process.

Complete DNA sequence analysis of the 3’ terminal region of the murine Tcra/J chain locus has provided a great deal of information concerning the organization and structure of T-cell receptor genes and additionally gives new insight into some of the mechanisms by which these genes rearrange and are expressed in lymphoid cells. A total of 15 novel TcraJ gene segments were dis- covered, bringing the total number of TcraJ gene seg- ments in the mouse germline to 50. All of the TcraJ gene segments appear to be randomly dispersed and are lo- cated in a region bounded proximally by the TcrE gene and distally by the Tcr& gene. All of the TcraJ gene

Page 10: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

THE MURINE T-CELL RECEPTOR a/a CHAIN LOCUS 1207

segments are oriented in the same transcriptional direc- tion. By germline sequence analysis, the large majority of Tcra!J gene segments appear to be functional coding regions. Interestingly, the first and last TcraJ gene seg- ments in the region are pseudogenes, and the 3’-most pseudogene (TcraJl) has lost nearly all of its coding se- quences. Aside from the previously identified inverted TcrW5 gene segment located between the TcrX gene and TcraJ50 (Iwashima et al., 1988), no additional open reading frames of obvious significance were discovered. Noncoding sequences that may play a role in the regula- tion of Tcr gene expression may be identified by conser- vation over evolution. At least one striking example of cross-species homology in noncoding regions of the mouse Tcr sequence was observed following preliminary sequence comparison. Possible functions of this region, a conserved sequence block of 200 bp located between TcraJ3 and TcraJ4, are discussed in greater detail in the accompanying paper (Koop et a!., 1992).

Large-scale DNA sequence analysis of specific regions of eucaryote genomes eventually will change the general approach of biological research. The ongoing construc- tion of genetic databases containing map and sequence information will allow opportunities for studying gene structure, expression and regulation, polymorphism, evolution and function, chromosomal structure and rear- rangement, functions of repetitive and other novel non- coding sequences, and disease correlations. Further- more, large-scale genome sequencing will provide the in- formation required to use molecular tools such as restriction enzymes, antibodies, and specific DNA am- plification assays. As discussed in the accompanying paper, it is clear that obtaining the nucleotide sequence of the 3’ terminal region of the murine Tcra/rY chain locus has allowed the design of experiments that will lead to a better understanding of Tcr gene structure and expression. Additional DNA sequence analysis of the murine a/6 chain locus, the fl and 7 chain loci, and the homologous regions in the human and other vertebrate genomes will further increase our knowledge of the mechanisms of Tcr genes and their role in the normal immune response and in diseases of the immune system.

ACKNOWLEDGMENTS

The authors thank the following individuals: B. Jones, R. DeBruyn, K. Gregorich, J. Walker, A. Wilson, G. Campbell, J. Lindelien, S. Lane, and L. Hillier for technical assistance, Dr. N. Avdalovic and Mr. T. McGurk of Beckman Instruments for the gift of two robotic work- stations, Dr. F. Blattner for computer software, Dr. E. Lai, Dr. D. Zaller, Mr. T. Hunkapiller, and Dr. E. Chen for helpful discussions, and Dr. R. Waterston for critical reading of this manuscript. This work was supported by a grant from NIH-NCHGR. B.F.K. was sup- ported by the Alfred Sloan Foundation.

REFERENCES

Anderson, S. (1981). Shotgun DNA sequencing using cloned DNase I-generated fragments. Nucleic Acids Bes. 9: 3015-3027.

Arden, B., Klotz, J. L., Siu, G., and Hood, L. (1985). Diversity and structure of genes of the cx family in mouse T-cell antigen receptor. Nature (London) 316: 783-787.

Baer, R., Bankier, A. T., Biggin, M. D., Deininger, P. L., Farrell, P. J., Gibson, T. J., Hatfull, G., Hudson, G. S., Satchwell, S. C., Seguin,

C., Tuffnell, P. S., and Barrell, B. G. (1984). DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature CLorz-

don) 310:207-211. Baer, R., Lefranc, M.-P., Minowada, J., Forster, A., Stinson, M. A.,

and Rabbitts, T. H. (1986). Organization of the T-cell receptor a chain gene and rearrangement in human T-cell leukaemias. Mol.

Biol, Med. 3: 2655277. Bankier, A. T., and Barrell, B. G. (1983). Shotgun DNA sequencing.

Technol. Nucleic Acid Biochemistry B5: l-34. Becker, D. M., Patten, P., Chien, Y.-H., Yokota, T., Eshhar, Z., Gied-

lin, M., Gascoigne, N. R. J., Goodnow, C. Wolf, R., Arai, K., and Davis, M. M. (1985). Variability and repertoire size of T-cell recep- tor Va gene segments. Nuture (London) 3 17: 430-434.

Bill, J., Yague, J., Appel, V. B., White, J., Horn, G., Eriich, H. A., and Palmer, E. (1989). Molecular genetic analysis of 178 I-Abml’-reac- tive T cells. J. Exp. Med. 169: 115-133.

Birnboim, H. C., and Doly, J. (1979). A rapid alkaline extraction pro- cedure for screening recombinant plasmid DNA. Nucleic Acids Res. 7: 1513-1523.

Chee, M. S., Bankier, A. T., Beck, S., Bohni, C., Brown, C. M., Cerny, R., Horsnell, T., Hutchison, C. A., Kouzarides, T., Martignetti, J. A., Preddie, E., Satchwell, S. C., Tomlinson, P., Weston, K. M., and Barrell, B. (1990). Analysis of the protein-coding content of the sequence of human cytomegalovirus strain ADl69. Curr. Top. Mi- crobiol. lmmunol. 154: 1255169.

Chen, E. Y., Liao, Y. C., Smith, D. H., Barrera-Saldana, H. A., Ge- linas, R. E., and Seeburg, P. H. (1989). The human growth hormone locus: Nucleotide sequence, biology, and evolution. Genomics 4: 479-497.

Chien, Y.-H., Becker, D. M., Lindsten, T., Okamura, M., Cohen, D. I., and Davis, M. M. (1984). A third type of murine T-cell receptor gene. Nuture (London) 312: 31-35.

Chien, Y.-H., Iwashima, M., Kaplan, K. B., Elliot, J. F., and Davis, M. M. (1987a). A new T-cell receptor gene located within the alpha locus and expressed early in T-cell differentiation. Nature (London) 327: 677-682

Chien, Y.-H., Iwashima, M., Wettstein, D. A., Kaplan, K. B., Elliot, J. F., Born, W., and Davis, M. M. (1987b). T-cell receptor 6 gene rear- rangements in early thymocytes. Nature (London) 330: 722-727.

Chou, H. S., Behlke, M. A., Godambe, S. A., Russell, J. H., Brooks, C. G., and Loh, D. Y. (1986). T-cell receptor genes in an alloreactive CTL clone: Implications for rearrangement and germline diversity of variable gene segments. EMBG J. 5: 2149-2155.

Craxton, M. (1991). Linear amplification sequencing, a powerful method for sequencing DNA. Methods: Companion Methods Enzy- mol. 3: 20-26.

Cummings, D. J., McNally, K. L., Domenico, J. M., and Matsuura, E. T. (1990). The complete DNA sequence of the mitochondrial ge- nome of Podospora anserina. Curr. Gene. 17: 375-402.

Davison, A. J., and Scott, J. E. (1986). The complete DNA sequence of Varicella-Zoster virus. J. Gen. Vn-ol. 67: 1759-1816.

Dear, S., and Staden, R. (1991). A sequence assembly and editing pro- gram for efficient management of large projects. Nucleic Acids Res. 19(14):3907-3911.

Deininger, P. (1983). Random subcloning of sonicated DNA Applica- tion to shotgun DNA sequence analysis. Anul. Biochem. 129: 216- 223.

Dembic, Z., Haas, W., Weiss, S., McCubrey, J., Kiefer, H., von- Boehmer, H., and Steinmetz, M. (1985). Transfer of specificity by murine a and 0 T-cell receptor genes. Nature (London) 320: 232- 238.

Edwards, A., Voss, H., Rice, P., Civitello, A., Stegemann, J., Schwager, C., Zimmermann, J., Erfle, H., Caskey, C. T., and Ansorge, W. (1990). Automated DNA sequencing of the human HPRT locus. Genomics 6: 593-608.

Goebel, S. J., Johnson, G. P., Perkus, M. E., Davis, S. W., Winslow, J. P., and Paoletti, E. (1990). The complete DNA sequence of Vac- cinia virus. Virology 179: 247-266.

Page 11: Nucleotide sequence analysis of 95 kb near the 3′ end of the murine T-cell receptor αδ chain locus: Strategy and methodology

WILSON ET AL.

Griesser, H., Champagne, E., Tkachuk, D., Takihara, Y., Lalande, M., Ballie, E., Minden, M., and Mak, T. (1988). The human T-cell re- ceptor a-6 locus: A physical map of the variable, joining and con- stant region genes. Eur. J. Zmmunol. 18: 641-644.

Gyllensten, U. B., and Erlich, H. A. (1989). Generation of single- stranded DNA by the polymerase chain reaction and its application to direct sequencing of the HLA-DQa locus. Proc. Natl. Acad. Ski. USA 85: 7652-7656.

Hayday, A. C., Diamond, D. J., Tanigawa, G., Heilig, J. S., Folsom, V,, Saito, H., and Tonegawa (1985). Unusual organization and diver- sity of T-cell receptor e chain genes. Nature (London] 316: 828- 832.

Hochgeschwender, U., Simon, H.-G., Weltzien, H. U., Bartels, F., Becker, A., and Epplen, J. T. (1987). Dominance of one T-cell re- ceptor in the H-2Kb/TNP response. Nature (London) 326: 307- 309.

Hunkapiller, T., Kaiser, R. J., Koop, B. F., and Hood, L. (199la). Large-scale DNA sequencing. Curr. Opin. Biotech. 2: 92-101.

Hunkapiller, T., Kaiser, R. .J., Koop, B. F., and Hood, L. (199lb). Large-scale and automated DNA sequence analysis. Science 254: 59-67.

Iwashima, M., Green, A., Davis, M. M., and Chien, Y.-H. (1988). Vari- able region (Va) gene segment most frequently utilized in adult thy- mocytes is 3’ of the constant (C?J) region. Proc. Natl. Acad. Sci. USA 85: 8161-8165.

Koop, B. F., Wilson, R. K., Chen, C., Halloran, N., Sciammis, R., and Hood, L. (1990). Sequencing reactions in microtiter plates. Bio- Techniques 9: 32-37.

Koop, B. F., Wilson, R. K., Wang, K., Vernooij, B., Zaller, D., Kuo, C. L., Seto, D., Toda, M., and Hood, L. (1992). Organization, struc- ture, and function of 95 kb of DNA spanning the murine T-cell receptor Ca,/Ci? region. Genornics 13: 120991230.

Kranz, D. M., Saito, H., Heller, M., Takagaki, Y., Haas, W., Eisen, H. N., and Tonegawa, S. (1985). Limited diversity of the rearranged T cell y gene. Nature (London) 313: 752-755.

Lai, E., Barth, R., and Hood, L. (1987). Genomic organization of the mouse T-cell receptor p-gene family. Proc. Natl. Acad. Sci. USA 84: 3846-3850.

Lai, E., Wilson, R. K., and Hood, L. (1989). Physical maps of immuno- globulin-like loci. Adu. Zmmunology 46: l-59.

Marolleau, J.-P., Fondell, J. D., Malissen, M., Trucy, J., Barbier, E.. Marcu, K. B., Cazenave, P.-A., and Primi, D. (1988). The joining of germline Va to Ja genes replaces the pre-existing Va-Ja complexes in a T-cell receptor a,@ positive T cell line. Cell 55: 291-300.

McGeoch, D. J., Dalrymple, M. A., Davison, A. J., Dolan, A., Frame, M. C., McNab, D., Perry, L. J., Scott, ,J. E., and Taylor, P. (1988). The complete DNA sequence of the long unique region in the ge- nome of Herpes simplex virus type 1. J. Gen. Viral. 69: 1531-1574.

Messing, .J. (1983). New Ml3 vectors for cloning. Meth. En~ym01. 101:20-78.

Ohyama, K., Fukuzawa, H., Kohchi, T., Sano, T., Sane, S., Shirai, H., Umesono, K., Shiki, Y., Takeuchi, M., Chang, Z., Aota, S., Inoku- chi, H., and Ozeki, H. (1988). Structure and organization of iWar- &a&a pol.vmorpha chloroplast genome. J. &fol. Biol. 203: 281-298.

Roth, M. S., Collins, F. S., and Ginsberg, D. (1988). Sizing of the human T cell receptor cr locus and detection of a large deletion in the Molt-4 cell line. Blood 71: 1744-1747.

Saiki, R. K., Scharf, S., Faloona, F., Mullis, K. B., Horn, G. T., Erlich, H. A., and Arnheim, N. (1985). Enzymatic amplification of 0.globin genomic sequences and restriction site analysis for diagnosis of sickle cell anemia. Science 230: 1350-1354.

Saito, H., Kranz, D., Takagaki, Y., Hayday, A., Eisen, H., and Tone- gawa, S. (1984). A third rearranged and expressed gene in a clone of cytotoxic T lymphocyte. Nature (London) 312: 36-39.

Sanger, F., Nicklen, S., and Coulson, A. R. (1977). DNA sequencing

with chain terminating inhibitors. Proc. Natl, Acad. Sci, USA 74: 5463-5467.

Sanger, F., Coulson, A., Barrell, B., Smith, A., and Roe, B. (1980). Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing. J. Mol. Biol. 143: 161-178.

Shehee, W. R., Loeb, D. D., Adey, N. B., Burton, F. H., Casavant. N. C., Cole, P., Davies, C. .J., McGraw, R. A., Schichman, S. A., Severynse, D. M., Voliva, C. F., Weyter, F. W., Wisely, G. B., Edgell, M. H., and Hutchison, C. A. (1989). Nucleotide sequence of the BALB/c mouse /%globin complex. J. Mol. Biok 205: 41-62.

Shinozaki, K., Ohme, M., Tanaka, M., Wakasugi, T., Hayashida, N., Matsubayashi, T., Zaita, N., Chunwongse, J., Obokata, J., Yamagu chi-Shinozaki, K., Ohto, C., Torazawa, K., Meng, B. Y., Sugita, M., Deno, H., Kamogashira, T., Yamada, K., Kusuda, J., Takaiwa, F., Kate, A., Tohdoh, N., Shimada, H., and Sugiura, M. (1986). The complete nucleotide sequence of the tobacco chloroplast genome: Its gene organization and expression. EMBO J. 5: 2043-2049.

Smith, L. M., Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Con- nell, C. R., Heiner, C., Kent, S. B. H., and Hood, L. E. (1986). Fluo- rescence detection in automated DNA sequence analysis. Nature (London) 321:674-679.

Spinella, D. G., Hansen, T. H., Walsh, W. D., Behlke, M. A., Tilling- hast, T. P., Chou, H. S., Whiteley, P. J., Kapp, J. A., Pierce, C. W., Shevach, E. W.- and Loh, D. Y. (1987). Receptor diversity of insu- lin-specific T cell lines from C57BL (H-2? mice. J. Zmmunol. 138: 3991-3995.

Sulston, <J., Du, Z., Thomas, K., Wilson, R., Hillier, L., Staden, R., Halloran, N., Green, P., Thierry-Mieg, J., Qiu, L., Dear, S., Coulson, A., Craxton, M., Durbin, R., Berks, M., Metzstein, M., Hawkins, T., Ainscough, R., and Waterston, R. (1992). The C. elegans genome sequencing project: A beginning. Nature (London) 356, 37-41.

Tabor, S., and Richardson, C, C. (1987). DNA sequence analysis with a modified bacteriophage T7 DNA polymerase. Proc. Natl. Acad. Ski. USA 84: 4767-4??1.

Tabor, S., and Richardson, C. C. (1989). Effect of manganese ions on the incorporation of dideoxynucleotides by bacteriophage T7 DNA polymerase and Escherichia coli DNA polymerase I. Proc. Natl. Acad. Sci. USA 86: 4076-4080.

Takeshita, S., Toda, M., and Yamagishi, H. (1989). Excision products of the T-cell receptor gene support a progressive rearrangement model of the e/J locus. EMBO J. 8: 3261-3270.

Wilson, R. K., Lai, E., Concannon, P., Barth, R. K., and Hood, L. (1988a). Structure, organization and polymorphism of murine and human T-cell receptor m and $ chain gene families. Zmmunol. Reu. 101: 1499172.

Wilson, R. K., Yuen, A. S., Clark, S. M., Spence, C., Arakelian, P., and Hood, L. (1988b). Automation of dideoxynucleotide DNA sequenc- ing reactions using a robotic workstation. BioTechrziques 6: 776- 787.

Wilson, R. K., Chen, C., and Hood, L. (199Oa). Optimization of asym- metric polymerase chain reaction for fluorescent DNA sequencing. BioTechniques 8: 1844189.

Wilson, R. K., Chen, C., Avdalovic, N., Burns, J., and Hood, L. (199Ob). Development of an automated procedure for fluorescent DNA sequencing. Genomks 6: 626-634.

Winoto, A., Mjolsness, S., and Hood, L. (1985). Genomic organization of the genes encoding mouse T-cell receptor 0 chain. Nature (Len- don) 316:832-836.

Winoto, A., and Baltimore, D. (1989a). A novel, inducible and T cell- specific enhancer located at the 3’ end of the T cell receptor <y locus. EMBO J. 8: 729-733.

Winoto, A., and Baltimore, D. (1989b). afi lineage-specific expression of the 0 T cell receptor gene by nearby silencers. Cell 59: 649-655.

Yoshikai, Y., Clark, S. P., Taylor, S., Sohn, U., Wilson, B. I., Minden, M. D., and Mak, T. W. (1985). Organization and sequences of the variable, joining and constant region genes of the human T-cell receptor a-chain. Nature (London) 316: 837-840.