This file describes how we made the browser database on the
Dec 12, 2000 freeze.

CREATING DATABASE AND STORING mRNA/EST SEQUENCE AND AUXILIARRY INFO 

o - ln -s /projects/hg2/gs.6/oo.27 ~/oo
o - Create the database.
     - ssh cc94
     - Enter mysql via:
           mysql -u hguser -phguserstuff
     - At mysql prompt type:
	create database hg6;
	quit
     - make a semi-permanent alias:
        alias hg6 mysql -u hguser -phguserstuff -A hg6
o - Make sure there is at least 5 gig free on cc94:/usr/local/mysql 
o - Store the mRNA (non-alignment) info in database.
     - hgLoadRna new hg6
     - hgLoadRna add hg6 /projects/hg2/mrna.122/mrna.fa /projects/hg2/mrna.122/mrna.ra
     - hgLoadRna add hg6 /projects/hg2/mrna.122/est.fa /projects/hg2/mrna.122/est.ra
    The last line will take quite some time to complete.  It will count up to
    about 3,200,000 before it is done.

REPEAT MASKING 

o - Lift up the repeat masker coordinates as so:
      - ssh kks00
      - cd ~/gs/oo.27
      - edit jkStuff/liftOut.sh and update ooGreedy version number
      - source jkStuff/liftOut.sh

o - Copy over RepeatMasking for chromosomes 21 and 22:
      - cp /projects/cc/hg/gs.4/oo.18/21/chr21.fa.out ~/gs/oo.27/21

o - Load RepeatMasker output into database:
      - cd /projects/hg2/gs.6/oo.27
      - hgLoadOut hg6 ?/*.fa.out ??/*.fa.out
        (Ignore the "Strange perc. field warnings.  Maybe mention them
	 to Arian someday.)

This will keep things going until the sensitive RepeatMasker run comes in
from Victor or elsewhere.  When it does:

o - Unzip and process Victor's stuff with doMerge.sh and victor_merge_rm_out1.pl.

o - Copy in Victor's repeat masking runs over the contig runs in ~/oo/*/ctg*

o - Lift these up as so:
      - ssh kks00
      - cd ~/gs/oo.27
      - source jkStuff/liftOut2.sh

o - Load RepeatMasker output into database again:
      - ssh cc94
      - cd /projects/hg2/gs.6/oo.27
      - hgLoadOut hg6 ?/*.fa.out ??/*.fa.out
        (Ignore the "Strange perc. field warnings.  Maybe mention them
	 to Arian someday.)


STORING O+O SEQUENCE AND ASSEMBLY INFORMATION

o - Create packed chromosome sequence files and put in database
     - hg6 < ~/src/hg/lib/chromInfo.sql
     - cd /projects/hg2/gs.6/oo.27
     - hgNibSeq hg6 /projects/hg2/gs.6/oo.27/nib ?/chr*.fa ??/chr*.fa

o - Store o+o info in database.
     - cd /projects/hg2/gs.6/oo.27
     - jkStuff/liftAgp.sh
     - jkStuff/liftGl.sh ooGreedy.99.gl
     - hgGoldGapGl hg6 /projects/hg2/gs.6 oo.27 
     - cd /projects/hg2/gs.6
     - hgClonePos hg6 oo.27 ffa/sequence.inf /projects/hg2/gs.6
       (Ignore warnings about missing clones - these are in chromosomes 21 and 22)
       (Had to patch sequence.inf and fin/fa from gs.5 to take care of
       missing AL133398 and AL031034.
     - hgCtgPos hg6 oo.27

o - Make and load GC percent table
     - login to cc94
     - cd /projects/hg2/gs.6/oo.27/bed
     - mkdir gcPercent
     - cd gcPercent
     - hg6 < ~/src/hg/lib/gcPercent.sql
     - hgGcPercent hg6 ../../nib


MAKING AND STORING mRNA AND EST ALIGNMENTS

o - Generate mRNA and EST alignments as so:
      - cdnaOnOoJobs /projects/hg2/gs.6/oo.27 /projects/hg2/gs.6/oo.27/jkStuff/postPsl refseq mrna est
      - cd /projects/hg2/gs.6/oo.27/jkStuff/postPsl
      - edit in/mrna in/est in/refseq to change directory h/mrna to h/mrna
      - edit all.con to remove the big contig on chromosome 6 (ctg15907)
      - split into 5 files.
      - submit each to condor
      - run psLayout to generate ctg15907.refseq.psl, ctg15907.mrna.psl 
        and ctg15907.est.psl on kks00 because it has a gigabyte of memory 
	(need 400 meg).
      - Wait 2 days or so for last stragglers.
      - Check errors by inspecting
      		cat err/* | more 
      - Do
           ls -1 log | wc
	   cat log/* | grep return | grep "value 0" | wc
	   ls -1 psl/*.psl psl/*/*.psl | wc
	and make sure two counts agree (except last should be two
	higher because of ctg15907).

o - Process EST and mRNA alignments into near best in genome.
    First the mRNAs:
      - cd /projects/hg2/gs.6/oo.27
      - cd /projects/hg2/gs.6/oo.27/jkStuff/postPsl/psl
      - mkdir /projects/hg2/gs.6/oo.27/psl
      - mkdir /projects/hg2/gs.6/oo.27/psl/mrnaRaw
      - ln */*.mrna.psl *.mrna.psl !$
      - mkdir /projects/hg2/gs.6/oo.27/psl/estRaw
      - ln */*.est.psl *.est.psl !$
      - cd /projects/hg2/gs.6/oo.27/psl
      - pslSort dirs mrnaGoodRaw.psl temp mrnaRaw
      - pslReps mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr
      - liftUp all_mrna.psl ../jkStuff/liftAll.lft carry mrnaBestRaw.psl
      - pslSortAcc nohead mrna temp all_mrna.psl
      - check mrna dir looks good.
      - rm -r mrnaRaw mrnaGoodRaw.psl mrnaBestRaw.psl mrnaBestRaw.psr
   (You might be able to do the EST version of this with jkStuff/postEst.sh).
   Since some of the EST files are over 2 gig, you need to do some of
   this on an Alpha.  However since some of the files are too large
   for the per-process memory on an Alpha, you'll have to do some on
   kks00 as well.  To do this run the following script
      - ssh kks00 
      - cd ~/oo
      - source jkStuff/postEst.sh

o - Load mRNA alignments into database.
      - ssh cc94
      cd /projects/hg2/gs.6/oo.27/psl/mrna
      foreach i (*.psl)
          mv $i $i:r_mrna.psl
      end
      hgLoadPsl hg6 *.psl
      cd ..
      - remove header lines from all_mrna.psl
      - hgLoadPsl hg6 all_mrna.psl

o - Load EST alignments into database.
      - ssh cc94
      cd /projects/hg2/gs.6/oo.27/psl/est
      foreach i (*.psl)
            mv $i $i:r_est.psl
      end
      hgLoadPsl hg6 *.psl
      cd ..
      - remove header lines from all_est.psl
      - hgLoadPsl hg6 all_est.psl

o - Create subset of ESTs with introns and load into database.
      - ssh kks00
      cd ~/oo
      source jkStuff/makeIntronEst.sh
      - ssh cc94
      cd ~/oo/psl/intronEst
      hgLoadPsl hg6 *.psl

SIMPLE REPEAT TRACK

o - Execute the following (which will take about 
    8 hours).
	mkdir ~/oo/bed/simpleRepeat
	cd ~/oo/nib
	foreach i (*.nib)
	    echo processing $i
	    trfBig $i ~/oo/bed/simpleRepeat/$i -bed
	end
	cd ~/oo/bed/simpleRepeat
	rm *.nib
	cat c*.bed > all.bed
    Then log onto cc94 and
        cd ~/oo/bed/simpleRepeat 
	hg6 < ~/src/hg/lib/simpleRepeat.sql
    At the mysql> prompt type
        load data local infile 'all.bed' into table simpleRepeat
	quit



PRODUCING MOUSE ALIGNMENTS

o - Make sure that contig/ctg*.fa.masked files exist already.
o - Mask simple repeats in addition to normal repeats with:
        mkdir ~/oo/jkStuff/trfCon
	cd ~/oo/allctgs
	/bin/ls -1 | grep -v '\.' > ~/oo/jkStuff/trfCon/ctg.lst
        cd ~/oo/jkStuff/trfCon
	mkdir err
	mkdir log
	mkdir out
    edit ~/oo/jkStuff/trf.gsub to update gs and oo version
	gensub ctg.lst ~/oo/jkStuff/trf.gsub
	mv gensub.out trf.con
	condor_submit trf.con
    wait for this to finish.
o - Load up the cluster with masked copies of the genome.
       ssh kks00
       cd ~/oo/allctgs
       zip -j -1 ../ctgFaTrf.zip */ctg*.fa.trf
       ssh cc01
       cd ~/oo
       ccCp ctgFaTrf.zip  /var/tmp/hg/gs.6/oo.27/ctgFaTrf.zip
    Do the following to unpack
       cd ~/cluster
    edit parOne to read
       cd /var/tmp/hg/gs.6/oo.27
       mkdir tanMaskNib
       cd tanMaskNib
       unzip ../ctgFaTrf.zip
    and then
       parAll
    Double check that all data is in place.
o - Do mouse/human alignments.
       ssh cc01
       cd ~/mouse/bigCon
       mkdir err
       mkdir out
       mkdir log
       mkdir psl
       ls -1S /var/tmp/hg/gs.6/oo.27/tanMaskNib/*.fa.trf > human.lst
       ls -1S /projects/hg3/mouse/ncbiTrace/fasta.mus_* > mouse.lst
       gensub2 mouse.lst human.lst aliMouse.gsub aliMouse.con
     break aliMouse.con into ten files and submit a couple a day.
o - Check alignments as so:
       ssh kks00
       cd ~/mouse/bigCon
       gensub2 mouse.lst human.lst fixMouse.gsub fixMouse.sh
       chmod a+x fixMouse.sh
       fixMouse.sh | tee fixMouse.out
o - If any output in fixMouse.out then resubmit jobs as so:
       ssh cc01
       cd ~/mouse/bigCon
       mv log log.old; mkdir log
       mv err err.old; mkdir err
       mv out out.old; mkdir out
       condor_submit fix.con
o - Sort alignments as so and get rid of pile-ups due to contamination:
       cd ~/mouse/bigCon
       pslCat -dir -check psl | liftUp -type=.psl stdout ~/oo.27/jkStuff/liftAll.lft warn stdin | pslSortAcc nohead chrom temp stdin
o - Get rid of big pile-ups due to contamination as so:
       cd chrom
       foreach i (*.psl)
           echo $i
           mv $i xxx
           pslUnpile -maxPile=150 xxx $i
       rm xxx
       end
o - Rename to correspond with tables as so and load into database:
       foreach i (*.psl)
	   set r = $i:r
           mv $i ${r}_blatMouse.psl
       end
o - Remove long redundant bits from read names by making a file
    called subs.in with the following line:
        gnl|ti^ti
    and running the command
        cd ~/mouse/bigCon/chrom
	subs -e -c ^ *.psl > /dev/null
o - load into database as so:
	ssh cc94
	cd ~/mouse/bigCon/chrom
        hgLoadPsl hg6 *.psl
	hgLoadRna addSeq '-abbr=gnl|' hg6 /projects/hg3/mouse/ncbiTrace/fasta.mus*
    These last two will take quite some time.  Perhaps an hour each.
      
o - Build fa file with only mouse traces that are part of alignments:
        ssh kks00
	cd ~/mouse/vsOo29
	subsetTraces ~/mouse/ncbiTrimmed blatMouse aliTrace.fa '-abbr=gnl|' '-tracePat=*.fa' '=pslPat=*.psl'

o - Load mouse trace ancillary information into the mouseTraceInfo table.
    This parses the trace id and template id out of the NCBI trace FASTA
    files.  The name of the table as well as the database is specified:

    hgTraceInfo hg6 mouseTraceInfo  /projects/hg3/mouse/ncbiTrace/fasta.mus*


PRODUCING KNOWN GENES (RefSeq)

o - Go to the Entrez browser at http://www.ncbi.nlm.nih.gov/Entrez/
    and select Nucleotide in the search box.  Go to Limits and select
    'exclude all of the above', Molecule: mRNA, Only from: RefSeq,
    and then put "Homo sapiens" [organism] in the search box and hit 
    go.  Tweak things so can save it to /projects/hg3/refseq.123/refseq.gb
    in GenBank flat file format.
o - Download ftp://ncbi.nlm.nih.gov/refseq/LocusLink/loc2acc and
    mim2loc to /projects/hg3/refseq.123
o - Similarly download refSeq proteins in fasta format to refSeq.pep
o - Align mRNAs as so:
    ssh kks00
    cd /projects/hg3/refSeq
    gbToFaRa mrna.fil refSeq.fa refSeq.ra refSeq.ta refseq.gb
    gfClient cc 17779 ~/oo/nib refSeq.fa rsVsOo27Raw.psl
        (This will take about 4 hours)
    pslReps rsVsOo27Raw.psl rsVsOo27Mixed.psl /dev/null -sizeMatters -minAli=0.97 -nearTop=0.005 -minCover=0.2

    pslSortAcc head chrom temp rsVsOo27Mixed.psl
    pslCat chrom/*.psl > rsVsOo27.psl
    grep NM_ rsVsOo27.psl > known.oo27.psl
o - Produce refGene, refPep, refMrna, and refLink tables as so:
     ssh cc94
     cd /projects/hg3/refseq.123
     hgRefSeqMrna hg6 refSeq.fa refSeq.ra known.oo27.psl loc2ref refSeqPep.fa mim2loc
    

LOADING IN EXTERNAL FILES

o - Load chromosome bands:
      - login to cc94
      - cd /projects/hg2/gs.6/oo.27/bed
      - mkdir cytoBands
      - cp /projects/cc/hg/mapplots/data/tracks/oo.27/cytobands.bed cytoBands
      - hg6 < ~/src/hg/lib/cytoBand.sql
      - Enter database with "hg6" command.
      - At mysql> prompt type in:
          load data local infile 'cytobands.bed' into table cytoBand;
      - At mysql> prompt type
          quit

o - Load STSs
     - login to cc94
     - cd ~/oo/bed
     - hg6 < ~/src/hg/lib/stsMarker.sql
     - mkdir stsMarker
     - cd stsMarker
     - cp /projects/cc/hg/mapplots/data/tracks/oo.27/stsMarker.bed stsMarker
      - Enter database with "hg6" command.
      - At mysql> prompt type in:
          load data local infile 'stsMarker.bed' into table stsMarker;
      - At mysql> prompt type
          quit
      - Ignore warnings from extra fields (7611 of them).

o - Load SNPs into database.
      - ssh cc94
      - cd ~/oo/bed
      - mkdir snp
      - cd snp
      - Download SNPs from ftp://ncbi.nlm.nih.gov/pub/sherry/dbSNPoo27.txt.gz
      - Unpack.
        grep RANDOM db*.txt > snpTsc.txt
        grep MIXED  db*.txt >> snpTsc.txt
        grep BAC_OVERLAP  db*.txt > snpNih.txt
        grep OTHER  db*.txt >> snpNih.txt
        awk -f filter.awk snpTsc.txt > snpTsc.contig.bed
        awk -f filter.awk snpNih.txt > snpNih.contig.bed
        liftUp snpTsc.bed ../../jkStuff/liftAll.lft warn snpTsc.contig.bed
        liftUp snpNih.bed ../../jkStuff/liftAll.lft warn snpNih.contig.bed
        hg6 < ~kent/src/hg/lib/snpTsc.sql
        hg6 < ~kent/src/hg/lib/snpNih.sql
      - Start up mySQL with the command 'hg6'.  At the prompt do:
         load data local infile 'snpTsc.bed' into table snpTsc;
         load data local infile 'snpNihbed' into table snpNih;

o - Load cpgIslands
     - login to cc94
     - cd /projects/hg2/gs.6/oo.27/bed
     - mkdir cpgIsland
     - cd cpgIsland
     - hg6 < ~kent/src/hg/lib/cpgIsland.sql
     - wget http://genome.wustl.edu:8021/pub/gsc1/achinwal/MapAccessions/cpg_dec12.oo27.tar
     - tar -xf cpg*.tar
     - awk -f filter.awk */ctg*/*.cpg > contig.bed
     - liftUp cpgIsland.bed ../../jkStuff/liftAll.lft warn contig.bed
     - Enter database with "hg6" command.
     - At mysql> prompt type in:
          load data local infile 'cpgIsland.bed' into table cpgIsland

o - Load rnaGene table
      - login to cc94
      - cd ~kent/src/hg/lib
      - hg6 < rnaGene.sql
      - cd /projects/hg2/gs.6/oo.27/bed
      - mkdir rnaGene
      - cd rnaGene
      - download data from ftp.genetics.wustl.edu/pub/eddy/pickup/ncrna-oo27.gff.gz
      - gunzip *.gz
      - liftUp chrom.gff ../../jkStuff/liftAll.lft carry ncrna-oo27.gff
      - hgRnaGenes hg6 chrom.gff

o - Mouse synteny track
     - login to cc94.
     - cd ~/kent/src/hg/lib
     - hg6 < mouseSyn.sql
     - mkdir ~/oo/bed/mouseSyn
     - cd ~/oo/bed/mouseSyn
     - Put Dianna Church's (church@ncbi.nlm.nih.gov) email attatchment as
       mouseSyn.txt
     - awk -f format.awk *.txt > mouseSyn.bed
     - delete first line of mouseSyn.bed
     - Enter database with "hg6" command.
     - At mysql> prompt type in:
          load data local infile 'mouseSyn.bed' into table mouseSyn

o - Load Ensembl genes:
     cd ~/oo/bed
     mkdir ensembl
     wget ftp://ftp.ensembl.org/current/data/gtf/ens100_genes_gtf_dump.tar.gz
     mkdir contig
     cd contig
     gtar -zxf ../*.gz
     cd ..
     liftUp chromRaw.gtf ../../jkStuff/liftAll.lft warn contig/*.gtf
     grep -v start_codon chromRaw.gtf | grep -v stop_codon > chrom.gtf
     ldHgGene hg6 ensGene chrom.gtf

o - Load Ensembl peptides:
     (poke around ensembl to get their peptide files as ensembl.pep)
     (substitute ENST for ENSP in ensemble.pep with subs)
     hgPepPred hg6 ensPep ensembl.pep

o - Load Softberry genes and associated peptides:
     cd ~/oo/bed
     mkdir softberry
     cd softberry
     get ftp://www.softberry.com/pub/GP_SCDEC/*
     ldHgGene hg6 softberryGene chr*.gff
     hgPepPred hg6 softberry *.pro
     hgSoftberryHom hg6 *.pro

o - Load Softberry promoters
     cd ~/oo/bed
     mkdir softPromoter
     cd softPromoter
    move Softberry_prometers.tar.gz from Victor's email
    attachment into this directory. 
     gtar -zxf *.gz
     hgSoftPromoter hg6 *.prom

o - Load exoFish table
     - login to cc94
     - cd /projects/hg2/gs.6/oo.27/bed
     - mkdir exoFish
     - cd exoFish
     - hg6 < ~kent/src/hg/lib/exoFish.sql
     - Put email attatchment from Olivier Jaillon (ojaaillon@genoscope.cns.fr)
       into /projects/hg2/gs.6/oo.27/bed/exoFish/oo27.ecore1.4jimkent
     - cp oo27.ecore1.4jimkent foo
     - Substitute "chr" for "CHR" in that file by
          subs -e foo >/dev/null
     - Copy in filter.awk from previous version's bed/exoFish dir.
     - Add dummy name column and convert to tab separated with
       awk -f filter.awk foo > exoFish.bed
     - Enter database with "hg6" command.
     - At mysql> prompt type in:
          load data local infile 'exoFish.bed' into table exoFish

>>> Done to here ~~~

o - Load Genie predicted genes and associated peptides.
     - cat */ctg*/ctg*.affymetrix.gtf > predContigs.gtf
     - liftUp predChrom.gtf ../../jkStuff/liftAll.lft warn predContigs.gtf
     - ldHgGene hg6 genieAlt predChrom.gtf

     - cat */ctg*/ctg*.affymetrix.aa > pred.aa
     - hgPepPred hg6 genie pred.aa 

     - hg6
         mysql> delete * from genieAlt where name like 'RS.%';
         mysql> delete * from genieAlt where name like 'C.%';


FAKING DATA FROM PREVIOUS VERSION
(This is just for until proper track arrives.  Rescues about
97% of data  Just an experiment, not really followed through on).

o - Rescuing STS track:
     - log onto cc94
     - mkdir ~/oo/rescue
     - cd !$
     - mkdir sts
     - cd sts
     - bedDown hg3 mapGenethon sts.fa sts.tab
     - echo ~/oo/sts.fa > fa.lst
     - pslOoJobs ~/oo ~/oo/rescue/sts/fa.lst ~/oo/rescue/sts g2g
     - log onto cc01
     - cc ~/oo/rescue/sts
     - split all.con into 3 parts and condor_submit each part
     - wait for assembly to finish
     - cd psl
     - mkdir all
     - ln ?/*.psl ??/*.psl *.psl all
     - pslSort dirs raw.psl temp all
     - pslReps raw.psl contig.psl /dev/null
     - rm raw.psl
     - liftUp chrom.psl ../../../jkStuff/liftAll.lft carry contig.psl
     - rm contig.psl
     - mv chrom.psl ../convert.psl


