Genomics Homework1 速成指南

Genomics Homework1


用助教的示例做一个示范,毕竟每个人要查询的基因是不一样的。

注意:这个教程只能确保在半个小时内完成本次作业,这肯定不是完成这个作业最优雅的方法。

这里我们需要查找的是 uc002rpz.3

Get output->Get BED

chr2        37428774        37458740        uc002rpz.3        0        -        37428906        37458710        0        16        272,82,59,84,197,58,98,67,69,103,54,89,184,232,1493,186,        0,1152,1318,9367,10211,10703,12232,13254,14508,14684,15352,18753,20748,21538,25912,29780,

这可是真tm长
From<https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=763976383_GbLRUAHtXz4mJsZuaxa6O4lzsEDg&boolshad.hgta_printCustomTrackHeaders=0&hgta_ctName=tb_knownGene&hgta_ctDesc=table+browser+query+on+knownGene&hgta_ctVis=pack&hgta_ctUrl=&fbQual=whole&fbUpBases=200&fbExonBases=0&fbIntronBases=0&fbDownBases=200&hgta_doGetBed=get+BED>

About BED format

  1. chrom – The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart – The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd – The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature, however, the number in position format will be represented. For example, the first 100 bases of chromosome 1 are defined as chrom=1, chromStart=0, chromEnd=100, and span the bases numbered 0-99 in our software (not 0-100), but will represent the position notation chr1:1-100. Read more here.

The 9 additional optional BED fields are:

  1. name – Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  2. score – A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). This table shows the Genome Browser’s translation of BED score values into shades of gray:
  3. strand – Defines the strand. Either “.” (=no strand) or “+” or “-“.
  4. thickStart – The starting position at which the feature is drawn thickly (for example, the start codon in gene displays). When there is no thick part, thickStart and thickEnd are usually set to the chromStart position.
  5. thickEnd – The ending position at which the feature is drawn thickly (for example the stop codon in gene displays).
  6. itemRgb – An RGB value of the form R,G,B (e.g. 255,0,0). If the track line itemRgb attribute is set to “On”, this RBG value will determine the display color of the data contained in this BED line. NOTE: It is recommended that a simple color scheme (eight colors or less) be used with this attribute to avoid overwhelming the color resources of the Genome Browser and your Internet browser.
  7. blockCount – The number of blocks (exons) in the BED line.
  8. blockSizes – A comma-separated list of the block sizes. The number of items in this list should correspond to blockCount.
  9. blockStarts – A comma-separated list of block starts. All of the blockStart positions should be calculated relative to chromStart. The number of items in this list should correspond to blockCount.

 

From <https://genome.ucsc.edu/FAQ/FAQformat.html>

 

From the bed format we can answer:

• Extract this transcript as BED12 format (5’) chr2        37428774        37458740        uc002rpz.3        0        –        37428906        37458710        0        16        272,82,59,84,197,58,98,67,69,103,54,89,184,232,1493,186,        0,1152,1318,9367,10211,10703,12232,13254,14508,14684,15352,18753,20748,21538,25912,29780,
• Chromosome (5’) chr2
• Strand (5’)
• Exon number (5’) This equals to blockCount,16 in this case

Then we set output format to sequence, click get output.

Choose “protein” and submit

We can get the fasta format protein sequence

uc002rpz.3
MAAVKEPLEFHAKRPWRPEEAVEDPDEEDEDNTSEAENGFSLEEVLRLGGTKQDYLMLAT
LDENEEVIDGGKKGAIDDLQQGELEAFIQNLNLAKYTKASLVEEDEPAEKENSSKKEVKI
PKINNKNTAESQRTSVNKVKNKNRPEPHSDENGSTTPKVKKDKQNIFEFFERQTLLLRPG
GKWYDLEYSNEYSLKPQPQDVVSKYKTLAQKLYQHEINLFKSKTNSQKGASSTWMKAIVS
SGTLGDRMAAMILLIQDDAVHTLQFVETLVNLVKKKGSKQQCLMALDTFKELLITDLLPD
NRKLRIFSQRPFDKLEQLSSGNKDSRDRRLILWYFEHQLKHLVAEFVQVLETLSHDTLVT
TKTRALTVAHELLCNKPEEEKALLVQVVNKLGDPQNRIATKASHLLETLLCKHPNMKGVV
SGEVERLLFRSNISSKAQYYAICFLNQMALSHEESELANKLITVYFCFFRTCVKKKDVES
KMLSALLTGVNRAYPYSQTGDDKVREQIDTLFKVLHIVNFNTSVQALMLLFQVMNSQQTI
SDRYYTALYRKMLDPGLMTCSKQAMFLNLVYKSLKADIVLRRVKAFVKRLLQVTCQQMPP
FICGALYLVSEILKAKPGLRSQLDDHPESDDEENFIDANDDEDMEKFTDADKETEIVKKL
ETEETVPETDVETKKPEVASWVHFDNLKGGKQLNKYDPFSRNPLFCGAENTSLWELKKLS
VHFHPSVALFAKTILQGNYIQYSGDPLQDFTLMRFLDRFVYRNPKPHKGKENTDSVVMQP
KRKHFIKDIRHLPVNSKEFLAKEESQIPVDEVFFHRYYKKVAVKEKQKRDADEESIEDVD
DEEFEELIDTFEDDNCFSSGKDDMDFAGNVKKRTKGAKDNTLDEDSEGSDDELGNLDDDE
VSLGSMDDEEFAEVDEDGGTFMDVLDDESESVPELEVHSKVSTKKSKRKGTDDFDFAGSF
QGPRKKKRNLNDSSLFVSAEEFGHLLDENMGSKFDNIGMNAMANKDNASLKQLRWEAERD
DWLHNRDAKSIIKKKKHFKKKRIKTTQKTKKQRK

 

From <https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=763976383_GbLRUAHtXz4mJsZuaxa6O4lzsEDg&hgta_geneSeqType=protein&hgta_doGenePredSequence=submit>

 

Warning:Be careful, this output sequence contains EOL or ‘\n’.

You can put it in a seqCleaner

E.g.http://www.detaibio.com/sms2/filter_protein.html

 

This is the answer of CDS protein sequence

• Extract the sequence of the exon 12 (5’), highlight this block (5’) and take a screenshot as pdf format (5’)

 

 

Click get sequence, then you can get some fasta format data

Find the No.12 exon(or block,depends on how you name it)

 

 

Warning: No.12 exon is xxx_11, because the subscript start from zero.

 

You can hightlight block by shift +click and drag

Click the gene graphic and you can enter a page full of all kins of data

https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc002rpz.3&hgg_prot=uc002rpz.3&hgg_chrom=chr2&hgg_start=37428774&hgg_end=37458740&hgg_type=knownGene&db=hg19&hgsid=763976383_GbLRUAHtXz4mJsZuaxa6O4lzsEDg

 

Do it have isoforms?

uc003qmu.2

 

From <https://genome.ucsc.edu/cgi-bin/hgGene?hgg_gene=uc003qmu.2&hgg_prot=uc003qmu.2&hgg_chrom=chr6&hgg_start=149979288&hgg_end=150039392&hgg_type=knownGene&db=hg19&hgsid=764000265_VoPfi1D1ZReEprAdTtMipArUJMoR>

Take this one for example. It has.

 

But for uc002rpz.3

It doesn’t

 

Or you can click it

 

 

If it have mutiple isoforms, there is transcript variant.

To be sure, Search LATS1 for how many isoforms

 

• mRNA length (5’)

This size include polyA tail,for mrna length without polyA tail, add all blockSizes
• TSS position (5’)

What is TSS positon?

Transcription start site, the starting point of the process of creating a complementary RNA copy of a sequence of DNA

 

From <https://en.wikipedia.org/wiki/TSS>

 

Because this is a negative strand gene? I don’t know if I can say it like this.

 

 

• CDS end site position (5’)

 

• CDS length (hint: not including introns) (5’)

This is equal to ORF size. 3165

•Get all transcript IDs within 100,000 bp upstream of this transcript TSS (UCSC Genes annotation) (5’)

 

Gene Upstream is Chromosome Downstream

https://genome.ucsc.edu/cgi-bin/hgTables

 

Warning: Do not include itself

 

Or you can do it this way


祝作业顺利

发表评论

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据