'English' 카테고리의 글 목록

English

Rosalind hint - Fixing an Inconsistent Character Set 2021.01.31
Biopython - How to get NCBI taxonomy using Entrez 2021.01.24

Rosalind hint - Fixing an Inconsistent Character Set

벽장속의해리 2021. 1. 31. 14:04

2021. 1. 31. 14:04

This is the hint for "Fixing an Inconsistent Character Set"

http://rosalind.info/problems/cset/

The most important concept you will use is:

"Two splits S1|Sc1 and S2|Sc2 conflict when all four intersections S1 ∩ S2, S1 ∩ Sc2, Sc1 ∩ S2, and Sc1 ∩ Sc2 are nonempty."

Let's say we are splitting a set with six elements: 1, 2, 3, 4, 5, and 6.

The splits can be {1, 2} | {3, 4, 5, 6} or {1, 4, 5} | {2, 3, 6} etc.

We will check if two splits conflict using the concept above.

Let's say S1 = {1, 2}, Sc1 = {3, 4, 5, 6}, S2 = {1, 4, 5}, and Sc2 = {2, 3, 6}.

S1 ∩ S2 = {1, 2} ∩ {1, 4, 5} = {1}

S1 ∩ Sc2 = {1, 2} ∩ {2, 3, 6} = {2}

Sc1 ∩ S2 = {3, 4, 5, 6} ∩ {1, 4, 5} = {4, 5}

Sc1 ∩ Sc2 = {3, 4, 5, 6} ∩ {2, 3, 6} = {3, 6}

As you can see, all four intersections are non-empty. Thus, those two splits conflict.

The other two splits {1, 2} | {3, 4, 5, 6} and {1, 2, 3} | {4, 5, 6} do not conflict because

S1 ∩ S2 = {1, 2}

S1 ∩ Sc2 = {} ← Empty!

Sc1 ∩ S2 = {3}

Sc1 ∩ Sc2 = {4, 5, 6}

Now, look at the sample dataset:

We will number the columns as 1, 2, 3, 4, 5, 6 and re-write them as splits. Note it is not important to distinguish between 0 and 1.

100001 → {1, 6} | {2, 3, 4, 5}

000110 → {1, 2, 3, 6} | {4, 5}

111000 → {1, 2, 3} | {4, 5, 6}

100111 → {1, 4, 5, 6} | {2, 3}

Then, find a row that conflicts with more than one other row.

Conflicts between rows. X indicates "conflict"

Only row 1 and row 3 conflict. Thus, you can delete either one of the rows. Note you will need to remove only one row. This will make computation easier.

---
I don't put ads in my blog. If this post helps you, please consider donating a small amount of money that will be a huge motivation.
https://www.buymeacoffee.com/harryincupboard

저작자표시 (새창열림)

Biopython - How to get NCBI taxonomy using Entrez

벽장속의해리 2021. 1. 24. 14:05

2021. 1. 24. 14:05

This script takes a "taxon name" or "taxonomy ID", then outputs its NCBI lineage using the Biopython Entrez. There's another way to do it without connecting to the NCBI server by parsing the taxdump (which I posted here).

Your input would be either a taxon name like "Lentinula edodes" or a taxonomy ID like "5353." The output will be full lineage like "Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Omphalotaceae; Lentinula"

The code is here:

import re
import sys
from Bio import Entrez
Entrez.email = 'A.N.Other@example.com'

def get_ncbi_tax(taxon):
    '''Getn NCBI taxonomy'''
    # If the input is a string
    if not re.match(r'\d+', taxon):
        # Get taxonomy ID using Entrez
        taxon2 = '"' + taxon + '"'
        handle = Entrez.esearch(
            db='taxonomy', term=taxon2, rettype='gb', retmode='text')
        record = Entrez.read(handle, validate=False)
        handle.close()
        # If there's no result
        if not record['IdList']:
            sys.exit(
                '[ERROR] The taxon "{}" you provided is invalid. '
                'Please check NCBI Taxonomy'.format(taxon))
        tax_id = record['IdList']
    else:
        tax_id = taxon

    # Now connect NCBI again using the tax_id
    # Entrez.efetch will give you various information
    handle2 = Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml')
    record2 = Entrez.read(handle2, validate=False)
    handle2.close()

    tax_list = record2[0]['LineageEx']
    for tax_element in tax_list:
        print('{}: {}'.format(
            tax_element['Rank'], tax_element['ScientificName']))
            
# Now call the function
get_ncbi_tax('5353')  # Using tax ID
get_ncbi_tax('Lentinula edodes')  # Using tax name

Explanation

If you don't have an ID, you must search for it first using the Entrez.esearch(). The taxon name can be any level (e.g., species, genus, etc).

taxon2 = '"' + taxon + '"'
handle = Entrez.esearch(
	db='taxonomy', term=taxon2, rettype='gb', retmode='text')
record = Entrez.read(handle, validate=False)
handle.close()

2. After you get the ID, use the Entrez.efetch() to get the lineage. You can also get other data such as common name or acronym. Explore the record2 for details.

handle2 = Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml')
record2 = Entrez.read(handle2, validate=False)
handle2.close()

---

I don't put ads in my blog. If this post helps you, please consider donating a small amount of money that will be a huge motivation.

https://www.buymeacoffee.com/harryincupboard

PREV 이전 1 NEXT 다음

MBN's Blog

English

Rosalind hint - Fixing an Inconsistent Character Set

Biopython - How to get NCBI taxonomy using Entrez

Explanation

+ Recent posts

티스토리툴바