This is the hint for "Fixing an Inconsistent Character Set"

http://rosalind.info/problems/cset/

 

The most important concept you will use is:

"Two splits S1|Sc1 and S2|Sc2 conflict when all four intersections S1 ∩ S2, S1 Sc2, Sc1 S2, and Sc1 Sc2 are nonempty."

 

Let's say we are splitting a set with six elements: 1, 2, 3, 4, 5, and 6.

The splits can be {1, 2} | {3, 4, 5, 6} or {1, 4, 5} | {2, 3, 6} etc.

 

We will check if two splits conflict using the concept above. 

Let's say S1 = {1, 2}, Sc1 = {3, 4, 5, 6}, S2 = {1, 4, 5}, and Sc2 = {2, 3, 6}.

 

S1 ∩ S2 = {1, 2}{1, 4, 5} = {1}

S1 ∩ Sc2 = {1, 2}{2, 3, 6} = {2}

Sc1 ∩ S2 = {3, 4, 5, 6} {1, 4, 5} = {4, 5}

Sc1 ∩ Sc2 = {3, 4, 5, 6}{2, 3, 6} = {3, 6}

 

As you can see, all four intersections are non-empty. Thus, those two splits conflict.

The other two splits {1, 2} | {3, 4, 5, 6} and {1, 2, 3} | {4, 5, 6} do not conflict because

 

S1 ∩ S2 = {1, 2}

S1 ∩ Sc2 = {}  ← Empty!

Sc1 ∩ S2 = {3}

Sc1 ∩ Sc2 = {4, 5, 6}

 

Now, look at the sample dataset:

100001
000110
111000
100111

We will number the columns as 1, 2, 3, 4, 5, 6 and re-write them as splits. Note it is not important to distinguish between 0 and 1.

100001 → {1, 6} | {2, 3, 4, 5}

000110 → {1, 2, 3, 6} | {4, 5}

111000 → {1, 2, 3} | {4, 5, 6}

100111 → {1, 4, 5, 6} | {2, 3}

 

Then, find a row that conflicts with more than one other row.

 

Conflicts between rows. X indicates "conflict"

Only row 1 and row 3 conflict. Thus, you can delete either one of the rows. Note you will need to remove only one row. This will make computation easier.

 

---
I don't put ads in my blog. If this post helps you, please consider donating a small amount of money that will be a huge motivation.
https://www.buymeacoffee.com/harryincupboard

This script takes a "taxon name" or "taxonomy ID", then outputs its NCBI lineage using the Biopython Entrez. There's another way to do it without connecting to the NCBI server by parsing the taxdump (which I posted here).

 

Your input would be either a taxon name like "Lentinula edodes" or a taxonomy ID like "5353." The output will be full lineage like "Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina; Agaricomycetes; Agaricomycetidae; Agaricales; Omphalotaceae; Lentinula"

 

The code is here:

import re
import sys
from Bio import Entrez
Entrez.email = 'A.N.Other@example.com'

def get_ncbi_tax(taxon):
    '''Getn NCBI taxonomy'''
    # If the input is a string
    if not re.match(r'\d+', taxon):
        # Get taxonomy ID using Entrez
        taxon2 = '"' + taxon + '"'
        handle = Entrez.esearch(
            db='taxonomy', term=taxon2, rettype='gb', retmode='text')
        record = Entrez.read(handle, validate=False)
        handle.close()
        # If there's no result
        if not record['IdList']:
            sys.exit(
                '[ERROR] The taxon "{}" you provided is invalid. '
                'Please check NCBI Taxonomy'.format(taxon))
        tax_id = record['IdList']
    else:
        tax_id = taxon

    # Now connect NCBI again using the tax_id
    # Entrez.efetch will give you various information
    handle2 = Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml')
    record2 = Entrez.read(handle2, validate=False)
    handle2.close()

    tax_list = record2[0]['LineageEx']
    for tax_element in tax_list:
        print('{}: {}'.format(
            tax_element['Rank'], tax_element['ScientificName']))
            
# Now call the function
get_ncbi_tax('5353')  # Using tax ID
get_ncbi_tax('Lentinula edodes')  # Using tax name

 

Explanation

If you don't have an ID, you must search for it first using the Entrez.esearch(). The taxon name can be any level (e.g., species, genus, etc).

taxon2 = '"' + taxon + '"'
handle = Entrez.esearch(
	db='taxonomy', term=taxon2, rettype='gb', retmode='text')
record = Entrez.read(handle, validate=False)
handle.close()

2. After you get the ID, use the Entrez.efetch() to get the lineage. You can also get other data such as common name or acronym. Explore the record2 for details.

handle2 = Entrez.efetch(db='taxonomy', id=tax_id, retmode='xml')
record2 = Entrez.read(handle2, validate=False)
handle2.close()

 

 

---

I don't put ads in my blog. If this post helps you, please consider donating a small amount of money that will be a huge motivation.

https://www.buymeacoffee.com/harryincupboard

+ Recent posts