web-dev-qa-db-fra.com

CS50 Problem Set 6 (DNA) "Python", je ne peux pas compter la séquence d'ADN intermittente, mon code réussit dans une petite base de données, échoue dans la grande

Je suis un débutant en programmation, j'ai donc décidé de suivre un cours CS50. Dans Problem Set6 (Python), j'ai écrit le code et cela a fonctionné pour la petite base de données, mais cela a échoué pour la grande, donc je n'ai demandé que de l'aide avec l'idée. Voici la page du cours , et vous pouvez téléchargez-la ici (depuis Google Drive)

Mon code

import csv
from sys import argv


class DnaTest(object):

    """CLASS HELP: the DNA test, simply give DNA sequence to the program, and it searches in the database to
       determine the person who owns the sample.

    type the following in cmd to run the program:
    python dna.py databases/small.csv sequences/1.txt """

    def __init__(self):
        # get filename from the command line without directory names "database" and "sequence"
        self.sequence_argv = str(argv[2][10:])
        self.database_argv = str(argv[1][10:])

        # Automatically open and close the database file
        with open(f"databases/{self.database_argv}", 'r') as database_file:
            self.database_file = database_file.readlines()

        # Automatically open and close the sequence file
        with open(f"sequences/{self.sequence_argv}", 'r') as sequence_file:
            self.sequence_file = sequence_file.readline()

        # Read CSV file as a dictionary, function: compare_database_with_sequence()
        self.csv_database_dictionary = csv.DictReader(self.database_file)
        # Read CSV file to take the first row, function: get_str_list()
        self.reader = csv.reader(self.database_file)
        # computed dictionary from the sequence file
        self.dict_from_sequence = {}

    # returns the first row of the CSV file (database file)
    def get_str_list(self):
        # get first row from CSV file
        self.keys = next(self.reader)

        # remove 'name' from list, get STR only.
        self.keys.remove("name")
        return self.keys

    # returns dictionary of computed STRs from the sequence file (key(STR): value(count))
    def get_str_count_from_sequence(self):  # PROBLEM HERE AND RETURN DICTIONARY FROM IT !
        for dna_seq in self.get_str_list():
            self.dict_from_sequence.update({dna_seq: self.sequence_file.count(dna_seq)})

    # compare computed dictionary with the database dictionaries and get the person name
    def compare_database_with_sequence(self):
        for dictionary in self.csv_database_dictionary:
            dict_from_database = dict(dictionary)
            dict_from_database.pop('name')

            # compare the database dictionaries with sequence computed dictionary
            shared_items = {k: self.dict_from_sequence[k] for k in self.dict_from_sequence if
                            k in dict_from_database and self.dict_from_sequence[k] == int(dict_from_database[k])}

            if len(self.dict_from_sequence) == len(shared_items):
                dict_from_database = dict(dictionary)
                print(dict_from_database['name'])
                break


# run the class and its functions (Program control)
if __name__ == '__main__':
    RunTest = DnaTest()
    RunTest.get_str_count_from_sequence()
    RunTest.compare_database_with_sequence()

Le problème est

dans la fonction get_str_count_from_sequence(self): j'utilise count, et c'est du travail mais pour la séquence séquentielle, dans le fichier de séquence (exemple 5.txt), la séquence requise est non séquentielle et je ne peux pas comparer chaque nombre de séquences consécutives. J'ai cherché, mais je n'ai rien trouvé de simple. Certains utilisent le module Regex et d'autres utilisent le module re et je n'ai pas trouvé de solution.

CODE DE TEST:

Depuis le site CS50: Exécutez votre programme en tant que python dna.py databases/large.csv sequences/6.txt Votre programme doit afficher Luna

spécification

Depuis le site CS50.

1
MrAhmedElsayed

Merci "Piyush Singh" J'ai travaillé avec vos conseils et utilisé re pour résoudre le problème. Au début, j'ai choisi un groupe (la plus longue séquence séquentielle) en utilisant re et j'ai défini le groupe de correspondance qui est un dictionnaire, puis j'ai pris la plus grande valeur pour chaque STR puis J'ai effacé les données du dictionnaire pour stocker le suivant STR et ici j'ai fait une mise à jour d'une fonction de comparaison Dictionnaires (lu à partir de la base de données et calculé à partir du fichier de séquence)

import csv
from sys import argv
import re


class DnaTest(object):
    """CLASS HELP: the DNA test, simply give DNA sequence to the program, and it searches in the database to
       determine the person who owns the sample.

    type the following in cmd to run the program:
    python dna.py databases/small.csv sequences/1.txt """

    def __init__(self):
        # get filename from the command line without directory names "database" and "sequence"
        self.sequence_argv = str(argv[2][10:])
        self.database_argv = str(argv[1][10:])

        # Automatically open and close the database file
        with open(f"databases/{self.database_argv}", 'r') as database_file:
            self.database_file = database_file.readlines()

        # Automatically open and close the sequence file
        with open(f"sequences/{self.sequence_argv}", 'r') as sequence_file:
            self.sequence_file = sequence_file.readline()

        # Read CSV file as a dictionary, function: compare_database_with_sequence()
        self.csv_database_dictionary = csv.DictReader(self.database_file)
        # Read CSV file to take the first row, function: get_str_list()
        self.reader = csv.reader(self.database_file)
        # computed dictionary from the sequence file
        self.dict_from_sequence = {}
        self.select_max = {}

    # returns the first row of the CSV file (database file)
    def get_str_list(self):
        # get first row from CSV file
        keys = next(self.reader)

        # remove 'name' from list, get STR only.
        keys.remove("name")
        return keys

    # returns dictionary of computed STRs from the sequence file (key(STR): value(count))
    def get_str_count_from_sequence(self):  # PROBLEM HERE AND RETURN DICTIONARY FROM IT !
        for str_key in self.get_str_list():
            regex = rf"({str_key})+"
            matches = re.finditer(regex, self.sequence_file, re.MULTILINE)

            # my code
            for match in matches:
                match_len = len(match.group())
                key_len = len(str_key)
                self.select_max[match] = match_len
                #  select max value from results dictionary (select_max)
                max_values = max(self.select_max.values())

                if max_values >= key_len:
                    result = int(max_values / key_len)
                    self.select_max[str_key] = result
                    self.dict_from_sequence[str_key] = result

            # clear compare dictionary to select new key
            self.select_max.clear()

    # compare computed dictionary with the database dictionaries and get the person name
    def compare_database_with_sequence(self):
        # comparison function between database dictionary and sequence computed dictionary
        def dicts_equal(from_sequence, from_database):
            """ return True if all keys and values are the same """
            return all(k in from_database and int(from_sequence[k]) == int(from_database[k]) for k in from_sequence) \
                and all(k in from_sequence and int(from_sequence[k]) == int(from_database[k]) for k in from_database)

        def check_result():
            for dictionary in self.csv_database_dictionary:
                dict_from_database = dict(dictionary)
                dict_from_database.pop('name')

                if dicts_equal(self.dict_from_sequence, dict_from_database):
                    dict_from_database = dict(dictionary)
                    print(dict_from_database['name'])
                    return True

        if check_result():
            pass
        else:
            print("No match")


# run the class and its functions (Program control)
if __name__ == '__main__':
    RunTest = DnaTest()
    RunTest.get_str_count_from_sequence()
    RunTest.compare_database_with_sequence()

Vérifier la solution

Run your program as python dna.py databases/small.csv sequences/1.txt. Your program should output Bob.
Run your program as python dna.py databases/small.csv sequences/2.txt. Your program should output No match.

pour plus de contrôles, visitez CS50 DNA ensemble de problèmes

1
MrAhmedElsayed