SQLite avec une vraie "recherche en texte intégral" et des fautes d’orthographe (FTS + orthfix ensemble)

Question

Disons que nous avons 1 million de lignes comme ceci:

import sqlite3 db = sqlite3.connect(':memory:') c = db.cursor() c.execute('CREATE TABLE mytable (id integer, description text)') c.execute('INSERT INTO mytable VALUES (1, "Riemann")') c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

Contexte:

Je sais comment faire cela avec SQLite:

Trouver une ligne avec une requête avec un seul mot, avec quelques erreurs d’orthographe avec le spellfix module et la distance de Levenshtein (j’ai posté un réponse détaillée ici sur la façon de le compiler, de l’utiliser, ...):

db.enable_load_extension(True) db.load_extension('./spellfix') c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall() #Query: 'Riehmand' #Answer: [(1, u'Riemann')]

Avec 1M lignes, ce serait super lent! Comme détaillé ici , postgresql pourrait avoir une optimisation avec ceci en utilisant trigrams. Une solution rapide, disponible avec Sqlite, consiste à utiliser un VIRTUAL TABLE USING spellfix:

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1') c.execute('INSERT INTO mytable3(Word) VALUES ("Riemann")') c.execute('SELECT * FROM mytable3 WHERE Word MATCH "Riehmand"'); print c.fetchall() #Query: 'Riehmand' #Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

Trouver une expression avec une requête correspondant à un ou plusieurs mots avec FTS ("Recherche en texte intégral"):
```
c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)') c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")') c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall() #Query: 'NUMBERS carmichael' #Answer: [(2, u'All the Carmichael numbers')] 
```
Il est insensible à la casse et vous pouvez même utiliser une requête avec deux mots dans le mauvais ordre, etc.: FTS est en effet assez puissant. Mais l’inconvénient est que chaque mot-clé de requête doit être orthographié correctement, c'est-à-dire que FTS seul ne permet pas les fautes d'orthographe.

Question:

Comment faire une recherche de texte intégral (FTS) avec Sqlite et aussi permettre des fautes d'orthographe? i.e. "FTS + spellfix" ensemble

Exemple:

ligne dans la base de données: "All the Carmichael numbers"
requête: "NUMMBER carmickaeel" devrait correspondre!

Comment faire cela avec SQLite?

C'est probablement possible avec SQLite depuis cette page états:

Ou bien, [orthographe] pourrait être utilisé avec FTS4 pour effectuer une recherche de texte intégral en utilisant des mots potentiellement mal orthographiés.

Question liée: similarité des chaînes avec Python + Sqlite (distance de Levenshtein/distance de modification)

Martijn Pieters · Accepted Answer

La documentation spellfix1 vous indique en fait comment procéder. Depuis la Overview section :

Si vous avez l'intention d'utiliser cette table virtuelle en association avec une table FTS4 (pour la correction orthographique des termes de recherche), vous pouvez extraire le vocabulaire à l'aide d'une table fts4aux :
INSERT INTO demo(Word) SELECT term FROM search_aux WHERE col='*'; 

L'instruction SELECT term from search_aux WHERE col='*' extrait tous les jetons indexés } _.

En connectant cela avec vos exemples, où mytable2 est votre table virtuelle fts4, vous pouvez créer une table fts4aux et insérer ces jetons dans votre table mytable3 spellfix1 avec:

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2); INSERT INTO mytable3(Word) SELECT term FROM mytable2_terms WHERE col='*';

Vous voudrez probablement qualifier davantage cette requête pour ignorer tous les termes déjà insérés dans spellfix1, sinon vous vous retrouverez avec des doubles entrées:

INSERT INTO mytable3(Word) SELECT term FROM mytable2_terms WHERE col='*' AND term not in (SELECT Word from mytable3_vocab);

Maintenant, vous pouvez utiliser mytable3 pour mapper des mots mal orthographiés sur des jetons corrigés, puis utiliser ces jetons corrigés dans une requête MATCH contre mytable2.

En fonction de vos besoins, vous devrez peut-être gérer vous-même vos jetons et créer vos requêtes. il n'y a pas d'analyseur syntaxique de requête fts4 exposé. Par conséquent, votre chaîne de recherche à deux jetons doit être scindée, chaque jeton étant exécuté dans la table spellfix1 pour mapper les jetons existants, puis les jetons alimentés vers la requête fts4.

Ignorer la syntaxe SQL pour gérer cela, utiliser Python pour effectuer le fractionnement est assez simple:

def spellcheck_terms(conn, terms): cursor = conn.cursor() base_spellfix = """ SELECT :term{0} as term, Word FROM spellfix1data WHERE Word MATCH :term{0} and top=1 """ terms = terms.split() params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join([ base_spellfix.format(i + 1) for i in range(len(params))]) cursor.execute(query, params) correction_map = dict(cursor) return " ".join([correction_map.get(t, t) for t in terms]) def spellchecked_search(conn, terms): corrected_terms = spellcheck_terms(conn, terms) cursor = conn.cursor() fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?' cursor.execute(fts_query, (corrected_terms,)) return cursor.fetchall()

Ceci retourne alors [('All the Carmichael numbers',)] pour spellchecked_search(db, "NUMMBER carmickaeel").

Conserver la gestion du correcteur orthographique en Python vous permet ensuite de prendre en charge des requêtes FTS plus complexes en fonction des besoins. vous devrez peut-être réimplémenter l'analyseur d'expression } pour le faire, mais au moins Python vous donne les outils pour le faire.

Un exemple complet, regroupant l'approche ci-dessus dans une classe, qui extrait simplement les termes sous forme de séquences de caractères alphanumériques (qui, à ma lecture, suffit à la syntaxe d'expression spécifiée):

import re import sqlite3 import sys class FTS4SpellfixSearch(object): def __init__(self, conn, spellfix1_path): self.conn = conn self.conn.enable_load_extension(True) self.conn.load_extension(spellfix1_path) def create_schema(self): self.conn.executescript( """ CREATE VIRTUAL TABLE IF NOT EXISTS fts4data USING fts4(description text); CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms USING fts4aux(fts4data); CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data USING spellfix1; """ ) def index_text(self, *text): cursor = self.conn.cursor() with self.conn: params = ((t,) for t in text) cursor.executemany("INSERT INTO fts4data VALUES (?)", params) cursor.execute( """ INSERT INTO spellfix1data(Word) SELECT term FROM fts4data_terms WHERE col='*' AND term not in (SELECT Word from spellfix1data_vocab) """ ) # fts3 / 4 search expression tokenizer # no attempt is made to validate the expression, only # to identify valid search terms and extract them. # the fts3/4 tokenizer considers any alphanumeric ASCII character # and character in the range U+0080 and over to be terms. if sys.maxunicode == 0xFFFF: # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+") else: # UCS4 _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+") def _terms_from_query(self, search_query): """Extract search terms from a fts3/4 query Returns a list of terms and a template such that template.format(*terms) reconstructs the original query. terms using partial* syntax are ignored, as you can't distinguish between a misspelled prefix search that happens to match existing tokens and a valid spelling that happens to have 'near' tokens in the spellfix1 database that would not otherwise be matched by fts4 """ template, terms, lastpos = [], [], 0 for match in self._fts4_expr_terms.finditer(search_query): token, (start, end) = match.group(), match.span() # skip columnname: and partial* terms by checking next character ismeta = search_query[end:end + 1] in {":", "*"} # skip digits if preceded by "NEAR/" ismeta = ismeta or ( token.isdigit() and template and template[-1] == "NEAR" and "/" in search_query[lastpos:start]) if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta: # full search term, not a keyword, column name or partial* terms.append(token) token = "{}" template += search_query[lastpos:start], token lastpos = end template.append(search_query[lastpos:]) return terms, "".join(template) def spellcheck_terms(self, search_query): cursor = self.conn.cursor() base_spellfix = """ SELECT :term{0} as term, Word FROM spellfix1data WHERE Word MATCH :term{0} and top=1 """ terms, template = self._terms_from_query(search_query) params = {"term{}".format(i): t for i, t in enumerate(terms, 1)} query = " UNION ".join( [base_spellfix.format(i + 1) for i in range(len(params))] ) cursor.execute(query, params) correction_map = dict(cursor) return template.format(*(correction_map.get(t, t) for t in terms)) def search(self, search_query): corrected_query = self.spellcheck_terms(search_query) cursor = self.conn.cursor() fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?" cursor.execute(fts_query, (corrected_query,)) return { "terms": search_query, "corrected": corrected_query, "results": cursor.fetchall(), }

et une démo interactive utilisant la classe:

>>> db = sqlite3.connect(":memory:") >>> fts = FTS4SpellfixSearch(db, './spellfix') >>> fts.create_schema() >>> fts.index_text("All the Carmichael numbers") # your example >>> from pprint import pprint >>> pprint(fts.search('NUMMBER carmickaeel')) {'corrected': 'numbers carmichael', 'results': [('All the Carmichael numbers',)], 'terms': 'NUMMBER carmickaeel'} >>> fts.index_text( ... "They are great", ... "Here some other numbers", ... ) >>> pprint(fts.search('here some')) # edgecase, multiple spellfix matches {'corrected': 'here some', 'results': [('Here some other numbers',)], 'terms': 'here some'} >>> pprint(fts.search('NUMMBER NOT carmickaeel')) # using fts4 query syntax {'corrected': 'numbers NOT carmichael', 'results': [('Here some other numbers',)], 'terms': 'NUMMBER NOT carmickaeel'}

Basj · Answer

La réponse acceptée est bonne (à son actif), voici une légère variation qui, bien que moins complète que celle acceptée pour les cas complexes, est utile pour saisir l’idée:

import sqlite3 db = sqlite3.connect(':memory:') db.enable_load_extension(True) db.load_extension('./spellfix') c = db.cursor() c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)") c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)") c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1") c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')") # populate the table c.execute("INSERT INTO mytable2 VALUES ('They are great')") c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')") c.execute("INSERT INTO mytable3(Word) SELECT term FROM mytable2_terms WHERE col='*'") def search(query): # Correcting each query term with spellfix table correctedquery = [] for t in query.split(): spellfix_query = "SELECT Word FROM mytable3 WHERE Word MATCH ? and top=1" c.execute(spellfix_query, (t,)) r = c.fetchone() correctedquery.append(r[0] if r is not None else t) # correct the Word if any match in the spellfix table; if no match, keep the Word spelled as it is (then the search will give no result!) correctedquery = ' '.join(correctedquery) # Now do the FTS fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?' c.execute(fts_query, (correctedquery,)) return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query} print(search('NUMBBERS carmickaeel')) print(search('some HERE')) print(search('some qsdhiuhsd'))

Voici le résultat:

{'requête': 'NUMBBERS carmickaeel', 'correctedquery': u numbers carmichael ',' result ': [(u' Tous les nombres de Carmichael ',)]}}
{'query': 'some ICI', 'correctedquery': ici, 'résultat': [(u'ici quelques autres nombres ',)]}
{'query': 'une certaine qsdhiuhsd', 'correctedquery': une certaine qsdhiuhsd ',' result '': []}

Remarque: On peut noter que le "Correction de chaque terme de requête avec la table de correction orthographique" est effectué avec une requête SQL par terme. Les performances de cette requête UNION SQL par rapport à une seule sont étudiées ici .