RAD Studio Build Robust Topic Modelling Capabilities In Your Python GUI App With Powerful Gensim Library

FireWind

Свой
Регистрация
2 Дек 2005
Сообщения
1,957
Реакции
1,199
Credits
4,009
Build Robust Topic Modelling Capabilities In Your Python GUI App With Powerful Gensim Library
February 2, 2021 by Anbarasan

Do you want to Train Large scale semantic NLP Models in your Delphi GUI App? This post will get to understand how to use Gensim Python Library using Для просмотра ссылки Войди или Зарегистрируйсяin Delphi/C++ application and learn the core concepts of Gensim – A Superfast, Proven, Data Streaming, Platform Independent library with some pretrained models for specific domains like legal or health.

Python for Delphi (P4D) is a set of free components that wrap up the Python DLL into Delphi. They let you easily execute Python scripts, create new Python modules and new Python types. You can use Python4Delphi a number of different ways such as:
  • Create a Windows GUI around your existing Python app.
  • Add Python scripting to your Delphi Windows apps.
  • Add parallel processing to your Python apps through Delphi threads.
  • Enhance your speed-sensitive Python apps with functions from Delphi for more speed.
Prerequisites.
  • If not python and Python4Delphi is not installed on your machine, Check this Для просмотра ссылки Войди или Зарегистрируйся
  • Open windows open command prompt, and type pip install -U gensim to install GenSim. For more info for Installing Python Modules Для просмотра ссылки Войди или Зарегистрируйся
  • First, run the Demo1 project for executing Python script in Python for Delphi. Then load the Texblob sample script in the Memo1 field and press the Execute Script button to see the result. On Clicking Execute Button the script strings are executed using the below code. Go to GitHub to download the Для просмотра ссылки Войди или Зарегистрируйся source.
Код:
procedure TForm1.Button1Click(Sender: TObject);
begin
 PythonEngine1.ExecStrings( Memo1.Lines );
end;

Gensim Core concepts :
  1. Document: A document is an object of the Для просмотра ссылки Войди или Зарегистрируйся (commonly known as str in Python 3). A document could be anything from a short 140 character tweet, a single paragraph (i.e., journal article abstract), a news article, or a book.
  2. Corpus: a collection of documents. Serve 2 purposes.
    1. Input for training a Model. During training, the models use this training corpus to look for common themes and topics, initializing their internal model parameters.
    2. Documents to organize. After training, a topic model can be used to extract topics from new documents (documents not seen in the training corpus).Such corpora can be indexed for Similarity Queries, queried by semantic similarity, clustered etc.
  3. Vector: a mathematically convenient representation of a document.
  4. Model: an algorithm for transforming vectors from one representation to another.
Gensim Python Library sample script details: The sample scripts helps to understand how the core concepts were implemented for a simple Corpus.
  • A Corpus consists of 9 documents where each document consisting of a string.
  • Created a set of frequent words where to be ignored by splitting it by white space.
  • Get the word count frequencies and just keep the words which is occurring more than once.
  • Assign to dictionary in corpora and print the tokens and its id by calling token2id.
  • Create the bag-of-word representation for a new document using the doc2bow and convert our entire original corpus to a list of vectors.
  • Train using the model ‘tf-idf‘ – transforms vectors from the bag-of-words representation to a vector space
  • Transform the “system minors” string from the dictionary using doc2bow
  • Transform the whole corpus via TfIdf and index it, in preparation for similarity queries.
Python:
import pprint
from gensim import corpora
from gensim import models
from gensim import similarities
 
# Corpus - It consists of 9 documents, where each document is a string consisting of a single sentence.
text_corpus = [
    "Human machine interface for lab abc computer applications",
    "A survey of user opinion of computer system response time",
    "The EPS user interface management system",
    "System and human system engineering testing of EPS",
    "Relation of user perceived response time to error measurement",
    "The generation of random binary unordered trees",
    "The intersection graph of paths in trees",
    "Graph minors IV Widths of trees and well quasi ordering",
    "Graph minors A survey",
]
 
# Create a set of frequent words
stoplist = set('for a of the and to in'.split(' '))
# Lowercase each document, split it by white space and filter out stopwords
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in text_corpus]
# Count word frequencies
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1
# Only keep words that appear more than once
processed_corpus = [[token for token in text if frequency[token] > 1] for text in texts]
pprint.pprint(processed_corpus)
 
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
 
pprint.pprint(dictionary.token2id)
 
# create the bag-of-word representation for a document using the doc2bow
 
new_doc = "Human computer interaction"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec)
 
# convert our entire original corpus to a list of vectors:
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(bow_corpus)
 
 
# Model `tf-idf - transforms vectors from the bag-of-words representation to a vector space
# where the frequency counts are weighted according to the relative rarity of
# each word in the corpus. Here's a simple example. Let's initialize the tf-idf model, training it on
# our corpus and transforming the string "system minors":
 
# train the model
tfidf = models.TfidfModel(bow_corpus)
 
# transform the "system minors" string
words = "system minors".lower().split()
print(tfidf[dictionary.doc2bow(words)])
 
# to transform the whole corpus via TfIdf and index it, in preparation for similarity queries:
 
index = similarities.SparseMatrixSimilarity(tfidf[bow_corpus], num_features=12)
 
# and to query the similarity of our query document ``query_document`` against every document in the corpus:
query_document = 'system engineering'.split()
query_bow = dictionary.doc2bow(query_document)
sims = index[tfidf[query_bow]]
print(list(enumerate(sims)))
 
# Document 3 has a similarity score of 0.718=72%, document 2 has a similarity score of 42% etc.
# We can make this slightly more readable by sorting:
 
for document_number, score in sorted(enumerate(sims), key=lambda x: x[1], reverse=True):
    print(document_number, score)
1612299941122.png
Gensim Demo
Note: Samples used for demonstration were picked from Для просмотра ссылки Войди или Зарегистрируйся with only the difference of printing the outputs. You can check the APIs and some more samples from the same place.

You have read the quick overview of Gensim library, download this library from Для просмотра ссылки Войди или Зарегистрируйся, and perform NLP tasks quickly with help of models such as word2vec, Latent Dirichlet Allocation Model, FastText Model, etc. Check out Для просмотра ссылки Войди или Зарегистрируйся and easily build Python GUIs for Windows using Delphi.