SUMMERY few state-of- -art approach to modeling semantics,


In this thesis, I
examine semantic processing of formless text from three perspectives. In   primary part, I describe more than a few
state-of- -art approach to modeling semantics, which are base on   thought of create a semantic model as of a
(training) quantity of formless text. Through  
trained model, it is then possible to express arbitrary documents (ones
from   training corpus as well as others)
in a new, se-mantic representation. In this representation, documents may be
evaluated as closely related despite not sharing any common words a strict
departure from   more traditional
“keyword” search systems.Advantage of using semantic models is therefore in
assessing similarity at a higher, “topical” level.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

Common theme among all algorithms
was their focus on robustness toward input data noise and computational
tractability.   second part of   thesis dealt with my own contributions
to   field, which consisted of new
training algorithms for Latent Semantic Analysis and Latent Dirichlet
Allocation.Novelty lies in their focus on scalability  algorithms run “online”, with constant memory
in   number of training documents so that
they can process arbitrarily large input, and with computation distributed
across a cluster of computers. Also,  
input document stream does not require random access. This last point is
especially relevant in environments where data is retrieved from slow media
(compressed on disk, tape, accessed through  
web). In case of LSA,   input
stream does not even need to be repeatable algorithm only accesses each
training document once, in sequential order (a streamed single-pass algorithm).
This additional constraint allows processing infinite input streams:   model is updated online and training
observations may be immediately discarded. This online, incremental, streamed,
distributed algorithm exhibits unparalleled performance and makes LSA
processing feasible on web scale datasets. In the last part I considered
applicability of   general-purpose
semantic algorithms to   domain of
Information Retrieval. Useful as they are,  
algorithms are nevertheless only a first step towards a successful IR
system. Issues of document heterogeneity often hurt performance; I presented
two novel algorithms for increasing topical consistency among documents: a) by
splitting documents into smaller, topically consistent blocks and b) by
splitting multilingual documents into blocks of the same language.




All these algorithms represent steps in   direction of more automated and intelligent
access to   vast digital repositories of today.
It has been said that “devil is in  
details”, and this is certainly true of IR systems. Despite   theoretical advances and complex data mining
methods, expert knowledge in tuning a system is still invaluable. It is my
belief that a relatively simple system that is properly tuned by someone with
deep understanding of   problem at hand
will most of   time outperform a complex,
out-of- -box system, even if   latter
utilizes state-of- -art techniques. In a way, this is a testament and a tribute
to   ingenuity of human mind. On   other hand,  
amount of raw digital data increases while   amount of human experts stays roughly   same, so despite their imperfections,
automated methods considered in this thesis may serve very well in their
limited goal of assisting humans during data exploration.


This idea of bringing semantic processing to non-experts
(non-linguists and non-computer scientists) was also at   core of  
de-sign philosophy of a software package for topical modeling that accompanies
this thesis . It is my hope (already partially fulfilled) that such software
will lead to wider adoption of un-supervised semantic methods outside of   academic community.