Added: (Fri Feb 10 2006)
Pressbox (Press Release) -
In the recent time we have seen the rise of various programs performing search of documents in various formats, information in DBMS and informational systems, email messages and other data stored both on the hard drive of a personal computer or in the local network of an enterprise, as well as in other sources of knowledge.
The need for such search systems is conditioned by an on-going growth of textual information bulk available both to the whole society, and to each of its representatives. While until recently the search tools were aimed at the corporate sector (for the purposes of home use "direct search" with conventional browsing through each file was quite sufficient), now developers are working towards meeting the needs of a conventional user. After all, the bulk of information has soared. Nevertheless, the priority tendency of developing search technologies (in addition to the Internet) is the corporate sector.
The most important parameter of any search system is the speed of its operation. This relates both to indexing large amounts of data and to the speed of searching documents. It goes without saying that a highly important factor is the ability to work with various data sources, lists of supported file formats and additional functionality (support of morphology, synonyms, and various types of search). However, when you consider a certain set of required functions, the overwhelming majority of competitive programs boast of them all.
The problem of organizing data into a single database is partly solved with the help of DMS, CRM and special-purpose DBMS. However, the larger the enterprise and the more diverse its activity, the more complicated it is to process information from various sources. Documents on a disk, 1C, Oracle and various informational systems - the list can go on forever. Archives of html-pages, electronic correspondence and even ICQ logs have lately grown to create a substantial "informational sector" that can be easily connected to the main data warehouses within any large company. Based on the analysis of diverse sources of entering and storing textual data, two major "dataware" problems can be derived. They are the unstructured nature of information and its search. In principle, these problems are closely interrelated. Once you acquire a good system of searching information in various sources, you can dramatically systematize the results you obtain.
When there is a problem, there is a solution. They are corporate search systems working with various sources of knowledge, both on the user computer, as well as in the local network. Their main purpose is to perform quick and accurate search of documents in large data volumes. Such special-purpose programs are in the spotlight of this article. We will not deviate to various DMS search engines elements, however splendid they may be. After all, there is no way to compare home cinemas and TV sets built into, say, refrigerators.
The basis of modern technologies is represented by two fundamental processes. First of all, they are indexing of available information and processing queries with subsequent display of results. As regards indexing, any application (be it a desktop search system, a corporate informational system or an Internet search engine) creates its own search area. In other words, it processes documents and creates an index of these documents (an organized structure that contains information on processed data). This very index is further used in the work of the engine for a quick production of a list of required documents according to the query. The rest of the process is not that simple in terms of technology, but is quite simple to understand by a common user. The application processes the query (by the key word-phrase) and displays the list of documents that contain this key phrase. Since the information is stored in a structured index, the query is processed much (dozens and hundreds times!) faster than in the case with direct search (the selection of documents is based on the analysis of textual information in the index rather than on browsing through every file).
The application displays the found documents in the resulting list on the relevance basis, i.e., the conformity of the document to the query text. Various technologies, undoubtedly, comprise various methods of searching and determining the relevance of the document (the number of key word "inclusions" and the frequency of mentioning it in the document, the ratio of these parameters to the general number of words in the document, the distance between the words of the query phrase in the files, and so on). These parameters serve as the basis for determining the "weight" of the document and, depending on this weight, the file ends up in the list of results in a certain position. In case of Internet-search the matter is even more complicated. After all, in this case many other factors have to be taken into account (for example, Page Rank Google). However, this is the subject of a separate article; therefore we will leave the Internet alone for the time being.
Participants and Disposition
This review is called to discover the currently fastest and smartest system of searching information. Seven software products have been selected for the search-test: Google Desktop Search, Copernic Desktop Search, DtSearch 7.0, iSYS 7.0 and SearchInform 1.5.02. The search marathon has been comprised of 20 gigabytes of textual information (documents in the doc, txt and html formats) including fiction extracts and various news articles from the Internet. The tests were run on a state-of-the-art office computer with processor AMD Barton 2.5 MHz, 1 gigabyte of random access memory, 160 gigabyte IDE hard drive Seagate with 7200 rpm and operational system Windows XP.
Ðàçðàáîò÷èê: dtsearch Corp.
Official site: www.dtsearch.com
Distribution package size: 23,1 Mb
A product of dtSearch Corp., dtSearch Desktop with a built-in dtSearch Spider can index and find not only files on a user computer, but also Web nodes (at preset depth), local network resources. It can also use external indexes created on other computers. As was to be expected, dtSearch recognizes various character sets, including Cyrillic, as well as a number of file formats, such as .doc, .xls, .rtf, .pdf, .html and so on. It should be noted that the application is equipped with the ability to search data in databases on the whole and by contents of specific database fields in particular.
In addition to the conventional search in "natural language" or by means of formal queries, dtSearch sports some other types of searching: with account of morphology, fuzzy (implying possible errors and misprints), phonetic (with account of similar sounding words) and synonymous search. However, they are all promised abilities. It should be noted, though, that no discrepancies with the declared functions have been discovered.
The test 20 gigabytes of information have been indexed by dtSearch Desktop 7.0 within 6 hours 13 minutes, resulting in a 7.9 Gb index for the purposes of subsequent search.
As regards documents search per se, the application has revealed no blunders. The same proved to be true with morphology and fuzzy search. The system properly found all required documents (though with a slight pause - after all, we are talking 20 gigabytes) both by a simple one word query and when using a couple of paragraphs from a document as the key phrase. It should be noted, though, that when searching by a large text fragment (consisting of several dozens of words) the system would "freeze" for a while before reporting the result.
The strengths of dtSearch Desktop 7.0 The weaknesses of dtSearch Desktop 7.0
+ searching with account of morphology+ searching with account of synonyms+ fuzzy search+ phonetic search + search within databases (via ODBC)+ support of Outlook messages + support of various character sets+ work in the local network+ indexing Web pagesat preset depth - inability to connect to various sources of information (besides DBMS) and Outlook e-mail- low speed of searching by key phrase over 50 words
Developer: iSYS Search
Official site: www.isys-search.com
Distribution package size: 38.8 Mb
The iSYS company has been on the market for 16 years, and has acquired over 10 000 consumers of its products. Since the very foundation of the company the software developed by iSYS has been aimed at business users. The software range delivered by iSYS includes search programs on desktop computers, in corporate networks and in the Internet.
The corporate search system from iSYS is designed to secure a fast and convenient search. Whether applied on a personal computer, the Internet or the corporate network of an enterprise, iSYS indexes data and performs documents search by using statements and key phrases just as in case of Internet search engines.
iSYS supports several query methods (Command Line Query, Menu-Assisted Query, Natural Language Query); uses the document relevance algorithm and the linguistic peculiarities of the language that allow introducing such features as synonyms, fuzzy search (search with errors) and so on.
iSYS supports 125 file formats (including Microsoft Office documents, WordPerfect, email, PDF, XML, databases and so on) and 30 languages, including even Chinese, Japanese and Corean.
Indexing and processing 20 gigabytes of information by iSYS 7.0 took 6 hours 13 minutes resulting in a rather good time and size of the created file - 7.9 b.
The slightly complicated method of searching with different query versions may strike a newbie as inconvenient at first (for lack of experience). However, close scrutiny resolves all questions. The point of the matter is that the application refuses to search documents by a "long" query consisting of several words. This type of search is provided for by some additional features. Among the strengths of the application is the high quality system of automatic documents rubrication. As soon as indexing was complete, iSYS assigned all processed documents to the appropriate rubrics and presented them in a convenient form.
The strengths of iSYS Desktop 7.0 The weaknesses of iSYSDesktop 7.0
+ searching with account of synonyms+ fuzzy search+ support of various character sets+ support of various query methods+ heuristic analysis+ support of various data sources (SQL, FTP, TRIM Context, WORLDOX 2002)+ searching information in over 30 languages+ a sophisticated system of automatic data rubrication+ work in the local network - absence of morphology support- price
Google Desktop Search + GDE Enterprise
Official site: http://desktop.google.com/enterprise
Distribution package size with TweakGDS: 1.2 Mb
A free solution from Google is intended for searching information on a personal computer, in the Internet and in the corporate network of an enterprise.
Google Desktop Search Enterprise proudly sports the ability to index and search documents in dozens of the most widely spread text formats, as well as electronic mail, audio and video files tags and images. To be remembered: to be able to tell the application which files and folders to index, you have to install an additional component gdetweak. Without this addon Google Desktop Search Enterprise will index all information on the user computer and in the network of the enterprise that it can access. Google Desktop Search managed to process 20 gigabytes of text within 8 hours 17 minutes. The size of the resulting index was 4,5 Gb. The search speed is quite satisfactory and is on the same level as other broadly acknowledged market participants.
Unlike iSYS and dtSearch, Google Desktop Search Enterprise by right boasts of the most user-friendly interface. However, as regards administering and setting up the work in the local network, it yields to its competitors, and the difference is quite tangible. The thing is, it is quite complicated to set up network operation as you would need it in a particular situation, because the system tries to do everything for you. The only way to fine tune the application is to install additional components. This is a major disadvantage. It goes without saying that as a desktop system Google Desktop Search with the gdetweak component knows no equals.
But corporate application is still a long run from the current state. The promised search of documents with a similar content (in the Internet originally posed as similar pages), leaves much to be desired. Apparently, for this very reason it is not included either into the "global" desktop and network versions.
The strengths of Google Desktop Search The weaknesses of Google Desktop Search
+ searching with account of morphology+ searching with account of synonyms+ Support of various character sets+ a familiar Web interface + work in the local network (Enterprise version)+ indexing electronic messages, audio and video files tags and images+ free of charge - the structure of addons*
*The point of the matter is that full scale operation of the application requires downloading and installing a large number of additional modules. In order to show the application which files and folders to index, you have to install an additional component gdetweak. Without this add-in Google Desktop Search will index the whole information on user computer and in the network of the enterprise that it can access. The same goes for all other features of this search tool. For example, support of archives.
Copernic Desktop Search
Official site: www.copernic.com
Distribution package size: 2.56 Mb
Copernic Desktop Search allows searching various files, email messages (supporting Outlook Express 5.x/6.x, Outlook 2000/XP/2003, Windows Address Book), Word documents, Excel, PowerPoint, Acrobat PDF, music and video files, graphics etc. To top it all, the search can be performed both on a local computer and in the Internet. Built-in tools for viewing various files allow you to see the search results. For example, if you select in the main window of the application the thumbnail of an HTML-document, Copernic Desktop Search will display its contents. Upon installation of the application a small window will be displayed in the Taskbar. In the window you can enter the search query and perform quick search set-up. The speed of application operation is of separate notice, as well as the low level of computer resource consuming.
Copernic Desktop Search indexed 20 gigabytes of text within 10 hours 51 minutes. The size of the resulting index was 7 Gb.
The strengths of Copernic Desktop Search The weaknesses of Copernic Desktop Search
+ searching with account of morphology+ an exceptionally user-friendly interface+ indexing electronic messages, audio and video files tags and images + processing Microsoft Outlook and Microsoft Outlook Express electronic messages.+ free of charge - absence of a built-in document viewer- absence of network support
Developer: SoftInform Ltd.
Official site: www.searchinform.com/site/ru
Distribution package size: 15 Mb
Though last in the list, but far from being the last in efficiency the SearchInform tool is presented by the SoftInform Company. SearchInform Desktop 1.5 indexed the 20 gigabytes of test data at a record time - within 3 hours and 17 minutes. By the way, the size of the resulting index was the smallest of all, 4.4 Gb.
The search tool from the SoftInform Company was developed on the basis of a patented technology «similar contents documents search» - SoftInform Search Technology. It incorporates all tools necessary for structuring disembodied information within the framework of an enterprise and is an efficient solution to any problems of searching and consolidating information.
The high indexing rate (up to 6 Gb/hour), the small size of the index (15-20% of the actual bulk of textual information), support of virtually all wide-spread text file formats (including .pdf and .html), as well as correct work with archives are delivered all in one package.
Once you consider a minor, but extremely useful feature of SearchInform – Smart Indexing that tracks in real time computer processor capacity and adjusts the level of system resources consumption in the process of indexing, SoftInform will bear the palm of supremacy by right, to say the least.
In addition, the process of indexing (unlike other programs in the review) is very vivid and demonstrates not only the speed, but also the number of processed documents, as well as the number of unique words by which the search will be performed.
SearchInform Corporate has proven to be an incontestable leader in search speed as well. The 20 gigabytes appeared to be a piece of cake for the application, while it paused after the first query only (the rest of the search was completed in an instant). The relevance of search was irreproachable.
On top of it, SearchInform Corporate, developed on the basis of the unique technology SoftInform Search Technology, sports a highly interesting feature: search of documents with a content similar to query text. Thus there is no need to preliminarily select key words, the search is performed in the whole document. The search result is the display of documents that are most similar to the query text fragment, indicating relevance ratio.
The strengths of SearchInform Desktop 1.5 The weaknesses of SearchInform Desktop 1.5
+ searching with account of morphology+ searching with account of synonyms+ fuzzy search+ Important words function for pinpointing the search+ indexing Outlook and TheBat! electronic messages.+ search by attributes+ rubricator + automatic rubrication of documents+ support of various sources of information (DBMS, DMS,, CRM, and so on). + network operation (the Corporate version) on the basis of NTFS inheritance of Windows authentication+ the speed of searching and indexing + searching documents with a similar context* - problems with protected PDF-documents
*This technology is based on the mathematical model of document structure analysis and selection of similar words, phrases and text arrays. The search result is the display of documents that are most similar to the query text fragment, indicating relevance ratio. Unlike the standard phrasal search, SoftInform Search Technology helps to avoid preliminary selection of key words. This feature reduces the duration of a "search session" to the minimum. Such a convenient and much called for feature is at present the prerogative of this system only.
Comparison of Indexing Speed
The 20 gigabytes of information were indexed by a computer with the following configuration: AMD Barton 2.5 MHz, 1 gigabyte of random access memory, a 160 gigabyte IDE hard drive Seagate with 7200 rpm and system Windows XP+SP2.
Search system Indexing duration Index size
DtSearch 7.0 6 hours 3minutes 8.6 Gb
iSYS Desktop 7.0 6 hours 13 minutes 7.9 Gb
Google Desktop Search 8 hours 17 minutes 4.5 Gb
Copernic Desktop Search 10 hours 51 minutes 7 Gb
SearchInform 1.5.02 3 hours 17 minutes 4.4 Gb
Close scrutiny of the functionality and speed factors of the search systems brings us to a difficult decision. It turned out that the new solution from the Russian company SoftInform works much faster and more efficiently than its Western, "time-proved" counterparts. However…
The well promoted and absolutely free GDS Enterprise can be fine-tuned and laden with additional features only via installing plug-ins. This is how support of archives is implemented. In addition, to be able to enjoy all features of this system is full operation, developers recommend that you acquire Premium Support. And it costs, by the way, "next to nothing", $10000 a year for every 1000 users. Without well-paid experts Google will find deploying a full-fledged working enterprise system not quite impossible, but extremely difficult. Therefore, in view of rather satisfactory speed characteristics of the application and its user-friendly IE-like interface we would do it justice by labeling it as a great "desktop" search tool, and give Google its due for attempting to put into practice the dream of Bill Gates, namely to come into every home. It is excellent branding, isn't it?
The tests revealed two major rivals, if we may call them so, they are the already known products dtSearch and iSYS, and the new solution SearchInform developed by the Russian company SoftInform. These systems brag on the ability to connect to third party sources of knowledge, such as, for example, databases, high speed of indexing and searching with advanced search features.
In addition to the highest speed of indexing and searching documents seasoned with the unique feature of searching documents with a similar content, SearchInform Corporate can act as a system that consolidates information within the whole enterprise. The thing is, this system can process not only documents on a computer disk, or in the network of an enterprise, but also utilize other data sources, such as CRM or DMS, DBMS on the basis of MS SQL and so on. It goes without saying that SearchInform Corporate is the only application from the review that can solve both of the most burning problems of enterprise dataware - the problem of searching documents, as well as consolidating the knowledge into a single and expedient system.