Please note that haupia was renamed to SmartSearch with version 2.4.0. While we have updated our documentation accordingly, you may still encounter "haupia" in older code examples, APIs, scripts, and configuration files. In such cases, please consider SmartSearch as the current and correct term. We recommend making the appropriate substitutions when implementing solutions from older documentation. |
1. Introduction
Searching a website is especially important as more and more information is available online these days and websites become larger and more complex. An effective search function allows users to quickly and easily find the information they need without having to click through countless pages and links. It is also an efficient way to save time and improve the user experience, as it allows users to go directly to the relevant content without having to deal with unnecessary information. In addition, a good search function can also help a website be perceived as user-friendly and professional, which in turn increases user confidence and can help them stay longer on the website and come back more often. Overall, effective search on a website is an important factor in a positive user experience and the success of a website.
The search function on the website can also be used as a marketing tool to provide visitors with personalized offers and recommendations. For example, if a visitor is searching for a specific product, the company can recommend similar or complementary products or make special offers to increase their willingness to buy. In addition, certain hits can be better placed in search results to draw the attention of website visitors to specific content.
Search gives the company valuable insights into the needs and interests of visitors. By analyzing search terms and results, companies can identify trends and patterns and adapt their website content accordingly to better meet the needs of the target group. In this way, the website can also serve as a market research tool.
Overall, a well-designed search function on the website can not only help improve user experience and increase conversion rates, but also provide valuable insights into the needs of the target audience and be used as a marketing tool for customer retention and acquisition.
The SmartSearch bundles requirements that customers place on the search function of an online presence: An intuitive, high-performance search solution that can be used on extensive websites and delivers relevant results. It offers both high result quality and optimum search convenience, thus retaining customers on the website.
At the same time, it provides editors with a web interface through the integrated SmartSearch cockpit that can be used without IT knowledge. Editors from specialist and marketing departments are thus enabled to control and monitor search results on the web interface. For this purpose, the cockpit provides statistics, filters and analysis functions and allows the indexing of various data types (for example XML, audio, video, media) from different data sources. With the help of individualized result lists, editors can prioritize and weight search results in the back end and display selected content for predefined search queries.
1.1. Concept
The SmartSearch functionality, e.g. settings to user permissions and group permissions, can be managed in the browser-based SmartSearch cockpit without any prior technical knowledge.
1.1.1. data generator
In the SmartSearch cockpit, you can create data generators to capture searchable data.
There are three types of data generators:
-
Web: The web crawler enhances the searchability of an existing web site.
-
XML: The XML file crawler speeds up the processing of data when it exists as XML files.
-
API: Assists in passing data from a self-developed application or the FirstSpirit SmartSearch Connect module to an API.
1.1.2. Prepared Search
Using a Prepared Search, you can configure the search result, for example by bundling content from multiple data generators into one Prepared Search.
1.1.3. Adaptable Result
In an Adaptable Result, you can actively modify the search results, for example to display certain search results higher in the list, or to exclude search results from the list.
1.1.4. synonyms and stopwords
You can specify a synonym to associate a term with the search that does not appear in the website text, but is meaningfully associated with another search term. synonyms can be derived from common search queries, for example.
You can use stopwords to ignore certain terms in the search query. A list of typical stopwords for each language is provided and can be extended.
1.2. Architecture
The functionalities of the SmartSearch are realized by an architecture made up of a range of different components (see figure Architecture).
These components are:
-
ZooKeeper
-
Solr
-
SmartSearch
The individual components always interact according to the following schema:
-
Prior to creating the search index, the SmartSearch must collect the required data. For this, it accesses the information to be collected on the customer side, which can exist in the form of websites, portals or databases. In addition, a REST interface provides the option of filling the search index with further data from outside.
-
After that, the SmartSearch server normalizes the data and transfers it to the Solr server. The server receives the data and persists it in an index.
-
The query for data is done equivalently: The SmartSearch server receives the request, modifies it, and then forwards it to the Solr server. This responds with a search result, which the SmartSearch server returns to the customer’s end application via the REST interface.
-
The SmartSearch cockpit is to be seen detached from the other components. It serves the administration of the SmartSearch server and therefore offers a simple, web-based administration interface. Among other things, search solutions can be created and configured in this interface.
-
Configurations made in the SmartSearch cockpit are saved on the ZooKeeper server together with the Solr configuration data.
The communication to the outside is protected by HTTPS, between the components it is done via HTTP.
2. SmartSearch cockpit
The SmartSearch cockpit is a component of the SmartSearch. It enables the backend-side administration of the data collected by the SmartSearch and offers a simple, web-based interface for this purpose. This is divided into the areas Configuration
, Analysis
, Data
and System
, which can be reached via the menu. The button with the globe icon also provides a language switcher for German and English.
By default, the SmartSearch cockpit is accessible via the following URL:
The first start of the cockpit must be done with the master admin. It is created automatically with the data from the application.yml
at the initial start of the SmartSearch server.
If the user and group management is implemented via an LDAP server, the credentials may differ. |
After valid login, the user is automatically redirected to the dashboard of the cockpit. Re-authentication is only required after an explicit logout or after the session has expired.
2.1. Configuration
The Configuration
area is divided into the submenus Prepared Search
, Stopwords
and Synonyms
. These allow the configuration of the output of the data collected by the SmartSearch.
The following subsections describe the submenus and the functions provided by them.
2.1.1. Prepared Search
The customer-side gathering of the required data is done by the so-called data generators, which are a part of the Data
area. For their management, the SmartSearch provides the Prepared Searches. These allow optimizing the search results by prioritizing individual data.
The creation and administration of the Prepared Searches is done in the interface of the same name, which can be called via the menu entry .
The area shows a list of all already existing Prepared Searches and is initially empty.
In cloud mode, the list also displays the accessibility of each Prepared Search. |
New Prepared Search
For the creation of a new Prepared Search there is a separate view, which can be called by clicking on the button New Prepared Search and is divided into the three tabs General
, Facets
and Preview
.
- General
-
The first thing to do within the tab
General
is to specify a name for the new Prepared Search. In cloud mode, the additional checkboxpublicly accessible
is located next to the input field for the name. With it the accessibility of a Prepared Search can be defined. Activating the checkbox enables the Prepared Search to be queried via the internet (API gateway). Otherwise the Prepared Search is only accessible as it is the cockpit.The following selection of any number of data generators in the selection list of the same name shows their available fields. The initially activated checkbox
Verbose
shows or hides all technical fields. The button provided together with the checkbox enables the emphasis of the selected fields.The list of field names per data generator is cached. When creating a new data generator and running it for the first time, it may take several minutes for the field names to appear in the list.
The selected fields can be transferred via button to the list of fields relevant for a search, which by default contains the fields
content
,link
andtitle
. A previously defined emphasis is automatically assigned to each of these fields.The list provides the following configuration options per field:
-
Highlight: By activating this checkbox, a search word is highlighted within a text segment in the search result. The length of the text section is freely configurable (see below).
-
Output: This option defines whether the field is visible in the search result. For example, in the case of the link that only refers to the entire document, this may not be desired.
-
Search: In order for the associated field to be taken into account by the search, this checkbox must be activated.
Deactivating the
Search
option hides the subsequentPartial match
andEmphasis
options for the corresponding field. -
Partial match: This option enables partial matches to be taken into account. If the checkbox is selected, the search for
electricity
, for example, also finds the matcheco-electricity provider
. The search word must have a length of three up to twenty characters. -
Emphasis: The emphasis offers the possibility to set a prioritization for matches of the selected fields and thus to influence the search result.
The button with the trash can icon available for each field allows deleting the corresponding field from the list.
The tab additionally contains the following general configuration options:
-
Hits per page: According to its name, this button specifies the maximum number of hits displayed on a search results page. In combination with the URL parameter
page
it is also possible to split the search results into multiple pages. -
Length of highlight (in characters): Using this button, the length of the text segment in which a search term is visually highlighted can be defined, as mentioned before.
-
Sort by: By default, the search results are sorted in descending order of their score. If a different sorting is desired, this text input field allows a corresponding adjustment. For this, any field is to be specified and supplemented by the expression
ASC
for an ascending orDESC
for a descending sorting.
-
-
Spellcheck (hits less than): If the number of search results is smaller than the value configured at this point, a spelling check is performed and the search is performed for search words of similar spelling.
-
Must Match: For searching multiple terms, the entry of this text input field determines how they are to be linked:
-
The value
0
corresponds to an OR reference between the search terms used. -
The value
100%
corresponds to an AND reference between the search terms used. -
An
absolute value
defines the number of terms that must be contained within a search hit. For example, the value2
for five given search terms means that two of the five terms must be contained within a search result. Furthermore, the values2
and50%
are equivalent to each other for four search terms.
-
-
Groovy Script: In addition, Prepared Searches allow the inclusion of a self-implemented Groovy script. Such a script enables additional modifications of the dataset. For example, additional documents can be added to the dataset or the existing dataset can be edited.
- Facets
-
Facets provide the possibility to restrict result lists according to fields that are included in a document. Since facets always refer to the data generators selected in the tab
General
, the tabFacets
is initially empty.Figure 5. FacetsNew facet
For the creation of a new facet there is a separate view, which can be called by clicking on the button New Facet. This is only active if at least one data generator is defined in the
General
tab.Within the tab, a field must first be selected in the dropdown of the same name. The available fields are taken from the selected data generators. With the selection of a field, a list of the values associated with this field appears, for which the following configuration options are available:
-
Filter: This input field can be used to search for facet values.
-
Show weighted values only: By selecting this checkbox, only facet values that have been weighted will be displayed.
-
Weight: By clicking on the
Weight
field, a weighting can be assigned to a facet value. This number represents a multiplier of the score and can be between 0.00 and 9.99. A value between 0.00 and 2.00 is recommended. Results that belong to this facet value receive a weighting and are ranked accordingly higher or lower in the search results. -
Display values on number of matches greater than: This option can be used to exclude values from the facet list whose number of matches is less than the specified threshold.
-
Multiple selection possible: This option allows filtering the search results list according to different aspects. For example, the search for a specific object can subsequently be restricted to a specific size as well as to a specific color. This makes the search more specific and minimizes the search results list.
-
Exclude own filter: In order to provide different filter options for the search, the selection of a filter may only refer to the search result list, but not to the filter options. Otherwise, the other options would also be hidden by the search.
For example, if the filter options
German
,English
andFrench
exist for the facetLanguage
, the search will only return English documents when the optionEnglish
is selected. If the filterEnglish
is not excluded in this case, the list of available filters will also only showEnglish
. In this case it is no longer possible to switch to another filter or to make a multiple selection in connection with the previous configuration option.- Preview
-
The preview tab provides the option to test a Prepared Search configuration on a preview page. For this purpose it is not necessary to save the Prepared Search. The settings from the other tabs are directly applied to the search queries. On the preview page there is an input field for search terms. The entered term is then looked up in the current Prepared Search and the results are displayed. Next to each search result there is a button marked with an arrow pointing upwards. Clicking this button displays all of the fields that are selected in this Prepared Search below the result.
In order to transfer a facet to the preview, it is sufficient to confirm the configuration by clicking OK. After switching back to the preview tab, the last search term is used again automatically and new filter options appear in the column next to the search results. These receive their names and values from the previously created facet. The preview feature can also be used to test weightings or groovy scripts.
Previously created Adaptable Results will appear in the specified order in the preview. |
List of existing Prepared Searches
After the Prepared Search is saved and successfully created, it is added to the list of existing Prepared Searches in the tab of the same name. The list is divided into two columns with the following information:
- Name
-
The column contains the name of the Prepared Search specified during generation.
- Last edit
-
The information in this column shows the time and the editor of the last change to the Prepared Search.
Editing a Prepared Search
The button with the symbol of a pen enables editing a Prepared Search. Clicking it opens the detailed view with all the configurations made. These can be changed according to the previously described possibilities.
Deleting a Prepared Search
The button with the symbol of a trash can enables the deletion a Prepared Search. Clicking it opens a dialog with a confirmation request to confirm the deletion.
Alternatively, it is possible to select several Prepared Searches for deletion via the checkboxes that are visible in the overview list in front of each Prepared Search. The button Delete selected removes these selected Prepared Searches after a confirmation request has been confirmed.
2.1.2. Stopwords
Every written text contains articles, filler words and other terms which are irrelevant for a search query and therefore should be ignored. For this reason, the SmartSearch has a list of so-called stopword for each of the 26 supported languages, which it provides in the menu area in the form of a drop-down menu. The terms contained in these lists are ignored by the data generators when the data is collected and are not included in the search index.
To edit a stopword list, first select it from the drop-down menu. The interface will then display all stopword contained in it.
The field Add new stopword
allows the comma-separated specification of further stopword, which are transferred to the list by clicking on the button with the plus symbol. The buttons with the symbol of a pen or a trash can, which are visible per stop word, allow editing or deleting a single entry of the list.
Each change to the selected list finally requires being saved via the button with the symbol of a checkmark, which is displayed next to the drop-down menu. Only with this click the adaptations will be applied, even if they are already visible in the list.
In order for the changes to the stopword lists to be considered, a re-run of the data generators is necessary. |
2.1.3. Synonyms and Substitutions
Similar to stopwordn, which can be used to filter out irrelevant terms from the captured dataset, synonyms and substitutions allow for the addition of specific terms. In the menu area there is the option to select a language in a drop-down menu. Below this there are two tabs for entering substitutions and synonyms. After the language has been selected, a list of substitutions or synonyms for the selected language appears in the selected tab. This list is initially empty and can be expanded using the corresponding input field.
Each entered term will be converted to lower case when stored. |
Substitutions
This is a list of substitutions. Each entry maps a search term to a series of words. The 'Add new search term' field allows you to enter a search term. After confirming with 'Enter' or clicking on the button with the plus symbol, the word will appear together with an empty row of substitutions in the list below. Via the field 'Add new substitution' the corresponding substitutions can be assigned to the search term.
It is also possible to enter several comma-separated terms in the 'Add new search term' field. In this case, the first entered term will become the search term and the remaining words will become their substitutions. The same applies to the |
After saving, a search for the entered term will be replaced by the terms in the corresponding row.
Furthermore, the buttons with the symbol of a pen or a trash can, which are visible per search word, allow editing or deleting a single entry of the list.
Synonyms
This is a list of synonymseries. Entries of a synonymseries are treated as synonymous by the search. This means that a search for one term from a series will return results for all terms from the series. The field 'Add new synonyms' allows for the creation of a new series. Here it is also possible to enter several, comma-separated terms. In this case, they will appear as individual terms in a new synonymseries in the list below.
Substitutions that work like synonyms are recognized by SmartSearch and will appear in the synonyms tab after saving.
|
Any change made to the selected list finally requires saving it using the button labeled "Save" displayed next to the drop-down menu. Only with this click will the adjustments be applied, even if they are already visible in the list beforehand.
2.2. Analysis
The Analysis
area is divided into the submenus Statistics
and Adaptable Results
. They allow both the analysis of the performed search queries and the manual control of the hit sequence for individual search terms.
The following subchapters describe the submenus and the functions provided by them.
2.2.1. Statistics
The Statistics
area allows the analysis of the search queries of a specific Prepared Search. The field of the same name lists all existing Prepared Searches for this purpose, from which the Prepared Search to be analyzed is to be selected by a checkbox.
Every fifteen minutes, there is an automatic update of the data to be analyzed, so that the analysis is always based on the most up-to-date data. |
The following fields are used to specify the data to be analyzed:
-
The drop-down menu
Number of results
indicates whether the 10, 25, 50 or 100 most searched keywords should be output for the analysis. -
The
From
andTo
fields define the time period to be considered. -
The
Tags
field allows the statistic to be customized by specific filter criteria. They are used to categorize search queries and can be defined during their execution. By default, tags are entered in the basic view, where they can be selected by simply clicking on them. All selected tags are connected by anAnd
orOr
link. In addition, the Advanced button provides a switch to a more complex input of the tags.
Examples of the usage of tags can be found in the tags section.
Switching from the basic to the advanced view is possible at any time. Switching in the other direction is only possible if the entered filtering can be mapped in the basic view. |
The Adopt button performs the determination of the most frequent search words that result from the configuration made for the selected Prepared Search. They are then displayed as a list in the interface.
The Download CSV button provides both the defined configuration data and the search words determined on their basis in the form of a CSV file. In addition to this information, it contains a deep link and the absolute sum of all search queries - with and without filtering. The deep link enables the file to be downloaded again.
The CSV file which was generated from the statistics above contains the deep link (csv deep link
) and the following statistic parameters:
-
search terms (
string
) -
search frequency (
count
) -
selected time period (
start
,end
) -
names of the selected Prepared Searches (here
Mithras
) -
query totals with (
total
) and without (total without filter
) filter.
The deep link contained in the CSV is also provided by the Copy to clipboard > Deep Link CSV button. The Copy to clipboard > Deep Link Cockpit button provides a link that can be used to call up the Statistics
area again, including the configuration made. This way, both the settings and the result based on them can be made available to third parties.
2.2.2. Adaptable Results
The storage of the contents determined by the SmartSearch happens in a multiplicity of various fields. These can be assigned a weight during the configuration of a Prepared Search to prioritize them differently. This gives them a so-called score, which is calculated automatically and determines the order of a search result list. The manual control of these search result lists, the so-called Adaptable Results, is done in the area of the same name.
It is also possible to create Adaptable Results for search terms that do not yield any results. |
The area displays a list of all existing Adaptable Results, each of which matches the search query for a keyword of a Prepared Search.
Create an Adaptable Result
For the creation of a new Adaptable Result there is a separate view, which can be called by clicking on the button New Adaptable Result.
Within the view, a Prepared Search is to be selected and any search term is to be entered for which the order of the search results should be changed. A click on the button Adopt deactivates the two fields and, on the basis of the entries, determines the corresponding result list, which is displayed in the form of a table. The Reset button hides the table and reactivates the two fields.
In addition to the title, the link and the evaluated tokens of the displayed document, each entry in the table shows the score assigned to it, from which its position in the list results. This can be influenced by the buttons Elevate and Exclude which are also displayed for each entry:
- Elevate
-
The Elevate button highlights the search result in green and moves it to the first position in the results list. The arrow buttons that appear instead of the Elevate button can be used to specify the position of the entry even more precisely. It should be noted, however, that the positioning via the arrow buttons refers only to the prioritized entries. It is not possible to mix prioritized and non-prioritized entries. The button with the symbol of a minus sign can be used to cancel the prioritization.
- Exclude/Include
-
The Exclude button highlights the search result in yellow and removes it from the results list. This means that it will no longer be displayed in the frontend. The Include button that appears instead adds the search result back to the list.
Both elevations and exclusions are based on links that are available in the index at a certain point in time. Changing the search results list can therefore lead to situations where the links are no longer available and thus no longer detectable by a data generator. Due to their missing correspondence in the index, these links can no longer be displayed in the list view. They are so-called orphaned elevations or exclusions. In case of such entries, a warning appears in the detail view of an Adaptable Result. This lists the orphaned links and their document ids. Activating the checkmark under the warning triggers a cleanup when saving. |
The Add page manually
field allows manual addition of further documents by specifying their URL. These must already be included in the index of the data generators assigned to the selected Prepared Search. The corresponding URL must be entered in the form http[s]://www.newdocument.com
and confirmed with the plus button. If the document is included in the index, it will be added to the results list. Otherwise, a corresponding note will appear.
Each change to the search results list must be persisted using the Save button. Only with this click the adjustments are applied, even if they are already visible in the list before.
List of existing Adaptable Results
After the Adaptable Result is saved and successfully created, it is added to the list of existing Adaptable Results in the area of the same name. The list is divided into three columns with the following information:
- Name
-
The search word used during the creation of the Adaptable Result also acts as its name and is shown in this column.
If not only one search word but a list of search words is passed, these are sorted alphabetically. |
- Prepared Search
-
The column contains the Prepared Search to which the search query refers. It is selected during the creation of the Adaptable Result.
- Last edit
-
The information in this column shows both the time and the editor of the last change to the Adaptable Result.
Edit an Adaptable Result
The button with the symbol of a pen allows editing an Adaptable Result. Clicking on it opens the detailed view with the entries made. These can be changed according to the previously described possibilities.
Delete an Adaptable Result
The button with the symbol of a trash can allows to delete an Adaptable Result. A click on this button opens a dialog with a confirmation request to confirm the deletion.
Alternatively, it is possible to select multiple Adaptable Results for deletion using the checkboxes visible in front of each Adaptable Result in the overview list. The Delete selected button removes these selected Adaptable Results after a confirmation request has also been confirmed beforehand.
2.3. Data
Prior to creating a search index, the SmartSearch accesses the information to be collected on the customer side. This can exist in the form of websites, portals or databases. Their gathering is done via so-called data generators, which provide different configuration options. The data persisted on the Solr server can then be viewed via the Browser, which is also provided by the SmartSearch cockpit. In addition to the contents, this also displays all metadata.
The following subsections describe both the data generators and the Browser in detail.
2.3.1. Data Generators
The SmartSearch provides so-called data generators for the customer-side collection of the required data. Their creation and administration takes place in the interface of the same name, which can be called up via the menu entry .
The area shows the list of all already existing data generators and is therefore initially empty. When creating further data generators, three types have to be distinguished:
-
Web
-
XML
-
Extern
In addition to these data generator types relevant for manual creation, there are others which cannot be created manually. An example of this is the |
Create a data generator
For the creation of a new data generator there is a separate view, which can be called by clicking on the button New Data Generator and is divided into three tabs. Besides the two tabs General
and Enhancer
, which are identical for all data generators, each of them has an additional specific tab.
The following sections first describe the two identical tabs before an explanation of the specific tabs follows:
- General
-
Within the tab
General
, first a name for the data generator to be created has to be specified along with a default language. The counterMinimum number of documents for synchronization
defines a threshold. Up from this threshold a persistence of the collected data into the index created by the Solr server takes place. The remaining settings allow a regular and automatic collection of the required data and can be enabled or disabled via the checkboxStart regularly
. For this purpose, a daily or weekly start interval can be selected and the next start time can be defined by specifying a date and a time.In this case, the languages correspond to the values provided by ISO 639-1, meaning language codes consisting of two characters. If a data generator finds longer language codes on documents, everything after the second character is truncated. This way, for example, the language code
de_DE
becomes the valuede
during generation.Figure 17. data generator - General
- Enhancer
-
The
Enhancer
tab enables the creation of configured Enhancers. They are used to modify the data collected by the data generator before it is persisted on the Solr server. Per data generator configuration multiple Groovy script Enhancer can be defined. These are processed in the order in which they are stored in the configuration. The creation of an Enhancer is done by clicking on the New Enhancer button.A detailed description of the Enhancer that can be created is included in the Enhancer chapter below.
- Web Crawler
-
The tab
Web Crawler
is specific to the data generator of the same name and is therefore only displayed for it. It collects all the content of an existing web presence. This includes HTML content (but without HTML comments contained in it) as well as all files of any format. To do this, it first requires a start URL and the maximum number of pages to be covered. If the start URL corresponds to a sitemap, the links contained in the sitemap are successively captured.By specifying the value zero for the number of pages to be captured, the entire web presence is captured.
To ignore HTML content when generating the search index, it must be marked by HTML comments of the form
<!-- NO INDEX -→
and<!-- INDEX -→
.If the collected pages contain links to other content, the data generator follows them and gathers the linked content as well. Canonical links are also taken into account. This way, it captures the web presence specified with the start URL either completely or up to the specified limit.
Links to be ignored by the data generator must be marked with attribute
rel="nofollow"
.A self-implemented Groovy script specified in the
Check URL Script
field can be used to individually influence the gathering and processing of linked content. The Validate button checks the created script for possible errors. The script determines whether a link will be captured: If the script returns true, every link found by the crawler will be captured, which poses the risk of creating a never-ending crawler.
return true; // Caution: This could lead to a never-ending crawler
return url.startsWith("https://www.mithrasenergy.com") && !url.contains("/shop/");
The Web-Crawler recognizes sitemaps in XML format automatically if they are specified as the start URL. The documents displayed in them are processed and crawled like HTML links . In addition, it is also possible to use an index sitemap as the start URL. Please note that both, XML-sitemaps and index sitemaps, are not documents by themselves and due to this are not included in the index. A sitemap.xml references within a HTML file is processed as a normal document and not as a sitemap. To process a URL as a sitemap, the URL must be listed directly as the start URL or as the content of a sitemap index file. |
The data generator can only follow references from HTTP to HTTP(S) pages. It cannot switch from HTTPS to HTTP. This also applies if an index sitemap is used and other sitemaps are referenced within it. |
The remaining settings allow you to specify the data generator’s search queries:
-
By de-/activating the checkbox
Ignore no-script areas in HTML
it is possible to determine whether the content contained in these areas is also to be collected. -
If the
observe robots.txt
checkbox is enabled, the data generator takes into account the restrictions defined in the robots.txt file of the respective web presence. Otherwise, it ignores them and gathers the entire content of the website. -
If the captured links contain additional parameters, these can be skipped by activating the
Remove parameters from URL
checkbox. This way, redundancies in the captured documents can be avoided. -
The
Charset
field specifies which charset is used to persist the collected data. -
For both collecting data and retrieving persisted documents, the SmartSearch provides the
smart-search-agent
user agent. Its reference name is contained in the corresponding field and can be overwritten as desired. -
The value defined for the
Interval between two downloads
determines how fast the SmartSearch submits the individual search queries. This way an overload of the web server due to a too high query frequency is avoidable. -
If the data generator does not receive a response to a search request, the
Read and connect timeout
indicates the amount of time before the data acquisition is aborted.When collecting the contents of an existing web site, the Web Crawler may receive different status codes. These are processed as follows:
-
2xx: These status codes correspond to the standard codes. In this case the contents are processed as usual by the Web Crawler.
-
301: This status code corresponds to a redirect. In this case the Web Crawler adopts the URL of the new page.
-
302 and 307: These two status codes also indicate a redirect. Even in this case, the Web Crawler adopts the contents of the new page, but the old link is retained.
-
4xx and 5xx: These status codes specify error statuses. The affected pages are ignored by the Web Crawler and the error is logged as
DEBUG
.
For the redirect links resulting from a 3xx status code, the Strip Parameters configuration is taken into account.
The web-Crawler first tries to read a language tag from the following tags in the source code of the crawled HTML document:
-
<html lang=en">
-
<body lang=en">
-
<meta name="language" content="en">
-
<meta http-equiv="content-language" content="en">
-
<meta property="og:locale" content="en_US">
-
<meta property="DC:language" content="en">
-
<meta property="og:locale:alternative" content="en_GB">
<meta property="og:locale:alternative" content="en_AU">
For the first of the above tags that was found, the specified language will be interpreted and stored as the language of the document.
If none of the tags are present, the web-Crawler tries to detect a language automatically from the
content
area of the document. If this fails, the default language maintained at the data generator is stored at the document.By default, the SmartSearch stores the input language of a document in the
language_original_stored_only
field. This storage is regardless of the changes made to a document’s language during language recognition and processing.For example, if the SmartSearch processes the
en_US
language passed to the document in thelanguage_keyword
field and changes it to the supported languageen
, the value of thelanguage_original_stored_only
field will still been_US
.Figure 18. Web Crawler -
- XML Crawler
-
The
XML Crawler
tab is also specific to the data generator of the same name and is therefore only displayed for it. It captures XML data from a defined source that does not exists in the form of a web page. For this purpose, anXML URL
must be specified in the corresponding field. For the target of this URL there are two file types:-
Data XML file
The data-XML-file corresponds to a specific source. It includes the documents to be processed by the data generator in a single file. -
Seed XML file
If there is a need to record multiple sources, the URL of a Seed XML file can be passed. The data generator then follows the references contained in this file, the number of which is not limited.
-
-
Collecting XML sitemaps is possible only with the help of the web data generator.
Access to the files to be processed is performed exclusively via http(s) request. Acquiring XML files from the SmartSearch server file system or other file sources is not possible. If the data generator does not receive a response to a search request, the
Read and connect timeout
indicates the amount of time before the data acquisition is aborted.For both collecting data and retrieving persisted documents, the SmartSearch provides the
smart-search-agent
user agent. Its reference name is contained in the corresponding field and can be overwritten as desired. Sein Referenzname ist in dem entsprechenden Feld enthalten und lässt sich beliebig überschreiben. -
Figure 19. XML Crawler
-
Data XML file
The
XML Crawler
captures XML data that comes from either a single source or multiple sources. A data XML file corresponds to a single source. It contains any number of documents that the SmartSearch processes in the context of an XML data generation. When an XML data generator, which captures such a file, is started, all documents contained in the file are added to the search index. However, this assumes that the documents are valid.A valid data XML file meets the following requirements:
-
The root element is a
documents
element. -
The
documents
element contains any number ofdocument
elements:-
A
document
element represents an entity to be found by the search. -
A
document
element has auid
attribute whose value is a unique alphanumeric document ID. -
The
document
element contains any number offield
elements:-
Different
document
elements may consist of different numbers of fields. -
A
field
element has aname
attribute whose value is the desired field name in the search index. -
A
field
element contains the value that the field should have in the document. -
Multiple values (e.g. for fields with the dynamic field type
*_keyword
) can be created by entering fieldelements
with identicalname
attribute values but with different tag contents (seecategory_keyword
value in the example below). -
The processing of a field value depends on the naming convention of the field name. For an explanation of the available naming conventions and field types, see section Available fields and field types.
It is recommended to always enclose field values within a
CDATA
block (see example).
-
-
-
-
The following example shows an exemplary structure of an XML data file. Such a file can directly serve as XML URL for an XML data generator.
Example of a data XML file<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <documents> <document uid="0815"> <field name="title"><![CDATA[The explanatory video for the SmartSearch XML file]]></field> <field name="link"><![CDATA[https://www.youtube.com/watch?v=...]]></field> <field name="content"> <![CDATA[The content here tells you how to create a SmartSearch XML file.]]> </field> <field name="language_keyword"><![CDATA[en]]></field> <field name="publishing_date"><![CDATA[ 2019-05-23T10:00:01Z ]]></field> <field name="favorites_count_long"><![CDATA[34]]></field> <field name="category_keyword"><![CDATA[socialmedia]]></field> <field name="category_keyword"><![CDATA[press]]></field> <field name="meta_keywords"><![CDATA[SmartSearch explanatory video facets]]></field> </document> ... <document uid="4711"> <field name="title"><![CDATA[The SmartSearch help pages.]]></field> ... </document> </documents>
A document is only valid if the fields The field name It is also recommended to pass the language of the document via the field name |
-
Seed XML file
In contrast to the data XML file, a seed XML file provides data from different sources. It therefore contains an unlimited number of references to data XML files or further seed XML files. Specifying additional seed XML files allows for an arbitrarily deep tree structure. When the seed XML file is passed, the <<xml_dg,data generator> follows all these references and handles them as if they had been specified directly as an
XML-URL
. Also in this case all documents must be valid.A valid seed XML file meets the following requirements:
-
The root element is a
seeds
element. -
The
seeds
element contains any number ofseed
elements:-
A
seed
element represents a data XML file or another seed XML file. -
A
seed
element is empty, but has aurl
attribute with the reference target.
-
-
-
The following example shows a seed XML file that contains references to both data and seed XML files:
Example of a seed XML file<?xml version="1.0" encoding="UTF-8" standalone="yes"?> <seeds> <!-- References to data XML files --> <seed url="https://www.customer.com/.../.../documents-01.xml" /> <seed url="https://www.customer.com/.../.../documents-02.xml" /> <!-- References to other seed XML files --> <seed url="https://www.customer.com/.../.../another-seed.xml" /> </seeds>
- Externer Crawler
-
The external data generator does not collect any information by itself, but only records data that it receives externally via the REST interface. It therefore provides no additional configuration options in its specific tab. The settings made in the
General
andEnhancer
tabs are sufficient for it.The transfer of data to the REST interface is to be realized by a self-implementation. The information required for this is contained in the documentation SmartSearch REST API for external data generators.
The external data generator will be replaced in its functionality by the generic API in the future and will not be developed further. So new projects should use this new approach to index externally generated documents. |
List of existing data generators
After the data generator is saved and successfully created, it is added to the list of existing data generators in the tab of the same name. The list is divided into a four-column table with the following information:
- Name
-
The column contains the name of the data generator specified during generation.
- Type
-
This column indicates whether the data generator is of type
Web
,XML
orExternal
. - Status
-
This column shows the data generator’s current state. Depending on the operation, it can have one of the following statuses: Ready, Record, First or Second Enhancer, Synchronized or Error.
- Last edit
-
The information in this column shows the time and the editor of the last change to the data generator.
Edit a data generator
The button with the symbol of a pen enables editing the data generator. Clicking it opens the detailed view with all the configurations made. These can be changed according to the previously described possibilities.
Execute a data generator
The button with the play icon manually starts the execution of the data generator. It then collects the data from the configured source and transfers it to the Solr server.
The persistence of the data by the Solr server takes place at regular intervals. Therefore, it can take up to fifteen minutes until the data is retrievable. The Reload Collection button enables immediate synchronization. It is visible only to members of the |
Deleting a data generator
The button with the symbol of a trash can enables the deletion a data generator. Clicking it opens a dialog with a confirmation request to confirm the deletion.
Alternatively, it is possible to select several data generators for deletion via the checkboxes that are visible in the overview list in front of each data generator. The button Delete selected removes these selected data generators after a confirmation request has been confirmed.
Erasing the indexed documents of a data generator
The button with the symbol of an eraser can enables the emptying of a data generators data, deleting all documents from the index. Clicking the button Empty selected opens a dialog with a confirmation request to confirm the erasure of data.
2.3.2. Browser
For the display of the data collected by the data generator and persisted by the Solr server the SmartSearch provides a Browser. This can be called via the menu item .
For the display, a data generator must first be selected in the drop-down menu. The Browser then lists all documents that have been collected by this data generator. Clicking on such a document will also display all fields and the meta information stored in them. The Previous document and Next document buttons allow browsing through the captured documents, whose total number can be read between them.
2.3.3. Preparation of data by Enhancer
In the course of data generation, it is often necessary to adjust the documents prepared for the index. This is required to increase the discoverability of documents, to remove invalid documents or to increase data quality in general.
The SmartSearch uses so-called Enhancer for this purpose. These are both supplied and project-dependently configured components that are part of a data generator configuration. During each data generation, the Enhancers are executed for each captured document and their effects are applied to the document to be indexed.
Generally, there are three types of Enhancers, which are executed in the following order:
-
General Enhancer
The general Enhancers are provided by the SmartSearch. They are transparent to users and are not configurably customizable. Furthermore, they allow, for example, speech analysis to be performed, the results of which are usable in the following configured Enhancers.
-
Configured Enhancer
With the configured Enhancers, the SmartSearch provides the ability to configure custom logic for processing and customizing documents as part of a data generation process. The types of configured Enhancers and their use cases are explained in the following section.
-
Finishing Enhancers
The finishing Enhancers are provided by the SmartSearch, analogous to the general Enhancers. They are also transparent to the users and are also not configurably customizable.
The Enhancer
tab in the configuration view of a data generator enables the creation of configured Enhancers. These are used to modify the data collected by the data generator before it is indexed on the Solr server. Per data generator configuration multiple Groovy script Enhancers can be defined. These are processed in the order in which they are stored in the configuration. The creation of an Enhancer is done by clicking on the New Enhancer button.
Groovy script Enhancer
This Enhancer allows the inclusion of a Groovy script to be implemented by the user. The editing mask required for this opens after creating the Enhancer via the menu item Groovy Script Enhancer
. Such a script allows additional modifications of the dataset, and access to the data on the document generated during the execution of the general Enhancers. For example, additional documents can be added to the dataset or existing documents can be edited or deleted. An input made can be checked for possible syntax errors using the Validate button.
The following code example shows the structure of a Groovy script Enhancer:
void manipulate(final Document document, final String html, final org.jsoup.nodes.Document jsoupDocument) {
// Only the method body is to be filled.
// The signature is automatically bracketed around the configured code in the edit mask.
}
The Groovy script is passed an object document
, a string html
and an object jsoupDocument
:
-
The
document
object contains the SmartSearch document currently being processed. Additional data can be added to or removed from this document during processing:document.addData("field name","field value") document.removeData("fieldToRemove")
-
In the context of a Web data generator, the
html
string contains the HTML of the page originally captured by the crawler, which is currently being converted into a SmartSearch document.For other kinds of data generators, this object is not relevant.
-
In the context of a Web data generator, the object
jsoupDocument
contains a representation of the page as a Jsoup document. This allows object-based processing of the page:document.addData("page title", jsoupDocument.title())
This object is also relevant only for data generators of type
WEB
.
If a document is not to be included in the index during generation, this can be achieved by defining the boost value 0
:
document.setBoost(0)
Further examples of Groovy scripts and their usage are included in the Examples of Groovy Scripts Enhancers chapter.
Extracting names and nouns
According to its name, this Enhancer extracts names and nouns and stores them in the fields names_multi_keyword
and nouns_multi_keyword
.
This Enhancer can only be used for the language |
2.3.4. Available fields and field types
With the SmartSearch a data structure - the so-called Solr-schema - is delivered, which defines the fields existing for captured documents. After a data gathering, these fields at the document contain structured contents, which are stored in the index in relation to the document. Since the contents of various fields potentially affect the results of a search query, some relevant fields are explained below. Some fields are automatically created and populated with values during the run of data generators. In addition, the Groovy Script Enhancer provides the possibility to add more fields and values.
Basically, a field consists of an identifier and one or more values. There are three types of fields, which can differ in their use and function:
-
mandatory fields
-
static fields
-
dynamic fields
A field can have one or more values depending on the field type. For more information, see the section "Fields with multiple values per document".
To use the SmartSearch it is important to understand the use cases and content of the available fields. The following tables give an overview of the three field types and the relevant part of the available fields and outline their content:
Mandatory fields
The identifiers as well as values of mandatory fields are obligatory on every document, the former are also unchangeable.
The following table shows an overview of fields of this type:
Identifier | Description | Content |
---|---|---|
id |
Unique id of a document |
With the content of this field, each document in the index is uniquely identifiable. |
title |
Title of a document |
For example, the title of a document can be used as a headline on a search results page. |
content |
Content of a document |
The content of the document is - along with the title - the content basis for the search function. Content that is relevant for finding the document is usually to be written into this field during data generation. |
link |
URL of a document |
The URL of a document is the jump label with which the source of the indexed document can be retrieved by the end user during a search process. This can be, for example, a link after the title of the document, which then leads to the actual document. |
Static fields
The identifiers of static fields are unchangeable, just like those of mandatory fields. Unlike the latter, however, static fields do not necessarily have a value.
The following table lists some typical static fields:
Identifier | Description | Content |
---|---|---|
thumbnail |
Compact view as image link |
To display the compact view of the documents in a search results page, the URL of such a compact view must be stored in this field. |
keywords |
Keywords |
This field is intended for terms that should not undergo any linguistic processing, such as stemming, before being included in the index. |
language |
Language abbreviation |
This field contains the language abbreviation according to ISO 639-1. The chapter data generators contains more information about this. |
Dynamic fields
Dynamic field identifiers follow a naming convention and may or may not be present depending on the usage and document. Dynamic fields are intended to make the schema flexible for future necessary data and thus the application more robust.
Dynamic fields behave like static fields except that their identifiers contain a placeholder. If documents are indexed by starting a data generation, data that cannot be assigned to a static field can be caught by a dynamic field.
For example, a document may contain a custom_date
field that does not exist in the Solr schema. The data in this field would still be included in the index via the *_date
dynamic field.
For more information on dynamic fields, see the Solr documentation.
An overview of the most important dynamic fields available can be found in the following table:
Identifier | Description | Content | Supported functionalities |
---|---|---|---|
*_integer |
Integer values |
Fields of this type contain values from the value range Integer ( |
|
*_long |
Long values |
Fields of this type contain values from the value range Long ( |
|
*_double |
Double values |
Fields of this type contain values from the value range Double (64-bit IEEE floating point).. |
|
*_date |
Date values |
Fields of this type contain date values and are to be used accordingly for e. g. chronological sorting. The values are given in UTC format: for example |
|
*_date_range |
Date values |
Fields of this type carry a different kind of date values: Unlike |
|
*_keyword |
Keywords |
Like the static field |
|
*_keyword_lc |
Keywords in lowercase |
Fields of this type contain keywords, which were automatically converted to lowercase during indexing. |
|
*_sort |
Sort values |
This field type allows storing sortable values. The values in these fields are converted to lowercase and cleansed of spaces. |
|
*_sort_de |
Sort values - special case for German |
This field type behaves analogously to |
|
*_token |
Grouping values |
This field type allows grouping documents based on the field contents, for example, document or product categories. |
|
*_pnt |
Spatial point values |
Fields of this type contain spatial point values, e. g. to store geographic information. The values have the format |
|
*_facet_string |
Text facet values |
Fields of this type contain textual values for use as facet attributes. These fields are automatically created and populated with facet values by using identifiers of type |
|
*_facet_double |
Facet values for floating point numbers |
Fields of this type contain numeric values for use as facet attributes. For example, to use a numeric facet value for the |
|
*_stored_only |
Unprocessed values |
Fields of this type contain content with up to 32766 characters, which are not made searchable or processed in any other way. However, they can be queried when retrieving documents for display. |
|
*_stored_only_big |
Large unprocessed values |
Fields of this type contain contents of unlimited size, which are not to be made searchable or processed in any other way. However, they can be queried when retrieving documents for display. |
|
*_autocomplete |
Terms for auto-completion |
Fields of this type contain terms that can be used for auto-completion in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field. |
|
*_expanded_autocomplete |
Terms for cross-word auto-completion |
Fields of this type contain terms that can be used for auto-completion over two words, in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field. |
Static/dynamic fields for language abbreviations
In addition to the types of fields described above, further fields exist which can be present per language on documents and whose naming schemes are illustrated in the following table:
Naming scheme | Description | Content | Supported functionalities |
---|---|---|---|
*_text_de |
Language dependent textual content |
These fields contain content that is processed textually depending on the language. |
|
*_autocomplete_de |
Language-dependent terms for auto-completion |
Fields of this type contain language-dependent terms, which can be used for auto-completion in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field. |
|
*_expanded_autocomplete_de |
Language-dependent terms for cross-word auto-completion |
Fields of this type contain language-dependent terms, which can be used for auto-completion over two words in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field. |
Fields with multiple values per document
When using SmartSearch, multiple values for a field name can be indexed for fields in certain cases. This creation of multiple values is not possible for every static and dynamic field.
Which static and dynamic fields support multiple values can be found in the table in section "Available fields and field types". User-defined fields, which are not mapped to any of the table values, generally allow multiple values.
If a field which does not support multiple values is passed to the index, the document will still be indexed. However, only one value is stored for the field, usually the first value passed for the field. In this case, a corresponding warning is found in the log of the application. |
2.3.5. Language details
The SmartSearch persists the content collected on the customer side in a variety of different fields, for the definition of which it provides the Prepared Searches. During execution, a language can be passed to these in order to perform the data collection in a language-dependent manner. The table below shows an overview of all languages currently configured for the SmartSearch, including their abbreviations (according to ISO 639-1) and the features available for them. The features are applied automatically when processing the captured information. More information about the individual languages is available in the Solr documentation.
Currently, the SmartSearch assigns the following features to the languages:
- Lower case
-
The internal persistence of the recorded data is done in lower case. This has no effect on the output of the information and is only used for its processing.
- Normalization
-
Normalization is used to align similar spellings of words. In Scandinavian languages, for example, the characters
æÆäÄöÖøØ
(as well as the variantsaa
,ao
,ae
,oe
, andoo
) are substituted to formåÅæÆøØ
. The different spellingsblåbärsyltetöj
andblaabaersyltetoej
thus result in the normalized formblåbærsyltetøj
. - Stemming
-
Stemming enables the linking of different, morphological variants of a word with their common root. For example, the terms
fished
,fishing
andfisher
have the common rootfish
. - Stop words
-
The stop words are a list of terms that occur very frequently in a language. They include, for example, articles, pronouns and linking words. Due to their frequency, they should not be considered for the search index. The delivery contains such a list for each language, the configuration of which is possible via the SmartSearch cockpit.
- Synonyms
-
For languages that allow synonyms, there is a possibility within the SmartSearch cockpit to create a list that assigns other, similar words to a term. These words are treated the same in a search.
Language |
Abbreviation |
Lower case |
Normalization |
Stemming |
Stop words |
Synonyms |
Arabic |
ar |
✔ |
✔ |
✔ |
✔ |
✔ |
Bengali |
be |
✔ |
✘ |
✘ |
✔ |
✔ |
Bulgarian |
bg |
✔ |
✘ |
✔ |
✔ |
✔ |
Chinese Traditional |
zh-tw |
✔ |
✔ |
✘ |
✔ |
✔ |
Chinese Simplified |
zh-cn |
✔ |
✔ |
✔ |
✔ |
✔ |
Croatian |
hr |
✔ |
✘ |
✘ |
✔ |
✔ |
Czech |
cs |
✔ |
✘ |
✔ |
✔ |
✔ |
Danish |
da |
✔ |
✔ |
✔ |
✔ |
✔ |
Dutch |
nl |
✔ |
✘ |
✔ |
✔ |
✔ |
English |
en |
✔ |
✘ |
✔ |
✔ |
✔ |
Estonian |
et |
✔ |
✘ |
✘ |
✔ |
✔ |
Finnish |
fi |
✔ |
✘ |
✔ |
✔ |
✔ |
French |
fr |
✔ |
✔ |
✔ |
✔ |
✔ |
German |
de |
✔ |
✘ |
✔ |
✔ |
✔ |
Greek |
gr |
✔ |
✘ |
✘ |
✔ |
✔ |
Hungarian |
hu |
✔ |
✘ |
✔ |
✔ |
✔ |
Indonesian |
id |
✔ |
✘ |
✔ |
✔ |
✔ |
Italian |
it |
✔ |
✔ |
✔ |
✔ |
✔ |
Japanese |
ja |
✔ |
✘ |
✘ |
✔ |
✔ |
Korean |
ko |
✔ |
✔ |
✘ |
✔ |
✔ |
Latvian |
lv |
✔ |
✘ |
✔ |
✔ |
✔ |
Lithuanian |
lt |
✔ |
✘ |
✔ |
✔ |
✔ |
Malay |
ms |
✔ |
✘ |
✘ |
✔ |
✔ |
Norwegian |
no |
✔ |
✘ |
✔ |
✔ |
✔ |
Polish |
pl |
✔ |
✘ |
✔ |
✔ |
✔ |
Portuguese |
pt |
✔ |
✘ |
✔ |
✔ |
✔ |
Romanian |
ro |
✔ |
✘ |
✔ |
✔ |
✔ |
Russian |
ru |
✔ |
✘ |
✔ |
✔ |
✔ |
Serbian |
sr |
✔ |
✔ |
✘ |
✔ |
✔ |
Slovak |
sk |
✔ |
✘ |
✘ |
✔ |
✔ |
Spanish |
es |
✔ |
✘ |
✔ |
✔ |
✔ |
Swedish |
sv |
✔ |
✔ |
✔ |
✔ |
✔ |
Thai |
th |
✔ |
✘ |
✘ |
✔ |
✔ |
Turkish |
tr |
✔ |
✔ |
✔ |
✔ |
✔ |
Ukrainian |
uk |
✔ |
✘ |
✔ |
✔ |
✔ |
Vietnamese |
vi |
✔ |
✘ |
✘ |
✔ |
✔ |
2.4. System
The area System
has the User Management
as the only subitem. This is used to create and configure users and groups and is therefore divided into two tabs with the same names. The following subchapters explain the functionalities of the two tabs in detail.
If the user and group management is realized on the basis of LDAP, the SmartSearch cockpit has only reading access to it. In this case, the cockpit is only used to list the users and groups, but not to edit or delete them. Only the administration of the permissions is done by the SmartSearch cockpit also in the LDAP case. |
2.4.1. Users
Access to the SmartSearch cockpit requires an active user. For this reason, a master admin exists for the initial login, with which further users can be created. The permissions of a user result from its group membership, which is to be defined during its creation. The creation and administration of the users takes place in the tab of the same name of the User Management
.
If the user management is realized on the basis of LDAP, the SmartSearch cockpit has only reading access to it. User creation, editing and deletion is not possible within the cockpit in this case. |
The tab shows a list of all existing users, which initially contains only the master admin.
Create users
For the creation of a new user there is a separate view, which can be called up by clicking the New User button.