SmartSearch

Please note that haupia was renamed to SmartSearch with version 2.4.0. While we have updated our documentation accordingly, you may still encounter "haupia" in older code examples, APIs, scripts, and configuration files. In such cases, please consider SmartSearch as the current and correct term. We recommend making the appropriate substitutions when implementing solutions from older documentation.

1. Introduction

Searching a website is especially important as more and more information is available online these days and websites become larger and more complex. An effective search function allows users to quickly and easily find the information they need without having to click through countless pages and links. It is also an efficient way to save time and improve the user experience, as it allows users to go directly to the relevant content without having to deal with unnecessary information. In addition, a good search function can also help a website be perceived as user-friendly and professional, which in turn increases user confidence and can help them stay longer on the website and come back more often. Overall, effective search on a website is an important factor in a positive user experience and the success of a website.

The search function on the website can also be used as a marketing tool to provide visitors with personalized offers and recommendations. For example, if a visitor is searching for a specific product, the company can recommend similar or complementary products or make special offers to increase their willingness to buy. In addition, certain hits can be better placed in search results to draw the attention of website visitors to specific content.

Search gives the company valuable insights into the needs and interests of visitors. By analyzing search terms and results, companies can identify trends and patterns and adapt their website content accordingly to better meet the needs of the target group. In this way, the website can also serve as a market research tool.

Overall, a well-designed search function on the website can not only help improve user experience and increase conversion rates, but also provide valuable insights into the needs of the target audience and be used as a marketing tool for customer retention and acquisition.

The SmartSearch bundles requirements that customers place on the search function of an online presence: An intuitive, high-performance search solution that can be used on extensive websites and delivers relevant results. It offers both high result quality and optimum search convenience, thus retaining customers on the website.

At the same time, it provides editors with a web interface through the integrated SmartSearch cockpit that can be used without IT knowledge. Editors from specialist and marketing departments are thus enabled to control and monitor search results on the web interface. For this purpose, the cockpit provides statistics, filters and analysis functions and allows the indexing of various data types (for example XML, audio, video, media) from different data sources. With the help of individualized result lists, editors can prioritize and weight search results in the back end and display selected content for predefined search queries.

1.1. Concept

The SmartSearch functionality, e.g. settings to user permissions and group permissions, can be managed in the browser-based SmartSearch cockpit without any prior technical knowledge.

1.1.1. data generator

In the SmartSearch cockpit, you can create data generators to capture searchable data.

There are three types of data generators:

Web: The web crawler enhances the searchability of an existing web site.
XML: The XML file crawler speeds up the processing of data when it exists as XML files.
API: Assists in passing data from a self-developed application or the FirstSpirit SmartSearch Connect module to an API.

1.1.2. Prepared Search

Using a Prepared Search, you can configure the search result, for example by bundling content from multiple data generators into one Prepared Search.

1.1.3. Adaptable Result

In an Adaptable Result, you can actively modify the search results, for example to display certain search results higher in the list, or to exclude search results from the list.

1.1.4. synonyms and stopwords

You can specify a synonym to associate a term with the search that does not appear in the website text, but is meaningfully associated with another search term. synonyms can be derived from common search queries, for example.

You can use stopwords to ignore certain terms in the search query. A list of typical stopwords for each language is provided and can be extended.

1.2. Architecture

The functionalities of the SmartSearch are realized by an architecture made up of a range of different components (see figure Architecture).

These components are:

ZooKeeper
Solr
SmartSearch

Figure 1. Architecture

The individual components always interact according to the following schema:

Prior to creating the search index, the SmartSearch must collect the required data. For this, it accesses the information to be collected on the customer side, which can exist in the form of websites, portals or databases. In addition, a REST interface provides the option of filling the search index with further data from outside.
After that, the SmartSearch server normalizes the data and transfers it to the Solr server. The server receives the data and persists it in an index.
The query for data is done equivalently: The SmartSearch server receives the request, modifies it, and then forwards it to the Solr server. This responds with a search result, which the SmartSearch server returns to the customer’s end application via the REST interface.
The SmartSearch cockpit is to be seen detached from the other components. It serves the administration of the SmartSearch server and therefore offers a simple, web-based administration interface. Among other things, search solutions can be created and configured in this interface.
Configurations made in the SmartSearch cockpit are saved on the ZooKeeper server together with the Solr configuration data.

The communication to the outside is protected by HTTPS, between the components it is done via HTTP.

1.3. Technical Requirements

To deploy SmartSearch, the following technical requirements must be met:

Java 21
ZooKeeper version 3.9
Solr version 9.7 in cloud mode
SmartSearch in the latest version

ZooKeeper and Solr are not included in the delivery. They must therefore be downloaded before installation in the version specified.

2. SmartSearch cockpit

The SmartSearch cockpit is a component of the SmartSearch. It enables the backend-side administration of the data collected by the SmartSearch and offers a simple, web-based interface for this purpose. This is divided into the areas Configuration, Analysis, Data and System, which can be reached via the menu. The button with the globe icon also provides a language switcher for German and English.

By default, the SmartSearch cockpit is accessible via the following URL:

http://<Servername>:8181

The first start of the cockpit must be done with the master admin. It is created automatically with the data from the application.yml at the initial start of the SmartSearch server.

If the user and group management is implemented via an LDAP server, the credentials may differ.

After valid login, the user is automatically redirected to the dashboard of the cockpit. Re-authentication is only required after an explicit logout or after the session has expired.

Figure 2. SmartSearch dashboard

2.1. Configuration

The Configuration area is divided into the submenus Prepared Search, Stopwords and Synonyms. These allow the configuration of the output of the data collected by the SmartSearch.

The following subsections describe the submenus and the functions provided by them.

2.1.1. Prepared Search

The customer-side gathering of the required data is done by the so-called data generators, which are a part of the Data area. For their management, the SmartSearch provides the Prepared Searches. These allow optimizing the search results by prioritizing individual data.

The creation and administration of the Prepared Searches is done in the interface of the same name, which can be called via the menu entry Configuration Prepared Search.

The area shows a list of all already existing Prepared Searches and is initially empty.

In cloud mode, the list also displays the accessibility of each Prepared Search.

Restricted: The Prepared Search is only accessible in the cockpit.
Public: The Prepared Search can be queried via API gateway without authentication.
Authenticated: The Prepared Search can be queried via API gateway, but only with valid authentication.

Figure 3. Prepared Searches

New Prepared Search

For the creation of a new Prepared Search there is a separate view, which can be called by clicking on the button New Prepared Search and is divided into the three tabs General, Facets and Preview.

Figure 4. Creating a Prepared Search

General

The first thing to do within the tab General is to specify a name for the new Prepared Search. In cloud mode, the additional checkbox publicly accessible is located next to the input field for the name. With it the accessibility of a Prepared Search can be defined. While it is deactivated, the Prepared Search is only accessible in the cockpit. Activating the checkbox enables the Prepared Search to be queried via the internet (API gateway) without authentication and additionally displays the checkbox authentication required. Activating this checkbox causes the Prepared Search to be accessible via the API gateway, but only with valid authentication.

The following selection of any number of data generators in the selection list of the same name shows their available fields. The initially activated checkbox Verbose shows or hides all technical fields. The button provided together with the checkbox enables the emphasis of the selected fields.

The list of field names per data generator is cached. When creating a new data generator and running it for the first time, it may take several minutes for the field names to appear in the list.

The selected fields can be transferred via button to the list of fields relevant for a search, which by default contains the fields content, link and title. A previously defined emphasis is automatically assigned to each of these fields.

The list provides the following configuration options per field:

Highlight: By activating this checkbox, a search word is highlighted within a text segment in the search result. The length of the text section is freely configurable (see below).
Output: This option defines whether the field is visible in the search result. For example, in the case of the link that only refers to the entire document, this may not be desired.

Search: In order for the associated field to be taken into account by the search, this checkbox must be activated.

Deactivating the Search option hides the subsequent Partial match and Emphasis options for the corresponding field.

Partial match: This option enables partial matches to be taken into account. If the checkbox is selected, the search for electricity, for example, also finds the match eco-electricity provider. The search word must have a length of three up to twenty characters.
Emphasis: The emphasis offers the possibility to set a prioritization for matches of the selected fields and thus to influence the search result.

The button with the trash can icon available for each field allows deleting the corresponding field from the list.

The tab additionally contains the following general configuration options:
Hits per page: According to its name, this button specifies the maximum number of hits displayed on a search results page. In combination with the URL parameter page it is also possible to split the search results into multiple pages.
Length of highlight (in characters): Using this button, the length of the text segment in which a search term is visually highlighted can be defined, as mentioned before.
Sort by: By default, the search results are sorted in descending order of their score. If a different sorting is desired, this text input field allows a corresponding adjustment. For this, any field is to be specified and supplemented by the expression ASC for an ascending or DESC for a descending sorting.

Spellcheck (hits less than): If the number of search results is smaller than the value configured at this point, a spelling check is performed and the search is performed for search words of similar spelling.
Must Match: For searching multiple terms, the entry of this text input field determines how they are to be linked:
- The value 0 corresponds to an OR reference between the search terms used.
- The value 100% corresponds to an AND reference between the search terms used.
- An absolute value defines the number of terms that must be contained within a search hit. For example, the value 2 for five given search terms means that two of the five terms must be contained within a search result. Furthermore, the values 2 and 50% are equivalent to each other for four search terms.
Groovy Script: In addition, Prepared Searches allow the inclusion of a self-implemented Groovy script. Such a script enables additional modifications of the dataset. For example, additional documents can be added to the dataset or the existing dataset can be edited.

Facets: Facets provide the possibility to restrict result lists according to fields that are included in a document. Since facets always refer to the data generators selected in the tab General, the tab Facets is initially empty.

Figure 5. Facets

New facet

For the creation of a new facet there is a separate view, which can be called by clicking on the button New Facet. This is only active if at least one data generator is defined in the General tab.

Within the tab, a field must first be selected in the dropdown of the same name. The available fields are taken from the selected data generators. With the selection of a field, a list of the values associated with this field appears, for which the following configuration options are available:

Filter: This input field can be used to search for facet values.
Show weighted values only: By selecting this checkbox, only facet values that have been weighted will be displayed.
Weight: By clicking on the Weight field, a weighting can be assigned to a facet value. This number represents a multiplier of the score and can be between 0.00 and 9.99. A value between 0.00 and 2.00 is recommended. Results that belong to this facet value receive a weighting and are ranked accordingly higher or lower in the search results.
Display values on number of matches greater than: This option can be used to exclude values from the facet list whose number of matches is less than the specified threshold.
Multiple selection possible: This option allows filtering the search results list according to different aspects. For example, the search for a specific object can subsequently be restricted to a specific size as well as to a specific color. This makes the search more specific and minimizes the search results list.
Exclude own filter: In order to provide different filter options for the search, the selection of a filter may only refer to the search result list, but not to the filter options. Otherwise, the other options would also be hidden by the search.

For example, if the filter options German, English and French exist for the facet Language, the search will only return English documents when the option English is selected. If the filter English is not excluded in this case, the list of available filters will also only show English. In this case it is no longer possible to switch to another filter or to make a multiple selection in connection with the previous configuration option.

Preview

The preview tab provides the option to test a Prepared Search configuration on a preview page. For this purpose it is not necessary to save the Prepared Search. The settings from the other tabs are directly applied to the search queries. On the preview page there is an input field for search terms. The entered term is then looked up in the current Prepared Search and the results are displayed. Next to each search result there is a button marked with an arrow pointing upwards. Clicking this button displays all of the fields that are selected in this Prepared Search below the result.

In order to transfer a facet to the preview, it is sufficient to confirm the configuration by clicking OK. After switching back to the preview tab, the last search term is used again automatically and new filter options appear in the column next to the search results. These receive their names and values from the previously created facet. The preview feature can also be used to test weightings or groovy scripts.

Previously created Adaptable Results will appear in the specified order in the preview.

Figure 6. Search query in the preview tab

List of existing Prepared Searches

After the Prepared Search is saved and successfully created, it is added to the list of existing Prepared Searches in the tab of the same name. The list is divided into two columns with the following information:

Name: The column contains the name of the Prepared Search specified during generation.
Last edit: The information in this column shows the time and the editor of the last change to the Prepared Search.

Editing a Prepared Search

The button with the symbol of a pen enables editing a Prepared Search. Clicking it opens the detailed view with all the configurations made. These can be changed according to the previously described possibilities.

Deleting a Prepared Search

The button with the symbol of a trash can enables the deletion a Prepared Search. Clicking it opens a dialog with a confirmation request to confirm the deletion.

Alternatively, it is possible to select several Prepared Searches for deletion via the checkboxes that are visible in the overview list in front of each Prepared Search. The button Delete selected removes these selected Prepared Searches after a confirmation request has been confirmed.

2.1.2. Stopwords

Every written text contains articles, filler words and other terms which are irrelevant for a search query and therefore should be ignored. For this reason, the SmartSearch has a list of so-called stopword for each of the 26 supported languages, which it provides in the Configuration Stopwords menu area in the form of a drop-down menu. The terms contained in these lists are ignored by the data generators when the data is collected and are not included in the search index.

To edit a stopword list, first select it from the drop-down menu. The interface will then display all stopword contained in it.

Figure 7. stopword

The field Add new stopword allows the comma-separated specification of further stopword, which are transferred to the list by clicking on the button with the plus symbol. The buttons with the symbol of a pen or a trash can, which are visible per stop word, allow editing or deleting a single entry of the list.

Each change to the selected list finally requires being saved via the button with the symbol of a checkmark, which is displayed next to the drop-down menu. Only with this click the adaptations will be applied, even if they are already visible in the list.

In order for the changes to the stopword lists to be considered, a re-run of the data generators is necessary.

Simultaneous use of stopword and synonyms for identical terms is not useful.

For example, if a stopword energy is defined, the simultaneous definition of synonyms for energy will not have the desired replacing effect.

2.1.3. Synonyms and Substitutions

Similar to stopwordn, which can be used to filter out irrelevant terms from the captured dataset, synonyms and substitutions allow for the addition of specific terms. In the menu area Configuration synonyme there is the option to select a language in a drop-down menu. Below this there are two tabs for entering substitutions and synonyms. After the language has been selected, a list of substitutions or synonyms for the selected language appears in the selected tab. This list is initially empty and can be expanded using the corresponding input field.

Each entered term will be converted to lower case when stored.

Substitutions

This is a list of substitutions. Each entry maps a search term to a series of words. The 'Add new search term' field allows you to enter a search term. After confirming with 'Enter' or clicking on the button with the plus symbol, the word will appear together with an empty row of substitutions in the list below. Via the field 'Add new substitution' the corresponding substitutions can be assigned to the search term.

It is also possible to enter several comma-separated terms in the 'Add new search term' field. In this case, the first entered term will become the search term and the remaining words will become their substitutions. The same applies to the Add new substitution field.

After saving, a search for the entered term will be replaced by the terms in the corresponding row.

Furthermore, the buttons with the symbol of a pen or a trash can, which are visible per search word, allow editing or deleting a single entry of the list.

Figure 8. Substitutions

Synonyms

This is a list of synonymseries. Entries of a synonymseries are treated as synonymous by the search. This means that a search for one term from a series will return results for all terms from the series. The field 'Add new synonyms' allows for the creation of a new series. Here it is also possible to enter several, comma-separated terms. In this case, they will appear as individual terms in a new synonymseries in the list below.

Substitutions that work like synonyms are recognized by SmartSearch and will appear in the synonyms tab after saving.

Example: A record with the search term energy and the substitutions energy and power and the search term power and the substitutions energy and power is equivalent to a list of synonyms with the terms energy and power.

Figure 9. Synonyms

Any change made to the selected list finally requires saving it using the button labeled "Save" displayed next to the drop-down menu. Only with this click will the adjustments be applied, even if they are already visible in the list beforehand.

Simultaneous use of stopword and synonyms for identical terms is not useful.

For example, if a stopword energy is defined, the simultaneous definition of synonyms for energy will not have the desired replacing effect.

2.2. Analysis

The Analysis area is divided into the submenus Statistics and Adaptable Results. They allow both the analysis of the performed search queries and the manual control of the hit sequence for individual search terms.

The following subchapters describe the submenus and the functions provided by them.

2.2.1. Statistics

The Statistics area allows the analysis of the search queries of a specific Prepared Search. The field of the same name lists all existing Prepared Searches for this purpose, from which the Prepared Search to be analyzed is to be selected by a checkbox.

Every fifteen minutes, there is an automatic update of the data to be analyzed, so that the analysis is always based on the most up-to-date data.

The following fields are used to specify the data to be analyzed:

The drop-down menu Number of results indicates whether the 10, 25, 50 or 100 most searched keywords should be output for the analysis.
The From and To fields define the time period to be considered.
The Tags field allows the statistic to be customized by specific filter criteria. They are used to categorize search queries and can be defined during their execution. By default, tags are entered in the basic view, where they can be selected by simply clicking on them. All selected tags are connected by an And or Or link. In addition, the Advanced button provides a switch to a more complex input of the tags.

Examples of the usage of tags can be found in the tags section.

Switching from the basic to the advanced view is possible at any time. Switching in the other direction is only possible if the entered filtering can be mapped in the basic view.

The Adopt button performs the determination of the most frequent search words that result from the configuration made for the selected Prepared Search. They are then displayed as a list in the interface.

Figure 10. statistics

The Download CSV button provides both the defined configuration data and the search words determined on their basis in the form of a CSV file. In addition to this information, it contains a deep link and the absolute sum of all search queries - with and without filtering. The deep link enables the file to be downloaded again.

Figure 11. statistics for the following CSV file

The CSV file which was generated from the statistics above contains the deep link (csv deep link) and the following statistic parameters:

search terms (string)
search frequency (count)
selected time period (start, end)
names of the selected Prepared Searches (here Mithras)
query totals with (total) and without (total without filter) filter.

Figure 12. the CSV file for the above statistics

The deep link contained in the CSV is also provided by the Copy to clipboard > Deep Link CSV button. The Copy to clipboard > Deep Link Cockpit button provides a link that can be used to call up the Statistics area again, including the configuration made. This way, both the settings and the result based on them can be made available to third parties.

2.2.2. Adaptable Results

The storage of the contents determined by the SmartSearch happens in a multiplicity of various fields. These can be assigned a weight during the configuration of a Prepared Search to prioritize them differently. This gives them a so-called score, which is calculated automatically and determines the order of a search result list. The manual control of these search result lists, the so-called Adaptable Results, is done in the area of the same name.

It is also possible to create Adaptable Results for search terms that do not yield any results.

Figure 13. Adaptable Results

The area displays a list of all existing Adaptable Results, each of which matches the search query for a keyword of a Prepared Search.

Create an Adaptable Result

For the creation of a new Adaptable Result there is a separate view, which can be called by clicking on the button New Adaptable Result.

Figure 14. New Adaptable Result

Within the view, a Prepared Search is to be selected and any search term is to be entered for which the order of the search results should be changed. A click on the button Adopt deactivates the two fields and, on the basis of the entries, determines the corresponding result list, which is displayed in the form of a table. The Reset button hides the table and reactivates the two fields.

In addition to the title, the link and the evaluated tokens of the displayed document, each entry in the table shows the score assigned to it, from which its position in the list results. This can be influenced by the buttons Elevate and Exclude which are also displayed for each entry:

Elevate: The Elevate button highlights the search result in green and moves it to the first position in the results list. The arrow buttons that appear instead of the Elevate button can be used to specify the position of the entry even more precisely. It should be noted, however, that the positioning via the arrow buttons refers only to the prioritized entries. It is not possible to mix prioritized and non-prioritized entries. The button with the symbol of a minus sign can be used to cancel the prioritization.
Exclude/Include: The Exclude button highlights the search result in yellow and removes it from the results list. This means that it will no longer be displayed in the frontend. The Include button that appears instead adds the search result back to the list.

Both elevations and exclusions are based on links that are available in the index at a certain point in time. Changing the search results list can therefore lead to situations where the links are no longer available and thus no longer detectable by a data generator. Due to their missing correspondence in the index, these links can no longer be displayed in the list view. They are so-called orphaned elevations or exclusions.

In case of such entries, a warning appears in the detail view of an Adaptable Result. This lists the orphaned links and their document ids. Activating the checkmark under the warning triggers a cleanup when saving.

The Add page manually field allows manual addition of further documents by specifying their URL. These must already be included in the index of the data generators assigned to the selected Prepared Search. The corresponding URL must be entered in the form http[s]://www.newdocument.com and confirmed with the plus button. If the document is included in the index, it will be added to the results list. Otherwise, a corresponding note will appear.

Each change to the search results list must be persisted using the Save button. Only with this click the adjustments are applied, even if they are already visible in the list before.

List of existing Adaptable Results

After the Adaptable Result is saved and successfully created, it is added to the list of existing Adaptable Results in the area of the same name. The list is divided into three columns with the following information:

Name: The search word used during the creation of the Adaptable Result also acts as its name and is shown in this column.

If not only one search word but a list of search words is passed, these are sorted alphabetically.

Prepared Search: The column contains the Prepared Search to which the search query refers. It is selected during the creation of the Adaptable Result.
Last edit: The information in this column shows both the time and the editor of the last change to the Adaptable Result.

Edit an Adaptable Result

The button with the symbol of a pen allows editing an Adaptable Result. Clicking on it opens the detailed view with the entries made. These can be changed according to the previously described possibilities.

Delete an Adaptable Result

The button with the symbol of a trash can allows to delete an Adaptable Result. A click on this button opens a dialog with a confirmation request to confirm the deletion.

Alternatively, it is possible to select multiple Adaptable Results for deletion using the checkboxes visible in front of each Adaptable Result in the overview list. The Delete selected button removes these selected Adaptable Results after a confirmation request has also been confirmed beforehand.

2.3. Data

Prior to creating a search index, the SmartSearch accesses the information to be collected on the customer side. This can exist in the form of websites, portals or databases. Their gathering is done via so-called data generators, which provide different configuration options. The data persisted on the Solr server can then be viewed via the Browser, which is also provided by the SmartSearch cockpit. In addition to the contents, this also displays all metadata.

The following subsections describe both the data generators and the Browser in detail.

2.3.1. Data Generators

The SmartSearch provides so-called data generators for the customer-side collection of the required data. Their creation and administration takes place in the interface of the same name, which can be called up via the menu entry Data Data Generators.

Figure 15. Data Generator

The area shows the list of all already existing data generators and is therefore initially empty. When creating further data generators, three types have to be distinguished:

Web
XML
Extern

In addition to these data generator types relevant for manual creation, there are others which cannot be created manually. An example of this is the API type created by integration with the SmartSearch Connect FirstSpirit module or by usage of the Generic API.

Create a data generator

For the creation of a new data generator there is a separate view, which can be called by clicking on the button New Data Generator and is divided into three tabs. Besides the two tabs General and Enhancer, which are identical for all data generators, each of them has an additional specific tab.

Figure 16. data generator - Tabs

The following sections first describe the two identical tabs before an explanation of the specific tabs follows:

General

Within the tab General, first a name for the data generator to be created has to be specified along with a default language. The counter Minimum number of documents for synchronization defines a threshold. Up from this threshold a persistence of the collected data into the index created by the Solr server takes place. The remaining settings allow a regular and automatic collection of the required data and can be enabled or disabled via the checkbox Start regularly. For this purpose, a daily or weekly start interval can be selected and the next start time can be defined by specifying a date and a time.

In this case, the languages correspond to the values provided by ISO 639-1, meaning language codes consisting of two characters. If a data generator finds longer language codes on documents, everything after the second character is truncated. This way, for example, the language code de_DE becomes the value de during generation.

Figure 17. data generator - General

Enhancer: The Enhancer tab enables the creation of configured Enhancers. They are used to modify the data collected by the data generator before it is persisted on the Solr server. Per data generator configuration multiple Groovy script Enhancer can be defined. These are processed in the order in which they are stored in the configuration. The creation of an Enhancer is done by clicking on the New Enhancer button.

A detailed description of the Enhancer that can be created is included in the Enhancer chapter below.

Web Crawler

The tab Web Crawler is specific to the data generator of the same name and is therefore only displayed for it. It collects all the content of an existing web presence. This includes HTML content (but without HTML comments contained in it) as well as all files of any format. To do this, it first requires a start URL and the maximum number of pages to be covered. If the start URL corresponds to a sitemap, the links contained in the sitemap are successively captured.

By specifying the value zero for the number of pages to be captured, the entire web presence is captured.

To ignore HTML content when generating the search index, it must be marked by HTML comments of the form <!-- NO INDEX -→ and <!-- INDEX -→.

If the collected pages contain links to other content, the data generator follows them and gathers the linked content as well. Canonical links are also taken into account. This way, it captures the web presence specified with the start URL either completely or up to the specified limit.

Links to be ignored by the data generator must be marked with attribute rel="nofollow".

A self-implemented Groovy script specified in the Check URL Script field can be used to individually influence the gathering and processing of linked content. The Validate button checks the created script for possible errors. The script determines whether a link will be captured: If the script returns true, every link found by the crawler will be captured, which poses the risk of creating a never-ending crawler.

Groovy Script Body that marks every link as to be crawled.

return true; // Caution: This could lead to a never-ending crawler

Groovy Script Body to exclude pages with '/shop/' in the URL within the domain www.mithrasenergy.com

return url.startsWith("https://www.mithrasenergy.com") && !url.contains("/shop/");

The Web-Crawler recognizes sitemaps in XML format automatically if they are specified as the start URL. The documents displayed in them are processed and crawled like HTML links . In addition, it is also possible to use an index sitemap as the start URL. Please note that both, XML-sitemaps and index sitemaps, are not documents by themselves and due to this are not included in the index.

A sitemap.xml references within a HTML file is processed as a normal document and not as a sitemap. To process a URL as a sitemap, the URL must be listed directly as the start URL or as the content of a sitemap index file.

The data generator can only follow references from HTTP to HTTP(S) pages. It cannot switch from HTTPS to HTTP. This also applies if an index sitemap is used and other sitemaps are referenced within it.

The remaining settings allow you to specify the data generator’s search queries:

By de-/activating the checkbox Ignore no-script areas in HTML it is possible to determine whether the content contained in these areas is also to be collected.
If the observe robots.txt checkbox is enabled, the data generator takes into account the restrictions defined in the robots.txt file of the respective web presence. Otherwise, it ignores them and gathers the entire content of the website.
If the captured links contain additional parameters, these can be skipped by activating the Remove parameters from URL checkbox. This way, redundancies in the captured documents can be avoided.
The Charset field specifies which charset is used to persist the collected data.
For both collecting data and retrieving persisted documents, the SmartSearch provides the smart-search-agent user agent. Its reference name is contained in the corresponding field and can be overwritten as desired.
The value defined for the Interval between two downloads determines how fast the SmartSearch submits the individual search queries. This way an overload of the web server due to a too high query frequency is avoidable.

If the data generator does not receive a response to a search request, the Read and connect timeout indicates the amount of time before the data acquisition is aborted.

When collecting the contents of an existing web site, the Web Crawler may receive different status codes. These are processed as follows:

2xx: These status codes correspond to the standard codes. In this case the contents are processed as usual by the Web Crawler.
301: This status code corresponds to a redirect. In this case the Web Crawler adopts the URL of the new page.
302 and 307: These two status codes also indicate a redirect. Even in this case, the Web Crawler adopts the contents of the new page, but the old link is retained.
4xx and 5xx: These status codes specify error statuses. The affected pages are ignored by the Web Crawler and the error is logged as DEBUG.

For the redirect links resulting from a 3xx status code, the Strip Parameters configuration is taken into account.

The web-Crawler first tries to read a language tag from the following tags in the source code of the crawled HTML document:

<html lang=en">
<body lang=en">
<meta name="language" content="en">
<meta http-equiv="content-language" content="en">
<meta property="og:locale" content="en_US">
<meta property="DC:language" content="en">
<meta property="og:locale:alternative" content="en_GB">

<meta property="og:locale:alternative" content="en_AU">

For the first of the above tags that was found, the specified language will be interpreted and stored as the language of the document.

If none of the tags are present, the web-Crawler tries to detect a language automatically from the content area of the document. If this fails, the default language maintained at the data generator is stored at the document.

By default, the SmartSearch stores the input language of a document in the language_original_stored_only field. This storage is regardless of the changes made to a document’s language during language recognition and processing.

For example, if the SmartSearch processes the en_US language passed to the document in the language_keyword field and changes it to the supported language en, the value of the language_original_stored_only field will still be en_US.

$webcrawler$

Figure 18. Web Crawler

XML Crawler

The XML Crawler tab is also specific to the data generator of the same name and is therefore only displayed for it. It captures XML data from a defined source that does not exist in the form of a web page. For this purpose, an XML URL must be specified in the corresponding field. For the target of this URL there are two file types:

Data XML file
The data-XML-file corresponds to a specific source. It includes the documents to be processed by the data generator in a single file.
Seed XML file
If there is a need to record multiple sources, the URL of a Seed XML file can be passed. The data generator then follows the references contained in this file, the number of which is not limited.

Collecting XML sitemaps is possible only with the help of the web data generator.

Access to the files to be processed is performed exclusively via http(s) request. Acquiring XML files from the SmartSearch server file system or other file sources is not possible. If the data generator does not receive a response to a search request, the Read and connect timeout indicates the amount of time before the data acquisition is aborted.

For both collecting data and retrieving persisted documents, the SmartSearch provides the smart-search-agent user agent. Its reference name is contained in the corresponding field and can be overwritten as desired. Sein Referenzname ist in dem entsprechenden Feld enthalten und lässt sich beliebig überschreiben.

Figure 19. XML Crawler

Data XML file

The XML Crawler captures XML data that comes from either a single source or multiple sources. A data XML file corresponds to a single source. It contains any number of documents that the SmartSearch processes in the context of an XML data generation. When an XML data generator, which captures such a file, is started, all documents contained in the file are added to the search index. However, this assumes that the documents are valid.

A valid data XML file meets the following requirements:

The root element is a documents element.

The documents element contains any number of document elements:

A document element represents an entity to be found by the search.
A document element has a uid attribute whose value is a unique alphanumeric document ID.

The document element contains any number of field elements:

Different document elements may consist of different numbers of fields.
A field element has a name attribute whose value is the desired field name in the search index.
A field element contains the value that the field should have in the document.
Multiple values (e.g. for fields with the dynamic field type *_keyword) can be created by entering field elements with identical name attribute values but with different tag contents (see category_keyword value in the example below).

The processing of a field value depends on the naming convention of the field name. For an explanation of the available naming conventions and field types, see section Available fields and field types.

It is recommended to always enclose field values within a CDATA block (see example).

The following example shows an exemplary structure of an XML data file. Such a file can directly serve as XML URL for an XML data generator.

Example of a data XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<documents>
   <document uid="0815">
           <field name="title"><![CDATA[The explanatory video for the SmartSearch XML file]]></field>
                <field name="link"><![CDATA[https://www.youtube.com/watch?v=...]]></field>
                <field name="content">
                   <![CDATA[The content here tells you how to create a SmartSearch XML file.]]>
                </field>
                <field name="language_keyword"><![CDATA[en]]></field>
                <field name="publishing_date"><![CDATA[ 2019-05-23T10:00:01Z ]]></field>
                <field name="favorites_count_long"><![CDATA[34]]></field>
                <field name="category_keyword"><![CDATA[socialmedia]]></field>
                <field name="category_keyword"><![CDATA[press]]></field>
                <field name="meta_keywords"><![CDATA[SmartSearch explanatory video facets]]></field>
        </document>
           ...
        <document uid="4711">
           <field name="title"><![CDATA[The SmartSearch help pages.]]></field>
                ...
        </document>
</documents>

A document is only valid if the fields title, link and content are present in the corresponding document element. The fields title and content are processed language-dependently.

The field name id must not be used. This is used internally and its occurrence in the source document will result in an error during data generation.

It is also recommended to pass the language of the document via the field name language_keyword. More information about the available language abbreviations is contained in the Language details section.

Seed XML file

In contrast to the data XML file, a seed XML file provides data from different sources. It therefore contains an unlimited number of references to data XML files or further seed XML files. Specifying additional seed XML files allows for an arbitrarily deep tree structure. When the seed XML file is passed, the <<xml_dg,data generator> follows all these references and handles them as if they had been specified directly as an XML-URL. Also in this case all documents must be valid.

A valid seed XML file meets the following requirements:

The root element is a seeds element.
The seeds element contains any number of seed elements:
- A seed element represents a data XML file or another seed XML file.
- A seed element is empty, but has a url attribute with the reference target.

The following example shows a seed XML file that contains references to both data and seed XML files:

Example of a seed XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<seeds>

   <!-- References to data XML files -->
        <seed url="https://www.customer.com/.../.../documents-01.xml" />
        <seed url="https://www.customer.com/.../.../documents-02.xml" />

        <!-- References to other seed XML files -->
        <seed url="https://www.customer.com/.../.../another-seed.xml" />
</seeds>

Externer Crawler

The external data generator does not collect any information by itself, but only records data that it receives externally via the REST interface. It therefore provides no additional configuration options in its specific tab. The settings made in the General and Enhancer tabs are sufficient for it.

The transfer of data to the REST interface is to be realized by a self-implementation. The information required for this is contained in the documentation SmartSearch REST API for external data generators.

The external data generator will be replaced in its functionality by the generic API in the future and will not be developed further. So new projects should use this new approach to index externally generated documents.

List of existing data generators

After the data generator is saved and successfully created, it is added to the list of existing data generators in the tab of the same name. The list is divided into a four-column table with the following information:

Name: The column contains the name of the data generator specified during generation.
Type: This column indicates whether the data generator is of type Web, XML or External.
Status: This column shows the data generator’s current state. Depending on the operation, it can have one of the following statuses: Ready, Record, First or Second Enhancer, Synchronized or Error.
Last edit: The information in this column shows the time and the editor of the last change to the data generator.

Edit a data generator

The button with the symbol of a pen enables editing the data generator. Clicking it opens the detailed view with all the configurations made. These can be changed according to the previously described possibilities.

Execute a data generator

The button with the play icon manually starts the execution of the data generator. It then collects the data from the configured source and transfers it to the Solr server.

The persistence of the data by the Solr server takes place at regular intervals. Therefore, it can take up to fifteen minutes until the data is retrievable.

The Reload Collection button enables immediate synchronization. It is visible only to members of the ADMIN group and takes into account the collected data of all data generators. The synchronization of the data of a single data generator is not possible.

Deleting a data generator

The button with the symbol of a trash can enables the deletion a data generator. Clicking it opens a dialog with a confirmation request to confirm the deletion.

Alternatively, it is possible to select several data generators for deletion via the checkboxes that are visible in the overview list in front of each data generator. The button Delete selected removes these selected data generators after a confirmation request has been confirmed.

Erasing the indexed documents of a data generator

The button with the symbol of an eraser can enable the emptying of a data generators data, deleting all documents from the index. Clicking the button Empty selected opens a dialog with a confirmation request to confirm the erasure of data.

2.3.2. Browser

For the display of the data collected by the data generator and persisted by the Solr server the SmartSearch provides a Browser. This can be called via the menu item Data Browser.

For the display, a data generator must first be selected in the drop-down menu. The Browser then lists all documents that have been collected by this data generator. Clicking on such a document will also display all fields and the meta information stored in them. The Previous document and Next document buttons allow browsing through the captured documents, whose total number can be read between them.

Figure 20. Browser

2.3.3. Preparation of data by Enhancer

In the course of data generation, it is often necessary to adjust the documents prepared for the index. This is required to increase the discoverability of documents, to remove invalid documents or to increase data quality in general.

The SmartSearch uses so-called Enhancer for this purpose. These are both supplied and project-dependently configured components that are part of a data generator configuration. During each data generation, the Enhancers are executed for each captured document and their effects are applied to the document to be indexed.

Generally, there are three types of Enhancers, which are executed in the following order:

General Enhancer

The general Enhancers are provided by the SmartSearch. They are transparent to users and are not configurably customizable. Furthermore, they allow, for example, speech analysis to be performed, the results of which are usable in the following configured Enhancers.
Configured Enhancer

With the configured Enhancers, the SmartSearch provides the ability to configure custom logic for processing and customizing documents as part of a data generation process. The types of configured Enhancers and their use cases are explained in the following section.
Finishing Enhancers

The finishing Enhancers are provided by the SmartSearch, analogous to the general Enhancers. They are also transparent to the users and are also not configurably customizable.

The Enhancer tab in the configuration view of a data generator enables the creation of configured Enhancers. These are used to modify the data collected by the data generator before it is indexed on the Solr server. Per data generator configuration multiple Groovy script Enhancers can be defined. These are processed in the order in which they are stored in the configuration. The creation of an Enhancer is done by clicking on the New Enhancer button.

Groovy script Enhancer

This Enhancer allows the inclusion of a Groovy script to be implemented by the user. The editing mask required for this opens after creating the Enhancer via the menu item Groovy Script Enhancer. Such a script allows additional modifications of the dataset, and access to the data on the document generated during the execution of the general Enhancers. For example, additional documents can be added to the dataset or existing documents can be edited or deleted. An input made can be checked for possible syntax errors using the Validate button.

The following code example shows the structure of a Groovy script Enhancer:

Structure of a Groovy script Enhancer

void manipulate(final Document document, final String html, final org.jsoup.nodes.Document jsoupDocument)  {

// Only the method body is to be filled.
// The signature is automatically bracketed around the configured code in the edit mask.

}

The Groovy script is passed an object document, a string html and an object jsoupDocument:

The document object contains the SmartSearch document currently being processed. Additional data can be added to or removed from this document during processing:
```
document.addData("field name","field value")
document.removeData("fieldToRemove")
```
In the context of a Web data generator, the html string contains the HTML of the page originally captured by the crawler, which is currently being converted into a SmartSearch document.

For other kinds of data generators, this object is not relevant.
In the context of a Web data generator, the object jsoupDocument contains a representation of the page as a Jsoup document. This allows object-based processing of the page:
```
document.addData("page title", jsoupDocument.title())
```
This object is also relevant only for data generators of type WEB.

If a document is not to be included in the index during generation, this can be achieved by defining the boost value 0:

Definition of the boost value

document.setBoost(0)

Further examples of Groovy scripts and their usage are included in the Examples of Groovy Scripts Enhancers chapter.

Extracting names and nouns

According to its name, this Enhancer extracts names and nouns and stores them in the fields names_multi_keyword and nouns_multi_keyword.

This Enhancer can only be used for the language German and has no function for other languages.

2.3.4. Available fields and field types

With the SmartSearch a data structure - the so-called Solr-schema - is delivered, which defines the fields existing for captured documents. After a data gathering, these fields at the document contain structured contents, which are stored in the index in relation to the document. Since the contents of various fields potentially affect the results of a search query, some relevant fields are explained below. Some fields are automatically created and populated with values during the run of data generators. In addition, the Groovy Script Enhancer provides the possibility to add more fields and values.

Basically, a field consists of an identifier and one or more values. There are three types of fields, which can differ in their use and function:

mandatory fields
static fields
dynamic fields

A field can have one or more values depending on the field type. For more information, see the section "Fields with multiple values per document".

To use the SmartSearch it is important to understand the use cases and content of the available fields. The following tables give an overview of the three field types and the relevant part of the available fields and outline their content:

Mandatory fields

The identifiers as well as values of mandatory fields are obligatory on every document, the former are also unchangeable.

The following table shows an overview of fields of this type:

Table 1. Mandatory fields
Identifier	Description	Content
id	Unique id of a document	With the content of this field, each document in the index is uniquely identifiable.
title	Title of a document	For example, the title of a document can be used as a headline on a search results page.
content	Content of a document	The content of the document is - along with the title - the content basis for the search function. Content that is relevant for finding the document is usually to be written into this field during data generation.
link	URL of a document	The URL of a document is the jump label with which the source of the indexed document can be retrieved by the end user during a search process. This can be, for example, a link after the title of the document, which then leads to the actual document.

Static fields

The identifiers of static fields are unchangeable, just like those of mandatory fields. Unlike the latter, however, static fields do not necessarily have a value.

The following table lists some typical static fields:

Table 2. Static fields
Identifier	Description	Content
thumbnail	Compact view as image link	To display the compact view of the documents in a search results page, the URL of such a compact view must be stored in this field.
keywords	Keywords	This field is intended for terms that should not undergo any linguistic processing, such as stemming, before being included in the index.
language	Language abbreviation	This field contains the language abbreviation according to ISO 639-1. The chapter data generators contains more information about this.

Dynamic fields

Dynamic field identifiers follow a naming convention and may or may not be present depending on the usage and document. Dynamic fields are intended to make the schema flexible for future necessary data and thus the application more robust.

Dynamic fields behave like static fields except that their identifiers contain a placeholder. If documents are indexed by starting a data generation, data that cannot be assigned to a static field can be caught by a dynamic field.

For example, a document may contain a custom_date field that does not exist in the Solr schema. The data in this field would still be included in the index via the *_date dynamic field.

For more information on dynamic fields, see the Solr documentation.

An overview of the most important dynamic fields available can be found in the following table:

Table 3. Dynamic fields
Identifier	Description	Content	Supported functionalities
*_integer	Integer values	Fields of this type contain values from the value range Integer (`32-bit signed integer`).	Sorting
*_long	Long values	Fields of this type contain values from the value range Long (`64-bit signed integer`).	Sorting
*_double	Double values	Fields of this type contain values from the value range Double (64-bit IEEE floating point)..	Sorting
*_date	Date values	Fields of this type contain date values and are to be used accordingly for e. g. chronological sorting. The values are given in UTC format: for example `2020-10-05T12:14:29.058Z`.	Sorting
*_date_range	Date values	Fields of this type carry a different kind of date values: Unlike `_date`, `_date_range` is not about an exact time, but about a period of time. Note that fields of this type are not sortable, since date ranges can overlap, intersect or be nested. For example a day or a year: `2020-10-05` or `2023`.
*_keyword	Keywords	Like the static field `keywords`, fields of this type contain keywords. If such a field exists in a document, the value is automatically processed as a facet value. If the facet does not exist yet, it will be created.	Multi-values
*_keyword_lc	Keywords in lowercase	Fields of this type contain keywords, which were automatically converted to lowercase during indexing.	Multi-values
*_sort	Sort values	This field type allows storing sortable values. The values in these fields are converted to lowercase and cleansed of spaces.	Sorting
*_sort_de	Sort values - special case for German	This field type behaves analogously to `*_sort`. However, it has the special feature that umlauts and the `ß` are sorted with the corresponding vowels and the `s` respectively.	Sorting
*_token	Grouping values	This field type allows grouping documents based on the field contents, for example, document or product categories.	Grouping
*_pnt	Spatial point values	Fields of this type contain spatial point values, e. g. to store geographic information. The values have the format `lat,lon` (based on the English terms `latitude` and `longitude`), for example `1.23,4.56`. For more information, see the Documentation for the underlying Solr spatial search.
*_facet_string	Text facet values	Fields of this type contain textual values for use as facet attributes. These fields are automatically created and populated with facet values by using identifiers of type `*_keyword`.	Multi-values
*_facet_double	Facet values for floating point numbers	Fields of this type contain numeric values for use as facet attributes. For example, to use a numeric facet value for the `price_keyword` facet, a corresponding value must be entered into a `price_facet_double` field (for example, via a Groovy Script Enhancer).	Multi-values
*_stored_only	Unprocessed values	Fields of this type contain content with up to 32766 characters, which are not made searchable or processed in any other way. However, they can be queried when retrieving documents for display.	Multi-values
*_stored_only_big	Large unprocessed values	Fields of this type contain contents of unlimited size, which are not to be made searchable or processed in any other way. However, they can be queried when retrieving documents for display.	Multi-values
*_autocomplete	Terms for auto-completion	Fields of this type contain terms that can be used for auto-completion in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field.
*_expanded_autocomplete	Terms for cross-word auto-completion	Fields of this type contain terms that can be used for auto-completion over two words, in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field.

Static/dynamic fields for language abbreviations

In addition to the types of fields described above, further fields exist which can be present per language on documents and whose naming schemes are illustrated in the following table:

Table 4. Static/dynamic fields for language abbreviations [de/en/fr/…]
Naming scheme	Description	Content	Supported functionalities
*_text_de	Language dependent textual content	These fields contain content that is processed textually depending on the language.	Multi-values
*_autocomplete_de	Language-dependent terms for auto-completion	Fields of this type contain language-dependent terms, which can be used for auto-completion in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field.	Multi-values
*_expanded_autocomplete_de	Language-dependent terms for cross-word auto-completion	Fields of this type contain language-dependent terms, which can be used for auto-completion over two words in addition to the content of the document. Terms to be used in this sense are specified as a blank character-separated list in this field.	Multi-values

Fields with multiple values per document

When using SmartSearch, multiple values for a field name can be indexed for fields in certain cases. This creation of multiple values is not possible for every static and dynamic field.

Which static and dynamic fields support multiple values can be found in the table in section "Available fields and field types". User-defined fields, which are not mapped to any of the table values, generally allow multiple values.

If a field which does not support multiple values is passed to the index, the document will still be indexed. However, only one value is stored for the field, usually the first value passed for the field. In this case, a corresponding warning is found in the log of the application.

2.3.5. Language details

The SmartSearch persists the content collected on the customer side in a variety of different fields, for the definition of which it provides the Prepared Searches. During execution, a language can be passed to these in order to perform the data collection in a language-dependent manner. The table below shows an overview of all languages currently configured for the SmartSearch, including their abbreviations (according to ISO 639-1) and the features available for them. The features are applied automatically when processing the captured information. More information about the individual languages is available in the Solr documentation.

Currently, the SmartSearch assigns the following features to the languages:

Lower case: The internal persistence of the recorded data is done in lower case. This has no effect on the output of the information and is only used for its processing.
Normalization: Normalization is used to align similar spellings of words. In Scandinavian languages, for example, the characters æÆäÄöÖøØ (as well as the variants aa, ao, ae, oe, and oo) are substituted to form åÅæÆøØ. The different spellings blåbärsyltetöj and blaabaersyltetoej thus result in the normalized form blåbærsyltetøj.
Stemming: Stemming enables the linking of different, morphological variants of a word with their common root. For example, the terms fished, fishing and fisher have the common root fish.
Stop words: The stop words are a list of terms that occur very frequently in a language. They include, for example, articles, pronouns and linking words. Due to their frequency, they should not be considered for the search index. The delivery contains such a list for each language, the configuration of which is possible via the SmartSearch cockpit.
Synonyms: For languages that allow synonyms, there is a possibility within the SmartSearch cockpit to create a list that assigns other, similar words to a term. These words are treated the same in a search.

Table 5. Languages
Language	Abbreviation	Lower case	Normalization	Stemming	Stop words	Synonyms
Arabic	ar	✔	✔	✔	✔	✔
Bengali	be	✔	✘	✘	✔	✔
Bulgarian	bg	✔	✘	✔	✔	✔
Chinese Traditional	zh-tw	✔	✔	✘	✔	✔
Chinese Simplified	zh-cn	✔	✔	✔	✔	✔
Croatian	hr	✔	✘	✘	✔	✔
Czech	cs	✔	✘	✔	✔	✔
Danish	da	✔	✔	✔	✔	✔
Dutch	nl	✔	✘	✔	✔	✔
English	en	✔	✘	✔	✔	✔
Estonian	et	✔	✘	✘	✔	✔
Finnish	fi	✔	✘	✔	✔	✔
French	fr	✔	✔	✔	✔	✔
German	de	✔	✘	✔	✔	✔
Greek	gr	✔	✘	✘	✔	✔
Hungarian	hu	✔	✘	✔	✔	✔
Indonesian	id	✔	✘	✔	✔	✔
Italian	it	✔	✔	✔	✔	✔
Japanese	ja	✔	✘	✘	✔	✔
Korean	ko	✔	✔	✘	✔	✔
Latvian	lv	✔	✘	✔	✔	✔
Lithuanian	lt	✔	✘	✔	✔	✔
Malay	ms	✔	✘	✘	✔	✔
Norwegian	no	✔	✘	✔	✔	✔
Polish	pl	✔	✘	✔	✔	✔
Portuguese	pt	✔	✘	✔	✔	✔
Romanian	ro	✔	✘	✔	✔	✔
Russian	ru	✔	✘	✔	✔	✔
Serbian	sr	✔	✔	✘	✔	✔
Slovak	sk	✔	✘	✘	✔	✔
Spanish	es	✔	✘	✔	✔	✔
Swedish	sv	✔	✔	✔	✔	✔
Thai	th	✔	✘	✘	✔	✔
Turkish	tr	✔	✔	✔	✔	✔
Ukrainian	uk	✔	✘	✔	✔	✔
Vietnamese	vi	✔	✘	✘	✔	✔

2.4. System

The area System has the User Management as the only subitem. This is used to create and configure users and groups and is therefore divided into two tabs with the same names. The following subchapters explain the functionalities of the two tabs in detail.

If the user and group management is realized on the basis of LDAP, the SmartSearch cockpit has only reading access to it. In this case, the cockpit is only used to list the users and groups, but not to edit or delete them. Only the administration of the permissions is done by the SmartSearch cockpit also in the LDAP case.

Figure 21. User Management

2.4.1. Users

Access to the SmartSearch cockpit requires an active user. For this reason, a master admin exists for the initial login, with which further users can be created. The permissions of a user result from its group membership, which is to be defined during its creation. The creation and administration of the users takes place in the tab of the same name of the User Management.

If the user management is realized on the basis of LDAP, the SmartSearch cockpit has only reading access to it. User creation, editing and deletion is not possible within the cockpit in this case.

Figure 22. Tab - Users

The tab shows a list of all existing users, which initially contains only the master admin.

Create users

For the creation of a new user there is a separate view, which can be called up by clicking the New User button.

Figure 23. Create a user

Within the view, the e-mail address of the new user must first be entered and a password must be defined. Furthermore, the user can be added directly to one or more groups at this point. Therefore, the drop-down menu Groups shows the list of existing groups. The assignment of the user to a group selected from the list is done via the plus symbol next to the drop-down menu. The trash can icon, which is visible for each assigned group, can be used to undo an assignment.

While the group assignment is optional, the specification of the e-mail address and the definition of the password are mandatory fields.

List of existing users

After the new user is saved and successfully created, it is added to the list of existing users in the tab of the same name. The list is divided into a four-column table with the following information:

Name: The sortable column contains the name and the e-mail address specified when the user was created. The name is automatically generated from the e-mail address.
Groups: The column contains the list of all groups the user belongs to. The permissions assigned to the groups determine the user’s access possibilities.
Active: The slider in this column allows to deactivate a user without deleting him directly. The user will not be able to log into the SmartSearch cockpit after deactivation.

It is not possible to deactivate the master admin. Due to this, the slider is not available for him.

Last edit: The information in this sortable column shows the time and the editor of the last change to the user.

With the exception of the master admin, the table also displays two buttons per user that allow editing and deleting the user.

Edit users

The button with the symbol of a pen enables editing a user. Clicking it opens the detailed view with all the information. In the view, the user’s group assignment can be changed by adding him to additional or removing him from groups. The assignment is done by selecting a group from the Groups drop-down menu and then clicking the plus icon. Removing the user from a group is done by using the trash can icon visible for each assigned group.

Deleting users

The button with the symbol of a trash can enables the deletion of a user. Clicking it opens a dialog with a confirmation request to confirm the deletion. The dialog contains a checkbox that is activated by default. It causes the removal of all references of the user to be deleted to his user account. Instead, they are replaced by the technical user ANONYMUS and thus anonymized. Deactivating the checkbox preserves the user’s references. They can be anonymized afterwards using the Remove references of deleted users button.

The master admin can never be deleted. Due to this, the delete button is not available for him.

Alternatively, it is possible to select several users for deletion via the checkboxes that are visible in the overview list in front of each user. The Delete selected button removes these selected users after a confirmation request has also been confirmed beforehand.

Even when deleting several users, their existing references can be replaced by a technical user and thus anonymized. If this option was deactivated, the references can be removed afterwards using the Remove references of deleted users button. It provides the same function and anonymizes the references of all users that no longer exist by replacing them with the technical user ANONYMUS.

2.4.2. Groups

The access to the areas and functions of the SmartSearch cockpit requires different permissions. They are defined via groups to which the individual users are assigned. Initially, there is only one master admin, for which the necessary permissions for creating new users and groups must first be activated. The creation and administration of the groups is done in the tab of the same name within the User Management.

If the group management is realized on the basis of LDAP, the SmartSearch cockpit has only reading access to it. Group creation, and deletion is not possible within the cockpit in this case.

Furthermore, the editing of groups is limited to the configuration of permissions.

Figure 24. Tab - Groups

The tab shows a list of all already existing groups, which initially contains only the admin and user group.

Create a group

For the creation of a new group there is a separate view, which can be called up by clicking the New Group button.

Figure 25. Create a group

Within the view a group name is to be assigned, which may contain only letters, numbers and hyphens. Furthermore, one or more users can be added directly to the group at this point. Therefore, the drop-down menu Users shows the list of existing groups. The assignment of a user selected from the list is done via the plus symbol next to the drop-down menu. The trash can icon, which is visible for each assigned user, can be used to undo an assignment.

In addition, the permissions of the group can be determined in this view. For this purpose, all areas of the cockpit are listed, for which the permissions Reading, Edit and Delete can be individually enabled or disabled via checkbox. For the areas data generator and Search the permission Execute is also available. The Activate all button enables the simultaneous activation of all displayed permissions.

While the assignment of permissions is optional, the specification of the name is a mandatory field.

List of existing groups

After the new group is saved and successfully created, it is added to the list of existing groups in the tab of the same name. The list is divided into a two-column table with the following information:

Name: The column contains the name of the group specified during creation.
Last edit: The information in this column shows the time and the editor of the last change to the group.

With the exception of the groups ADMIN and USER, the table also displays two buttons per group that allow editing and deleting the group.

Edit a group

The button with the symbol of a pen enables editing a group. Clicking it opens the group`s detailed view with all the information. In the view, the users assigned to the group as well as the permissions can be changed. The assignment of a user to the group is performed by selecting it from the User drop-down menu and then clicking the plus icon. Removing a user from the group is done by using the trash can icon visible for each assigned user. Disabling/enabling of permissions is possible using the checkbox displayed per permission.

Deleting a group

The button with the symbol of a trash can enables the deletion of a group. Clicking it opens a dialog with a confirmation request to confirm the deletion. The deletion removes the group and all permissions defined by it.

The groups ADMIN and USER can never be deleted. Due to this, the delete button is not available for them.

Alternatively, it is possible to select several groups for deletion via the checkboxes that are visible in the overview list in front of each group. The Delete selected button removes these selected groups after a confirmation request has also been confirmed beforehand.

3. Use cases

In the context of an integration of the SmartSearch different use cases can occur, which require certain configuration and/or development activities. The following sections describe such use cases together with typical solution approaches.

Such solution approaches may include adaptations in the cockpit to the configurations of Prepared Searches as well as data generators. This involves the creation and customization of Groovy Script Enhancers. In addition, the approaches presented may involve modifications to the queries against the search API. Thus, the following section is primarily aimed at developers.

The approaches outlined are merely implementation and integration approaches and thus supplement the documentation. They are not intended as a complete explanation of the topics addressed.

3.1. Usage of SmartSearch APIs

Apart from configurations in the cockpit, interaction with the SmartSearch is performed by accessing API endpoints. This chapter describes the basic usage of the SmartSearch APIs.

In general, any access to the SmartSearch APIs is done via HTTPS requests. A request can be done, for example, as a JavaScript call via a browser or using tools such as Insomnia or CURL.

SmartSearch itself uses certain APIs on the server side for its cockpit functionality. Those are not to be used for other purposes. This regards all API endpoints beneath '/rest/backend/'.

All examples in this documentation represent CURL calls. The following example call depicts a search for the search term searchterm in a Prepared Search demo on a SmartSearch server. The server is accessible on host localhost and port 8181.

CURL call for a search

curl
-X GET 'https://localhost:8181/rest/api/v1/prepared_search/demo/execute?query=searchterm'

3.1.1. API gateway and SmartSearch server

In principle, there are two different modes of operation and usage for the SmartSearch: The use of SmartSearch as a cloud service and the use of self-operated SmartSearch instances.

The usage scenarios of the APIs available in the SmartSearch as well as the parameters and invocation methods required for these scenarios are identical in both modes. However, the invocation target differs situationally:

When using self-operated SmartSearch instances, all API requests are directed against the SmartSearch instance whose cockpit is being used.
When using the cloud service, all requests are also directed against the SmartSearch instance whose cockpit is used, unless it is explicitly stated to direct them against the so-called API gateway. In all chapters in which API calls are described, attention must therefore be paid to references to the API gateway.

Using the JavaScript library SmartSearch.js requires the API gateway and is therefore only possible when using the cloud service.

3.1.2. Authentication and authorization

Accesses to SmartSearch server APIs are subject to the user permissions described above. Which permissions must be configured for the accessing user for which type of access depends on the type and purpose of the access. In the following use cases, the specific permissions required are explicitly stated.

Access to the API gateway is not subject to authentication or authorization.

If authentication is required for an access, a corresponding HTTP header Authorization must be set and sent to the server. The content of this header must be "Basic" followed by the value of the expression username:password encoded as a Base64 representation. In the following example, the value YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu corresponds to the Base64 representation of the expression admin@localhost.de:admin.

CURL call including Authorization header

curl
-X GET 'https://localhost:8181/rest/api/v1/prepared_search/demo/execute?query=searchterm'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'

3.2. Search query

The search query is the core function of the SmartSearch. It is based on a search index and returns the most relevant results for a search term. The configuration of which fields in the document are relevant for the search index and how the results look like is done in the corresponding Prepared Search.

Two forms are distinguishable for search queries:

simple search
faceted Search

The following subchapters describe these forms in detail.

3.2.1. Simple search

The simple search is the basic functionality of the SmartSearch. For a newly created Prepared Search, it returns up to ten search hits by default. These are sorted by relevance. Based on the results, the search can be refined step by step.

The simplest form of the query for a Prepared Search mithras with the search string solar looks as follows. The string here corresponds to the value of the query parameter.

Simple query against the API gateway

curl -X GET 'https://localhost:8181/v1/prepared_search/mithras/execute/?query=solar'

Simple query against a self-powered SmartSearch instance.

curl -X GET 'https://localhost:8181/rest/api/v1/prepared_search/mithras/execute/?query=solar
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0Lsdf5Re'

A possible response to this query might be as follows:

Sample answer

...
{
   "link": "https://mithrasenergy.com/content/Homepage/Startseite.html",
   "id": "fbfead657bf7118b92f2d91eb254710f0df563cf51fbb288a9e13eb1d5cf5099",
   "title": [
      "Welcome! - mithrasenergy"
   ]
},
{
   "link": "https://mithrasenergy.com/content/Services/Leistungen/Umwelt/Umweltaspekte-bei-der-Herstellung-der-Solarmodule.html",
   "id": "bedd3862818d2bd04809a44014c622f3402b75d1c1a7e49f116b10197eebb2c1",
   "title": [
      "Environmental aspects of solar module production. CO2 footprint. - mithrasenergy"
   ]
}
...

In all cases, the query parameter is limited to a length of 313 characters.

3.2.2. Faceted Search

If a search returns a lot of search results, limiting them may help searchers to get the desired result. To achieve this, the SmartSearch allows to restrict the results. To restrict on specific areas facets are used. facets provide the ability to narrow hit lists by fields that are present in a document. The configuration of facets is done in the Prepared Search.

The following example shows a query for the term solar for the Prepared Search mithras, whose result contains only German hits. The restricting is achieved by the facet language which is passed as search parameter.

Query by language faceted against the API gateway

curl -X GET 'https://localhost:8181/v1/prepared_search/mithras/execute/?query=solar&language=de

Query by language faceted against a self-powered SmartSearch instance

curl -X GET 'https://localhost:8181/rest/api/v1/prepared_search/mithras/execute/?query=solar&language=de
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0Lsdf5Re'

3.3. Pagination

In some cases, a search query returns numerous search results. For a better overview, these can be divided on several pages. For this, the SmartSearch offers two pagination options, which are described in more detail in the following subchapters:

Classic pagination: With classic pagination, a search query returns only the search results of a certain page. It can be used, for example, to implement the pagination known from Google.
Scroll pagination: With scroll pagination, a search query returns search results starting from a certain hit. It enables, for example, the implementation of infinite scrolling.

The SmartSearch provides the JavaScript library SmartSearch.js, which already handles some of the necessary implementations.

3.3.1. Classic pagination

Among other things, the configuration dialog of a Prepared Search has the Hits per page field. The value defined for this field determines the maximum number of hits that a simple search query will return. They correspond to the first page of the pagination. To get the search results of additional pages, the haupia_pageNumber parameter must be added to the query. Counting starts from 0, so the second page, for example, requires passing the value 1.

Query the second side against the API gateway

curl -X GET 'https://mithras-search-api.e-spirit.cloud/v1/prepared_search/mithras/execute/?query=solar&haupia_pageNumber=1'

Query the second page against a self-powered SmartSearch instance

curl -X GET 'https://localhost:8181/rest/api/v1/prepared_search/mithras/execute/?query=solar&haupia_pageNumber=1
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0Lsdf5Re'

The haupia_pageSize parameter allows to override the number of search results per page configured in the Prepared Search. The following query returns n results starting from the result ((m-1)*n)+1.

Query parameters

haupia_pageSize=n&haupia_pageNumber=m

3.3.2. Scroll pagination

In addition to classic pagination, the SmartSearch offers the option of scroll pagination, which focuses on infinite scrolling. In this case, additional search results are loaded as soon as the end of the currently displayed results is reached.

A newly created Prepared Search returns a maximum of ten results by default. To get the next ten hits, it is necessary to define a starting point of the results. This can be set using the haupia_start parameter.

Query search results after the first ten hits (API gateway)

curl -X GET 'https://mithras-search-api.e-spirit.cloud/v1/prepared_search/mithras/execute/?query=solar&haupia_start=10'

Query search results after the first ten hits (self-powered SmartSearch instance)

curl -X GET 'https://localhost:8181/rest/api/v1/prepared_search/mithras/execute/?query=solar&haupia_start=10
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0Lsdf5Re'

The haupia_rows parameter allows to override the number of search results per page configured in the Prepared Search. The following query returns n results starting from the m result.

Query parameters

haupia_rows=n&haupia_start=m

3.4. Autocomplete

Autocomplete is a function that supplements a search entry made by the user in a meaningful way. The basis for this supplement is the data in the search index, which are suggested as meaningful search terms depending on the input made so far.

The contents of the content field, which is also used to generate the search index, serve as the data basis for the autocomplete function. Thus, the function can be used in normal cases without the need for a specific configuration for a Prepared Search. The language dependency is made transparent for users, so a configuration related to the language used for completion is not necessary.

Basically, users of the autocomplete function send a query with a prefix to the SmartSearch. The response to this query contains a defined number of terms that start with the given prefix and have been identified by SmartSearch as relevant for it - for example, because they occur frequently in documents. Users can then retrieve a refined list of suggestions by entering extended strings and thus trigger a new query.

For example, if users send the prefix sol to the SmartSearch, a response might look like the following:

Example response

[
   "sold",
   "solar energy",
   "solar spectrum"
]

If the requester expands the input to "sola", submitting the query will result in a refined list:

Refined list

[
   "solar energy",
   "solar spectrum"
]

The queries that enable such behavior are described below. This example is based on the assumption that a Prepared Search mithras exists, which serves as the target for the autocomplete queries.

Furthermore, the assumption is that, as described in the Prepared Search chapter, the Prepared Search is linked to a data generator whose indexed documents enable the example. Specifically, this means that the keyword suggestions listed in the examples are contained in the indexed content.

Restrictions made to the Prepared Search with regards to the search result will also affect the results of the autocomplete query.

3.4.1. Language-independent autocomplete

The simplest form of an autocomplete query for a Prepared Search mithras for the string "sun" looks like the following. The string corresponds to the value of the prefix parameter.

/rest/api/v1/prepared_search/mithras/autocomplete?prefix=sun

If such a query has been sent by users, the response (depending on the database) may look like the following:

Example response

[
   "sunny",
   "sun",
   "sunrise",
   "sunbeam",
   "sunglasses"
]

3.4.2. Language dependent autocomplete

The previous example considers the case where an autocomplete query is language-independent across all documents. However, this may result in terms from multiple languages present in the index being suggested:

Multilingual suggestions

[
   "song",
   "sonne"
]

By adding the language parameter to the query, it is possible to restrict to a specific language:

Language parameter

/rest/api/v1/prepared_search/mithras/autocomplete?prefix=son&language=de

Thus, the preceding exemplary query for the language German with the abbreviation de leads to a corresponding result:

Language dependent response

[
   "sonne"
]

3.4.3. Cross-word autocomplete

In some use cases, it may be desired to expand the suggested terms across spaces rather than words separated by spaces. A typical example are search terms like "model 4A version 2", "latest news" or "cancel contract".

If the autocomplete function should suggest such multi-word terms, there are basically two ways to implement this:

multi-word autocomplete function for undefined content
multi-word autocomplete function for defined terms

Multi-word autocomplete function for undefined content

For a two-word autocomplete across a space, users have the option of using dynamic fields of type _expanded_autocomplete. These are automatically populated during a data generation for the content field, and can be retrieved by appending the _expanded suffix via the field parameter:

Suffix expanded

/rest/api/v1/prepared_search/mithras/autocomplete?prefix=ene&field=content_expanded

Multi-word suggestions

[
   "energy",
   "energy facts",
   "energy locations"
]

If the contents of other fields are to be used for such a multi-word autocomplete, then the contents must be transferred to a corresponding field of type *_expanded_autocomplete_[language abbreviation] during data generation in a Groovy Script Enhancer. In this example, a free text is copied to the text_expanded_autocomplete_en field:

Field text_expanded_autocomplete_en

document.addData("text_expanded_autocomplete_en", "Solar energy is the energy of the future, and we have dedicated ourselves to this future. With our solutions and products, we would like to make sure that you are best equipped for this energy of the future. This is the only way for each of us to achieve the highest levels of sustainability and environmental protection for ourselves, our families and our companies")

Users can now send a query against this field (without the _autocomplete_[language abbreviation] suffix):

Query against the field

/rest/api/v1/prepared_search/mithras/autocomplete?prefix=sol&field=text_expanded

The response to this query includes a space-crossing autocomplete if corresponding terms are present in the indexed data:

Space-crossing suggestions

[
   "solar energy",
   "solutions",
   "solar",
   "solutions and"
]

Multi-word autocomplete function for defined terms

As a search provider, in some cases it might be helpful to be able to control the word suggestions very precisely. For example, if the indexed contents are products, completing the input "mod" should possibly offer the suggestions "model 4a version 1" as well as "model 4a version A".

To achieve this behavior, the language-independent dynamic field _keyword_lc is available. In order to apply such a field in a Prepared Search a data generator need to populate the field. This is done by applying a Groovy Script Enhancer. The following example Groovy Script code sets the value of the dynamic field model_keyword_lc to model 4a version 1:

Replacing a dynamic field

document.setSingleValue("model_keyword_lc", "modell 4a version 1")

Now, if the index contains documents with the values model 4a version 1 or model 4a version 2, a user of the autocomplete function can specifically use the field model_keyword_lc to complete the input "mod" by using the parameter field:

Specific completion

/rest/api/v1/prepared_search/mithras/autocomplete?prefix=mod&field=model_keyword_lc

Unlike the previous examples, where only matching excerpts from field contents were returned, the response to such a query includes all complete values of the passed dynamic field. In the current example, the response could look like the following:

Response with complete values

[
   "modell 4a version 2",
   "modell 4a version 1"
]

3.4.4. Remarks

The autocomplete function does not suggest words that are contained in stopword lists. If changes are made to a stopword list, re-indexing by a data generator run is necessary for the change to be reflected in the list of delivered autocomplete suggestions.

It is advisable not to send autocomplete queries to the SmartSearch for prefixes with less than three letters. Although possible in principle, queries with such short prefixes may cause an increased load. Also, the suggestions in the responses of such short queries are often not useful, since the number of possible suggestions is still very long and thus not very narrowing.

In principle, the suggestion lists are extendable by so-called filter queries. Such filter queries further restrict the list of completion suggestions by reducing the set of underlying documents by a query clause. This can be achieved by extending the fq parameter with such a clause. Accordingly, the following example will only display suggestions from documents in whose indexed content the mime_type field contains the value 'html'.

Restriction by filter queries

rest/api/v1/prepared_search/mithras/autocomplete?prefix=ene&fq=mime_type:html

3.5. Sorting

For searchers, a successful search is characterized by the best hit being in first place. In the SmartSearch, the order of the search hits depends on their score. The score is the relevance of a hit in relation to the search term. The default setting of a newly created Prepared Search is a descending order based on the score. This can be influenced, for example, by weighting the individual fields.

In most cases, the score makes the most sense for sorting. However, sometimes another sorting,such as by date, is more suitable. To influence the sorting, the Prepared Search offers the text field Sort by. The field first expects the input of the field that should be the basis for the sorting. Separated by a space, the sorting order is to be specified afterwards (ASC or DESC).

To sort the search results by publication using the example field publication_date, the Sort by field must contain the following input:

Example entry

publication_date DESC

If the sorting is to be influenced by the searchers, two steps are necessary. First, the Groovy script must be adapted to the Prepared Search and second the call must be made accordingly in the FrontEnd.

In the example below, the Prepared Search can be passed the sortDate parameter. Depending on whether the parameter has the value asc or desc, the sorting will be adjusted.

The following code snippet shows an example of setting the sorting using a Groovy script at the Prepared Search.

Example of setting sorting

import org.apache.solr.client.solrj.SolrQuery

// Getting a parameter (sortDate) out of the request
def sortDateParam = parameter.get("sortDate")?.get(0)

// Set the sort order, depending on the given value asc/desc
if (sortDateParam?.equals("asc")) {
   solrQuery.setSort("sort_date", SolrQuery.ORDER.asc)
} else if (sortDateParam?.equals("desc")) {
   solrQuery.setSort("sort_date", SolrQuery.ORDER.desc)
}

When using the SmartSearch.js in the FrontEnd, the following adaptation is necessary to pass the sort parameter to the Prepared Search:

Example of sorting using SmartSearch.js

const smaseOptions = {
   autocompleteOptions: {
      language: this.options.language
   },
   customParams: [{
      "sortDate": "desc"
   },
   {
      "someOtherParam": "some other value"
   }]
}
this.smartSearch = new SmartSearch(DEFAULT_HOST, this.options.preparedSearch, smaseOptions)

Sorting is only feasible for so-called primitive fields (numerics, string, boolean, dates). On the other hand, sorting is not possible on fields like content. The Available fields and field types chapter contains more details about fields that also support sorting. Custom fields to be used for sorting must have the extension *_sort. This way SmartSearch ensures that the corresponding field is suitable for the search.

When sorting German text fields, umlauts are replaced by the corresponding vowel and ß by ss. In addition, for a German-language sorting the ending *_sort_de is recommended.

Example of sorting for umlauts:

Abend
Äcker
Advent

3.6. Usage of the SmartSearch Generic API for indexing documents.

The section on data generators outlined ways to use SmartSearch to retrieve and index data in formats such as XML and HTML from specific configurable sources. For use cases where documents are to be added, adjusted, or removed on a regular basis one at a time or in batches, SmartSearch provides what is known as the generic API for indexing documents.

The generic API provides endpoints for the following invocations:

Adding one document
Adding multiple documents
Removing a document

The available calls can be viewed and executed in an interactive interface. This interface is accessible on each SmartSearch instance at the relative URL /swagger-ui/index.html. In the cockpit it can be reached by calling the link API Documentation:

Figure 26. Opening the API documentation in the cockpit

The data exchange format in the case of the generic API is JSON, below is an example of the structure and target of these calls for the available operations.

3.6.1. Adding a document

Adding a single document is done by calling the http method PUT towards a URL which contains the name of the data generator as well as the unique document ID. The explicit specification of the language is optional:

https://localhost:8181/rest/api/v1/smartsearchconnect/myDatagenerator/00001

https://localhost:8181/rest/api/v1/smartsearchconnect/myDatagenerator/de/00001

The structure of the JSON document to be sent is a flat structure of fieldname-value pairs, whereas values can also be arrays of values, as can be seen in the following exemplary document:

Example of a JSON document with the mandatory fields 'title', 'link' and 'content' as well as arbitrary additional fields.

{
  "link": "https://www.mithrasenergy.com/myDocument01.html",
  "title": "My document 01.",
  "content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
  "myCategory": "Article",
  "multiValuedField" : [ "Value 01", "Value 02" ]
}

For example, the following invocation can submit this document towards the API:

Example of a cURL call to send the above document to the generic API.

curl -X 'PUT' \
'https://localhost:8181/rest/api/v1/smartsearchconnect/myDatagenerator/de/00001' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
}, "link": "https://www.mithrasenergy.com/myDocument01.html",
}, "title": "My document 01.",
"content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"myCategory": "Article",
"multiValuedField" : [ "Value 01", "Value 02" ]
}'

3.6.2. Adding multiple documents

If multiple documents are to be indexed, multiple JSON documents can be passed to a http-method POST. In this case, besides bracketing the documents into an array, the JSON structure differs by the necessary addition of a respective document ID uid, since this is not part of the POST URL:

Example of several JSON documents with the mandatory fields 'title', 'link' and 'content' as well as any other fields.

[
  {
    }, "uid": "00001",
    "link": "http://localhost/myDocument01.html",
    "title": "My document 01.",
    "content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
    "myCategory": "Article",
    "multiValuedField" : [ "Value 01", "Value 02" ]
  },
  {
    }, "uid": "00002",
    "link": "http://localhost/myDocument02.html",
    }, "title": "My document 02",
    "content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
    "myCategory": "Article",
    "multiValuedField" : [ "Value 01", "Value 02" ]
  }
]

For example, the corresponding cURL call would look like the following:

Example of a cURL call to send the above documents to the generic API.

curl -X 'POST' \
'https://localhost:8181/rest/api/v1/smartsearchconnect/myDatagenerator' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '[
{
}, "uid": "00001",
}, "link": "http://localhost/myDocument01.html",
"title": "My document 01.",
"content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"myCategory": "Article",
"multiValuedField" : [ "Value 01", "Value 02" ]
},
{
}, "uid": "00002",
"link": "http://localhost/myDocument02.html",
}, "title": "My document 02",
"content": "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"myCategory": "Article",
"multiValuedField" : [ "Value 01", "Value 02" ]
}
]'

3.6.3. Removing a document

To remove a document, a call must be made with the http method DELETE to the same URL that could be used to index the document:

Example of a cURL call to remove the above document to the generic API.

curl -X 'DELETE' \
'https://localhost:8181/rest/api/v1/smartsearchconnect/myDatagenerator/de/00001' \
-H 'accept: application/json'

3.6.4. Notes on data generators regarding the generic API

data generators of type API are created automatically when a first document is sent and no datagenerator with that name exists. Operations with the generic API require a data generatorname under which they are subsequently configurable in the cockpit, for example to add enhancers. Such a data generator is of type API, and does not need to be started or stopped in the cockpit. Also, when the data generator is automatically created as mentioned, an identically named Prepared Search is created, and linked to the data generator.

Operations against the generic API do not take effect immediately in the index, but after a very short time in the range of minutes. A more precise time specification is not possible due to technical aspects of the indexing process.

The generic API is partly also used the integration of FirstSpirit and the SmartSearch Connect module. Thus, occasional references to this usage can be found in the context, such as references to corresponding mandatory fields, for example fsTitle. However, these annotations can be ignored for FirstSpirit-independent usage of the API.

3.7. Start data generators

Starting data generators can be done comfortably via the cockpit. In some cases it may be necessary to be able to start a data generator automatically. For example, if a crawl process is to be triggered directly after a full generation in FirstSpirit. For such cases SmartSearch provides an API endpoint.

The call to the endpoint is made using the type and name of the data generator, as shown in the following example:

CURL call for the start of a data generator "mithras".

curl -X GET 'https://localhost:8181/rest/api/v1/datagenerator/WEB/mithras/start' -H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0Lsdf5Re'

If the start of the data generator was successful, then the status code 200 is returned. In case the data generator is already running at the time of the call, the status code 409 is returned.

3.8. Stop data generators

To restart a running data generator it must be stopped before. As with the start, the endpoint will execute the data generator based on its type and name.

CURL call to stop a data generator "mithras".

curl -X GET 'https://localhost:8181/rest/api/v1/datagenerator/WEB/mithras/stop' -H 'Authorization: Basic YWRtaW5AbG9jdsYWb3N0Lsdf5Re'

3.9. Hot injection and hot deletion

Basically, changes to the SmartSearch data set require its re-creation. This re-creation takes a certain amount of time, which causes delays for small amounts of data. With the hot injection and the hot deletion, the SmartSearch provides the possibility to quickly add or remove documents from the search results, should this be necessary on short term.

The API calls related to the hot injection and hot deletion are only available at the SmartSearch server, but not at the API gateway. As a result, requests to the hot injection as well as the hot deletion API require an HTTP header Authorization. A more detailed description can be found in the chapter Authentication and authorization. The user used for the call, or a group containing it, must have the authorization "execute data generator" in the SmartSearch cockpit.

The following subchapters describe the hot injection and hot deletion in detail.

3.9.1. Hot injection

The hot injection is, in the context of the SmartSearch, the fastest way to add data to the index of a data generator. Data added in this way is available in the index after fifteen minutes at the latest.

Basically, the HTTPS call to use the hot injection takes place at the endpoint defined as follows:

Method: PUT
URL: /rest/api/v1/datagenerator/{type}/{name}/inject

Table 6. URL parameters of a hot injection call
Parameter	Description
`type`	The type of the data generator to which the data passed in the call will be added. (e. g. `Web` or `XML`).
`name`	The name of the data generator to which the data passed in the call will be added.

The HTTP status code of the response to such a request normally corresponds to one of the following values:

The status code HTTP 200 - No Content means that the request was successful. The document is thus added to the index.
In case of the return of HTTP 404 - Not found the requested data generator does not exist.
Any 4xx or 5xx error codes indicate that the request failed and no processing took place. In this case, see the log for more details.

The format of the data passed in a hot injection request depends on the type of the data generator to which the data is added.

Submit to a Web data generator

The Web data generator accepts a list of URLs to be processed. Their processing is done in the same way as if they were captured by the Crawler as part of a data generation. The list of URLs can be passed either as line separated text or as a JSON array:

Example of a line separated URL list

http://www.mithrasenergy.com/page1.html
http://www.mithrasenergy.com/page2.html

Example of a URL list as JSON array

[
   "http://www.mithrasenergy.com/page1.html",
   "http://www.mithrasenergy.com/page2.html"
]

Depending on the format chosen for the transfer, the HTTP header Content-Type must be set at the request. This must have either the text/plain or application/json value. By default, the SmartSearch takes the value text/plain, which assumes the passing of a line separated URL list.

Example of a CURL call with a line separated URL list

curl -X PUT 'https://localhost:8181/rest/api/v1/datagenerator/WEB/demo/inject'
-H 'Content-Type: text/plain'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'
--data-raw 'http://www.mithrasenergy.com/page1.html
http://www.mithrasenergy.com/page2.html'

Example of a CURL call with URL list as a JSON array

curl -X PUT 'https://localhost:8181/rest/api/v1/datagenerator/WEB/mithras/inject'
-H 'Content-Type: application/json'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'
--data-raw '[
"https://www.mithrasenergy.com/content/Career/Jobs.html",
"https://www.mithrasenergy.com/content/Services/Services-2.html"
]'

Furthermore, the following points must be noted when transferring URL lists:

Each of the transferred URLs is included and processed individually.
Subsequent links on the pages are not followed. This behavior is contrary from that for a generation using a Web data generator and also applies to canonical links. If a passed link contains a canonical link to another source page, this page is not included in the index, nor is the canonical link followed.
The log contains an Info about the number of new, changed and ignored pages.

For a hot injection, the field Check URL Script maintainable at Web data generators is not evaluated.

Submit to an XML data generator

For an XML-data generator, the hot injection accepts a list of URLs in the same way as for a Web data generator. These can consist of both data XML files and seed XML files. Any mixture of data XML files and seed XML files is also possible, as shown in the following example from the XML data generator chapter:

Example for a seed XML file

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<seeds>

   <!-- References to data XML files -->
        <seed url="https://www.customer.com/.../.../documents-01.xml" />
        <seed url="https://www.customer.com/.../.../documents-02.xml" />

        <!-- References to other seed XML files -->
        <seed url="https://www.customer.com/.../.../another-seed.xml" />
</seeds>

With the hot injection of an XML data generator it is also possible to pass data directly. To do this, the content type text/xml must be set and the body of the query must conform to the XML format of the XML data generator:

Example of a data XML file, whose content can be passed to a hot injection request

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<documents>
   <document uid="0815">
           <field name="title"><![CDATA[The explanatory video for the SmartSearch XML file]]></field>
                <field name="link"><![CDATA[https://www.youtube.com/watch?v=...]]></field>
                <field name="content">
                   <![CDATA[The content here tells you how to create a SmartSearch XML file.]]>
                </field>
                <field name="language_keyword"><![CDATA[en]]></field>
                <field name="publishing_date"><![CDATA[ 2019-05-23T10:00:01Z ]]></field>
                <field name="favorites_count_long"><![CDATA[34]]></field>
                <field name="category_keyword"><![CDATA[socialmedia]]></field>
                <field name="category_keyword"><![CDATA[press]]></field>
                <field name="meta_keywords"><![CDATA[SmartSearch explanatory video facets]]></field>
        </document>
           ...
        <document uid="4711">
           <field name="title"><![CDATA[The SmartSearch help pages.]]></field>
                ...
        </document>
</documents>

The corresponding request to this document structure might look like this:

Example CURL call of a hot injection to an XML data generator

curl -X PUT 'https://localhost:8181/rest/api/v1/datagenerator/XML/test/inject'
-H 'Content-Type: text/xml'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'
--data-raw '<documents>
   <document uid="0815">
      ...
   </document>
</documents>'

3.9.2. Hot deletion

The hot deletion allows quick and immediate removal of documents from the index. Its call takes place at the following endpoint:

Method: DELETE
URL: /rest/api/v1/datagenerator/{type}/{datageneratorName}/delete

Table 7. URL parameters of a hot deletion call
Parameter	Description
`type`	The type of the data generator from whose index the document will be removed (e.g. `Web` or `XML`).
`name`	The name of the data generator from whose index the document will be removed.

Since the removal of documents from the index triggers its re-indexing, the Delete method should only be used in moderation!

For the long-term deletion of a larger number of documents from the index, an appropriate configuration of the data generator is required so that it does not capture them in the first place. This way the documents are no longer present in the index after new data generator runs.

Passing one of the following information defines the document to be deleted. The information must be passed as the JSON body of the call.

The Solr ID which exists in the id field on the document.
The URL which is present in the link field on the document.

Example of a hot deletion call with the id of the document

{
   "id" : "1234567890"
}

Example of a hot deletion call with the link of the document

{
   "url" : "https://www.foo.bar/foo/bar.html"
}

The HTTP status code of the response to such a request normally corresponds to one of the following values:

A return of HTTP 200 - No Content means that the request was successful. The document is thus removed from the index.
In case of the return of HTTP 404 - Not found the requested data generator does not exist.
Any 4xx or 5xx error codes indicate that the request failed and no processing took place. In this case, see the log for more details.

3.10. Highlighting

Among other things, the configuration dialog of a Prepared Search contains the Highlight and Length of highlight (in characters) fields. These are used to accentuate search terms in the search hits. For each Prepared Search it is configurable to which fields the highlighting is applied. By default, highlighting is defined for the content and limited to 200 characters.

The return of a search query contains the highlighting in addition to the results. The link between a search hit and the associated highlighting can be established using the ID. The highlighting of the configured search term is done by the HTML element <strong>. No further configuration is necessary.

Example of highlights in a search result

"results": [
   {
      "link": "https://www.e-spirit.com/de/produkt/hybrid-cms/",
      "id": "dfee2187ac7113a788dfc9aa262255b1c281bfa8df0729fb5d636613219a99c7",
      "title": [
         "FirstSpirit Hybrid Headless CMS & CaaS | e-Spirit"
      ]
   },
   ...
],
...
"highlighting": {
   "dfee2187ac7113a788dfc9aa262255b1c281bfa8df0729fb5d636613219a99c7": {
      "title": [
         "FirstSpirit Hybrid Headless <strong>CMS</strong> & CaaS | e-Spirit"
      ],
   },
   ...
}

To add further fields to the highlighting, it is sufficient to activate the highlight for the corresponding field in the configuration dialog of the Prepared Search. The length of the highlights can be defined via the field of the same name.

More detailed changes to the configuration can be made using a Groovy script. This is added to the Prepared Search and makes the desired adjustments to the SolrQuery object.

The following sample script modifies the highlighting prefix and postfix to customize, for example, the highlighting HTML element mentioned above:

Example script

solrQuery.setHighlightSimplePre("<em>");
solrQuery.setHighlightSimplePost("</em>");

In the Prepared Search configuration, the field Length of highlight (in characters) can be used to specify the approximate desired size of the highlight section. By passing these parameter to {solr}, SmartSearch will try to generate a highlight which is approximately equal to this number of characters.

However, in the default configuration SmartSearch will try to output complete sentences around the search hit as highlight. This can - e.g. for fields like 'content' which are filled with a lot of partially unstructured data - lead to undesirably long or unstructured highlights.

This can be remedied by using the 'hl.bs.type' parameter to adjust the separation behavior when outputting highlights. As above when defining the prefix and postfix, the separation behavior in the Groovy script of the Prepared Search can be customized as follows:

example script - setting the separator behavior when highlighting

solrQuery.set("hl.bs.type", "WORD");

Above script sets the separator behavior to words ('WORD') instead of sentences (default configuration - 'SENTENCE'). Other possible values for this parameter such as exact character separation ('CHARACTER') can be found in the corresponding https://solr.apache.org/guide/8_6/highlighting.html [{solr} documentation].

This adjustment of the separation behavior can result in the resulting highlight ending in the middle of a sentence or even a word.

3.11. Faceting

If a search returns a lot of search results, limiting them may help searchers to get the desired result. To achieve this, the SmartSearch provides so-called facets. Facets are a way to filter and refine search results. They are represented by certain specific in a document whose values can be used to narrow the search. For example, the product search of an online store might contain facets that allow filtering by color, size, or price. In the specific case of color, the name of the facet would be color and red, blue, green, etc. would be possible facet values.

The following subchapters describe how to create and use facets. Further information is provided in the Prepared Search chapter.

The JavaScript library SmartSearch.js also supports faceting. The associated documentation includes some explanatory examples in English.

3.11.1. Creating facets

The creation of a facet is done at a data generator by generating a field with the suffix _keyword (for example color_keyword). This can be done using a Groovy script Enhancer, for example. The fields created this way appear in the edit view of a Prepared Search once the corresponding data generator is selected. They are marked with the term facet.

The following code example shows a Groovy script that creates a facet named content_category.

Groovy script

// Check the value of three fields: fsType_keyword, schema, entityType

if (document.getData("fsType_keyword").contains("Dataset")
    && document.getData("schema").contains("Regional")
    && document.getData("entityType").contains("News")) {

      // Always delete the field and the value bevor creating it
      document.removeData("content_category_keyword")

      // We create a new field with the value news
      document.addData("content_category_keyword", "news")
}

Facets created in this way are called Field value facets. In addition to these, two other types of facets exist: the Range facets and the Pivot facets. However, the SmartSearch does not currently support these out of the box.

3.11.2. Using facets

The SmartSearch responds to each search query with JSON data that already contains all the necessary information for faceting. The following snippet shows such a server response to a search query in abbreviated form.

Snippet of a server response

// ...
"facets": [
   {
      "name": "language_facet_string",
      "counts": [{
         "value": "de",
         "count": 347,
         "filterQuery": "language_facet_string:de",
         "urlFilterQuery": "facet.filter.language=de"
      },
      {
         "value": "en",
         "count": 311,
         "filterQuery": "language_facet_string:en",
         "urlFilterQuery": "facet.filter.language=en"
      },
      {
         "value": "us",
         "count": 266,
         "filterQuery": "language_facet_string:us",
         "urlFilterQuery": "facet.filter.language=us"
      }]
   },
   {
      "name": "color_facet_string",
      "counts": [{
         "value": "red",
         "count": 897,
         "filterQuery": "color_facet_string:red",
         "urlFilterQuery": "facet.filter.color=red"
      },
      {
         "value": "blue",
         "count": 27,
         "filterQuery": "color_facet_string:blue",
         "urlFilterQuery": "facet.filter.color=blue"
      },
      {
         "value": "green",
         "count": 423,
         "filterQuery": "color_facet_string:green",
         "urlFilterQuery": "facet.filter.color=green"
      }]
   }
],

// ...

"facetConfigs": [
   {
      "baseName": "language",
      "multiSelect": false
   },
   {
      "baseName": "color",
      "multiSelect": true
   }
],
// ...

The order of the facet in the JSON cannot be influenced. Sorting must therefore be done on the search results page using JavaScript, for example.

The facet color can take three values in this example: red, green and blue. For each value, the key urlFilterQuery specifies the exact string with which to expand the query URL to limit the results to that facet value. For example, restricting the results to red shoes requires the following query URL:

https://hostname/v1/restpreparedSearch/ShoeSearch/execute?query=shoes&facet.filter.color=red

In addition, the value of the multiSelect key in facetConfigs indicates whether the corresponding facet allows multiple values.

The facet language is a special case because extending a query with the ?language URL parameter (instead of ?facet.filter.language) also allows to filter for a specific language.

3.12. Filter Queries

When querying search hits against a Prepared Search, the query can be extended by so-called filter queries. The purpose of such filter queries is to reduce and thus sharpen the result set without changing the order of the hits among each other. By using filter queries it is possible to define time periods, values or value spaces from which the search hits come. Filter queries can also be applied to fields that were not selected in the Prepared Search for output as a facet.

This option does not guarantee search hits in this range, as is common with facets.

In practice, this makes it possible to limit the search hits: for example, to display maximally one week old results or to products whose price does not exceed 100€.

Filter queries can be defined in SmartSearch in two ways. One is in the Groovy script of Prepared Search or as a parameter to the request of the SmartSearch API.

3.12.1. Filter queries using Groovy script

To set a filter query in the Groovy script, it must be added to the SolrQuery. This is done using the addFilterQuery method. For example, to have only news in the search hits that are not older than one week, the call in the Groovy script must look like this:

Add filter query in Groovy script.

solrQuery.addFilterQuery("creation_date:[NOW-7DAY TO NOW]");

Multiple filter queries can be added.

The use of filter queries in the Groovy script is useful if the range of values to be searched should not be influenceable by the user. If, on the other hand, the value of the filter query is to be flexible, it is a good idea to use a parameter that is appended to the request

Filter queries can utilize the time periods/points/spans supported by Solr. The example above includes, for instance,the time points NOW and NOW-7DAY (and thus the time span 7DAY), to describe the period from one week ago to now as NOW-7DAY TO NOW.

ATTENTION: Time points like NOW are ALWAYS evaluated in UTC.

3.12.2. FilterQueries via parameters

As an alternative to the usage in a groovy script, filter queries can be added to the request as parameters. The parameter is called fq. For example, to only include products in the search hits that are not older than one week, the request must look like this:

API Request with Filter Query

curl -X GET -G 'https://mithras-search-api.e-spirit.cloud/v1/prepared_search/mithras/execute/' --data-urlencode "query=*&fq=_storage_date:[NOW-7DAY TO NOW]"

The parameter fq can be passed multiple times.

User are able to manipulate the parameters of the search. For this reason, their use in connection with sensitive data is not recommended. In this case, filter queries in the Groovy script of the Prepared Search are a good choice.

When using SmartSearch.js, the filter queries can be passed using parameters as follows: .Filter query usage at SmartSearch.js

const fsss = new SmartSearch(host, preparedSearch)
fsss.setCustomParams({fq: "\"_storage_date:[NOW-7 to NOW]\""})

To better understand filters applied to fields of type _date, here are some examples:

Everything in 2020: [2020-01-01T00:00:00Z TO 2020-12-31T23:59:59Z]
Everything from 2020 to today: [2020-01-01T00:00:00Z TO NOW]
Everything from 2020 : [2020-01-01T00:00:00Z TO *]

If the field is a field of type _date_range, then a filtering for certain years can be implemented e.g. like this:

fq={!field f=dateRange op=Contains}[2013 TO 2018]

More detailed information for the use of filters in relation to date fields can be found in the SOLR documentation.

3.13. Tags

As explained in the statistics chapter, the SmartSearch cockpit provides the possibility to obtain information about search queries made in any periods of time. For example, the statistics function can display the most searched terms for a selected Prepared Search in the last month, the last year, or for other specified time periods. In the context of such utilization, it may be relevant for users to differentiate these ranked search queries not only by time periods or Prepared Searches, but also by criteria freely defined by the implementation.

In the following example, a Prepared Search demo is used in two ways by search queries:

Search queries in the search mask of a web portal
Search queries from a smartphone app

In order to extend the statistics collected in the SmartSearch with the information whether a query comes from a portal or an app, the tag parameter can be passed to a search query. The values that this parameter may take are basically freely selectable. In the context of this example, they are portal or app depending on the source of the search query:

Each of the following commands shows a CURL call for the german term Presse against a Prepared Search demo. Both calls contain an Authorization header and the tag parameter, which has either the portal or app value.

CURL call from a portal

curl
-X GET 'https://localhost:8181/rest/api/v1/prepared_search/demo/execute?query=Presse&tag=portal'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'

CURL call from an app

curl
-X GET 'https://localhost:8181/rest/api/v1/prepared_search/demo/execute?query=Presse&tag=app'
-H 'Authorization: Basic YWRtaW5AbG9jYWxob3N0LmRlOmFkbWlu'

In this example, users can query the corresponding data in the cockpit for any period of time:

The following figure shows the ranking of search queries from the portal, which in this case is based on only one search query. The tags clause of the statistics page is filled with the expression tags='portal' at this point.

Figure 27. Ranking of search queries for the tag value portal

The following figure shows the ranking of search queries from the app, which is here also based on only one search query. The tags clause of the statistics page is filled with the expression tags='app'.

Figure 28. Ranking of search queries for the tag value app

The following figure shows the summed up ranking of the search queries across both parameters. The tags clause in this case contains the OR operation tags='portal' or tags='app'.

Figure 29. Summed ranking of both tag values

3.14. Grouping

A common requirement for a search function is the generation of two result lists: For example, one for search hits from the product area of a portal, and one for the editorial area. For this reason, it may be desirable to group the results of a search in itself on the basis of keywords present in documents. In this case, such keywords should be located in fields isolated in the documents to be grouped.

The following case study is based on the assumption that a field category_token exists on the documents, based on which the grouping is done.

The field based on which the grouping is done is called category_token in this case. It is thus regarded as a dynamic field of type *_token. This dynamic field type is available to enable grouping.

Without grouping, a search query would display the documents of all categories together. The following example shows a search query for the term example, present in all titles, against a Prepared Search demo, and the ungrouped response to that query.

Search query

https://localhost/rest/api/v1/prepared_search/demo/execute?query=example

Response with ungrouped results

{
   "numRows": 4,
   "results": [
      {
         "link": "https://www.e-spirit.com/de/page/01/",
         "id": "fcf8e686f9362a90ab1139549c4b4c0ad0721e3d0086d91e5a5fc8fedfef44f7",
         "title": [
            "Example Page 01"
         ],
         "category_token": "page"
      },
      {
         "link": "https://www.e-spirit.com/de/page/02/",
         "id": "8cfbca43a63fcf882f38f692508034ed3329b30a20de21e0374c8a0af2f59fd1",
         "title": [
            "Example Page 02"
         ],
         "category_token": "page"
      },
      {
         "link": "https://www.e-spirit.com/de/product/01/",
         "id": "e24f3487b79b146f3d91efb4a9cc1cab7d1ab728e2de035db9a40f4652d87df0",
         "title": [
            "Example Product 01"
         ],
         "category_token": "product"
      },
      {
         "link": "https://www.e-spirit.com/de/product/02/",
         "id": "c5febe92fe119120e0d84093d06dda813956b5be22ee6d994867dd75a74f5f83",
         "title": [
            "Example Product 02"
         ],
         "category_token": "product"
      }
   ],
   ...
}

In order to activate a grouping, it is necessary to set the appropriate parameters to the Prepared Search, in this case demo, on the Solr query.

The parameters have the following functions:

group: This parameter enables (true) or disables (false, default value) grouping. It is therefore mandatory for using the function and must be set to true in this case.
group.field: This parameter defines which field of type _token is used for grouping. In the described example this is the category_token field.
group.limit: This parameter defines the number of results per group to be included in the response. The default value for this parameter is 1. In the example included in this chapter, the parameter has the value -1. This leads to the return of all available documents of the groups.

In order to set the mentioned parameters, the Groovy script has to be adapted in the configuration interface of the corresponding Prepared Search. Thereby it must be noted that the method signature is already given and thus only the method body has to be entered.

The following code snippet shows such a configuration of a Prepared Search:

Configuration of the Prepared Search

void apply(SolrQuery solrQuery, Map<String,List<String>> parameter) {
   solrQuery.set('group', true)
   solrQuery.set('group.field', 'category_token')
   solrQuery.set('group.limit', '-1')
}

A re-execution of the previously mentioned search query against the Prepared Search demo now returns a grouped result. This differs from the ungrouped result mainly in the following points:

The value numRows, which reflects the number of total results, is now 0, since this is intended for non-grouped results.
The value result also no longer contains results for this reason.
Instead, the value groups contains the grouped results:
- The matches value tells how many matches were grouped in total.
- The value name indicates the field name category_token based on which the grouping was done.
- For each concrete value of the category_token field, the groups value contains a list results. This corresponds to all documents of the group and shows their number as the value of the numRows parameter.

Answer with grouped results.

{
   "numRows": 0,
   "results": [],
   ...
   "groups": [
      {
         "matches": 4,
         "name": "category_token",
         "groups": [
            {
               "results": [
                  {
                     "link": "https://www.e-spirit.com/de/page/01/",
                     "id": "fcf8e686f9362a90ab1139549c4b4c0ad0721e3d0086d91e5a5fc8fedfef44f7",
                     "title": [
                        "Example Page 01"
                     ],
                     "category_token": "page"
                  },
                  {
                     "link": "https://www.e-spirit.com/de/page/02/",
                     "id": "8cfbca43a63fcf882f38f692508034ed3329b30a20de21e0374c8a0af2f59fd1",
                     "title": [
                        "Example Page 02"
                     ],
                     "category_token": "page"
                  }
               ],
               "value": "page",
               "numRows": 2
            },
            {
               "results": [
                  {
                     "link": "https://www.e-spirit.com/de/product/01/",
                     "id": "e24f3487b79b146f3d91efb4a9cc1cab7d1ab728e2de035db9a40f4652d87df0",
                     "title": [
                        "Example Product 01"
                     ],
                     "category_token": "product"
                  },
                  {
                     "link": "https://www.e-spirit.com/de/product/02/",
                     "id": "c5febe92fe119120e0d84093d06dda813956b5be22ee6d994867dd75a74f5f83",
                     "title": [
                        "Example Product 02"
                     ],
                     "category_token": "product"
                  }
               ],
               "value": "product",
               "numRows": 2
            }
         ]
      }
   ],
   ...
}

In the context of this example, the product pages (group product) are separated from the editorial pages (group page) in the query response using the document feature category_token. Both groups can thus be displayed as two separate lists in further processing, for example.

The SmartSearch JavaScript library smartsearch.js also supports grouping. Details can be found in the corresponding SmartSearch.js documentation.

3.15. Did you mean

When entering a search query, typing or misspelling errors may occur. To still allow a successful search, the SmartSearch suggests alternative, similar search terms. These suggestions are stored in an array with the key didYouMean, which is located in the JSON data of the response. In order for the SmartSearch to automatically offer alternatives, the number of search hits for a term must not exceed the threshold specified in the Prepared Search.

Only the content field is used for the suggestions. It is therefore necessary to ensure that all content for which alternative suggestions are to be used is stored in this field.

When filtering for a language using a language facet, the suggestions also come only from that language.

The following example shows a search query including a typo and a didYouMean suggestion:

Search query

{
   "numRows": 0,
   "results": [],

   ...

   "didYouMean": [
      "content"
   ],

   ...

   "requestParameters": {
      "query": [
         "cotnent"
      ]
   },

   ...
}

3.16. Examples of Groovy Scripts Enhancers

To customize documents during a data generation Enhancers can be extended with Groovy scripts. The following section includes some examples of such scripts to get implementation started.

Prevent indexing

If a document is not to be included in the index during generation, this can be achieved by defining the boost value 0. For facet values, this is possible through the appropriate facet configuration.

Definition of the boost value

document.setBoost(0)

Storing multiple values for one field

The method addData enables the storage of additional values for a field. To do this, the field name and the desired value must be passed to it.

Method addData

document.addData("fieldName", "first value")
document.addData("fieldName", "second value")

Replacing all values of a field

Using the setSingleValue method, all values of a field can be replaced with a single new value. The method requires the field name and the new value.

Method setSingleValue

document.setSingleValue("field", "new value")

Removing all values of a field

For deleting all values of a field without replacement the SmartSearch provides the method removeData. It expects only the field name as transfer parameter.

Method removeData

document.removeData("fieldName")

Putting a value into a category

The contains method allows checking a field on a document for a certain value. This enables the implementation of dependent conditions that put the field in a specific category using the previously mentioned setSingleValue method if it contains specific values. This way the values of the field are also available in a facet.

Dependent condition

if (document.getData("category").contains("article")
   || document.getData("category").contains("product")
   || document.getData("category").contains("news")) {

   document.setSingleValue("category_keyword", document.getData("category"))
}

3.17. Spatial Search

In addition to traditional text search capabilities, SmartSearch also provides spatial search support, allowing you to search for documents based on their location. Below, you will learn to perform a spatial search using SmartSearch.

3.17.1. Example: Spatial Search for Cities

To demonstrate a spatial search using SmartSearch, a collection of documents containing cities’ names and their respective geo-coordinates will be taken. The goal is to perform a spatial search and find all cities within a certain distance from a given point or to sort by the distance.

For this example, we will add the geo-coordinates of three random cities to our documents: Munich, Dortmund, and Paris. We will use the Groovy Script Enhancer to add a field named "city_pnt" containing the geo-coordinates of the respective city to each document.

Then we perform a spatial search to find all the cities located within 100km of Augsburg (about 60km west of Munich) and another search to sort by the distance to this city.

3.17.2. Field Requirements for Spatial Search

To perform a spatial search, the documents must contain a field ending with "_pnt" in the field name and the latitude and longitude values in a comma-separated format.

In the example below, a Groovy Script Enhancer adds such a field during indexing:

Random rnd = new Random()
def rndInt = rnd.nextInt(3)
def city = "munich"
def latLon = "48.137154,11.576124"
if(rndInt == 1) {
  city = "dortmund"
  latLon = "51.514244,7.468429"
} else if(rndInt == 2) {
  city = "paris"
  latLon = "48.864716,2.349014"
}
document.addData("city_keyword", city)
document.addData("city_pnt", latLon)

In the example below, * a new field named "city_pnt" containing the respective city’s latitude and longitude values is created. * a new field named "city_keyword" is created to check the search results for correctness.

3.17.3. Performing a Spatial Search

Add a Groovy Script to a Prepared Search to perform a spatial search. The two examples below use the data specified above:

// geofilt example
{slr}Query.add("fq", "{!geofilt sfield=city_pnt}")
{slr}Query.add("pt", "48.366512,10.894446")
{slr}Query.add("d", "100")

// bbox example
{slr}Query.add("fq", "{!bbox sfield=city_pnt}")
{slr}Query.add("pt", "48.366512,10.894446")
{slr}Query.add("d", "100")

In both examples, the "pt" parameter specifies the point around which the search is performed, and the "d" parameter specifies the distance in kilometers.

In the first example, the "fq" parameter specifies a filter query, which uses the geofilt function to perform a spatial search. The "sfield" parameter specifies the field that contains the latitude and longitude values.

In the second example, the "fq" parameter specifies a filter query, which uses the bbox function to perform a spatial search. The "sfield" parameter works the same way as in the geofilt example.

3.17.4. Difference between bbox and geofilt

The main difference between bbox and geofilt is the shape of the search area. Geofilt performs a circular search around the specified coordinates, while bbox performs a rectangular search.

The choice between bbox and geofilt depends on the specific needs of your system. The following statements are generalizations, and the actual performance and accuracy of each approach may depend on the specific implementation and used data:

Geofilt is typically faster than bbox for small search areas but slower for larger areas.
Bbox can be more accurate than geofilt for irregularly shaped search areas but less accurate for circular search areas.
Geofilt is generally easier to use than bbox, as it requires fewer parameters to be specified.

3.17.5. Sorting by Distance

You can also sort the search results by distance from a specified point. There Groovy Script example below features a spatial search and sorts the results by distance:

solrQuery.add("fq", "{!geofilt}")
solrQuery.add("sfield", "city_pnt")
solrQuery.add("pt", "48.366512,10.894446")
solrQuery.add("d", "10000")
solrQuery.set("sort", "geodist() asc")

In the example above, the "fq" parameter specifies the filter query, which uses the geofilt function to perform a spatial search. The "sfield" parameter specifies the field that contains the latitude and longitude values. The "pt" parameter specifies the coordinates around which the search is performed, and the "d" parameter specifies the distance in kilometers. The "sort" parameter specifies the sorting order. It uses the geodist function to sort the results by distance in ascending order.

Sorting by distance can be expensive, especially for large datasets. It is thus important to use it judiciously and with consideration for performance.

3.17.6. Conclusion

The examples above only show some possibilities of a spatial search with SmartSearch. Many other approaches and techniques can be used depending on the specific needs of your system. For more information about spatial search with Solr, refer to the Solr documentation at https://solr.apache.org/guide/8_6/spatial-search.html.

4. Logging

For a detailed diagnosis of the SmartSearch server it may be necessary to adjust the logging. The following chapter describes different procedures in more detail.

In the default configuration of the SmartSearch server there is a lockback-spring.xml in the root folder. For details about the structure of the lockback-spring.xml it is recommended to read the documentation at Spring. Changes to the lockback-spring.xml remain valid even after a restart of the SmartSearch server.

There is an API endpoint for short term adjustment of log levels that are discarded after a server restart. In the example below, the log level for the ApiCrawler class would be set to DEBUG.

Example of log level customization via API endpoint.

curl -X POST 'https://mithras-search.e-spirit.cloud/actuator/loggers/de.arithnea.haupia.datageneration.crawler.api.reactive.ApiCrawler'
--header 'Authorization: Basic YWRdsgdfg8wxob3N0LmRlOmFkbWlu' \
--header 'Content-Type: application/json'
--data-raw '{
  "configuredLevel": "DEBUG"
}'

In addition to individual classes, the log level of all classes in a package can also be set. To find out the log level of a class, the corresponding API endpoint can be queried using GET. A GET query against …/actuator/loggers returns all classes with the corresponding log level. This way it is also possible to find out the paths of the required packages and classes.

When querying the log levels, the levels are returned as either configuredLevel or effectiveLevel. configuredLevel is an explicitly set level for this class. effectiveLevel is an inherited level by a parent element.

SmartSearch uses Simple Logging Facade for Java (SLF4J) as its logging framework.

5. GDPR

The General Data Protection Regulation (GDPR) is an EU regulation that protects the fundamental right of European citizens to privacy and regulates the handling of personal data. Simplified, all persons from whom data is collected have the following options, among others, via the GDPR:

to learn which of their personal data is stored and how it is processed (duty to inform and right to information),
to restrict the collection of data (right to restriction of processing),
to influence the data collected (right to rectification); and
to delete the data collected (right to be forgotten).

What is personal data?

Personal data is any information by which a natural person is directly or indirectly identifiable. This includes any potential identifiers:

direct identifiers, such as
- Names
- Email addresses
- Telephone numbers
indirect identifiers, such as
- Location data
- Customer numbers
- Staff numbers
Online identifiers, such as
- IP addresses
- Cookies
- Tracking Pixel

Detailed information on the GDPR can be found in the blogpost The Ultimate Resource for GDPR Readiness.

5.1. GDPR and the SmartSearch

The search engine SmartSearch stores data as documents that can be made available on various platforms. The type and scope of the data, hereinafter referred to as "collected data", depends on the purpose of the product.

The manufacturer Crownpeak Technology GmbH expressly points out that it is the responsibility of the customer to check collected data to determine whether it contains personal data and to ensure that appropriate measures are taken.

In addition to the editorial data, the SmartSearch stores personal data (basically the email, and if LDAP is used the user’s username), which are used for logging in to the system and auditing configurations, in order to be able to contact an editor of an element if necessary. Parts of this data are kept in log files. In the following, this data is referred to as "personal system data" (see below).

5.2. Personal system data in the SmartSearch

The Crownpeak Technology GmbH takes the protection and security of your data very seriously. Of course, we comply with the legal data protection regulations and treat personal data but also non-personal data of our users with appropriate care. We only collect personal data if it is necessary for the security and functionality of SmartSearch.

The following subchapters provide information about the collection and handling of personal data in the SmartSearch.

5.2.1. Data for authorization and authentication of users in the SmartSearch

The SmartSearch works with a consistent user and rights system. New users are created and managed via the user management. After creation the user is known on the server and can log in (with a valid login) via the SmartSearch cockpit. Access to the configuration elements is granted via group rights/roles.

Why is the data needed?

Authorization and authentication ensure that only authenticated users can access the SmartSearch and that these users can edit elements only according to the rights granted to them. Thus, this data is mandatory for the security of the information.

Where is the data used or displayed?

Information about the user is displayed in various places, for example:

when logging in to the cockpit
when granting group rights
When changing an object via auditing
and many more

Where is the data stored?

The credentials of the individual users are always stored in the configuration component. In the case of LDAP, the personal system data is loaded from the customer’s LDAP in read-only mode.

How long is the data stored?

When a user is removed via the user management, the user’s credentials are immediately removed from the configuration component.

Deactivating a user in the user management does not delete their data.

5.2.2. Data for error analysis and troubleshooting in the SmartSearch (logging).

The SmartSearch uses log files to track actions and events on the SmartSearch server. Log files are collected to maintain secure operations. They can be used to analyze and troubleshoot error states.

Why is the data needed?: Some of the log files used by SmartSearch include IP addresses, login names, dates, times, requests, etc. and thus contain personal data.
Where is the data stored?: Basically, log files are written to the logs subdirectory of the SmartSearch server.
How long is the data stored?: Default behavior: When a fixed file size of 100 MB is reached, the current log file is archived. Up to nine archived files are kept. If a new file is then archived, the oldest one is deleted. This behavior can be customized via the configuration file logback-spring.xml.

5.2.3. Data for auditing configuration items

Each time an editor makes a change to configuration items (for example, an Prepared Search), it is noted on it who made the last change and when. This overwrites any existing last change.

+ When a new configuration item is created, it notes who created the item and when.

Why is the data needed?: One objective of data storage in the SmartSearch is the traceability of the last changes, but also information about the creation of configuration items. For this purpose the SmartSearch stores auditing data.
Where is the data used or displayed?: The auditing data (which user made a change and when?) is displayed in list views for the configuration items.
Where is the data stored?: The data is stored at the configuration elements in the configuration component.
How long is the data stored?: Default behavior: When a user is deleted, references to that user are anonymized. It is also possible to anonymize references of already deleted users afterwards, in case the default behavior was not applied during the deletion. After anonymization, a report is displayed in which elements the user was anonymized. No information is logged for this purpose. The report is the only way to get information about the anonymization.

5.2.4. Usage of cookies in the cockpit

Cookies are information that can be stored in the browser of the viewer about a visited website.

Why is the data needed?: The SmartSearch uses cookies in the cockpit to save the user’s session and for protection against XSRF attacks.
Where is the data used or displayed?: The cookies are stored in the browser and sent with every interaction with the cockpit from the moment of logging in.
How long is the data stored?: The lifetime of the cookies is set to session.

6. Legal hints

The SmartSearch is a product of Crownpeak Technology GmbH, Dortmund, Germany. Only a license agreed upon with Crownpeak Technology GmbH is valid for using the SmartSearch.

7. Help

The Technical Support of the Crownpeak Technology GmbH provides expert technical support covering any topic related to the FirstSpirit™ product. You can get and find more help concerning relevant topics in our community.

8. Disclaimer

This document is provided for information purposes only. Crownpeak Technology GmbH may change the contents hereof without notice. This document is not warranted to be error-free, nor subject to any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. Crownpeak Technology GmbH specifically disclaims any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document. The technologies, functionality, services, and processes described herein are subject to change without notice.