SmartSearch - Administrator’s Manual

e-Spirit AG

09.09.2021
Inhaltsverzeichnis

1. Introduction

The SmartSearch bundles requirements that customers place on the search function of an online presence: An intuitive, high-performance search solution that can be used on extensive websites and delivers relevant results. It offers both high hit quality and optimum search convenience, thus retaining customers on the website.

At the same time, it provides editors with a web interface through the integrated SmartSearch Cockpit that can be used without IT knowledge. Editors from specialist and marketing departments are thus enabled to control and monitor search results on the web interface. For this purpose, the cockpit provides statistics, filters and analysis functions and allows the indexing of various data types (for example XML, audio, video, media) from different data sources. With the help of individualized hit lists, editors can prioritize and weight search results in the back end and display selected content for predefined search queries.

This document is intended for administrators and therefore only describes the required installation and configuration steps. All functionalities and use cases provided by the SmartSearch are included in the SmartSearch documentation.

1.1. Architecture

The functionalities of the SmartSearch are realized by an architecture made up of a range of different components (see figure Architecture).

These components are:

  • ZooKeeper
  • Solr
  • SmartSearch
Architecture
Abbildung 1. Architecture


The individual components always interact according to the following schema:

  • Prior to creating the search index, the SmartSearch must collect the required data. For this, it accesses the information to be collected on the customer side, which can exist in the form of websites, portals or databases. In addition, a REST interface provides the option of filling the search index with further data from outside.
  • After that, the SmartSearch server normalizes the data and transfers it to the Solr server. The server receives the data and persists it in an index.
  • The query for data is done equivalently: The SmartSearch server receives the request, modifies it, and then forwards it to the Solr server. This responds with a search result, which the SmartSearch server returns to the customer’s end application via the REST interface.
  • The SmartSearch Cockpit is to be seen detached from the other components. It serves the administration of the SmartSearch server and therefore offers a simple, web-based administration interface. Among other things, search solutions can be created and configured in this interface.
  • Configurations made in the SmartSearch Cockpit are saved on the ZooKeeper server together with the Solr configuration data.

The communication to the outside is protected by HTTPS, between the components it is done via HTTP.

1.2. Technical requirements

To use the SmartSearch, the following technical requirements must be met:

  • Java 11 or higher
  • ZooKeeper in the version 3.4.10
  • Solr in the version 8.1.1 in the Cloud modus
  • the SmartSearch in the latest version

ZooKeeper and Solr are not included in the delivery. They must therefore be downloaded before installation in the version specified.

2. Installation and Configuration

In order to use the functionalities of the SmartSearch, the individual components must first be installed, which can be distributed as required and are freely scalable. It is essential that they are installed in the following order so that the basic configurations can be made correctly:

  1. ZooKeeper
  2. Solr
  3. SmartSearch

ZooKeeper and Solr are not included in the delivery. They must therefore be downloaded before installation in the version specified in the technical requirements.

It is not recommended to perform the installation of the components as root. Instead, the use of a technical user with the appropriate access rights is recommended. The name of the technical user is expected to match the name of the component in each case.

Solr creates this user automatically. Manual creation is therefore not necessary in this case.

Since most computer systems are based on Linux, the following subchapters also concentrate exclusively on installation under Linux. The specified commands refer to the following system:

  • Ubuntu 18/04 LTS (Bionic Beaver)
  • OpenJDK 11

Furthermore, the scenario of simple redundancy is assumed for the description of the installation (see figure Simple redundancy). The two nodes N1 and N2 shown in the following figure correspond to two physical or virtual systems, on each of which a ZooKeeper, Solr, and SmartSearch instance is to be installed. Such a cluster operation of at least two nodes ensures basic failover and data redundancy.

If the aspects of failover and data redundancy are not relevant, the entire stack may also be installed on a single node. The chapter on the operation of a single node describes the differences to be considered for this.

Simple redundancy
Abbildung 2. Simple redundancy


The described scenario can be considered as the basis for a cluster operation with any number of nodes.

Cluster operation with any number of nodes
Abbildung 3. Cluster operation with any number of nodes


2.1. Ports used

Various ports are mentioned in the context of the following chapters on the installation of the various components. The following table shows an overview of these ports and explains their meaning.

Tabelle 1. Ports used
PortMeaning

8181

This port is for the SmartSearch Cockpit and allows access to the REST API.

8983

This port allows access to the Solr interface and to the API.

2181

This port enables client communication of a ZooKeeper instance.

2888

This port is used for communication between the ZooKeeper instances.

3888

This port is also used for communication between the ZooKeeper instances.



2.2. ZooKeeper

To backup the configurations of Solr and the SmartSearch, it is necessary to install and configure two ZooKeeper instances, which are not included in the delivery. To do this, download ZooKeeper in the version specified in the technical requirements and install it on both systems in the ZooKeeper directory to be created. This folder can be located anywhere in the respective system.

By default, the assumption is that the installation directory to be created is located in the opt directory of the associated system. Deviations must be adjusted accordingly in the ZooKeeper.service file.

Then create the necessary configuration file under conf/zoo.cfg for each system with the following command.

Creating the configuration file (to be executed on both systems). 

cp ~/zookeeper/conf/zoo_sample.cfg ~/zookeeper/conf/zoo.cfg

In the same directory, create a file named java.env to adjust the amount of memory available to ZooKeeper. In this file, specify the parameters to change the memory for ZooKeeper.

Creating the java.env file (to be executed on both systems). 

touch ~/zookeeper/conf/java.env

By default ZooKeeper saves configuration data in the tmp directory. However, since this is only for temporary storage, the ZooKeeper-data directory to be created in the installation directory of both systems is needed instead. To ensure ZooKeeper uses it for persistence, the path of the ZooKeeper-data directory must be specified in the zoo.cfg file as the value of the dataDir parameter on both systems. In the same file, add the hostnames or static IP addresses of all ZooKeeper instances to be included. In this case, this specification corresponds to the hostnames hostname-n1 and hostname-n2. By default, the ZooKeeper instances use the ports 2888 and 3888 to communicate with each other. The port specification after the semicolon refers to the port on which ZooKeeper listens for connections. Additionally, the permission of so-called 4 letter words commands is required, since Solr needs them.

In the described server configuration, each server is given an ID: For example, in the server.1 specification, the ID is 1. The ID must correspond to a number between 1 and 255. The definition of the server ID is done in the file myid to be created in the data directory, which has no further content beyond that.

zoo.cfg - Specification of the ZooKeeper instances (identical on both systems). 

server.1=hostname-n1:2888:3888;2181
server.2=hostname-n2:2888:3888;2181

4lw.commands.whitelist=*

After that, both ZooKeeper servers can be started using the following command.

Start of the ZooKeeper server (to be executed on both systems). 

~/zookeeper/bin/zkServer.sh start

In addition to the SmartSearch configurations, the Solr server settings are also stored in ZooKeeper. To avoid conflicts, it is necessary to separate the data. This is achieved by creating an empty Solr node that ZooKeeper uses to persist the Solr files. The node can only be created with a running ZooKeeper client, which must be terminated afterwards.

Since ZooKeeper keeps the data synchronized, creation of the Solr node is only necessary on one of the two instances.

The SmartSearch node, which is also required, is automatically created during the installation of the SmartSearch server and therefore does not need to be created manually.

Creation of the Solr node (to be executed on one system). 

~/zookeeper/bin/zkCli.sh
create /solr ""
quit

Finally, copy the ZooKeeper.service file, which is included in the systemd-samples folder in the delivery, to the etcsystemdsystem directory on both systems respectively. This allows to use ZooKeeper as a service and to control it via systemctl.

It must be ensured that both the user and the path of the installation directory are specified correctly in each case in the service file. By default, the user ZooKeeper and an installation under opt are assumed.

The storage of the configurations from the backend of the SmartSearch is also done in ZooKeeper. The name of the root node is haupia by default. Thus a separation to the Solr data is ensured. Within this folder the configuration data are stored readable. The assigned names in the configuration are also the names of the nodes.

2.3. Solr

Persistence of custom data collected by the SmartSearch server is done using two Solr servers. These are not included in the delivery and must therefore be installed manually. To do this, download Solr in the version specified in the technical requirements and install it on both systems in the Solr directory to be created. The directory can be located anywhere in the respective system.

By default, the assumption is that the installation directory to be created is located in the opt directory of the associated system. Deviations must be adjusted accordingly in the solr.service file.

Solr provides the install_solr_service.sh script for the installation, which first has to be extracted from the downloaded file and then executed on both systems. The target directory for persisting the collected data can be chosen freely in each case. The script installs both Solr servers as a service, creates the user solr on them and also creates all the required files and directories.

For the pending configuration, it is mandatory that both Solr servers are in a non-running state.

Installation of the Solr server (to be run on both systems). 

./install_solr_service.sh solr-8.1.1.tgz -d <VARIABLE_PATH> -n

For example, for memory usage, the Solr servers require various Java variables. These are to be defined per system in the configuration file etc/default/solr.in.sh.

Furthermore, the use of the Solr cloud must be enabled in this file respectively and the Solr node previously created during the ZooKeeper installation must be specified.

Before executing the script, make sure that the Solr node was created during the ZooKeeper installation and actually exists there.

Adjusting the Solr configuration (to be run on both systems). 

sed -i 's/#ZK_HOST=.*/ZK_HOST=hostname-n1:2181,hostname-n2:2181\/solr/' /etc/default/solr.in.sh

Finally, copy the solr.service file, which is included in the systemd-samples folder in the delivery, to the etcsystemdsystem directory on both systems respectively. This allows to use Solr as a service and to control it via systemctl.

It must be ensured that both the user and the path of the installation directory are specified correctly in each case in the service file. By default, an installation under opt is assumed.

The SmartSearch provides its own schema to be installed on the Solr server. For this, after the Solr installation, but before the first start, the following jar files must also be present in the classpath. These are part of the Solr delivery and are contained in the contrib directory.

  • morfologik-stemming-X.Y.Z.jar
  • morfologik-fsa-X.Y.Z.jar
  • morfologik-polish-X.Y.Z.jar
  • lucene-analyzers-morfologik-X.Y.Z.jar
  • lucene-analyzers-smartcn-X.Y.Z.jar

2.4. SmartSearch

The SmartSearch is a search solution based on Spring Boot and enables the control, filtering and analysis of individualized result lists. In order to use the functionalities of the SmartSearch stack, it has to be installed on both systems in the SmartSearch directory to be created. It is provided in the form of a zip file. The installation directory created for each system must also contain the license, which can be requested from Technical Support. The directory can be created at any location in the respective system.

By default, the assumption is that the installation directory to be created is located in the opt directory of the associated system. Deviations must be adjusted accordingly in the smart-search.service file und in der application.yml for all parameters.

After installation of the SmartSearch stack, it is necessary to perform the following configuration steps on both systems:

Creation of the directory data
The SmartSearch stack needs the directory data for data storage. This must be created manually within the respective installation directory.
Adaptation of the application.yml file

The application.yml file included in the delivery enables the configuration of the respective SmartSearch server. In particular, the following points must be observed:

  • The SmartSearch server uses port 8181 by default. However, you can optionally customize this per system.
  • The communication of the server to the outside is protected by SSL. For this purpose, a self-signed certificate is included in the delivery. This is exclusively for local development and must be exchanged with an officially signed certificate for productive use.
  • For the storage of configuration data the SmartSearch server needs a connection to the ZooKeeper of its system. For this the address of the corresponding ZooKeeper server must be made known to it under the connection parameter of the ZooKeeper key.
  • The previously created data directory must also be made known to the SmartSearch stack. For this purpose, the root parameter of the haupia key must be adapted accordingly in the associated application.yml file.
  • The host name of each individual SmartSearch instance must be known to the ZooKeeper instances. This must therefore be specified as the value of the server.address parameter. If no specification is made, usage of the SmartSearch Cockpit is not possible.
  • User data is determined by accessing a cache. This requires a certain amount of time for updating in case of changes. The parameter haupia.zookeeper.users.cache.await defines the timeout in seconds to wait for such an update. Initially, the default value equals a timeout of one second.
  • The SmartSearch determines the URL scheme to be used from the Solr cluster property urlScheme. If this value is not set, http is used by default. The solr.url.scheme parameter allows overriding the behavior and must have the value http or https.
  • It is possible to secure the Solr server using basic authentication. The required credentials are to be specified via the solr.auth.username and solr.auth.password parameters.
  • When using the FirstSpirit module SmartSearch Connect, the haupia.connect.datagenerator.pool.size parameter allows configuring how many FirstSpirit projects must be responded to in the form of a separate data generator. If the value is smaller than the number of connected FirstSpirit projects, data cannot be received from all projects. If this value is not set, then a maximum of number of cores x 500 applies.
  • The timeout for automatic user logout from the SmartSearch Cockpit is defined by the server.servlet.session.timeout parameter. Possible values for it are, for example, 60s or 1h. The default timeout is 15 minutes.

The master admin is automatically created at the initial start of the first SmartSearch server and must therefore have the same data for all instances. It can neither be disabled nor deleted, thus preventing lockout from the SmartSearch Cockpit. The master admin password can be changed in the SmartSearch Cockpit. Starting the SmartSearch server with the optional resetAdminPassword parameter resets the master admin password to the value from the application.yml.

Start parameters, encoding and JVM mode
Both SmartSearch servers need the parameter -Dhaupia.master.profile=STANDALONE to start. By default, this is contained in the supplied server.conf file and must not be changed. In addition, the encoding and the mode of the Java Virtual Machine can be determined in this file, among other things. These are also already set and can be adjusted if necessary. Additionally the file offers the possibility to define the memory usage of the respective SmartSearch server.

Start the SmartSearch stack of the first system via the server.jar contained in the zip file. This contains a launch script that allows the jar file to be used as a Unix service. Only when the log file of the first SmartSearch instance contains the following message about the successful start, the second instance may be started analogously:

Log message. 

less /opt/SmartSearch/logs/SmartSearch.log
(...) Started Server in 60 seconds (JVM running for 61.2)

When the server.jar is started directly, the process is executed in the foreground. The startup script automatically detects if it was executed as a link under etcinit.d and starts as a service in this case. To also start the server.jar as a service, the variable MODE must be set to service:

MODE=service ./server.jar start

Finally, copy the smart-search.service file, which is included in the systemd-samples folder in the delivery, to the etcsystemdsystem directory on both systems respectively. This allows to use the SmartSearch as a service and to control it via systemctl.

Ensure that both the user and the path of the installation directory are correctly specified in the service file in each case. By default, the user SmartSearch and an installation under opt are assumed.

The installed and configured SmartSearch instances communicate with each other and perform a so-called leader election, which determines the leading instance. While the REST service behaves identically on all running instances, the SmartSearch Cockpit can only be accessed on the leader. As a result, both the configuration and the automatic or manual execution of the data generators are also only possible on the leading instance.

If an attempt is made to address the cockpit via an instance other than the leader, automatic forwarding takes place. This forwarding is realized by a leader query addressed to ZooKeeper. For this, the ZooKeeper must know the host name of each instance.

In addition, the SmartSearch license must be stored in a text file on the SmartSearch server. The application for the license is done via the Technical Support of the e-Spirit AG. The following key allows to refer to this license file in the application.yml file:

Reference to SmartSearch license. 

haupia:
   licence:
      path: ./license.txt

2.5. HTTPS Proxy Configuration

If an http(s) proxy is necessary for the Web/XML crawlers to have access to their crawl targets, it can be configured via JAVA_OPTS. The relevant parameters for configuration are the following:

Tabelle 2. Proxy configuration
ParameterDescriptionValue

http|https.proxyHost

Host for the http(s) proxy

12.13.14.15

http|https.proxyPort

Port for the http proxy

8080

http.nonProxyHosts

Exceptions where the proxy should not be considered.

localhost



Exemplary entry in the server.conf file. 

JAVA_OPTS="-server
-Dhaupia.master.profile=STANDALONE
-Dfile.encoding=UTF-8
-Dhttp.proxyHost=12.13.14.15
-Dhttp.proxyPort=8080
-Dhttps.proxyHost=12.13.14.15
-Dhttps.proxyPort=8080
-Dhttp.nonProxyHosts=localhost|127.0.0.1"

3. Further configuration

The SmartSearch is based on Spring Boot and can be configured via the application.yml file, which is located in the SmartSearch directory next to the server.jar file. The application.yml file provides a variety of configuration options, which are described in the Spring Boot documentation. For example, it offers the possibility to configure logging and security settings. The following example explains the configuration of SSL, which is a common use case.

Example: SSL configuration

By default, the SmartSearch uses SSL to enable secure data transfer, using the Spring Boot Jetty. This works in most cases. However, problems can arise with the choice of protocol or encryption. The Spring Boot documentation describes the configuration keys server.ssl.enabled-protocols and server.ssl.ciphers which can be used to solve such problems. First, to get a list of possible parameters, logging for Jetty must be set to DEBUG in the right place. To do this, add the following configuration to the application.yml file:

Jetty logging configuration. 

logging:
   level:
      org:
         eclipse:
            jetty:
               util:
                  ssl: DEBUG

After configuring logging and restarting, the log contains a DEBUG output at the end of the startup process, which has approximately the following appearance:

Example of DEBUG output of the jetty’s SSL information. 

May 22 11:14:50 smart-search-cluster-1 server.jar[1208]: 2017-05-22 11:14:50.621 DEBUG 1230 --- [ main] o.e.jetty.util.ssl.SslContextFactory : Selected Protocols [TLSv1, TLSv1.1, TLSv1.2] of [SSLv2Hello, SSLv3, TLSv1, TLSv1.1, TLSv1.2]
May 22 11:14:50 smart-search-cluster-1 server.jar[1208]: 2017-05-22 11:14:50.622 DEBUG 1230 --- [ main] o.e.jetty.util.ssl.SslContextFactory : Selected Ciphers [TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256, [...]
May 22 11:14:50 smart-search-cluster-1 server.jar[1208]: A, SSL_DH_anon_WITH_DES_CBC_SHA, [...]

These two log outputs show the current values for the protocol and encryption as well as the possible values. To limit the encryption to two encryption methods if required, the application.yml file has to be adapted as follows:

Configuration of SSL encryption methods. 

server:
   ssl:
      ciphers:
         TLS_DHE_RSA_WITH_AES_128_CBC_SHA,
         TLS_ECDH_ECDSA_WITH_AES_128_GCM_SHA256

The same applies to the protocol used. After a successful configuration, it can be found in the logs during a restart.

3.1. Configuration of the Solr replicas

Finally, after installing ZooKeeper, Solr and the SmartSearch, the following steps must be performed in the Solr web interface. They enable on both Solr instances the automatic replication of the data needed by the SmartSearch to answer search queries.

The steps outlined refer only to the described cluster operation with at least two nodes. For a single node operation they can be ignored.

Use the URL http://hostname-<INSTANCE>:8983 to open the Solr web interface on one of the two systems and select the menu item Collections. The web interface will then display a list of all existing collections. In this list, select the collection that has the name of the client and click on the Shard: shard1 item.

The Add replica button enables the creation of another replica. Leave the dropdown in the state No specified node and create the replica using the button Create Replica. Solr selects a free node automatically.

Reload the web interface after creating the replica and verify the existence of the second replica by clicking on the Shard: shard1 item.

Next, repeat the described steps with the collection that has the suffix _signals.

3.2. Single node operation

If the aspects of failover and data redundancy are not relevant for the operation of the SmartSearch, the entire stack may also be installed on a single node.

Single node operation
Abbildung 4. Single node operation


The installation of the individual instances is performed in the same way as described in the previous chapters. However, the following differences should be noted:

Specification of ZooKeeper instances
In cluster mode, the hostnames or static IP addresses of all ZooKeeper instances to be included must be specified in the zoo.cfg files of all systems. This specification is omitted in the case of a single node.
Sed script to adjust the Solr configuration

The adjustment of the Solr configuration is done via the script solr.in.sh. The command for its execution in this case is as follows:

Adaptation of the Solr configuration for single node operation. 

sed -i 's/#ZK_HOST=.*/ZK_HOST=localhost:2181\/solr/' /etc/default/solr.in.sh

Solr replikas
The creation of replicas refers only to the cluster mode. Therefore, for single node operation, their creation should be ignored.

3.3. LDAP

By default, the management of the users and groups necessary for the use of the SmartSearch, as well as the definition of the permissions, takes place within the SmartSearch Cockpit. The storage of the users and groups is done by the previously installed ZooKeeper server.

Alternatively, however, it is possible to implement user and group management based on LDAP. It should be noted that this option only refers to authentication (users and groups), but not to authorization (ACLs).

The LDAPs connection only establishes read access to the LDAP server. A management of users and groups within the SmartSearch Cockpit is no longer possible afterwards.

To use LDAP, the following aspects must be present on the LDAP server:

  • On the LDAP side, the users are to be configured on the groups.
  • If the administration of the users and groups is done via the SmartSearch Cockpit, it provides a master admin for the very first login and the initial assignment of permissions. Equivalently, such a master admin is also required on the LDAP side. This must be assigned to the also required group ADMIN.

    User management on the LDAP-side
    Abbildung 5. User management on the LDAP-side


It should be noted that any user included in the admin group is allowed to edit permissions within the SmartSearch Cockpit without restriction.

  • The successful login of a user requires that the SmartSearch Cockpit knows the password encoder used by LDAP. This can therefore be specified in the password field in the form {id}hash. The id corresponds to the id of the hashing algorithm and must be one of the following values:

    • bcrypt
    • sha, SHA oder SHA-1
    • md4 oder MD4
    • md5 oder MD5
    • noop oder NOOP
    • pbkdf2
    • SHA-256, SHA256 oder sha256

Alternatively, it is possible to define the password encoder in the application.yml file of the SmartSearch server as the value of the haupia.ldap.default-password-encoder parameter.

In addition to the adaptations on the LDAP side, the following mandatory parameters must be specified in the application.yml file of the SmartSearch server:

  • haupia.ldap.enable
    This parameter enables the use of LDAP. The value of the haupia.ldap.enable parameter must be set to true.
  • spring.ldap.username and spring.ldap.password
    The connection to the LDAP server requires a technical user to be made known to the SmartSearch server. For this reason, the Distinguished Name (DN) and the password of the technical user must be specified for the spring.ldap.username and spring.ldap.password parameters.
  • haupia.ldap.user-search-base or haupia.ldap.group-search-base
    For searching the user or group objects, the corresponding Distinguished Name (DN) must also be specified as the value of the haupia.ldap.user-search-base or haupia.ldap.group-search-base parameter.
    (Example: ou=people,dc=example,dc=org or ou=groups,dc=example,dc=org)

The search for user or group objects refers exclusively to the specified level. Subtrees are excluded from this search.

In addition to these mandatory details, further optional adaptations can also be made:

  • spring.ldap.urls
    This parameter contains a list of URLs of the available LDAP servers. By default, the parameter has the value ldap://localhost:389.
  • spring.ldap.base
    This parameter can be used to define a DN suffix for all operations performed against the LDAP server.
  • haupia.ldap.user-filter
    To find a user object in the LDAP server tree, a filter can be specified with this parameter. By default it has the value uid={0}. The placeholder {0} is replaced by the entered user name.
  • haupia.ldap.group-filter
    Equivalent to the user filter, this parameter allows to specify a group filter that finds all group objects belonging to a user. By default it has the value (member={0}). The {0} placeholder is replaced by the Distinguished Name (DN) of the corresponding user in this case.
  • haupia.ldap.default-password-encoder
    If the password encoder used is not added to the password field on the LDAP server, it can be specified using this parameter instead. By default it has the value bcrypt.
  • haupia.ldap.user-attributes.uid and haupia.ldap.user-attributes.password
    The values of these parameters correspond to the fields where the name or password of a user is stored on the LDAP side. By default they have the values uid and userPassword.
  • haupia.ldap.user-attributes.language and haupia.ldap.user-attributes.default-language
    The language parameter can be used to control in which language the SmartSearch Cockpit starts. If it is missing, the value of the default-language parameter is used instead. This parameter initially defines English as the default language.
  • haupia.ldap.group-attributes.name and haupia.ldap.group-attributes.user
    Equivalent to the user attributes, the values of these parameters correspond to the name of a group and the list of users contained in a group. By default, they have the values cn and member.

The group names are always displayed in uppercase in the SmartSearch Cockpit, even if they have a different spelling on the LDAP side.

LDAP parameters in the application.yml. 

spring:
   [...]
   ldap:
      urls: ldap://localhost:389
      password: admin
      username: cn=admin,dc=example,dc=org
      base: dc=example,dc=org

[...]

haupia:
   [...]
   ldap:
      enable: false
      user-search-base: ou=people,dc=example,dc=org
      user-filter: uid={0}
      group-search-base: ou=groups,dc=example,dc=org
      group-filter: (member={0})
      default-password-encoder: bcrypt
      user-attributes:
         uid: uid
         password: userPassword
         language: language
         default-language: en
      group-attributes:
         name: cn
         user: member

4. SSL

The processing of the data collected by the SmartSearch requires communication between the individual components and the customer’s end application. The communication of the SmartSearch stack to the outside is thereby protected by SSL. By default, the SmartSearch server uses a self-signed certificate for this purpose, which is included in the delivery.

The self-signed certificate included in the delivery is only designed for use in local development environments. For the use of the SmartSearch stack in productive operation, the use of an officially signed certificate is therefore strongly recommended.

The execution of the following steps assumes that an officially signed certificate has already been requested and that the file server.crt is thus available. Furthermore, it is assumed that all necessary files are stored in the same directory.

To use the SmartSearch stack in the production system, a new keystore must first be created. This must be made known to the server to replace the existing certificates. The following command shows an example for the creation of the new keystore:

Creation of the new keystore. 

openssl pkcs12 -export -in server.crt -inkey server.key -out server.p12 -name server -CAfile ca.crt -caname root

The command causes a conversion of the x509 certificate and the associated key into a pkcs12 file. The password to be assigned must be specified in the application.yml file under the key-store-password parameter of the key ssl. At the same place the newly created keystore must be defined for the key-store parameter as well as the keyAlias server.

SSL parameters. 

ssl:
   key-store: server.p12
   key-store-password: PASSWORD
   keyStoreType: PKCS12
   keyAlias: server

4.1. Unsigned SSL certificates

When crawling an https server, the error shown below may occur. It states that no valid certificate is entered in the keystore.

Possible error with unsigned SSL certificate. 

Caused by: sun.security.validator.ValidatorException:
PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

To fix the error, the SSL certificate of the crawled https server must be included in the keystore of the SmartSearch server. The certificate is included by downloading the certificate file, which is to be added to the JRE with the keytool command.

An example call for Mac OS looks like this:

Example call. 

sudo keytool -import -alias ANY_ALIAS -keystore /PATH_TO_JAVA_JRE/lib/security/cacerts -file /PATH/TO/THE/CERTIFICATE/certificates.file

After successful integration of the certificate, the described error no longer occurs and the page is crawlable as desired.

5. Monitoring

Monitoring is used to monitor processes, procedures and states. Thereby the main purpose is to ensure that the SmartSearch server continues to run properly and that no failures occur.

The SmartSearch also offers this possibility for its own server. A simple connection to Prometheus allows the collection and evaluation of the server data. For the communication between Prometheus and the SmartSearch, minor configurations are required at both ends.

On the SmartSearch side, the key management.metric.export.prometheus.enabled must be set to true in the application.yml.

Configuration of the key. 

management:
   metrics:
      export:
         prometheus:
            enabled: true

After that, the Prometheus service is available via the following URL:
smart-search-server:port/actuator/prometheus

Finally, the SmartSearch server must be made known to Prometheus. This is done in the configuration file prometheus.yml and requires the following settings:

Settings. 

scrape_configs:
- job_name : smart-search
  metrics_path: /actuator/prometheus
  scheme: https
  basic_auth:
     username: USERNAME
     password: PASSWORD
  static_configs:
     - targets:
       - smart-search-server:PORT

5.1. Metrics

To monitor the activity of a SmartSearch server, Prometheus collects the so-called metrics. These are available after performing the configuration for monitoring on the Prometheus server. Metrics are parameters that describe certain measured variables of the server. They allow deriving various information about the behavior of the SmartSearch server and are periodically collected and stored by Prometheus.

The JVM parameters, such as memory usage and the number of threads, are of particular interest. Other important metrics include CPU utilization or the performance of HTTP requests.

Prometheus metriken
Abbildung 6. Prometheus metriken


5.2. Grafana

The additional configuration of a Grafana instance allows a comprehensive visualization with dashboards. Grafana is a software that displays values from metrics in graphs. The visualization of values offers the advantage of better comparability. Information on how to install and start a Grafana server can be found on the official website.

The addition of the configured Prometheus server as a data source in the Grafana user interface allows access to the metrics provided by the SmartSearch.

Grafana graphs
Abbildung 7. Grafana graphs


5.3. Solr Prometheus exporter

With version 8 Solr also provides a Prometheus exporter. This is located in the contrib subfolder of the Solr installation folder. More information about the operation and configuration of the exporter can be found in the official documentation.

6. Operation

This chapter contains hints that are necessary for the operation of the SmartSearch as well as the connected systems.

6.1. Resetting the master admin password

At the initial start of the SmartSearch server the master admin is automatically created with the data from the application.yml. The admin’s password can be changed in the SmartSearch Cockpit at any time afterwards. Starting the SmartSearch server with the optional resetAdminPassword parameter resets the master admin password to the value from the application.yml. This way, lockout from the SmartSearch Cockpit is prevented.

7. Troubleshooting

Data processing by the SmartSearch can only be done if the individual components are working properly. Therefore, if disruptions occur or an update is required, both the SmartSearch server and all ZooKeeper and Solr servers must always be considered. The following sections describe some possible solutions to known challenges.

Activation of Spring Boot actuators

Spring Boot provides a set of actuators that are enabled in the default configuration of the SmartSearch. With these it is possible to get an insight into the server at runtime. The actuators provide for example information about the current environment or about the configurations of the different log levels. Detailed background information about the actuators can be found in the Spring Boot actuators documentation.

The actuators provide their information via REST services. These are secured via http basic authentication and require valid credentials of an admin user.

The following configuration enables the default actuator endpoints, but explicitly disables the heapdump endpoint. The heapdump endpoint enables a GB-sized download, which may not be desired in a production environment.

Configuration for default endpoints. 

management:
   endpoint:
      health:
         show-details: when-authorized
      metrics:
         enabled: true
      prometheus:
         enabled: true
   endpoints:
      web:
         exposure:
            include: "*"
            exclude:
               - "heapdump"
   metrics:
      export:
         prometheus:
            enabled: false
            descriptions: false

The following URL provides an overview of the current actuator endpoints.

  • Method: GET
  • URL: /actuator

    The call performs a basic authentication with an admin user. For this purpose the user from the application.yml file can be used.

    Possible response to a query of the actuators. 

    {
       "_links": {
          "self": {
             "href": "https://smart-search-server:8181/actuator",
             "templated": false
          },
          "auditevents": {
             "href": "https://smart-search-server:8181/actuator/auditevents",
             "templated": false
          },
          "beans": {
             "href": "https://smart-search-server:8181/actuator/beans",
             "templated": false
          },
          "health": {
             "href": "https://smart-search-server:8181/actuator/health",
             "templated": false
          },
          "conditions": {
             "href": "https://smart-search-server:8181/actuator/conditions",
             "templated": false
          },
          [...]
          "mappings": {
             "href": "https://smart-search-server:8181/actuator/mappings",
             "templated": false
          }
       }
    }

Use of the logging API to adjust the log levels

A special Spring Boot actuator exists for handling the log settings. This allows the log levels to be adjusted at runtime. Calling the following relative URL on a SmartSearch instance generates the output of all currently configured log levels:

  • Method: GET
  • URL: /actuator/loggers

    It is also possible to output information about a special logger:

  • Method: GET
  • URL: /actuator/loggers/<Logger>

    Example URL to query the logger en.arithnea.haupia.Server:

  • URL: /actuator/loggers/de.arithnea.haupia.Server

    The following code snippet shows a possible response from the SmartSearch server to this query.

    Example response. 

    {
       "configuredLevel": "INFO",
       "effectiveLevel": "INFO"
    }

    The adjustment of a log level is executed by a POST request against a specific logger URL. In this request, the new log level is transmitted via JSON format.

  • Method: POST
  • URL: /actuator/loggers/<Logger>

    The body is a JSON object that has the desired log level as the value for the configuredLevel key.

    Example:

    Curl call. 

    $ curl 'https://smart-search-server:8181/actuator/loggers/de.arithnea.haupia.Server'
    -i -u 'user:password' -X POST \
       -H 'Content-Type: application/json' \
       -d '{
          "configuredLevel" : "DEBUG"
       }'

    The returned HTTP status code 204 confirms the successful setting of the log level.

Prometheus endpoint

Prometheus is a tool for monitoring processes. In regular intervals, the tool records the state of a process and allows a evaluation of the data over time. This enables, for example, to observe the memory consumption in relation to the requests to the REST services. The SmartSearch provides a Prometheus endpoint already preconfigured, which is to be activated via the application.yml file. For activation the following key has to be set to true:

Example of activating the Prometheus endpoint. 

management:
   metrics:
      export:
         prometheus:
            enabled: true

An endpoint is then available via the following URL:

  • URL: /actuator/prometheus

    For example, the endpoint can be included in Prometheus as follows:

    Example of embedding the endpoint in Prometheus. 

    scrape_configs:
       - job_name: 'smart-search'
       metrics_path: '/actuator/prometheus'
       scheme: https
       basic_auth:
          username: admin@localhost.de
          password: admin
       static_configs:
          - targets: ['smart-search-server:8181']

8. DSGVO

The General Data Protection Regulation (GDPR) is an EU regulation that protects the fundamental right of European citizens to privacy and regulates the handling of personal data. Simplified, all persons from whom data is collected have the following options, among others, via the GDPR:

  • to learn which of their personal data is stored and how it is processed (duty to inform and right to information),
  • to restrict the collection of data (right to restriction of processing),
  • to influence the data collected (right to rectification); and
  • to delete the data collected (right to be forgotten).
What is personal data?

Personal data is any information by which a natural person is directly or indirectly identifiable. This includes any potential identifiers:

  • direct identifiers, such as

    • Names
    • Email addresses
    • Telephone numbers
  • indirect identifiers, such as

    • Location data
    • Customer numbers
    • Staff numbers
  • Online identifiers, such as

    • IP addresses
    • Cookies
    • Tracking Pixel

Detailed information on the GDPR can be found in the blogpost The Ultimate Resource for GDPR Readiness.

8.1. GDPR and the SmartSearch

The search engine SmartSearch stores data as documents that can be made available on various platforms. The type and scope of the data, hereinafter referred to as "collected data", depends on the purpose of the product.

The manufacturer e-Spirit expressly points out that it is the responsibility of the customer to check collected data to determine whether it contains personal data and to ensure that appropriate measures are taken.

In addition to the editorial data, the SmartSearch stores personal data (basically the email, and if LDAP is used the user’s username), which are used for logging in to the system and auditing configurations, in order to be able to contact an editor of an element if necessary. Parts of this data are kept in log files. In the following, this data is referred to as "personal system data" (see below).

8.2. Personal system data in the SmartSearch

The e-Spirit AG takes the protection and security of your data very seriously. Of course, we comply with the legal data protection regulations and treat personal data but also non-personal data of our users with appropriate care. We only collect personal data if it is necessary for the security and functionality of SmartSearch.

The following subchapters provide information about the collection and handling of personal data in the SmartSearch.

8.2.1. Data for authorization and authentication of users in the SmartSearch

Why is the data needed?

The SmartSearch works with a consistent user and rights system. New users are created and managed via the user management. After creation the user is known on the server and can log in (with a valid login) via the SmartSearch Cockpit. Access to the configuration elements is granted via group rights/roles.

This ensures that only authenticated users have access to the SmartSearch and that these users may only edit elements according to the rights granted to them.

Where is the data used or displayed?

Information about the user is displayed in various places, for example:

  • when logging in to the cockpit
  • when granting group rights
  • When changing an object via auditing
  • and many more
Where is the data stored?
The credentials of the individual users are always stored in the configuration component. In the case of LDAP, the personal system data is loaded from the customer’s LDAP in read-only mode.
How long is the data stored?

When a user is removed via the user management, the user’s credentials are immediately removed from the configuration component.

Deactivating a user in the user management does not delete their data.

8.2.2. Data for error analysis and troubleshooting in the SmartSearch (logging).

Why is the data needed?
The SmartSearch uses log files to track actions and events on the SmartSearch server. Log files are collected to maintain secure operations. They can be used to analyze and troubleshoot error states. Some of the log files used by SmartSearch include IP address, login name, date, time, request, etc. and thus contain personal data.
Where is the data stored?
Basically, log files are written to the logs subdirectory of the SmartSearch server.
How long is the data stored?
Default behavior: When a fixed file size of 100 MB is reached, the current log file is archived. Up to nine archived files are kept. If a new file is then archived, the oldest one is deleted. This behavior can be customized via the configuration file logback-spring.xml.

8.2.3. Data for auditing configuration items

Why is the data needed?

One objective of data storage in the SmartSearch is the traceability of the last changes, but also information about the creation of configuration items. For this purpose the SmartSearch stores auditing data.

Each time an editor makes a change to configuration items (for example, an Prepared Search), it is noted on it who made the last change and when. This overwrites any existing last change.

When a new configuration item is created, it notes who created the item and when.

Where is the data used or displayed?
The auditing data (which user made a change and when?) is displayed in list views for the configuration items.
Where is the data stored?
The data is stored at the configuration elements in the configuration component.
How long is the data stored?
Default behavior: When a user is deleted, references to that user are anonymized. It is also possible to anonymize references of already deleted users afterwards, in case the default behavior was not applied during the deletion. After anonymization, a report is displayed in which elements the user was anonymized. No information is logged for this purpose. The report is the only way to get information about the anonymization.

8.2.4. Usage of cookies in the cockpit

Why is the data needed?
The SmartSearch uses cookies in the cockpit to save the user’s session and for protection against XSRF attacks.
Where is the data used or displayed?
The cookies are stored in the browser and sent with every interaction with the cockpit from the moment of logging in.
How long is the data stored?
The lifetime of the cookies is set to session.