Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Configuration for ElasticSearch Analyzer #17

Closed
wants to merge 4 commits into from

Conversation

vkublytskyi
Copy link

@vkublytskyi vkublytskyi commented Aug 7, 2018

This is a proposal of adding new elements to esconfig.xml files to have more control on ElasticSearch analyzer settings for indexes.

Here is example of proposed changes:

<config>
    <tokenizer>
        <default>
            <type>standard</type>
        </default>
        <jp_JP>
            <type>kuromoji_tokenizer</type>
            <mode xsi:type="string">extended</mode>
            <discard_punctuation xsi:type="boolean">false</discard_punctuation>
            <user_dictionary xsi:type="string">userdict_ja.txt</user_dictionary>
        </jp_JP>
    </tokenizer>
    <toke_filters>
        <default>
            <lowercase /><!-- declare usage of standard filter -->
            <my_token_filter><!-- customized token filter -->
                <type>standard</type>
                <max_token_length xsi:type="number">5</max_token_length>
            </my_token_filter>
        </defualt>
        <en_US><!-- locale specific filters>
            <!-- ... -->
        </en_US>
    </toke_filters>
    <char_filters>
        <default>
            <html_strip /><!-- declare usage of standard filter -->
            <my_char_filter><!-- customized filter to replece "-" by "_" -->
                <type>pattern_replace</type>
                <pattern>(\\d+)-(?=\\d)</pattern>
                <replacement>$1_</replacement>
            </my_char_filter>
        </defualt>
        <en_US><!-- locale specific filters>
            <!-- ... -->
        </en_US>
    </char_filters>
</config>

This proposal is part of work on magento/magento2-jp#18 and has implemented prototype at magento/magento2-l10n#1 implemented by @HirokazuNishi from Veriteworks in collaboration wit Magento Community Engineering Team.


Extended Configuration for ElasticSearch Analyzer

Context

During work on Japanese localization community project we descovered requirement to modify default index settings created by Magento\Elasticsearch\Model\Adapter\Index\BuilderInterface::build.
First of all, this is caused by the unique nature of the Japanese writing system. Usage of standard analyzer and tokenizer do not provide sufficient search accuracy. To provide valid search results Japanese (kuromoji) Analysis Elasticsearch plugin should be used which provides set of tokenizers and token filters correctly processing Japanese texts.

Existing Elasticsearch configuration in Magento 2 allows setting only default stemmer name and list of stop words. It introduces two top-level XML elements stemmer and stopwords_file. Each element includes elements with valid locale code as a name and default element which holds configuration that should be used if locale-specific configuration is not defined.

<config>
    <stemmer>
        <type>stemmer</type>
        <default>english</default><!-- used for all locales that does not have explicit config element-->
        <de_DE>german</de_DE><!-- stemmer name for de_DE locale -->
        <!-- elements for other locales -->
    </stemmer>
    <stopwords_file>
        <default>stopwords.csv</default><!-- stop words list to be used if locale does not defined own list -->
        <de_DE>stopwords_de_DE.csv</de_DE><!-- de_DE locale stop words -->
        <!-- elements for other locales -->
    </stopwords_file>
</config>

There may be much more use-cases when SI integrators would like to change default index settings to adjust search results.

To allow complex changes of default Elasticsearch configuration in Magento 2, such us required by Japanese localization project, at the current moment we have 2 options:

  1. Use pluginization mechanism on \Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface and Magento\Elasticsearch\Model\Adapter\Index\BuilderInterface.
  2. Introduce explicit extension points and extend possibilities of Magento 2 ElasticSearch configuration.

Plugins are powerful mechanism but they push developers to explore the internals of pluginized module that violates the Open-Closed principle. Usage of plugins would require a same code from project-to-project.

Decision

Magento 2 should provide a possibility to configure analyzer for indexes. To not limit all the power of ElasticSearch without exposing all complexity of configuration should include:

  • tokenizer configuration
  • token filters configuration
    • customized filters configuration
    • list of token filters to be used by an analyzer
  • char filters configuration
    • customized filters configuration
    • list of token filters to be used by an analyzer

The extended schema should be fully backward compatible and consistent with the current implementation.

General Concepts

As all required configuration options are independent they should be expressed as top-level XML elements.

As all required configuration options have a strong dependency on a language and to be consistent with existing configuration elements each introduced elements should contain default element and elements with names corresponding to locale names.

As particular tokenizer and filters may require complex types for configuration XML schema should allow to do that and converting of available options described at ElasticSearch Documentation in JSON to XML elements should follow simple straight-forward rules.

Tokenizer Configuration

A tokenizer is configured by tokenizer element which must include default element as a container for common configuration and may include one or more elements with valid locale codes as a name that are containers for locale-specific configuration.

Each direct child of tokenizer element must contain type element with the name of tokenizer to be used.

Each direct child of tokenizer element may contain element which represent configuration parameter available for specified tokenizer type. See ElasticSearch parameters XML representation section for more information.

<config>
    <tokenizer>
        <default>
            <type>standard</type>
        </default>
        <jp_JP>
            <type>kuromoji_tokenizer</type>
            <mode xsi:type="string">extended</mode>
            <discard_punctuation xsi:type="boolean">false</discard_punctuation>
            <user_dictionary xsi:type="string">userdict_ja.txt</user_dictionary>
        </jp_JP>
    </tokenizer>
</config>

Token Filters Configuration

Token filters are configured by token_filters element which must include default element as a container for common configuration and may include one or more elements with valid locale codes as a name that are containers for locale-specific configuration.

Each direct child of token_filters holds configuration of filter:

  • standard filters that should be used by analyzer must be declared as an empty element with name equal to token filter name (e.g. <lowercase/>)
  • customized filters declared by an element with name equal to custom filter name, required type child element which contains the name of customized token and optional parameters. See ElasticSearch parameters XML representation section for more information.
<config>
    <toke_filters>
        <default>
            <lowercase /><!-- declare usage of standard filter -->
            <my_token_filter><!-- customized token filter -->
                <type>standard</type>
                <max_token_length xsi:type="number">5</max_token_length>
            </my_token_filter>
        </defualt>
        <en_US><!-- locale specific filters>
            <!-- ... -->
        </en_US>
    </toke_filters>
</config>

Char Filters Configuration

Char filters configuration is similar to token filters but declared by char_filters element

<config>
    <char_filters>
        <default>
            <html_strip /><!-- declare usage of standard filter -->
            <my_char_filter><!-- customized filter to replece "-" by "_" -->
                <type>pattern_replace</type>
                <pattern>(\\d+)-(?=\\d)</pattern>
                <replacement>$1_</replacement>
            </my_char_filter>
        </defualt>
        <en_US><!-- locale specific filters>
            <!-- ... -->
        </en_US>
    </char_filters>
</config>

ElasticSearch parameters XML representation

JSON configuration described at ElasticSearch configuration may be converted to XML configuration to specify tokenizer and filter parameters.

Objects

Objects or maps are key structure to declare ElasticSearch configuration. When converting JSON to XML keys became element names and values are presented as node values. An element may declare a type of value with xsi:type attribute but it may be autodetected by the parser.

JSON XML
{key: {}} <key xsi:type="map"><!-- ... --></key>

Strings

String values are converted to text node:

JSON XML
{key: "string value"} <key xsi:type="string">string value</key>

Numbers

Numeric values are converted to text node. This type utilize both integer and float numbers depend on provided decimal part:

JSON XML
{key: 42} <key xsi:type="number">42</key>

Booleans

Boolean values are converted to text node and may contain "true" or "false":

JSON XML
{key: true} <key xsi:type="boolean">true</key>

Least

Array values represented as series of item nodes:

JSON XML
{key: ['v1', 'v2']} <key xsi:type="list"><item>v1</item><item>v2</item></key>

Item node may declare ref attribute that should be unique inside list. ref attribute should be used to provide a possibility to override list element by Magento configuration merging mechanism. If module B want to add element to list in ElasticSearch config declared by module A then ref attribute should be used as well.

<!-- module A-->
<articles xsi:type="list">
    <item ref="overridableArticleItem" xsi:type="string">may be overridden</item>
    <item>can not be referenced so is can not be changed</item>
</articles>

<!-- module B which has module A in sequence at module.xml -->
<articles xsi:type="list">
    <item ref="overridableArticleItem" xsi:type="string">overridden value</item>
    <item ref="addedArticleItem">added value</item><!-- after merge articles list will have 3 items -->
</articles>

Disabling Configuration Element

As with proposed changes configuration may be complex and conflict with some requirements it is necessary to provide a mechanism of disabling some parts of configuration during the Magento configuration merge process.

To achieve this goal any element may declare disabled boolean attribute:

<config>
    <token_filters>
        <default>
            <my_token_filter disabled="true"/><!-- disable customized filter declared in other config file -->
        </default>
    </token_filters>
</config>

Deprecation of Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface

Current @api interface Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface violates Interface Segregation principle. It extending is impossible as that would be backward incompatible.

As a solution Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface should be deprecated and instead of it new set of interfaces should be introduced. Each new interface should be responsible for a single aspect of ElasticSearch configuration.

/**
 * @api
 * @deprecated
 */
interface EsConfigInterface extends EsStemmerConfigInterface, EsStopWordsConfigInterface
{
    public function getStemmerInfo();
    public function getStopwordsInfo();
}

/**
 * @api
 */
interface EsStemmerConfigInterface
{
    public function getStemmerInfo();
}

/**
 * @api
 */
interface EsStopWordsConfigInterface
{
    public function getStopwordsInfo();
}

/**
 * @api
 */
interface EsTokenizerConfigInterface
{
    public function getTokenizerInfo(): array;
}

/**
 * @api
 */
interface EsTokenFilterConfigInterface
{
    public function getTokenFiltersInfo(): array;
    public function getTokenFiltersList(): array;
}

/**
 * @api
 */
interface EsCharFilterConfigInterface
{
    public function getCharFiltersInfo(): array;
    public function getCharFiltersList(): array;
}

Class Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfig will implement all of these interfaces to simplify backward compatible implementation.

Prototype

Prototype of proposed changes is implemented in magento/magento2-l10n#1 by Hirokazu Nishi from Veriteworks in collaboration wit Magento Community Engineering Team.

Status

Proposed

Consequences

Proposed changes to ElasticSearch XML configuration give a possibility to fully configure ElastoicSearch analyzer for indexes. It currently not supported multiples tokenizers. The current implementation also assumes usage of unified stemmer and stop words list declared in CSV files. Fixing these limitations is out of the scope of this proposal and should be addressed if real issue reported.

The main drawback of this proposal is an introduction of XML schema that not really match XML philosophy (not strictly defined elements structure). This is done by intention to be consistent with existing configuration and to provide straightforward, not verbose conversion of configuration in JSON described at ElasticSearch documentation to XML format expected by Magento.

@kandy
Copy link
Contributor

kandy commented Aug 16, 2018

Given that changes to the search settings are a rare necessity, I recommend using the possibilities of DI Configuration instead of building a new type of configuration

@vkublytskyi
Copy link
Author

@kandy,

the search settings are a rare necessity

The main reason is adapting the search for different languages. With more presence of Magento at such huge markets as Japan, China, India, etc. this is really "must have" feature.

I recommend using the possibilities of DI Configuration instead of building a new type of configuration

First, this is not a new configuration but the existing one extended in a backward compatible way.

Second, DI configuration may be used in 2 way:

  • change implementation of interfaces Magento\Elasticsearch\Model\Adapter\Index\BuilderInterface by preference. This will not work in case if some merchant will need to adjust ElastciSearch settings for Japanese and Chinese at same Magento instance
  • pluginize Magento\Elasticsearch\Model\Adapter\Index\BuilderInterface or/and Magento\Elasticsearch\Model\Adapter\Index\Config\EsConfigInterface. Thankfully to a power of plugins, this is a doable approach. But it has also drawbacks. Each time when language-specific settings should be applied a good amount of code will be duplicated to modify config. Regardless that these interfaces marked with @api, a developer will be pushed to explore internal implementation and rely on it as plugins have to modify PHP arrays that are not defined in interfaces but are implementation-specific. Plugins are "last hope" for 3rd party developers. As soon as we detect frequent requirement to customize/extend some behavior we must introduce explicit extension points.

Third (and this one is my personal opinion), DI configuration in Magento currently abused and configure too many aspects. In fact, DI should be used to set dependencies but not to configure their behavior. Magento has a pretty flexible configuration framework that allows introducing of new config types easily and this should be considered as best practice. We should have more small config files targeted to particular feature instead of huge 500-1000 lines di.xml files


Each direct child of `tokenizer` element may contain element which represent configuration parameter available for specified tokenizer type. See [ElasticSearch parameters XML representation](#ElasticSearch-parameters-XML-representation) section for more information.

```xml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood the intention was to make configuration similar to what we already have. Current configuration in the module is different from configuration that we have for other modules. Have you thought about unifying it? Since we don't have a lot of configuration, maybe we can introduce new format, deprecate old one and create fallback for backwards compatibility?

I need to think more about alternative format, but here is what I have so far

<config>
    <languages>
        <language name="default">
            <tokenizer name="default" type="standard"/>
            <token_filter name="default" type="lowercase"/>
            <char_filter name="default" type="html_strip"/>
        </language>
        <language name="jp_JP">
            <tokenizer name="my_tokenizer"
                       type="kuromoji_tokenizer"
                       mode="extended"
                       discard_punctuation="false"
                       user_dictionary="userdict_ja.txt"/>
            <token_filer name="my_token_filter"
                         type="standard"
                         max_token_length="5"/>
            <char_filer name="my_char_filter"
                        type="pattern_replace"
                        pattern="(\\d+)-(?=\\d)"
                        replacement="$1_"/>
        </language>
    </languages>
</config>

Main benefits

  1. Name of the language is value of an attribute instead of name of the node, more consistent with other Magento configuration. Are there other benefits in proposed configuration except consistency with existing module configuration?
  2. Configuration per language defined in one place
  3. Nodes that unlikely will have their own attributes made attributes

Don't like how defaults being set in my example, need to think more about it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main intention was to make configuration similar to existing exactly in esconfig.xml regardless it differs from main approach in other config files. I also think that supporting two types of config will create too many difficulties in maintenance and overcomplicate implementation.

I suppose in future we will need BiC in this module to eliminate support of ES 2 and full support of new features. I think that time in future would be ideal for reviewing the whole config.

Copy link
Member

@melnikovi melnikovi Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would favor introducing configuration in the format that follows general and Magento best practices because amount of configuration we going to add is pretty big. Currently we have small amount of configuration and it should be easy to support fallback to it. Having free form configuration that is not following conventions of other configuration in Magento is bad for developer experience. Would be interested in other people opinion on this.

| ------------- |-------------|
| `{key: true}` | `<key xsi:type="boolean">true</key>` |

#### Least
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

| ------------- |-------------|
| `{key: ['v1', 'v2']}` | `<key xsi:type="list"><item>v1</item><item>v2</item></key>` |

Item node may declare `ref` attribute that should be unique inside list. `ref` attribute should be used to provide a possibility to override list element by Magento configuration merging mechanism. If module B want to add element to list in ElasticSearch config declared by module A then `ref` attribute should be used as well.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ref be name or it's different from how we use name in other configuration files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may renamed to name for consistency.


### ElasticSearch parameters XML representation

JSON configuration described at ElasticSearch configuration may be converted to XML configuration to specify tokenizer and filter parameters.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where this can be specified? Could you expand on this a little or point me to where it's described in case I'm missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that understand the question. XML config in the proposal does not declare strict schema. Instead of that proposed straightforward rules how to represent ElasticSerach config in XML

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there could be a benefit in making this configuration strict? Will be easier for developer to write. It's also better in case Elastic change configuration format. Could you point me to JSON configuration we trying to represent in XML?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We will not cover all possible ES config options with specific elements so will have to introduce general elements like standard_filter, custom_filter, filter_param, etc. This will make validation of XML more strong but in cost of extra verbosity and worse readability


#### Objects

Objects or maps are key structure to declare ElasticSearch configuration. When converting JSON to XML keys became element names and values are presented as node values. An element may declare a type of value with `xsi:type` attribute but it may be autodetected by the parser.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep configuration in the flexible format, can we remove xsi:type or there are cases when it's needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is optional. If omitted type is autodetected

@melnikovi
Copy link
Member

Per @vkublytskyi proposal became obsolete.

@melnikovi melnikovi closed this Mar 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants