What is Elasticsearch, and How Can Your Business Leverage It?
Pablo Grill
April 29, 2020
In recent years, the importance of efficiently storing, analyzing, and managing your data is a high priority in most businesses. Some of the most important tech companies nowadays reached that level of success because they prioritized data management. However, machine learning algorithms and all the benefits that these technologies can bring to your company (which we explained in greater detail in a previous post) require an input of a considerable amount of data.
The best approach to storing your data depends on several factors. Traditional SQL databases, such as PostgreSQL (postgres) or MySQL, are a good option for storing structured data. The main advantage of this kind of database is that you have mechanisms to guarantee the quality and consistency of your data, allowing you to model complex relationships between objects. This advantage is the reason behind their popularity and is why they are currently used by most systems.
However, when we are talking about storing and searching for information in big datasets of unstructured texts, such as users reviews, complaints emails or logs of user activity, these databases are not the best option. The performance of searching over text in SQL databases is usually slow, and sometimes it is difficult to include linguistic concepts such as synonyms, stop words, etc.
There are some features, such as the postgres full-text search (described in a previous post) that improve the text search capabilities of databases and give you the possibility of implementing a kind of smart search over a SQL schema. This option works great when your data is stored in a structured way and you want to add a text search functionality. However, when you have a considerable amount of unstructured data, traditional databases are usually not the best option. In those cases, it makes more sense to use a specific analytic search engine such as Elasticsearch.
What is Elasticsearch?
Elasticsearch is a search and analytics engine that supports multiple types of data, including textual, dates, numerical, etc. It’s a distributed solution by default; so each time that we are referring to an Elasticsearch server we are talking about a cluster with potentially several nodes. The interaction with the engine is made using REST APIs and a domain-specific language (DSL), usually called Elasticsearch DSL.
In Elasticsearch, there are two important terms, documents and indexes.
-
A document is any piece of information that can be mapped as a set of keys with their corresponding value. An example of a document could be a newspaper article mapped using the keys: title, content, publish_date, author and source.
-
An index is a collection of documents that are related to each other. Taking into account the previous example, an index could be all the articles that belong to a specific source, or articles regarding a specific topic (sports, health, etc).
Internally, Elasticsearch stores the documents as a JSON and the indexes as an inverted index. An inverted index is a data structure designed to allow efficient full-text searches. It’s a hashmap that lists every unique word that appears in any document and identifies all of the documents each word occurs in. .
What is ELK?
If you start to research Elasticsearch, you will immediately run into the acronym ELK (Elasticsearch, Logstash and Kibana), also known as Elastic Stack. This stack is a set of tools that helps you to deal with your data, covering features as extraction, loading, storage and visualization.
As we saw earlier, Elasticsearch is the engine responsible for the storage and is the core of ELK. Logstash is the most used tool to populate the data from the original sources into the Elasticsearch cluster. Some of the features that logstash provides are aggregation from different sources and data transformation. Kibana is the data visualization and management tool that the stack provides, giving you a web interface with graphs, dashboard and developer tools modules that help analyze your data in real time.
When to Use Elasticsearch
As Elasticsearch supports multiple types of data, it can be used for data storage in practically all situations. However, it wasn’t designed to work performatively in all scenarios. So, if you choose to use Elasticsearch on the wrong problem, your solution will still work, but the performance or the usability probably won't be the best.
Here are some tips for how to to verify that Elasticsearch is a good option for your project:
Your business manages a lot of textual data.
As we mentioned previously, the main advantage of Elasticsearch is that it uses an inverted index to store information. This data structure is one of the best for storing textual information.
Your business needs to analyze and search textual data.
The main advantage of the inverted index is not only the ability to store textual information (you can have the same feature using a NoSQL database such as MongoDB), but also the ability to search efficiently. Elasticsearch is well-known for executing complex searches in real time. Moreover, the search in Elasticsearch can be customized to obtain better results. For example, fuzzy query takes typos in the data into account.
Your data doesn't change frequently.
It is also important to focus on the disadvantages of Elasticsearch. A great search performance in textual data has as a counterpart a slow insert or update. Each time that a record is inserted or modified, Elasticsearch needs to update the inverted index, an operation that is time-consuming. If your business frequently modifies your data and this operation would otherwise be efficient, maybe Elasticsearch is not the best option.
Why use Elasticsearch?
Now that we understand what Elasticsearch is and when can it be used, let’s look at other aspects that make it a good solution for your business.
Scalability
Elasticsearch is distributed by nature, which makes the solution scalable by default. Once you have deployed your solution, increasing the number of nodes in your cluster is pretty easy. This feature allows you to easily improve the performance, add more nodes to your cluster, and implement the duplication of your data, adding robustness to your solution.
Open Source Community
Elasticsearch is a 10-year-old open source technology with a mature community behind it. Nowadays, there are hundreds of paid ETL and business intelligence solutions that can be used as an alternative to Elasticsearch. However, some of these solutions have a reduced number of features or cannot be easily adapted to your product. Moreover, support is usually paid or the community is small, giving you few options. On the other hand, Elasticsearch has a big community, offering you lots of support. If you start a project with Elasticsearch, the forums have all the help you'll need.
Complete ETL Stack
As we mentioned earlier, Elasticsearch is part of the ELK stack, giving you the option of easily deploying a complete solution with only a small configuration. Alternatives to Elasticsearch only cover the search engines, leaving you with the responsibility of uploading the data into it. In Elasticsearch, this part of the process is resolved by logstash, only leaving you the responsibility of configuring it properly.
Facilities to Transform Your Data
Something that we haven’t talked about yet is the data transformation capability that Elasticsearch has integrated. Before populating the data into the indexes, Elasticsearch processes it using analyzers. These analyzers apply some transformations in the raw data, such as removing stopwords, which guarantee an accurate search. Elasticsearch defines a default set of analyzers that can be applied. In addition, you can create custom analyzers and add them to your indexes. An example of a custom analyzer could be an entity substitution in your data.
Specific Search DSL
In Elasticsearch, all the searches are done using queries written in a DSL thought to cover the most common search situations. The elastic DSL has facilities to customize your searches in order to retrieve exactly the expected results you are looking for, such as match, bool, and fuzzy, among others.
Full Management Using Rest APIs
All the operations against the Elasticsearch cluster are done using a REST API. This architecture facilitates integration with other components without needing to install any specific package, library, or component.
In summary, Elasticsearch is a mature search technology that can help your business efficiently analyze your textual data. We hope this post has helped you understand what Elasticsearch is and why it could be a good option for your company.
How Your Company Can Benefit from Machine Learning and NLP
In this post, we clarify the capabilities of these technologies, focusing mostly on NLP, and we show some concrete examples of how your company can get the most out of them.
NLP or NLU: How to Choose the Best Option for Your Project
We delve deeper into natural language processing (NLP), explain the differences between NLP and natural language understanding (NLU), and offer some tips for choosing the best solution for your company.
Photo by Markus Winkler.
Categorized under research & learning.We’d love to work with you.
We treat client projects as if they were our own, understanding the underlying needs and astonishing users with the results.