Using Apache POI to Convert Excel to an Elasticsearch Compatible JSON

gowebing 2024-12-02

Abstract

Increasingly, Elasticsearch is being used as one of the preferred solutions by teams who handle a lot of text data. In this article, we will share my experience of uploading data into an Elasticsearch index and the reason why a simple Logstash script was not suitable for this purpose.

Introduction

Recently, we implemented a solution for a team that wanted to improve the productivity of their incident management team. As is standard practice, the team uses a knowledge base/knowledge repository/incident database while resolving newly logged tickets/incidents. Whenever a new ticket is created, the team refers to similar tickets that exist in the repository. The repository used by the team does provide a search feature, but the team wanted the results to be ranked by relevance.

To meet the requirement of the team, we decided to implement a solution using the ELK (Elasticsearch, Logstash, and Kibana) stack of tools. We know. Your immediate reaction is "Why do you want to implement another solution and create another maintenance headache?" While this question is justified, the reason for suggesting a custom solution was simple. After a discussion with the team, we realized that ranking the search results was just the first step and the team wanted to extend the solution functionality into analytics and other cognitive aspects. In this article, we are sharing how we went about creating the index in Elasticsearch. A description of the solution is a topic for a different article.

Excel Dump

The incident management system being used by the team is a well-known and popular application. As we did not have access to an API that would allow me to pull data from this application and build a ranking system on top of the results, we decided to export all the data to Elasticsearch. It was not possible to directly export data from the incident management system into Elasticsearch. Hence the team extracted an Excel dump from the application and shared the same with us, which would be uploaded to an Elasticsearch index.

Using Logstash

We did not have access to an Excel plugin for Logstash. Hence we converted the Excel dump to an equivalent CSV, for which a plugin is available in Logstash. Creation of the Logstash script was a breeze but we faced issues while uploading data to Elasticsearch. The biggest hurdle was due to the fact that incident information spanned multiple lines. As the Logstash CSV plugin expects data for one ‘record’/‘document’ to be present on a single line, it stumbles when it encounters text that spans multiple lines, populating incorrect data into the index. Thus, we were faced with a situation where Logstash would skip multiple lines of data and we ended up with an incomplete data set in the Elasticsearch index.

Enter Java and Apache POI

Clearly, the Logstash CSV plugin was not working for us. Hence we decided to implement a custom utility that would read data from the input file and upload the same to the Elasticsearch index. Why would our utility be any different from Logstash? For the simple reason that our utility would read an Excel file instead of a CSV and would be cognizant of newline characters in the input data.

Presently, our programming language of choice is Java. Having used the Apache POI library, we decided to use an Excel file as input instead of a CSV. The main reason for this was the newline characters. When an Excel file is converted to a CSV file, any newline characters in a field means that data in one filed spreads across multiple rows, making it impossible to use a tool like Logstash that reads data line-by-line.

In the case of an Excel file, all the data in a cell is available as a string. Our solution reads data in each cell and replaces the newline and line-feed characters by another character (or a string, if so desired by the team), thus converting multi-line data into a single line of text. After reading and processing each cell on one line, we had all the data needed to create one document in the Elasticsearch index.

In addition to removing newline characters, we also had to remove any characters that Elasticsearch deemed incorrect. For example, we were not able to insert many Unicode characters having diacritical marks, into the index. (It was only after a few days and a couple of tutorials later that we were able to understand how to handle Unicode text and store it in Elasticsearch.) Being unaware of Unicode support in Elasticsearch, we simply removed the Unicode characters. We are happy to report that removing these characters did not have an adverse impact on the information stored with the incident.

So, after some processing and the creation of a configuration file that allowed us to control the behavior of the utility (for example, specifying the characters to be replaced, specifying a substitution string for the newline characters), we had a working solution. Using the solution, we were able to upload all incident information from the input Excel, into the Elasticsearch index and we could move to the next stage of the solution – that of creating the search mechanism and the ranking system that would help the incident management team identify most relevant incidents.

Uploading Data to Elasticsearch

For each line of data from the input Excel, the solution generates an Elasticsearch compatible JSON. The generated JSON was uploaded to Elasticsearch using the bulk upload facility provided by Elasticsearch. The bulk upload REST API was invoked using the curl utility. Given below is the command used to upload the generated JSON to Elasticsearch.

curl-s -XPOST "http://localhost:9200/_bulk" --data-binary @generatedJSON.txt

Logstash Once Again

It is important to note that it is not necessary for the utility to generate a JSON file. In fact, we later modified the utility to generate a CSV file instead and we uploaded the CSV data into the Elasticsearch index using a Logstash script (that works perfectly, as it finds all data on one line).

Java Snippet

Given below is a Java snippet to illustrate how the Apache POI library can be used to extract values from an Excel file. Please note that this snippet is no different from many examples available on the Internet and has been included for illustration only.

. . .
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
. . .
File f = new File(filePathStr);
FileInputStream file = new FileInputStream(f);
// load the excel document
HSSFWorkbook workbook = new HSSFWorkbook(file);

// Get first sheet from the workbook
HSSFSheet sheet = workbook.getSheetAt(0);
// Get iterator to all the rows in current sheet
Iterator rowIterator = sheet.iterator();
Iterator cellIterator;
Row row;
Cell cell;
while ( rowIterator.hasNext() ) {
    row = rowIterator.next();
    // Get iterator to all cells of current row
    cellIterator = row.cellIterator();
    // load data from excel
    String t = null;
    while ( cellIterator.hasNext() ) {
        cell = cellIterator.next();
        if ( Cell.CELL_TYPE_NUMERIC == cell.getCellType() ) {
            t = cell.toString();
        }
        . . .
    // replace newline by string from configuration
    t = t.replaceAll("\n", configData.replaceNewlineBy);
    }
. . .
}
. . .

JSON Configuration

Given below is a JSON similar to the one that we used in the solution. The configuration JSON is used to control the behavior of the solution.

{
    "ignoreHeaderLine": "yes",
    "useHeaderForFieldNames": "no",
    "fieldNames": [
        "id",
        "status",
        "description",
        "date"
    ],
    "replaceNewline": "yes",
    "replaceNewlineBy": "#-#",
    "replaceTabCharacter": "yes",
    "replaceTabCharacterBy": "###",
    "replaceCharacters": [
    ],
    "replaceCharactersBy": "",
}

In Summary

In summary, the primary reason for creating a custom solution using Java and Apache POI was due to the fact that we were not able to add Excel data directly into an Elasticsearch index. Additionally, after converting the Excel into a CSV, the Logstash CSV plugin was not able to handle the input data as some of the fields span multiple physical lines. While addressing the primary need using the custom solution, we were also able to massage the data into a format acceptable by Elasticsearch, helping us move to the next stage of solution development.

References

Elasticsearch
Getting Started With Elasticsearch
Getting Started with Logstash
The Apache Software Foundation
Apache POI - the Java API for Microsoft Documents
BusinessPlan.java – A Sample from Apache POI