Data Science for Java Developers With Tablesaw

gowebing 2024-12-02

Data science is one of the hottest areas in computing today. Most people learn data science using either Python or R. Both are excellent languages for crunching and analyzing data.

But many Java developers feel left behind. There are great Java libraries for machine learning, especially for jobs that require distributed computing, but there's no simple path for Java developers to learn and apply data science. By minimizing the number of things you need to learn, the open-source Tablesaw provides a gateway.

Think of Tablesaw as a Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Used interactively or embedded in an application, its focus is to make data science as easy in Java as in R or Python. If you've done some data science, you may think of it as a data frame.

Tablesaw is easy to learn, but it's not a toy. Tables can be large — up to two billion rows. Performance is brisk — on my laptop, I can retrieve 500 records from a table of half of a billion rows in two milliseconds. It is open-sourced under a business-friendly Apache 2 license.

What Makes Tablesaw Beginner-Friendly?

It builds on what you know: For Java developers who want to do data science, it's a huge advantage to not have to also learn a new language.
It's easy to get started: Simply add Tablesaw as a Maven dependency for your project and you’re up and running. We’ll walk through an example below to show you how.
It's not distributed: Unlike many machine learning libraries, Tablesaw is not a distributed system. This removes enormous complexity and makes machine learning accessible to those without deep engineering experience or support.
The code is clear: There's a fluent API so you’ll understand your code the next time you read it.
It provides fast feedback: Tablesaw is designed to be used interactively for exploratory analysis.

Introductory Example

Here, I’ll show you some of Tablesaw’s basic data manipulation features. Future posts will address visualization, machine learning, the Kotlin API and REPL, and the Tablesaw architecture. The code for this example can be found here.

Up and Running

To begin, create a Java project and add the Tablesaw core library as a Maven dependency. The current dependency is:

<!-- https://mvnrepository.com/artifact/tech.tablesaw/tablesaw-core -->
<dependency>
    <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-core</artifactId>
    <version>0.23.3</version>
</dependency>

Next, create a class with a main method like so:

public class Foo {
    public static void main(String[] args {
       // rest of code goes here
    }
}

The rest of our code will go in this method.

The first thing to do is add a table. Tablesaw can load data from relational databases, but we will create our table from a flat text file:

Table table1 = Table.read().csv(“bush.csv");

Table objects can provide a lot of information:

table1.name(); returns bush.csv since the table name defaults to the file name.
table1.shape(); returns 323 rows X 3 cols.
table1.structure(); returns a table of column metadata:

Index Column Name Column Type 
0     date        LOCAL_DATE  
1     approval    SHORT_INT   
2     who         CATEGORY

Note that we've inferred the column types from the data.

table1.first(3); returns a new table containing only the first three rows.

BushApproval.csv
date       approval who 
2004-02-04 53       fox 
2004-01-21 53       fox 
2004-01-07 58       fox

Inevitably, we want to work with columns. Each has a data type, and usually, you’ll want it by that type and not as a generic column because typed columns have more power. For example, to get the approval column, you can use:

NumberColumn approval = table1.numberColumn(“approval”);

Each column sub-type supports numerous operations. As a rule, operations on a column are applied to every element without explicit loops. Some call these “vector operations.” For example, operations like count(), min(), and contains() produce a single value for a column of data:

double min = approval.min();.

Other operations return a new column. The method dayOfYear() applied to a DateColumn returns a short integer column with each element the day of the year from 1 to 366.

Some column-returning operations take a scalar value as an argument: dateColumn.plusDays(4);.

This adds four days to every element. Others take a second column as an argument. These process the two columns in order, applying each integer value from the argument to the corresponding element in the receiver.

Boolean operations like isMonday() don’t return a boolean column directly, but a Selection instead. Selections can be used to filter tables by the values in their columns, so we’ll see them again:

Selection selection = table1.dateColumn(“date”).isMonday();

You can, of course, get a boolean column if you want it. You simply pass the Selection and the original column length to a BooleanColumn constructor, along with a name for the new column:

BooleanColumn mondays = new BooleanColumn(“mondays”, selection, 1000);

There are hundreds of methods available for column manipulation, but let's turn now to tables. Operations exist for creating, describing, modifying, sorting, querying, and summarizing tables. Here we'll cover sorting, querying, and summarizing.

Queries

Queries apply a selection to a table and return a new filtered table. The method where() is what you want.

Usually, you will pass the query as a Selection to where(). Queries can be easily created:

NumberColumn approval = table1.numberColumn("approval");
Table highApproval = table1.where(approval.isGreaterThan(80));

Here you used the same kind of Selection objects we saw earlier in columns. You can also use those as arguments to table's where() method, allowing you to use column-specific logic to query a table.

DateColumn date = table1.dateColumn("date");
Table Q3 = table1.where(date.isInQ3());

Sorting

The easiest way to sort a table is sortOn();. This code gets it done:

table1.sortOn(“who”, “approval”);

Here “who” and “approval” are column names, and the sort is ascending. To sort in descending order, use sortDescendingOn().

To sort in mixed order, you can prepend a minus sign to a column name to indicate a descending sort on that column. For example, table1.sortOn(“who”, “-approval”); sorts on “who” in ascending order, and on “approval” in descending order.

Finally, you can write your own sort logic as an IntComparator, giving you full control over the ordering.

Summarizing

Now, we’ll cover summarization techniques like pivot tables (cross tabs). If you want to simply calculate group statistics for a table, the summarize() method works nicely. There are a large number of statistics available, including range, as shown below.

Table summary = table1.summarize("approval", range).by(“who”);

BushApproval.csv summary
who      Range [approval] 
fox      42.0             
gallup   41.0            
newsweek 40.0             
time.cnn 37.0             
upenn    10.0             
zogby    37.0

Cross tabs are useful for producing counts or frequencies of the number of observations in a combination of categories. First, let's get two categorical columns:

CategoryColumn who = table1.categoryColumn("who");
CategoryColumn month = date.month();
table1.addColumn(month);

Now, we can calculate the raw counts for each combination:

Table xtab = CrossTab.xTabCount(table1, month, who);

Crosstab Counts: date month x who
          fox gallup newsweek time.cnn upenn zogby total 
APRIL     6   10     3        1        0     3     23    
AUGUST    3   8      2        1        0     2     16    
DECEMBER  4   9      4        3        2     5     27    
FEBRUARY  7   9      4        4        1     4     29    
JANUARY   7   13     6        3        5     8     42    
JULY      6   9      4        3        0     4     26    
JUNE      6   11     1        1        0     4     23    
MARCH     5   12     4        3        0     6     30    
MAY       4   9      5        3        0     1     22    
NOVEMBER  4   9      6        3        1     1     24    
OCTOBER   7   10     8        2        1     3     31    
SEPTEMBER 5   10     8        3        0     4     30    
Total     64  119    55       30       10    45    323

If you prefer to see the relative frequency for each combination, pass your crosstab table to the tablePercents() method:

CrossTab.tablePercents(xtab);

and it will return a table showing the relative frequency of each cell:

Crosstab Table Proportions: 
          fox         gallup      newsweek     time.cnn     upenn        zogby        total       
APRIL     0.01857585  0.030959751 0.009287925  0.0030959751 0.0          0.009287925  0.071207434 
AUGUST    0.009287925 0.024767801 0.0061919503 0.0030959751 0.0          0.0061919503 0.049535602 
DECEMBER  0.012383901 0.027863776 0.012383901  0.009287925  0.0061919503 0.015479876  0.083591335 
FEBRUARY  0.021671826 0.027863776 0.012383901  0.012383901  0.0030959751 0.012383901  0.08978328  
JANUARY   0.021671826 0.04024768  0.01857585   0.009287925  0.015479876  0.024767801  0.13003096  
JULY      0.01857585  0.027863776 0.012383901  0.009287925  0.0          0.012383901  0.08049536  
JUNE      0.01857585  0.03405573  0.0030959751 0.0030959751 0.0          0.012383901  0.071207434 
MARCH     0.015479876 0.0371517   0.012383901  0.009287925  0.0          0.01857585   0.09287926  
MAY       0.012383901 0.027863776 0.015479876  0.009287925  0.0          0.0030959751 0.06811146  
NOVEMBER  0.012383901 0.027863776 0.01857585   0.009287925  0.0030959751 0.0030959751 0.0743034   
OCTOBER   0.021671826 0.030959751 0.024767801  0.0061919503 0.0030959751 0.009287925  0.095975235 
SEPTEMBER 0.015479876 0.030959751 0.024767801  0.009287925  0.0          0.012383901  0.09287926  
Total     0.19814241  0.36842105  0.17027864   0.09287926   0.030959751  0.13931888   1.0

There are similar methods for getting the row-wise or column-wise frequencies.

What's Next?

There is much more I hope this is encouraging you to give Tablesaw a try. As I mentioned, future posts will cover visualization, machine learning, and more. You can find the code on github at https://github.com/jtablesaw/tablesaw .

Since you're a Java developer, consider taking a look at our contributor's page. Tablesaw is a work in progress. Help us make Java a great platform for data science.