Data Science for Java Developers With Tablesaw
Data science is one of the hottest areas in computing today. Most people learn data science using either Python or R. Both are excellent languages for crunching and analyzing data.
But many Java developers feel left behind. There are great Java libraries for machine learning, especially for jobs that require distributed computing, but there's no simple path for Java developers to learn and apply data science. By minimizing the number of things you need to learn, the open-source Tablesaw provides a gateway.
Think of Tablesaw as a Java power tool for data manipulation with hooks for interactive visualization, analytics, and machine learning. Used interactively or embedded in an application, its focus is to make data science as easy in Java as in R or Python. If you've done some data science, you may think of it as a data frame.
Tablesaw is easy to learn, but it's not a toy. Tables can be large — up to two billion rows. Performance is brisk — on my laptop, I can retrieve 500 records from a table of half of a billion rows in two milliseconds. It is open-sourced under a business-friendly Apache 2 license.
What Makes Tablesaw Beginner-Friendly?
- It builds on what you know: For Java developers who want to do data science, it's a huge advantage to not have to also learn a new language.
- It's easy to get started: Simply add Tablesaw as a Maven dependency for your project and you’re up and running. We’ll walk through an example below to show you how.
- It's not distributed: Unlike many machine learning libraries, Tablesaw is not a distributed system. This removes enormous complexity and makes machine learning accessible to those without deep engineering experience or support.
- The code is clear: There's a fluent API so you’ll understand your code the next time you read it.
- It provides fast feedback: Tablesaw is designed to be used interactively for exploratory analysis.
Introductory Example
Here, I’ll show you some of Tablesaw’s basic data manipulation features. Future posts will address visualization, machine learning, the Kotlin API and REPL, and the Tablesaw architecture. The code for this example can be found here.
Up and Running
To begin, create a Java project and add the Tablesaw core library as a Maven dependency. The current dependency is:
<!-- https://mvnrepository.com/artifact/tech.tablesaw/tablesaw-core -->
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-core</artifactId>
<version>0.23.3</version>
</dependency>
Next, create a class with a main method like so:
public class Foo {
public static void main(String[] args {
// rest of code goes here
}
}
The rest of our code will go in this method.
The first thing to do is add a table. Tablesaw can load data from relational databases, but we will create our table from a flat text file:
Table table1 = Table.read().csv(“bush.csv");
Table objects can provide a lot of information:
table1.name();
returnsbush.csv
since the table name defaults to the file name.table1.shape();
returns323 rows X 3 cols
.table1.structure();
returns a table of column metadata:
Index Column Name Column Type
0 date LOCAL_DATE
1 approval SHORT_INT
2 who CATEGORY
Note that we've inferred the column types from the data.
table1.first(3);
returns a new table containing only the first three rows.
BushApproval.csv
date approval who
2004-02-04 53 fox
2004-01-21 53 fox
2004-01-07 58 fox
Inevitably, we want to work with columns. Each has a data type, and usually, you’ll want it by that type and not as a generic column because typed columns have more power. For example, to get the approval column, you can use:
NumberColumn approval = table1.numberColumn(“approval”);
Each column sub-type supports numerous operations. As a rule, operations on a column are applied to every element without explicit loops. Some call these “vector operations.” For example, operations like count()
, min()
, and contains()
produce a single value for a column of data:
double min = approval.min();
.
Other operations return a new column. The method dayOfYear()
applied to a DateColumn
returns a short integer column with each element the day of the year from 1 to 366.
Some column-returning operations take a scalar value as an argument: dateColumn.plusDays(4);
.
This adds four days to every element. Others take a second column as an argument. These process the two columns in order, applying each integer value from the argument to the corresponding element in the receiver.
Boolean operations like isMonday()
don’t return a boolean column directly, but a Selection
instead. Selections can be used to filter tables by the values in their columns, so we’ll see them again:
Selection selection = table1.dateColumn(“date”).isMonday();
You can, of course, get a boolean column if you want it. You simply pass the Selection
and the original column length to a BooleanColumn
constructor, along with a name for the new column:
BooleanColumn mondays = new BooleanColumn(“mondays”, selection, 1000);
There are hundreds of methods available for column manipulation, but let's turn now to tables. Operations exist for creating, describing, modifying, sorting, querying, and summarizing tables. Here we'll cover sorting, querying, and summarizing.
Queries
Queries apply a selection to a table and return a new filtered table. The method where()
is what you want.
Usually, you will pass the query as a Selection
to where()
. Queries can be easily created:
NumberColumn approval = table1.numberColumn("approval");
Table highApproval = table1.where(approval.isGreaterThan(80));
Here you used the same kind of Selection
objects we saw earlier in columns. You can also use those as arguments to table's where()
method, allowing you to use column-specific logic to query a table.
DateColumn date = table1.dateColumn("date");
Table Q3 = table1.where(date.isInQ3());
Sorting
The easiest way to sort a table is sortOn();
. This code gets it done:
table1.sortOn(“who”, “approval”);
Here “who”
and “approval”
are column names, and the sort is ascending. To sort in descending order, use sortDescendingOn()
.
To sort in mixed order, you can prepend a minus sign to a column name to indicate a descending sort on that column. For example, table1.sortOn(“who”, “-approval”);
sorts on “who”
in ascending order, and on “approval”
in descending order.
Finally, you can write your own sort logic as an IntComparator
, giving you full control over the ordering.
Summarizing
Now, we’ll cover summarization techniques like pivot tables (cross tabs). If you want to simply calculate group statistics for a table, the summarize()
method works nicely. There are a large number of statistics available, including range
, as shown below.
Table summary = table1.summarize("approval", range).by(“who”);
BushApproval.csv summary
who Range [approval]
fox 42.0
gallup 41.0
newsweek 40.0
time.cnn 37.0
upenn 10.0
zogby 37.0
Cross tabs are useful for producing counts or frequencies of the number of observations in a combination of categories. First, let's get two categorical columns:
CategoryColumn who = table1.categoryColumn("who");
CategoryColumn month = date.month();
table1.addColumn(month);
Now, we can calculate the raw counts for each combination:
Table xtab = CrossTab.xTabCount(table1, month, who);
Crosstab Counts: date month x who
fox gallup newsweek time.cnn upenn zogby total
APRIL 6 10 3 1 0 3 23
AUGUST 3 8 2 1 0 2 16
DECEMBER 4 9 4 3 2 5 27
FEBRUARY 7 9 4 4 1 4 29
JANUARY 7 13 6 3 5 8 42
JULY 6 9 4 3 0 4 26
JUNE 6 11 1 1 0 4 23
MARCH 5 12 4 3 0 6 30
MAY 4 9 5 3 0 1 22
NOVEMBER 4 9 6 3 1 1 24
OCTOBER 7 10 8 2 1 3 31
SEPTEMBER 5 10 8 3 0 4 30
Total 64 119 55 30 10 45 323
If you prefer to see the relative frequency for each combination, pass your crosstab table to the tablePercents()
method:
CrossTab.tablePercents(xtab);
and it will return a table showing the relative frequency of each cell:
Crosstab Table Proportions:
fox gallup newsweek time.cnn upenn zogby total
APRIL 0.01857585 0.030959751 0.009287925 0.0030959751 0.0 0.009287925 0.071207434
AUGUST 0.009287925 0.024767801 0.0061919503 0.0030959751 0.0 0.0061919503 0.049535602
DECEMBER 0.012383901 0.027863776 0.012383901 0.009287925 0.0061919503 0.015479876 0.083591335
FEBRUARY 0.021671826 0.027863776 0.012383901 0.012383901 0.0030959751 0.012383901 0.08978328
JANUARY 0.021671826 0.04024768 0.01857585 0.009287925 0.015479876 0.024767801 0.13003096
JULY 0.01857585 0.027863776 0.012383901 0.009287925 0.0 0.012383901 0.08049536
JUNE 0.01857585 0.03405573 0.0030959751 0.0030959751 0.0 0.012383901 0.071207434
MARCH 0.015479876 0.0371517 0.012383901 0.009287925 0.0 0.01857585 0.09287926
MAY 0.012383901 0.027863776 0.015479876 0.009287925 0.0 0.0030959751 0.06811146
NOVEMBER 0.012383901 0.027863776 0.01857585 0.009287925 0.0030959751 0.0030959751 0.0743034
OCTOBER 0.021671826 0.030959751 0.024767801 0.0061919503 0.0030959751 0.009287925 0.095975235
SEPTEMBER 0.015479876 0.030959751 0.024767801 0.009287925 0.0 0.012383901 0.09287926
Total 0.19814241 0.36842105 0.17027864 0.09287926 0.030959751 0.13931888 1.0
There are similar methods for getting the row-wise or column-wise frequencies.
What's Next?
There is much more I hope this is encouraging you to give Tablesaw a try. As I mentioned, future posts will cover visualization, machine learning, and more. You can find the code on github at https://github.com/jtablesaw/tablesaw .
Since you're a Java developer, consider taking a look at our contributor's page. Tablesaw is a work in progress. Help us make Java a great platform for data science.