Web Scraping in Java Using jsoup and OkHttp

Web scraping is a fundamental skill that is extremely useful for data collection and automating tasks. The following examples will show how we scrape sites such as wrapbootstrap and themeforest to populate the HTML/CSS Theme Templates page. We will be using jsoup for DOM parsing and OkHttp for HTTP. Although jsoup is capable of handling HTTP for us we prefer to stick with OkHttp in case we need anything more complex than a simple GET request, such as special headers and cookies. Why learn two libraries when one will do?

Model/POJO

We like to start simple, so we are only gathering four fields' title, URL, image URL, and the number of downloads, if available.

public class HtmlCssTheme {
    private final String title;
    private final String url;
    private final String imageUrl;
    private final int downloads;
    public HtmlCssTheme(String title, String url, String imageUrl, int downloads) {
        super();
        this.title = title;
        this.url = url;
        this.imageUrl = imageUrl;
        this.downloads = downloads;
    }
    public String getTitle() {
        return title;
    }
    public String getUrl() {
        return url;
    }
    public String getImageUrl() {
        return imageUrl;
    }
    public int getDownloads() {
        return downloads;
    }
}

Jsoup Scraper

Our scraper is fairly simple. All it needs to do is a single GET request and extract the data we are interested in. We are using failsafe for retry logic and jOOλ for a simplified streaming API. Setting up OkHttpClient Logging Interceptors is very useful for tracking down bugs. We are only showing the wrapbootstrap scraper but the rest can be found here.

public class WrapBootstrapScraper {
    private static final Logger log = LoggerFactory.getLogger(WrapBootstrapScraper.class);
    private static final String affilaiteCode = "stubbornjava";
    private static final String WRAP_BOOTSTRAP_HOST = "https://wrapbootstrap.com";
    private static final String POPULAR_THEMES_URL =
        "https://wrapbootstrap.com/themes/page.1/sort.sales/order.desc";
    private static final OkHttpClient client = HttpClient.globalClient();

    public static List<HtmlCssTheme> popularThemes() {
        HttpUrl url = HttpUrl.parse(POPULAR_THEMES_URL);
        Request request = new Request.Builder().url(url).get().build();
        // Retry if the request is not successful code >= 200 && code < 300
        String html = Retry.retryUntilSuccessfulWithBackoff(
            () -> client.newCall(request).execute()
        );

        // Select all the elements with the given CSS selector.
        Elements elements = Jsoup.parse(html).select("#themes .item");
        List<HtmlCssTheme> themes = Seq.seq(elements)
                                       .map(WrapBootstrapScraper::themeFromElement)
                                       .toList();

        return themes;
    }

    /*
     * Parse out the data from each Element
     */
    private static HtmlCssTheme themeFromElement(Element element) {
        Element titleElement = element.select(".item_head h2 a").first();
        String title = titleElement.text();
        String url = HttpUrl.parse(WRAP_BOOTSTRAP_HOST + titleElement.attr("href"))
                            .newBuilder()
                            .addQueryParameter("ref", affilaiteCode)
                            .build().toString();
        String imageUrl = element.select(".image img").attr("src");
        int downloads = Optional.of(element.select(".item_foot .purchases").text())
                                .filter(val -> !Strings.isNullOrEmpty(val))
                                .map(Integer::parseInt)
                                .orElse(0);
        return new HtmlCssTheme(title, url, imageUrl, downloads);
    }

    /*
     * Main methods everywhere! Very convenient for quick ad hoc
     * testing without spinning up an entire application.
     */
    public static void main(String[] args) {
        List<HtmlCssTheme> themes = popularThemes();
        log.debug(Json.serializer().toPrettyString(themes));
    }
}

Theme Service Layer

Our naming convention for the service layer is generally just pluralizing the model. We don't care how it's getting the data as long as it gets it. We are caching the results of each scraper so we don't upset the websites maintainers. In an ideal world, we might periodically scrape and store the data in our own database.

public class Themes {
    private static final Logger log = LoggerFactory.getLogger(Themes.class);
    private Themes() {}

    // A list of all the theme websites we currently support.
    private static List<Supplier<List<HtmlCssTheme>>> suppliers = Lists.newArrayList(
        BootstrapBayScraper::popularThemes,
        TemplateMonsterScraper::popularThemes,
        ThemeForestScraper::popularThemes,
        WrapBootstrapScraper::popularThemes
    );

    // Sort by downloads desc then by name.
    private static final Comparator<HtmlCssTheme> popularSort =
        Comparator.comparing(HtmlCssTheme::getDownloads).reversed()
                  .thenComparing(HtmlCssTheme::getTitle);

    // Fetch all themes and sort them together.
    private static final List<HtmlCssTheme> fetchPopularThemes() {
        return Seq.seq(suppliers)
                  .map(sup -> {
                    /*
                     *  If one fails we don't want them all to fail.
                     *  This can be handled better but good enough for now.
                     */
                    try {
                        return sup.get();
                    } catch (Exception ex) {
                        log.warn("Error fetching themes", ex);
                        return Lists.<HtmlCssTheme>newArrayList();
                    }
                  })
                  .flatMap(List::stream)
                  .sorted(popularSort)
                  .toList();
    }

    /*
     *  Fetch all themes and cache for 4 hours. It takes a little time
     *  to scrape all the sites. We also want to be nice and not spam the sites.
     */
    private static final Supplier<List<HtmlCssTheme>> themesSupplier =
        Suppliers.memoizeWithExpiration(Themes::fetchPopularThemes, 4L, TimeUnit.HOURS);

    public static List<HtmlCssTheme> getPopularThemes(int num) {
        return Seq.seq(themesSupplier.get()).limit(num).toList();
    }

    public static void main(String[] args) {
        List<HtmlCssTheme> themes = getPopularThemes(50);
        log.debug(Json.serializer().toPrettyString(themes));
    }
}

Theme Routes

Now we simply create a custom HttpHandler and pass the themes along to the HTML template.

public class ThemeRoutes {

    public static void popularThemes(HttpServerExchange exchange) {
        List<HtmlCssTheme> themes = Themes.getPopularThemes(96);
        int year = LocalDate.now().getYear();
        Response response = Response.fromExchange(exchange)
            .with("year", year)
            .with("themes", themes)
            .withLibCounts()
            .withRecentPosts();
        Exchange.body().sendHtmlTemplate(exchange, "templates/src/pages/popular-themes", response);
    }
}

Finally, it's hooked into our router and now we have a functioning HTML/CSS Theme Template page.

 

 

 

 

Top