Data Analysis for Live Streaming: What Happens in Real Time Is Analyzed in Real Time?

As live streaming emerges as a way of doing business, the need for data analysis follows up.

What's Different About Data Analytics in Live Streaming?

Live streaming is one typical use case for real-time data analysis because it stresses speed. Livestream organizers need to keep abreast of the latest data to see what is happening and maximize effectiveness. To realize that requires high efficiency in every step of data processing:

The rest of this post is about how a live-streaming service provider with 800 million end users found the right database to support its analytic solution.

Simplify the Components

In this case, the live streaming data analytic platform adopts the Lambda architecture, which consists of a batch processing pipeline and a streaming pipeline, the former for user profile information and the latter for real-time generated data, including metrics like real-time subscription, visitor count, comments, and responses. 

Old Architecture

The real-time metrics will be combined with the user profile information to form a flat table, and Elasticsearch will work as the query engine.

As their business burgeons, the expanding data size becomes unbearable for this platform, with problems like:

So, to sum up, the main problem for this architecture is its complexity. To reduce the components means to find a database that is not only capable of most workloads but also performant in data writing and queries. After six months of testing, they finally upgraded their live-streaming analytic platform with Apache Doris. 

They converge the streaming and batch-processing pipelines at Apache Doris. It can undertake analytic workloads and also provides a storage layer so data doesn't have to shuffle back to Elasticsearch and HBase as it did in the old architecture.

With Apache Doris as the data warehouse, the platform architecture becomes neater.

New Architecture

The above shows how Apache Doris speeds up the entire data processing pipeline with its all-in-one capabilities. Beyond that, it has some delightful features that can increase query efficiency and ensure service reliability in the case of live streaming.  

Disaster Recovery

The last thing you want in live streaming is service breakdown, so disaster recovery is necessary.

Before the live streaming platform had Apache Doris in place, they only backed up their data to object storage. It took an hour from when a failure was reported to when it was fixed. That one-hour window is fatal for live commerce because viewers will leave immediately. Thus, disaster recovery must be quick.

Now, with Apache Doris, they have a dual-cluster solution: a primary cluster and a backup cluster. This is for hot backup. Besides that, they have a cold backup plan, which is the same as what they did: backing up their everyday data to object storage via Backup and Freeze policies.

This is how they do the hot backup before Apache Doris 2.0: 

Apache Doris 2.0 supports Cross Cluster Replication (CCR), which can automate the above processes to reduce maintenance costs and inconsistency risks due to human factors.

Data Visualization

In addition to reporting, dashboarding, and ad-hoc queries, the platform also allows analysts to configure various data sources to produce their own visualized data lists. 

Apache Doris is compatible with most BI tools on the market, so the platform developers can tap on that and provide a broader set of functionalities for live streamers.

Also, built on the real-time capabilities and quick computation of Apache Doris, live streams can view data and see what happens in real time instead of waiting for a day for data analysis.

Bitmap Index to Accelerate Tag Queries

A big part of data analysis in live streaming is viewer profiling. Viewers are divided into groups based on their online footprint. They are given tags like "watched for over one minute" and "visited during the last minute". As the show goes on, viewers are constantly tagged and untagged. In the data warehouse, it means frequent data insertion and deletion. Plus, one viewer is given multiple tags. To gain an overall understanding of users entails joining queries, which is why the join performance of the data warehouse is important. 

The following snippets give you a general idea of how to tag users and conduct tag queries in Apache Doris.

Create a Tag Table

A tag table lists all the tags that are given to the viewers and maps the tags to the corresponding viewer ID.

 
create table db.tags (  
u_id string,  
version string,  
tags string
) with (  
'connector' = 'doris',  
'fenodes' = '',  
'table.identifier' = 'tags',  
'username' = '',  
'password' = '',  
'sink.properties.format' = 'json',  
'sink.properties.strip_outer_array' = 'true',  
'sink.properties.fuzzy_parse' = 'true',  
'sink.properties.columns' = 'id,u_id,version,a_tags,m_tags,a_tags=bitmap_from_string(a_tags),m_tags=bitmap_from_string(m_tags)',  
'sink.batch.interval' = '10s',  
'sink.batch.size' = '100000' 
);


Create a Tag Version Table

The tag table is constantly changing, so there are different versions of it as time goes by.

 
create table db.tags_version (  
id string,  
u_id string,  
version string  
) with (  
'connector' = 'doris',  
'fenodes' = '',  
'table.identifier' = 'db.tags_version',  
'username' = '',  
'password' = '',  
'sink.properties.format' = 'json',  
'sink.properties.strip_outer_array' = 'true',  
'sink.properties.fuzzy_parse' = 'true',  
'sink.properties.columns' = 'id,u_id,version',  
'sink.batch.interval' = '10s',  
'sink.batch.size' = '100000'  
);


Write Data Into Tag Table and Tag Version Table

 
insert into db.tags
select
u_id,  
last_timestamp as version,
tags
from db.source;  
 
insert into rtime_db.tags_version
select 
u_id,  
last_timestamp as version
from db.source;


Tag Queries Accelerated by Bitmap Index

For example, analysts need to find out the latest tags related to a certain viewer with the last name Thomas. Apache Doris will run the LIKE operator in the user information table to find all "Thomas." Then, it creates bitmap indexes for the tags. Lastly, it relates all user information tables, tag tables, and tag version tables to return the result.

Of almost a billion viewers, and each of them has over a thousand tags, the bitmap index can help reduce the query response time to less than one second.

 
with t_user as (
   select 
          u_id,
          name
   from db.user
   where partition_id = 1
   and name like '%Thomas%'
),

t_tags as (
        select 
                u_id, 
                version
        from db.tags
        where (
                  bitmap_and_count(a_tags, bitmap_from_string("123,124,125,126,333")) > 0 
          )
),

t_tag_version as (
        select id, u_id, version
        from db.tags_version
)

select 
  t1.u_id
  t1.name
from t_user t1
join t_tags t2 on t1.u_id = t2.u_id
join t_tag_version t3 on t2.u_id = t3.u_id and t2.version = t3.version
order by t1.u_id desc
limit 1,10;


Conclusion

Data analysis in live streaming is challenging for the underlying database, but it is also where the key competitiveness of Apache Doris comes into play. First of all, Apache Doris can handle most data processing workloads, so platform builders don't have to worry about putting many components together and consequential maintenance issues. Secondly, it has a lot of query-accelerating features, including but not limited to indexes. After tackling the speed issues, the Apache Doris developer community has been exploring its boundaries, such as introducing a more efficient cost-based query optimizer in version 2.0 and an inverted index for text searches, fuzzy queries, and range queries. These features are embraced by the live streaming service provider as they are actively testing them and planning to transfer their log analytic workloads to Apache Doris, too. 

 

 

 

 

Top