How to Perform Joins in Apache Hive
We will be working with two tables — customer and orders — that we imported in my sqoop import article, and we'll perform the following joins:
- INNER JOIN – Select records that have matching values in both tables.
- LEFT JOIN (LEFT OUTER JOIN) – Returns all the values from the left table, plus the matched values from the right table, or NULL in case of no matching join predicate
- RIGHT JOIN (RIGHT OUTER JOIN) A RIGHT JOIN returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate
- FULL JOIN (FULL OUTER JOIN) – Selects all records that match either left or right table records.
- LEFT SEMI JOIN: Only returns the records from the left-hand table. Hive doesn’t support IN subqueries so you can’t do
SELECT * FROM TABLE_A WHERE TABLE_A.ID IN (SELECT ID FROM TABLE_B);
Customer Table
Hive Tip: to print column headers in command line:
hive> set hive.cli.print.header=true;
hive> select * from customers;
OK
customers.id customers.name
1 John
2 Kevin
19 Alex
3 Mark
4 Jenna
5 Robert
6 Zoya
7 Sam
8 George
9 Peter
Orders Table:
hive> select * from orders;
OK
order_id orders.order_date orders.customer_id orders.amount
101 2016-01-01 7 3540
102 2016-03-01 1 240
103 2016-03-02 6 2340
104 2016-02-12 3 5000
105 2016-02-12 3 5500
106 2016-02-14 9 3005
107 2016-02-14 1 20
108 2016-02-29 2 2000
109 2016-02-29 3 2500
110 2016-02-27 1 200
INNER JOIN
Select records that have matching values in both tables.
hive> select c.id, c.name, o.order_date, o.amount from customers c inner join orders o ON (c.id = o.customer_id);
Output
c.id c.name o.order_date o.amount
7 Sam 2016-01-01 3540
1 John 2016-03-01 240
6 Zoya 2016-03-02 2340
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
9 Peter 2016-02-14 3005
1 John 2016-02-14 20
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
1 John 2016-02-27 200
LEFT JOIN (LEFT OUTER JOIN)
Returns all the values from the left table, plus the matched values from the right table, or NULL in case of no matching join predicate
hive> select c.id, c.name, o.order_date, o.amount from customers c left outer join orders o ON (c.id = o.customer_id);
Output
c.id c.name o.order_date o.amount
1 John 2016-03-01 240
1 John 2016-02-14 20
1 John 2016-02-27 200
2 Kevin 2016-02-29 2000
19 Alex NULL NULL
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
3 Mark 2016-02-29 2500
4 Jenna NULL NULL
5 Robert NULL NULL
6 Zoya 2016-03-02 2340
7 Sam 2016-01-01 3540
8 George NULL NULL
9 Peter 2016-02-14 3005
Time taken: 40.462 seconds, Fetched: 14 row(s)
RIGHT JOIN (RIGHT OUTER JOIN)
Returns all the values from the right table, plus the matched values from the left table, or NULL in case of no matching join predicate
hive> select c.id, c.name, o.order_date, o.amount from customers c left outer join orders o ON (c.id = o.customer_id);
Output
c.id c.name o.order_date o.amount
7 Sam 2016-01-01 3540
1 John 2016-03-01 240
6 Zoya 2016-03-02 2340
3 Mark 2016-02-12 5000
3 Mark 2016-02-12 5500
9 Peter 2016-02-14 3005
1 John 2016-02-14 20
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
1 John 2016-02-27 200
FULL JOIN (FULL OUTER JOIN)
Selects all records that match either left or right table records.
hive> select c.id, c.name, o.order_date, o.amount from customers c full outer join orders o ON (c.id = o.customer_id);
Output
c.id c.name o.order_date o.amount
1 John 2016-02-27 200
1 John 2016-02-14 20
1 John 2016-03-01 240
19 Alex NULL NULL
2 Kevin 2016-02-29 2000
3 Mark 2016-02-29 2500
3 Mark 2016-02-12 5500
3 Mark 2016-02-12 5000
4 Jenna NULL NULL
5 Robert NULL NULL
6 Zoya 2016-03-02 2340
7 Sam 2016-01-01 3540
8 George NULL NULL
9 Peter 2016-02-14 3005
LEFT SEMI JOIN
Find all the customers where at least one order exist or find all customer who has placed an order.
hive> select * from customers left semi join orders ON
(customers.id = orders.customer_id);
OUTPUT
customers.id customers.name
1 John
2 Kevin
3 Mark
6 Zoya
7 Sam
9 Peter
Time taken: 56.362 seconds, Fetched: 6 row(s)
That’s it for this article. I hope you find this useful, thank you for reading!