Learning Complex SQL Query Example
Daily Order Analysis with Postgresql
credit - https://www.databasestar.com/complex-sql-query-example/
01 Setup
Setup the database in postgres using the following scripts in order https://github.com/vijayanandrp/blog/tree/main/sql/postgres_db_setup
02 Goal
Write query for the below requirement.
- the order date
- the number of orders for that date
- the number of books ordered
- the total price of the orders
- the running total of books for the month
- the number of books from the same day last week (e.g. how this Sunday compares to last Sunday)
Note the use only single quotes for string representation
03 Order Date
Start with a simple query to see the order date.
select order_datefrom cust_order coorder by order_date desc;
04 Show Number of Orders Per Date
Whenever we need to see something like “the number of X per Y”, it usually means we need to use an aggregate function and group the results.
-- Wrong Basically We need to change the query to show only the date and not the time.select order_date, count(*)from cust_order cogroup by order_dateorder by order_date asc;
-- Right oneselect to_char(order_date, 'YYYY-MM-DD') as order_day, count(order_id) as num_ordersfrom cust_ordergroup by to_char(order_date, 'YYYY-MM-DD')order by order_day asc
05 Add Number of Books Ordered
select * from cust_order co limit 10;select * from order_line ol limit 10;
select count(distinct order_id) as num_orders, count(1) as total_records from cust_order co limit 10;select count(distinct order_id) as num_orders, count(1) as total_records from order_line co limit 10;
select 'inner' as join_type, count(1) as total_records from cust_order co inner join order_line ol on ol.order_id = co.order_idunion allselect 'left' as join_type, count(1) as total_records from cust_order co left join order_line ol on ol.order_id = co.order_idunion allselect 'right' as join_type, count(1) as total_records from cust_order co right join order_line ol on ol.order_id = co.order_id
select to_char(co.order_date, 'YYYY-MM-DD') as order_day , count(distinct co.order_id) as num_orders, count(distinct ol.book_id) as num_booksfrom cust_order coinner join order_line olon ol.order_id = co.order_idgroup by to_char(co.order_date, 'YYYY-MM-DD')order by order_day asc
06 Add Total Price of the Books
select to_char(co.order_date, 'YYYY-MM-DD') as order_day , count(distinct co.order_id) as num_orders, count(ol.book_id) as num_books, sum(ol.price) as total_pricefrom cust_order coinner join order_line olon ol.order_id = co.order_idgroup by to_char(co.order_date, 'YYYY-MM-DD')order by order_day asc
07 Add Running Total
select to_char(co.order_date, 'YYYY-MM') as order_month, to_char(co.order_date, 'YYYY-MM-DD') as order_day , count(distinct co.order_id) as num_orders, count(ol.book_id) as num_books, sum(ol.price) as total_price, sum(count(ol.book_id)) over ( partition by to_char(co.order_date, 'YYYY-MM') order by to_char(co.order_date, 'YYYY-MM-DD')) as running_total_num_booksfrom cust_order coinner join order_line ol on ol.order_id = co.order_idgroup by to_char(co.order_date, 'YYYY-MM') , to_char(co.order_date, 'YYYY-MM-DD')order by order_day asc
08 Reformat Query
select order_month , order_day, count(distinct order_id) as num_orders , count(book_id) as num_books, sum(price) as total_price, sum(count(book_id)) over (partition by order_month order by order_day) as running_total_num_booksfrom ( select to_char(co.order_date, 'YYYY-MM') as order_month, to_char(co.order_date, 'YYYY-MM-DD') as order_day, co.order_id , ol.price, ol.book_id from cust_order co inner join order_line ol on co.order_id = ol.order_id) subgroup by order_month , order_dayorder by order_day asc
09 Add Number from Last Week
LEAD lets you get data from rows further down (below) in the results,
LAG lets you get data from rows further up (above) in the results.
As we’re ordering by order date in ascending order,
the earlier orders are further up the results, so we can use LAG.
We use LAG to get a value from an order date from one week ago, or seven days.
LAG is used as an analytic function, so we specify the OVER clause and the ORDER BY clause to define how the data is searched.
select order_month , order_day, count(distinct order_id) as num_orders , sum(price) as total_price, count(book_id) as num_books, sum(count(book_id)) over (partition by order_month order by order_day) as running_total_num_books, lag(count(book_id), 7) over (order by order_day) as prev_booksfrom ( select to_char(co.order_date, 'YYYY-MM') as order_month, to_char(co.order_date, 'YYYY-MM-DD') as order_day, co.order_id , ol.price, ol.book_id from cust_order co inner join order_line ol on co.order_id = ol.order_id) subgroup by order_month , order_dayorder by order_day asc
10 Issue with Query
The LAG function will look at rows that are 7 rows earlier than the current row. Ideally, this means that it’s 7 days in the past.
However, the order_date values come from the cust_order table, and there is no guarantee that there will be an order every day.
LAG Function won't identify the missing days in past.
-- Calendar tablecreate table calendar_days ( calendar_date DATE, calendar_year INT, calendar_month INT, calendar_day INT, calendar_dayname VARCHAR(20));
create or replace procedure filldates(dateStart date, dateEnd date)language plpgsqlas $$begin while dateStart <= dateEnd loop truncate table calendar_days; insert into calendar_days (calendar_date, calendar_year, calendar_month, calendar_day, calendar_dayname) values ( dateStart, to_char(dateStart, 'YYYY')::int, to_char(dateStart, 'MM')::int, to_char(dateStart, 'DD')::int, to_char(dateStart, 'Day')::varchar); dateStart := to_char(dateStart + '1 day'::interval, 'YYYY-MM-DD')::date; end loop; commit;end;$$
truncate table calendar_days;
call filldates('2018-01-01', current_date);
select * from calendar_days;
/* UPDATED QUERY */
select c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_dayname, count(distinct sub.order_id) as num_orders , sum(sub.price) as total_price, count(sub.book_id) as num_books, sum(count(sub.book_id)) over ( partition by c.calendar_year, c.calendar_month order by c.calendar_date ) as running_total_num_books, lag(count(sub.book_id), 7) over ( order by c.calendar_date ) as prev_booksfrom calendar_days cleft join ( select to_char(co.order_date, 'YYYY-MM') as order_month, to_char(co.order_date, 'YYYY-MM-DD')::date as order_day, co.order_id , ol.price, ol.book_id from cust_order co inner join order_line ol on co.order_id = ol.order_id) sub on sub.order_day = c.calendar_dategroup by c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_daynamehaving count(distinct sub.order_id) > 0order by c.calendar_date asc
11 Analyse and Improve Performance
We can run an Explain Plan on this query and see what it shows.
- Create indexes
CREATE INDEX idx_calendar_date ON calendar_days(calendar_date);CREATE INDEX idx_ol_orderid ON order_line(order_id);
-- Optimize 1 to remove date formatselect c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_dayname, count(distinct sub.order_id) as num_orders , sum(sub.price) as total_price, count(sub.book_id) as num_books, sum(count(sub.book_id)) over ( partition by c.calendar_year, c.calendar_month order by c.calendar_date ) as running_total_num_books, lag(count(sub.book_id), 7) over ( order by c.calendar_date ) as prev_booksfrom calendar_days cleft join ( select co.order_date, co.order_id , ol.price, ol.book_id from cust_order co inner join order_line ol on co.order_id = ol.order_id) sub on sub.order_date::date = c.calendar_dategroup by c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_daynamehaving count(distinct sub.order_id) > 0order by c.calendar_date asc
-- Optimize 2 remove the sub queryselect c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_dayname, count(distinct co.order_id) as num_orders , sum(ol.price) as total_price, count(ol.book_id) as num_books, sum(count(ol.book_id)) over ( partition by c.calendar_year, c.calendar_month order by c.calendar_date ) as running_total_num_books, lag(count(ol.book_id), 7) over ( order by c.calendar_date ) as prev_booksfrom calendar_days cleft join cust_order co on c.calendar_date = co.order_date::dateinner join order_line ol on co.order_id = ol.order_idgroup by c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_daynameorder by c.calendar_date asc
-- Optimize 3 inner join to left joinselect c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_dayname, count(distinct co.order_id) as num_orders , sum(ol.price) as total_price, count(ol.book_id) as num_books, sum(count(ol.book_id)) over ( partition by c.calendar_year, c.calendar_month order by c.calendar_date ) as running_total_num_books, lag(count(ol.book_id), 7) over ( order by c.calendar_date ) as prev_booksfrom calendar_days cleft join cust_order co on c.calendar_date = co.order_date::dateleft join order_line ol using(order_id)where co.order_id is not nullgroup by c.calendar_date, c.calendar_year, c.calendar_month, c.calendar_daynameorder by c.calendar_date asc