Search
  • Welly Tambunan

Modern Data Lake Engine using Dremio, Arrow and TimescaleDB

Updated: Apr 12

Just finished up my first presentation as CTO of SciFin Technologies on AWS Meetup User Group


It's really a fun experience for me back into the pure engineering focused. The presentation material basically created by Dremio Team. So thanks a lot for help!


I would love to ask them for permission also


For the demo i follow the tutorial from TimescaleDB about taxi rides. I put all of the query into Dremio and let the Dremio Data Reflection do the optimization and acceleration.


Here's the Jupyter Notebook for the presentation. Well it's can be optimized even more by replacing dremio jdbc connector to arrow flight in Spark SQL Source. But dremio flight connector is still in early stages. Then we can get the data by using Spark Flight Connector. It's still spark 2.x version, need to upgrade that one.


After the data is in Spark as DataFrame we can transform that into pandas dataframe using pyarrow. Then machine learning using GPU. No problem sir. We've got it covered by blazingsql and rapids.ai


I'm not without doubt tell that the Dremio on AWS is the best version as lots of elasticity and manageable operation.


Here's the recording if you are interested on learning more (In Indonesian!)


https://www.youtube.com/watch?v=V7WfIQHUsuY


The future is bright and it's become real because of Apache Arrow


Cheers

376 views0 comments

Recent Posts

See All