Modern Data Lake Engine using Dremio, Arrow and TimescaleDB
Just finished up my first presentation as CTO of SciFin Technologies on AWS Meetup User Group
It's really a fun experience for me back into the pure engineering focused. The presentation material basically created by Dremio Team. So thanks a lot for help!
I would love to ask them for permission also
For the demo i follow the tutorial from TimescaleDB about taxi rides. I put all of the query into Dremio and let the Dremio Data Reflection do the optimization and acceleration.
Here's the Jupyter Notebook for the presentation. Well it's can be optimized even more by replacing dremio jdbc connector to arrow flight in Spark SQL Source. But dremio flight connector is still in early stages. Then we can get the data by using Spark Flight Connector. It's still spark 2.x version, need to upgrade that one.
After the data is in Spark as DataFrame we can transform that into pandas dataframe using pyarrow. Then machine learning using GPU. No problem sir. We've got it covered by blazingsql and rapids.ai
I'm not without doubt tell that the Dremio on AWS is the best version as lots of elasticity and manageable operation.
Here's the recording if you are interested on learning more (In Indonesian!)
The future is bright and it's become real because of Apache Arrow