Financial and Investment Research Factory - Data Curation
Just recently watch the presentation from Prof Marcos Lopez de Prado in youtube. The picture below is the framework of thinking when we build research factory.
The responsibility of each component in the assembly line is describe in the first chapter of Advanced in Financial Machine Learning that also available for free from SSRN.
From my software engineering perspective i've already shared the part of our job in my previous post. Now i will explain in more details on how the big data technology and data engineer have a big part on the large financial factory.
At first before anything we need to have clean data either that's structured, semi-structured or unstructured. We can get this from data vendor like databento, csidata or we can create our own ingestion strategy. Either that's real-time ingestion or batch ingestion. Quantamental with nowcasting capabilities will required lots of unstructured realtime-data. It's fall into alternative data category like satellite images, documents, or other non-conventional data types.
Each data is different and has different characteristics. Stock data have earning and stock split, dividend and specific behaviour. Futures and options also different. We can get these data from vendor but we need to double check from multiple source and merge them so we have the high quality data. If we have only single data source and trust it blindly, we might mislead the researcher with the bad data.
The time series data alone no longer provide competitive edge anymore as everyone can get the data for low cost or even freely available in the internet. If everyone can get the data easily, then everyone have the possibility find the same alpha, just a matter of time if they mine that hard enough. So we need new data sources that might have not been ingest ever before.
Ingestion is just one of the component of the data curation parts. After we ingest the data we need to store it. We need to store the data as raw as possible. Don't ever throw away the data. Even if we are thinking the data is flaw or bad as we might need that later or even we can have new ideas that turn the noise into signal. We don't know what we don't know.
The are a lots of kind of storage from raw object storage, file system, format and also optimize for specific purpose. If we want to model a graph we need to use graph database. If we have text data, we need to store that as raw as possible and then able to query that text via indexing or full text search later on.
One use case is to have knowledge graph that constructed from text data. The graph is called knowledge graph or network. This is powerful tools that required multiple skills and discipline like NLP/NLU and also semantic web technology. Graph enable us to model complex interaction between economic agent and enable us to have cause effect and forward looking model that build financial theory.
AI and Machine Learning also take a huge part on this factory but we need to use that differently. Not blindly predict price movement or something that has been done without success. A lot of quant fund bust because of these mistake. We need to use ML as a scientist. We can use ML as a tools to discover needle in they haystack. This called features selection, importance and reducing high dimensional datasets. This task can be easily done by use of Machine Learning.
The result of machine learning can be store in features store like feast or tecton or even vector database so we don't process the same data over and over again. Online learning and incremental improvement for streaming data also need to consider on how the best and efficient to leverage the hot data.
The data curator needs to know a lot of protocol like FIX, ITCH and basic computer science like networking, UDP, message passing and also data format and efficient serialization like SBE and other industry standard. The ability to read specification and implement without flaw is the key for this station.
The proper research need proper structure. Given that the research factory is a huge, there might be some kind of coordination that need to be efficient. Data Governance plays important role here. Big organization needs proper structure and policy. That's doesn't mean that we need to sacrifice agility. We just need to structure our data and research so it can be easily productionalize. Idea needs to be tested and bring into the real-world with proper methodology.