SBIR Phase I: Automatic Extraction of Financial Data from Text

Period of Performance: 01/01/2013 - 12/31/2013


Phase 1 SBIR

Recipient Firm

BCL Technologies
3031 Tisch Way, Suite 1000
San Jose, CA 95128
Principal Investigator, Firm POC


The innovation is the development of a linguistically-driven machine learning system that will extract financial data from financial text such as 10-Q documents, with an accuracy of over 85%. To be useful to the analysts, financial data needs to be a triple of "Financial Concept", "Numeric Value" and "Date Range." Because of the complexity of sentences in the financial domain, detecting the Financial Concept and attaching it to the correct Numeric Value and Date Range remains a challenge. Current financial extraction systems record an accuracy of less than 50%. The proposed method will use a combination of Financial Named Entity Recognition, Semantic Nearest Neighbor location and Support Vector Machines to improve Financial Concept detection, attachment and semantic tagging to 85%. By combining these methods in its Phase II Research, the innovation is the development of an end-to-end 'Automatic Extraction of Financial Data from Text' system that is usable by computerized systems. At the end of Phase I, the proposed method will demonstrate the feasibility of financial data extraction on the Notes section of 10-Q documents. The Phase II system will be designed to scale up to handle very large data sets, including non-American English documents in near real-time. The broader/commercial impact of Automatic Extraction of Financial Data from Text system is the availability of relevant financial information in computer-readable format with high accuracy in near real time. Currently, data embedded in financial text are extracted manually by hundreds of people working for data warehouses. This manual effort takes on the order of weeks making the bulk of the data unavailable in easily computer-usable forms in real time. The benefit of Automatic Extraction of Financial Data from Text will be in three areas: 1. Algorithmic Trading programs will be able to use all data published worldwide immediately after the data is published; 2. Financial data warehouses will be able to provide much larger types of data concepts - there are 18,498 concepts in the US Generally Accepted Accounting Principles taxonomy versus less than 180 available in commercial data warehouses; 3. There will be increased transparency in the financial market as financial information embedded in the text becomes computer readable. The algorithmic trading was estimated to reach over $5 Trillion with 750 Billion shares traded, generating a profit of over $600 Million in 2012. The impact of financial transparency is an intangible benefit that will improve financial market efficiency.