Indonesia Stock Exchange, Data Preprocessing
Processing IDX Financial Statement Spreadsheet Files with Pandas
Data preparation steps for data science
Indonesia Stock Exchange or IDX provides quarterly data of financial statements for companies that are listed in the exchange as a spreadsheet. As an individual who quel in learning data science the past few years, this data might be a stepping stone to learn more about data science. Unfortunately this spreadsheet cannot be feeded into statistical analysis or machine learning algorithms directly as in its raw form. This data needs to be processed into somewhat a tabular format where the columns represent attributes such as assets, liability, equities, etc and the row represent an object such as the company’s financial profile for a particular accounting period.
Let’s peek out at one of the company’s financial statements to figure out what we’re facing here.
Whew that’s not what we’re expecting. We want to make those columns transposed (switch between row and columns) for each sheet. And for the equity statement the problem is much worse. The equity statement is already in a tabular form but we cannot stack those unless we stack it in sorta 3d table configuration or we can flatten those into 1d tables.
First step, reading the spreadsheet files. Fortunately we can use python pandas to read spreadsheets using the read_excel function for specific sheets. Lets create a class to load every sheet into an ordered dictionary so we can use it later and maybe transposed (flip the row and column) along the way.
After the dictionary is prepared, the next step is cleaning up the dataframe assigned for each value inside the dictionary. Each sheet is unique therefore different treatments are required especially for equity statement sheets. Those sheets need to be flatten in order to stack them. So we can use multi indexing and later on appending 2 indexes together into 1 single index.
After processing all of the dataframe into 1D dataframe inside the dictionary we can proceed to append all of those dataframe into 1 long dataframe.
We can call it a day and loop all of those processes for each “long dataframe” that we’ve made. But after iterating a bunch of times resulting in error I found out that there’s a repetition or duplicates for column names. At this point I am already exhausted so I use one of the solutions in stack overflow to make duplicates into unique values.
After duplicated columns are “unique-fy” we can append the rest of spreadsheet files using pandas concatenate function. So let’s funnel all of the function into pipelines and loop through all of the spreadsheet files.
Are we done ? surely but appending 9890 spreadsheets is gonna take a while. So instead of running and iterating through all of those we can use multiprocessing to accelerate the process. We can divide the process into a batch of processes and append the final result later. *I have an old laptop :(
Using multiprocessing with 5 tasks running simultaneously my old laptop can accelerate the process twice as fast. After all the tasks are done and saved I append all of those data into 1 single dataframe and we get our financial statement data.
That’s all folks, Hope you learn somethin’.
And for anyone want to evaluate my code here’s my GitHub repo.
Parse excel file downloaded from IDX into structured form - MuhHanif/bei_excel_parser
For anyone who wants to use my data here’s the Kaggle link for it.