Currently, I work at a research institution as a scientific programmer/data analyst. I study and implement machine learning and causal discovery algorithms, design and conduct experiments to evaluate their performances in real and simulated datasets of high-dimentionality. The research mainly focuses on data obtained from the biomedical field.

In my spare time, I work on some side projects, providing consultation and developing applications to help out my nerdy friends and their nerdy friends with their data analysis. And sometimes I just analyse some publicly available data for fun!

I will gradually update this page, to share with you some of my past and current projects.

Information Content and Analysis Methods for Multi-Modal High-Throughput Biomedical Data

img

The Question: Does integrating data from mutiple modalities (e.g. gene expression, protein expression, family history, and imaging) improve the accuracy of diagnosis, prognosis?

The Experimental Design: Obtained and structured forty-seven datasets/predictive tasks regarding the risk, recurrence, treatment response and survival in risk, recurrence, treatment response and survival in cancer patients with features spanning 9 modalities. The performance of various algorithms on a given prediction task using information from individual modalities was compared to that using intergrated information from multiple modalities.

The Result: Intergrating data from multiple modalities does not significantly enhance predictivity in most tasks examined in the current experimental setting.

De-novo Reconstruction of genome-scale regulatory networks in Yeast

img

The Question: To what extent can genome-scale regulatory networks (relationships between hundreds of transcription factors and thousands of genes) be reconstructed from data? What types of data and causal discovery algorithms are best suited for this purpose?

The Experimental Design: Seven performance metrics was used to assess accuracy of eighteen statistical association-based approaches for de-novo network reverse-engineering in thirteen different datasets spanning over multiple data types (Observational data, Semi-experimental data, Experimental data).

The Result: Most reconstructed networks had statistically significant accuracies. Considering the cost efficiency of various data types, observational data with changes in environments/time is preferable for network recontruction.

Spike Train Reconstruction In the Dorsolateral Striatum

img

The Question: The Dorsolateral Striatum (DLS) is critically involved in a number of diseases that features motor system deficits. This study aims to understand the processes, represented by the spike train, in DLS neuron. The following question is answered: Can the spike train of DLS neuron be reconstructed?

The Experimental Design: Two data modalities was used to reconstruct spike trains in DLS neurons: (1) movement data and (2) spike history data. Reconstruction accuracy using either data modality and two modality combined was evaluated for individual neurons. Linear-Nonliner-Poisson model was used for spike train reconstruction.

The Result: Spike train can be reconstructed with statistically significant accuracy in most neurons. Specific feature modalities contribute differently in spike train reconstruction for individual neurons. The relative importance of feature modalities provide insights into the response characteristics of individual neurons.