Research Blog for Department of Computer Science @ Rensselaer Polytechnic Institute

Understanding Tuberculosis with Machine Learning

Amina Shabbeer, PhD Student in RPI/CS

My name is Amina Shabbeer. I am a PhD student in the Computer Science Department working with Prof. Kristin Bennett. I completed my B.E. in Computer Engineering from University of Mumbai in 2006. During my PhD, I have also worked closely with Prof. Bulent Yener. My research interests are in Machine Learning, Mathematical Programming and Bioinformatics. I have done internships at Argonne National Labs and at IBM Research, Zurich exploring some areas of my research interests. At ANL I worked with Dr. Chris Henry on metabolic
network reconstruction.

Many factors regarding the influence of variations in strains of the causative agent M. tuberculosis complex (MTBC) on the way TB presents itself remain unknown. The advent of genome sequencing techniques has greatly improved our understanding of the genetic diversity of MTBC.  There is growing evidence to believe that these variations in the MTBC genome have significant consequences on the phenotype. For example different strains of TB are known to have varying levels of virulence i.e. how fatal the infection can be. Studies have also shown that certain strains of TB affect certain groups of people based on ethnicity or country of origin. Armed with this information, epidemiologists are better equipped to develop effective control measures with limited resources available.

The goal of my research group http://tbinsight.cs.rpi.edu/ is to develop tools and techniques to assist researchers understand the variations in the MTBC population. Machine Learning can play a fundamental role in the systematic analysis of the relevance of MTBC genetic variability. We developed classification and clustering tools that exploit what we already know about the MTBC genome to identify “groups of interest”. For example, we use DNA fingerprint information to identify the genetic family or subfamily a particular strain belongs to. http://tbinsight.cs.rpi.edu/run_spotclust.html We use visual analytics to identify patterns, trends and anomalies in associations between host (patient) groups and strain groups. http://tbinsight.cs.rpi.edu/about_tb_vis.html.

We use mathematical programming to develop effective ways of visually representing biogeographic diversity data so that it can be easily understood and interpreted. This requires representing desired characteristics of a visualization in a precise mathematical form. It turns out that some of the optimization problems we are trying to solve can be extremely difficult (NP-Complete). We develop scalable algorithms for these nonconvex nonsmooth problems to create an optimal layout that satisfy our specified requirements. Click here to see our algorithm working on high-dimensional genotype data live in action! http://www.cs.rpi.edu/~shabba/FinalGD/ Take a look at this 5 minute tech-talk entitled “Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings” on mathematical programming for graph drawing: http://techtalks.tv/events/68/. This was a Spotlight talk at the NYAS Machine Learning Symposium.

These are 7 different visual representations of the same information, spoligoforests visualizing the genetic diversity of MTBC strains. Note that while images (e)-(g) appear uncluttered and are visually appealing, they are misrepresentations of the underlying information because they do not faithfully represent the genetic distances between strains. Images (b)-(d) are generated by techniques designed to preserve proximity relations. It can be observed from the relative placement of nodes and subgraphs, strains that are genetically similar e.g. strains that belong to the same genetic family are placed close to each other. This is a desired quality but inadvertently introduce other problems: edge-crossings and node overlaps. Several cognitive science studies have shown that crossing-minimization is the most important criteria in making a graph easy to understand and interpret. However, crossing-minimization is NP-complete and very difficult in practice too. The image (a) generated using our approach achieves a nice trade-off between crossing minimization and proximity preservation. This is made possible by a novel formulation of the problem by developing a precise mathematical condition that defines an edge-crossing which can be easily incorporated into any continuous embedding objective.

Any questions or comments feel free to contact me http://www.cs.rpi.edu/~shabba/ !


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


This entry was posted on May 14, 2012 by in Data Science, Machine Learning.


%d bloggers like this: