Steven P. Crain

Hi. I'm a PhD student in Computational Science and Engineering at Georgia Institute of Technology.


I am interested in helping people use the Internet to solve problems. Without doubt, modern Web search is very useful for many activities, but new developments will greatly increase possibilities. Current Web search solutions use a uniform approach to compare a document with a user's query. When the user already has a good mental model of the target document, this strategy works pretty well. However, it fails badly when the user has only a vague idea of the information need. The problem is especially significant when the user's attempt at a query includes slang words or when the target documents use specialized language, both of which are common when a consumer searches for health information. We seek to automatically understand user capabilities, understanding and language so that we can help the user find health resources (including documents, discussion groups and social contacts) and facilitate the interactions with the resources.

Also see my curriculum vitae.


I am teaching Numerical Analysis I (CS 4642) during Summer 2012.



Steven P. Crain, Shuang-Hong Yang and Hongyuan Zha. Understanding group dynaics in health forums. In submission.
Health forums play an important role in helping patients cope with disease. We analyze the vitality of groups, in terms of their ability to attract new members and stimulate discussion. Using event history analysis, we identify the factors that contribute to group vitality in a diabetes community. Our results show that standard policies that promote active groups uniformly to everyone resulted in a strong core set of groups with little diversity of participants across groups. Furthermore, we identified a dramatic change in user behavior after eight months of participation. Most old users preferred interacting amongst themselves, while a few sought opportunities to invest in the community.

Steven P. Crain, Ke Zhou, Shuang-Hong Yang and Hongyuan Zha. Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond. In C.C. Aggarwal and C. Zhai, eds. Mining Text data. To appear.
The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate terms with multiple meanings and to provide a lower-dimensional representation of documents that reflects concepts instead of raw terms. In this chapter, we survey two influential forms of dimension reduction. Latent semantic indexing uses spectral decomposition to identify a lower-dimensional representation that maintains semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that uses a probabilistic model to find the co-occurrence patterns of terms that correspond to semantic topics in a collection of documents. We describe the basic technologies in detail and expose the underlying mechanism. We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other contextual information.

Shuang-Hong Yang, Steven P. Crain and Hongyuan Zha. Bridging the language gap: topic level adaptation for cross-domain knowledge transfer. AI Stat. 2011.
The language-gap, for example between lowliteracy laypersons and highly-technical expert documents, is a fundamental barrier for cross-domain knowledge transfer. This paper seeks to close the gap at the thematic level via topic adaptation, i.e., adjusting the topical structures for cross-domain documents according to a domain factor such as technicality. We present a probabilistic model for this purpose based on joint modeling of topic and technicality. The proposed tauLDA model explicitly encodes the interplay between topic and technicality hierarchies, providing an effective topic-level bridge between lay and expert documents. We demonstrate the usefulness of tauLDA with an application to consumer medical informatics.

Steven P. Crain, Shuang-Hong Yang, Hongyuan Zha and Yu Jioa. Dialect topic modeling for improved consumer medical search. In the proceedings of the American Medical Informatics Association 34th Annual Symposium on Biomedical and Health Informatics. 2010. p. 132-136.
Access to health information by consumers is hampered by a fundamental language gap. Current attempts to close the gap leverage consumer oriented health information, which does not, however, have good coverage of slang medical terminology. In this paper, we present a Bayesian model to automatically align documents with different dialects (slang, common and technical) while extracting their semantic topics. The proposed diaTM model enables effective information retrieval, even when the query contains slang words, by explicitly modeling the mixtures of dialects in documents and the joint influence of dialects and topics on word selection. Simulations using consumer questions to retrieve medical information from a corpus of medical documents show that diaTM achieves a 25% improvement in information retrieval relevance by nDCG@5 over an LDA baseline.
Presentation slides.
The data used in this study will be made available shortly. Please contact me if you would like further information.

In the news

Steven P. Crain and Yu Jiao. Its time you DROVE: Deep retrieval with ontological visualization and exploration. Oak Ridge National Laboratory technical report. 2009.
Traditional information retrieval focuses primarily on the task of finding a small number of relevant documents to address a specific information need. In this project we address a more challenging task: enabling efficient dark web information retrieval and knowledge exploration through ontology and machine learning techniques. We present an application called DROVE that integrates search and exploration. A user can perform a broad query using shallow and deep searches to identify a large collection of relevant documents. A visual representation of an ontology is available to assist the user with query formulation and with understanding the relationships of the important concepts in the collection.

Short Papers

Shabbir Syed-Abdul, Luis Fernandez Luque, Steven P. Crain, Min-Huei Hsu, Yu-Chuan Li, Wen-Shan Jian, Yao-Chin Wang, Khandregzen Dorjsuren, Zaya Chuluunbaatar and Alex Nguyen. Social Media Promoting Anorexia: The YouTube case.
Steven P. Crain, Jian Huang and Hongyuan Zha. A scalable assistant librarian: hierachical subject classification of books. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval.
In this paper, we discuss our work in progress towards a scalable hierarchical classification system for books using the Library of Congress subject hierarchy. We examine the characteristics of this domain which make the problem very challenging, and we look at several appropriate performance measurements. We show that both Hieron and Hierarchical Support Vector Machines perform moderately well.


Burning Your Security At Three Ends: Security, Trust and Privacy in the Age of Disclosure.
This is a talk delivered at the Web Science Tea 4/18/2008.
I looked at the issues relating to security, trust and privacy. We discussed how security, trust and privacy are being melted from three directions: the human drive for community; the very attempts to improve security, trust and privacy; and, corporate profit.
Our discussion was be guided by several different security models. The economic model, which considers that we "buy" trust with "payments" of private information is currently very popular. We also used models from information security (for example, the characteristics of a good password).

Contact Information:

click for address@gatech.edu
Steven P. Crain
Computational Science and Engineering
Georgia Institute of Technology
Atlanta, GA 30332-0280

Last Modified: November 12, 2007