Steven P. Crain
Hi. I'm a PhD student in Computational Science and Engineering at
Georgia Institute of Technology.
I am interested in helping people use the Internet to solve problems. Without doubt, modern Web search is very useful for many activities, but new developments will greatly increase possibilities. Current Web search solutions use a uniform approach to compare a document with a user's query. When the user already has a good mental model of the target document, this strategy works pretty well. However, it fails badly when the user has only a vague idea of the information need. The problem is especially significant when the user's attempt at a query includes slang words or when the target documents use specialized language, both of which are common when a consumer searches for health information. We seek to automatically understand user capabilities, understanding and language so that we can help the user find health resources (including documents, discussion groups and social contacts) and facilitate the interactions with the resources.
Also see my curriculum vitae.
I am teaching Numerical Analysis I (CS 4642) during Summer 2012.
- Steven P. Crain, Shuang-Hong Yang and Hongyuan Zha.
Understanding group dynaics in health forums. In submission.
Health forums play an important role in helping patients cope with
disease. We analyze the vitality of groups, in terms of their
ability to attract new members and stimulate discussion. Using event
history analysis, we identify the factors that contribute to group
vitality in a diabetes community. Our results show that standard
policies that promote active groups uniformly to everyone resulted
in a strong core set of groups with little diversity of participants
across groups. Furthermore, we identified a dramatic change in user
behavior after eight months of participation. Most old users
preferred interacting amongst themselves, while a few sought
opportunities to invest in the community.
- Steven P. Crain, Ke Zhou, Shuang-Hong Yang and Hongyuan Zha. Dimensionality reduction and topic modeling: from latent semantic indexing to latent Dirichlet allocation and beyond. In C.C. Aggarwal and C. Zhai, eds. Mining Text data. To appear.
The bag-of-words representation commonly used in text analysis
can be analyzed very efficiently and retains a great deal of useful
information, but it is also troublesome because
the same thought can be expressed using many different terms or one
term can have very different meanings. Dimension
reduction can collapse together terms that have the same
semantics, to identify and disambiguate terms with multiple
meanings and to provide a lower-dimensional representation of
documents that reflects concepts instead of raw terms.
chapter, we survey two influential forms of dimension
reduction. Latent semantic indexing uses spectral decomposition to identify a
lower-dimensional representation that maintains
semantic properties of the documents. Topic modeling, including probabilistic latent semantic indexing and latent Dirichlet allocation, is a form of dimension reduction that
uses a probabilistic model to find the co-occurrence patterns of
terms that correspond to semantic topics
in a collection of documents. We describe the basic technologies in detail and
expose the underlying mechanism.
We also discuss recent advances that have made it possible to apply these techniques to very large and evolving text collections and to incorporate network structure or other
- Shuang-Hong Yang, Steven P. Crain and Hongyuan Zha. Bridging the language gap:
topic level adaptation for cross-domain knowledge transfer. AI Stat. 2011.
The language-gap, for example between lowliteracy
laypersons and highly-technical expert
documents, is a fundamental barrier for
cross-domain knowledge transfer. This paper
seeks to close the gap at the thematic level
via topic adaptation, i.e., adjusting the topical
structures for cross-domain documents
according to a domain factor such as technicality.
We present a probabilistic model for
this purpose based on joint modeling of topic
and technicality. The proposed tauLDA model
explicitly encodes the interplay between topic
and technicality hierarchies, providing an effective
topic-level bridge between lay and expert
documents. We demonstrate the usefulness
of tauLDA with an application to consumer
- Steven P. Crain, Shuang-Hong Yang, Hongyuan Zha and Yu Jioa.
Dialect topic modeling for improved consumer medical search. In the proceedings of the American Medical Informatics Association 34th Annual Symposium on Biomedical and Health Informatics. 2010. p. 132-136.
- Access to health information by consumers is hampered by a fundamental language gap. Current attempts to close the gap leverage consumer oriented
health information, which does not, however, have
good coverage of slang medical terminology. In this
paper, we present a Bayesian model to automatically
align documents with different dialects (slang, common and technical) while extracting their semantic
topics. The proposed diaTM model enables effective
information retrieval, even when the query contains
slang words, by explicitly modeling the mixtures of
dialects in documents and the joint influence of dialects and topics on word selection. Simulations using consumer questions to retrieve medical information from a corpus of medical documents show that
diaTM achieves a 25% improvement in information retrieval relevance by nDCG@5 over an LDA baseline.
- The data used in this study will be made available shortly. Please contact me if you would like further information.
- In the news
- Steven P. Crain and Yu Jiao. Its time you DROVE: Deep retrieval with ontological visualization and exploration. Oak Ridge National Laboratory technical report. 2009.
Traditional information retrieval focuses primarily on the task of finding a small number of relevant documents to address a specific information need. In this project we address a more challenging task: enabling efficient dark web information retrieval and knowledge exploration through ontology and machine learning techniques. We present an application called DROVE that integrates search and exploration. A user can perform a broad query using shallow and deep searches to identify a large collection of relevant documents. A visual representation of an ontology is available to assist the user with query formulation and with understanding the relationships of the important concepts in the collection.
- Shabbir Syed-Abdul, Luis Fernandez Luque, Steven P. Crain, Min-Huei Hsu, Yu-Chuan Li, Wen-Shan Jian, Yao-Chin Wang, Khandregzen Dorjsuren, Zaya Chuluunbaatar and Alex Nguyen. Social Media Promoting Anorexia: The YouTube case.
- Steven P. Crain, Jian Huang and Hongyuan Zha. A scalable assistant librarian: hierachical subject classification of books. In SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval.
- In this paper, we discuss our work in progress towards a scalable hierarchical classification system for books using the Library of Congress subject hierarchy. We examine the characteristics of this domain which make the problem very challenging, and we look at several appropriate performance measurements. We show that both Hieron and Hierarchical Support Vector Machines perform moderately well.
- Burning Your Security At Three Ends: Security, Trust and Privacy in the Age of Disclosure.
This is a talk delivered at the Web Science Tea 4/18/2008.
- I looked at the issues relating to security, trust and privacy. We
discussed how security, trust and privacy are being melted from three
directions: the human drive for community; the very attempts to improve
security, trust and privacy; and, corporate profit.
Our discussion was be guided by several different security models. The
economic model, which considers that we "buy" trust with "payments" of
private information is currently very popular. We also used models
from information security (for example, the characteristics of a good
click for email@example.com
Steven P. Crain
Computational Science and Engineering
Georgia Institute of Technology
Atlanta, GA 30332-0280
Last Modified: November 12, 2007