--

 There are all kinds of supervised and
 semi-supervised content analysis approaches that
 might be applicable - their effectiveness depends
 on the nature of your data, the level of detail
 you require, and your ability to design training
 data. For example, if you have a superset of
 health information topics and keywords associated
 with them you could probably do a simple keyword
 analysis to give you an idea of which posts are
 talking about which topic. The advantage is that's
 pretty easy to do even if the data is stored as
 text blobs in your MySQL database. However, if you
 need to do serious contextual analysis or you
 don't already know the scope and range of possible
 health topics and keywords that's a more difficult
 problem. You'll probably need to try some
 different machine learning approaches and see what
 works best for you needs. You can do literature
 searches for papers published at ICWSM, KDD, or
 WWW for examples of some approaches. There are
 other conferences that will be applicable, but
 those are good places to start. Follow the
 citation trail from the relevant pieces and you
 should get some idea of where to look for useful
 references.

 Network analysis of these things can either be
 pretty straightforward, moderately complicated, or
 totally impossible. It all depends on how your
 data is structured and what type of information is
 available. If you've got a standard relational
 database structure designed to drive the content
 in the online community then at best you're in for
 some work to reformat the data tables into
 something useful for actual analysis. What you
 hope for are tables with various combinations of
 userid, postid, time stamp, user_data, and
 post_data. You'll probably need to join several
 tables to get the actual data files you need, and
 although MySQL is not optimized for joins it
 should still be manageable.

 Assuming you're new to this type of analysis, my
 advice is to spend a lot of time familiarizing
 yourself with the underlying data. Do some basic
 distributions and see how much noise you've got in
 the system, find out the most efficient routes to
 generating the output you need, and discover what
 information is available in which table and how
 those tables are keyed and indexed. Make sure
 you're on the lookout for garbage data - a lot of
 the great data you get from various online sources
 is basically bad, either because of system errors
 (rare and usually easy to find) or "bad" users
 (common and not so easy to find - there's a whole
 literature on spam detection algorithms out
 there). You'll need to make decisions about your
 error tolerance and what types of behaviors you
 wish to ignore, and you can only do that
 effectively if you understand your data.

 I don't know how much experience you have with
 database queries, but assuming you can only handle
 moderately complex queries my advice is to use the
 database as a source for your final dataset and
 then conduct your analysis in some other tool. My
 typical approach to this situation is to write
 database queries that produce flat text files with
 one row per observation (typically per user, or
 per user/time_period combination, but this
 obviously depends on your research question) with
 one column for each metric. Then I load the data
 into R or Stata or whatever else and build models.
 You can do a fair amount of work in MySQL, but
 this is typically slower and more difficult than
 exporting and using an actual statistical package.

 The drawback of exporting data is that you're
 limited to whatever your stats package can hold in
 memory. If your data set is large (hundreds of
 thousands or millions of observations) then you
 need a fair amount of memory to run any kind of
 complex model. If you're dealing with 10s or 100s
 of millions of observations in your model then
 things get really interesting - I suggest
 sampling, but there are other more difficult
 options.

 If you don't know MySQL at all, you need to learn
 it. There are a plethora of books on MySQL out
 there - I like O'Reilly for reference and SAMS for
 instruction, so if you get something from one of
 those publishers you should be ok. You will also
 want to learn how to do some scripting in Python
 or Perl. For the rank beginner, I recommend going
 with Python and learning by working through the
 chapters and exercises in How To Think Like a
 Computer Scientist (free online at
 http://www.greenteapress.com/thinkpython/thinkCSpy/html/
 ). If you know how to program in general then
 diveintopython.org <http://diveintopython.org>  is
 your best bet. I'm not a Perl guy, but I'm sure
 someone can point you to resources.

 For the actual network analysis, I'd first look
 into NodeXL since it's easy to use, and if that
 doesn't meet your needs I'd go with something like
 igraph (see http://igraph.sourceforge.net/ ),
 which works with both R and Python.

 Best of luck.

 -Tom

 On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
 <timhale@uab.edu> wrote:

   Hi everyone,

   I am working with others on a project that will
   examine health communication among members of an
   online community who post to an online forum. We
   have three primary goals: (1) to conduct a
   content analysis to understand the types of
   health information that is communicated; (2)
   identify the context of the discussions,
   including understanding the characteristics of
   the individuals who initiate and disseminate
   health information; and (3) to conduct a social
   network analysis to examine the larger
   structures of health information sharing among
   community members.

   Although this type of research could be
   conducted by manually collecting posts from the
   online forum, coding for content, and the
   creation of a data set for social network
   analysis -- we are interested in other
   approaches that make better use of the forum
   database files. We have the cooperation of the
   website owner and administrator to access the
   MySQL database.

   I am seeking advice from anyone with experience
   working on similar research questions involving
   online forums and especially, making use of the
   original forum database files. All
   recommendations, suggestions, and pointers to
   articles, books, and appropriate tools are
   welcome and greatly appreciated.

   Thank you,
   Tim Hale

   ------------------------------------------------------------
   Timothy M. Hale, MA
   University of Alabama at Birmingham
   Department of Sociology
   Heritage Hall 460E
   1401 University Boulevard
   Birmingham, AL 35294-1152
   205.222.8108 (cell)
   timhale@uab.edu

   _______________________________________________
   CITASA mailing list
   CITASA@list.citasa.org
   http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

-------------------------------------------------

 _______________________________________________
 CITASA mailing list
 CITASA@list.citasa.org
 http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

citams@list.citams.org

[CITASA] seeking suggestions about research methodology for online forums

--

netdynam email list
[group dynamics on the internet:
now blogging at www.netdynam.org]

citams@list.citams.org

[CITASA] seeking suggestions about research methodology for online forums

--

netdynam email list [group dynamics on the internet: now blogging at www.netdynam.org]

netdynam email list
[group dynamics on the internet:
now blogging at www.netdynam.org]