[CITASA] seeking suggestions about research methodology for online forums

TH
Tim Hale
Fri, Feb 26, 2010 2:14 AM

Hi everyone,

I am working with others on a project that will examine health communication among members of an online community who post to an online forum. We have three primary goals: (1) to conduct a content analysis to understand the types of health information that is communicated; (2) identify the context of the discussions, including understanding the characteristics of the individuals who initiate and disseminate health information; and (3) to conduct a social network analysis to examine the larger structures of health information sharing among community members.

Although this type of research could be conducted by manually collecting posts from the online forum, coding for content, and the creation of a data set for social network analysis -- we are interested in other approaches that make better use of the forum database files. We have the cooperation of the website owner and administrator to access the MySQL database.

I am seeking advice from anyone with experience working on similar research questions involving online forums and especially, making use of the original forum database files. All recommendations, suggestions, and pointers to articles, books, and appropriate tools are welcome and greatly appreciated.

Thank you,
Tim Hale


Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu

Hi everyone, I am working with others on a project that will examine health communication among members of an online community who post to an online forum. We have three primary goals: (1) to conduct a content analysis to understand the types of health information that is communicated; (2) identify the context of the discussions, including understanding the characteristics of the individuals who initiate and disseminate health information; and (3) to conduct a social network analysis to examine the larger structures of health information sharing among community members. Although this type of research could be conducted by manually collecting posts from the online forum, coding for content, and the creation of a data set for social network analysis -- we are interested in other approaches that make better use of the forum database files. We have the cooperation of the website owner and administrator to access the MySQL database. I am seeking advice from anyone with experience working on similar research questions involving online forums and especially, making use of the original forum database files. All recommendations, suggestions, and pointers to articles, books, and appropriate tools are welcome and greatly appreciated. Thank you, Tim Hale ------------------------------------------------------------ Timothy M. Hale, MA University of Alabama at Birmingham Department of Sociology Heritage Hall 460E 1401 University Boulevard Birmingham, AL 35294-1152 205.222.8108 (cell) timhale@uab.edu
TM
Thomas M. Lento
Fri, Feb 26, 2010 2:54 AM

There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers
published at ICWSM, KDD, or WWW for examples of some approaches. There are
other conferences that will be applicable, but those are good places to
start. Follow the citation trail from the relevant pieces and you should get
some idea of where to look for useful references.

Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your
data is structured and what type of information is available. If you've got
a standard relational database structure designed to drive the content in
the online community then at best you're in for some work to reformat the
data tables into something useful for actual analysis. What you hope for are
tables with various combinations of userid, postid, time stamp, user_data,
and post_data. You'll probably need to join several tables to get the actual
data files you need, and although MySQL is not optimized for joins it should
still be manageable.

Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the
great data you get from various online sources is basically bad, either
because of system errors (rare and usually easy to find) or "bad" users
(common and not so easy to find - there's a whole literature on spam
detection algorithms out there). You'll need to make decisions about your
error tolerance and what types of behaviors you wish to ignore, and you can
only do that effectively if you understand your data.

I don't know how much experience you have with database queries, but
assuming you can only handle moderately complex queries my advice is to use
the database as a source for your final dataset and then conduct your
analysis in some other tool. My typical approach to this situation is to
write database queries that produce flat text files with one row per
observation (typically per user, or per user/time_period combination, but
this obviously depends on your research question) with one column for each
metric. Then I load the data into R or Stata or whatever else and build
models. You can do a fair amount of work in MySQL, but this is typically
slower and more difficult than exporting and using an actual statistical
package.

The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run
any kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.

If you don't know MySQL at all, you need to learn it. There are a plethora
of books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or
Perl. For the rank beginner, I recommend going with Python and learning by
working through the chapters and exercises in How To Think Like a Computer
Scientist (free online at
http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If you know how
to program in general then diveintopython.org is your best bet. I'm not a
Perl guy, but I'm sure someone can point you to resources.

For the actual network analysis, I'd first look into NodeXL since it's easy
to use, and if that doesn't meet your needs I'd go with something like
igraph (see http://igraph.sourceforge.net/ ), which works with both R and
Python.

Best of luck.

-Tom

On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale timhale@uab.edu wrote:

Hi everyone,

I am working with others on a project that will examine health
communication among members of an online community who post to an online
forum. We have three primary goals: (1) to conduct a content analysis to
understand the types of health information that is communicated; (2)
identify the context of the discussions, including understanding the
characteristics of the individuals who initiate and disseminate health
information; and (3) to conduct a social network analysis to examine the
larger structures of health information sharing among community members.

Although this type of research could be conducted by manually collecting
posts from the online forum, coding for content, and the creation of a data
set for social network analysis -- we are interested in other approaches
that make better use of the forum database files. We have the cooperation of
the website owner and administrator to access the MySQL database.

I am seeking advice from anyone with experience working on similar research
questions involving online forums and especially, making use of the original
forum database files. All recommendations, suggestions, and pointers to
articles, books, and appropriate tools are welcome and greatly appreciated.

Thank you,
Tim Hale


Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu


CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

There are all kinds of supervised and semi-supervised content analysis approaches that might be applicable - their effectiveness depends on the nature of your data, the level of detail you require, and your ability to design training data. For example, if you have a superset of health information topics and keywords associated with them you could probably do a simple keyword analysis to give you an idea of which posts are talking about which topic. The advantage is that's pretty easy to do even if the data is stored as text blobs in your MySQL database. However, if you need to do serious contextual analysis or you don't already know the scope and range of possible health topics and keywords that's a more difficult problem. You'll probably need to try some different machine learning approaches and see what works best for you needs. You can do literature searches for papers published at ICWSM, KDD, or WWW for examples of some approaches. There are other conferences that will be applicable, but those are good places to start. Follow the citation trail from the relevant pieces and you should get some idea of where to look for useful references. Network analysis of these things can either be pretty straightforward, moderately complicated, or totally impossible. It all depends on how your data is structured and what type of information is available. If you've got a standard relational database structure designed to drive the content in the online community then at best you're in for some work to reformat the data tables into something useful for actual analysis. What you hope for are tables with various combinations of userid, postid, time stamp, user_data, and post_data. You'll probably need to join several tables to get the actual data files you need, and although MySQL is not optimized for joins it should still be manageable. Assuming you're new to this type of analysis, my advice is to spend a lot of time familiarizing yourself with the underlying data. Do some basic distributions and see how much noise you've got in the system, find out the most efficient routes to generating the output you need, and discover what information is available in which table and how those tables are keyed and indexed. Make sure you're on the lookout for garbage data - a lot of the great data you get from various online sources is basically bad, either because of system errors (rare and usually easy to find) or "bad" users (common and not so easy to find - there's a whole literature on spam detection algorithms out there). You'll need to make decisions about your error tolerance and what types of behaviors you wish to ignore, and you can only do that effectively if you understand your data. I don't know how much experience you have with database queries, but assuming you can only handle moderately complex queries my advice is to use the database as a source for your final dataset and then conduct your analysis in some other tool. My typical approach to this situation is to write database queries that produce flat text files with one row per observation (typically per user, or per user/time_period combination, but this obviously depends on your research question) with one column for each metric. Then I load the data into R or Stata or whatever else and build models. You can do a fair amount of work in MySQL, but this is typically slower and more difficult than exporting and using an actual statistical package. The drawback of exporting data is that you're limited to whatever your stats package can hold in memory. If your data set is large (hundreds of thousands or millions of observations) then you need a fair amount of memory to run any kind of complex model. If you're dealing with 10s or 100s of millions of observations in your model then things get really interesting - I suggest sampling, but there are other more difficult options. If you don't know MySQL at all, you need to learn it. There are a plethora of books on MySQL out there - I like O'Reilly for reference and SAMS for instruction, so if you get something from one of those publishers you should be ok. You will also want to learn how to do some scripting in Python or Perl. For the rank beginner, I recommend going with Python and learning by working through the chapters and exercises in How To Think Like a Computer Scientist (free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If you know how to program in general then diveintopython.org is your best bet. I'm not a Perl guy, but I'm sure someone can point you to resources. For the actual network analysis, I'd first look into NodeXL since it's easy to use, and if that doesn't meet your needs I'd go with something like igraph (see http://igraph.sourceforge.net/ ), which works with both R and Python. Best of luck. -Tom On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale@uab.edu> wrote: > Hi everyone, > > I am working with others on a project that will examine health > communication among members of an online community who post to an online > forum. We have three primary goals: (1) to conduct a content analysis to > understand the types of health information that is communicated; (2) > identify the context of the discussions, including understanding the > characteristics of the individuals who initiate and disseminate health > information; and (3) to conduct a social network analysis to examine the > larger structures of health information sharing among community members. > > Although this type of research could be conducted by manually collecting > posts from the online forum, coding for content, and the creation of a data > set for social network analysis -- we are interested in other approaches > that make better use of the forum database files. We have the cooperation of > the website owner and administrator to access the MySQL database. > > I am seeking advice from anyone with experience working on similar research > questions involving online forums and especially, making use of the original > forum database files. All recommendations, suggestions, and pointers to > articles, books, and appropriate tools are welcome and greatly appreciated. > > Thank you, > Tim Hale > > ------------------------------------------------------------ > Timothy M. Hale, MA > University of Alabama at Birmingham > Department of Sociology > Heritage Hall 460E > 1401 University Boulevard > Birmingham, AL 35294-1152 > 205.222.8108 (cell) > timhale@uab.edu > > > _______________________________________________ > CITASA mailing list > CITASA@list.citasa.org > http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org >
CM
Christine Morton
Fri, Feb 26, 2010 3:34 AM

Wow.  Thank you for the question, Tim, and for sharing the outline of your
research.  It sounds very interesting and relevant.  Would you be interested
in sharing more details with me offlist?  and thanks Tom for your thoughtful
and thorough reply.    I¹ve learned a lot.
Regards,
Christine

On 2/25/10 6:54 PM, "Thomas M. Lento" thomas.lento@gmail.com wrote:

There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers published
at ICWSM, KDD, or WWW for examples of some approaches. There are other
conferences that will be applicable, but those are good places to start.
Follow the citation trail from the relevant pieces and you should get some
idea of where to look for useful references.

Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your data
is structured and what type of information is available. If you've got a
standard relational database structure designed to drive the content in the
online community then at best you're in for some work to reformat the data
tables into something useful for actual analysis. What you hope for are tables
with various combinations of userid, postid, time stamp, user_data, and
post_data. You'll probably need to join several tables to get the actual data
files you need, and although MySQL is not optimized for joins it should still
be manageable.

Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the great
data you get from various online sources is basically bad, either because of
system errors (rare and usually easy to find) or "bad" users (common and not
so easy to find - there's a whole literature on spam detection algorithms out
there). You'll need to make decisions about your error tolerance and what
types of behaviors you wish to ignore, and you can only do that effectively if
you understand your data.

I don't know how much experience you have with database queries, but assuming
you can only handle moderately complex queries my advice is to use the
database as a source for your final dataset and then conduct your analysis in
some other tool. My typical approach to this situation is to write database
queries that produce flat text files with one row per observation (typically
per user, or per user/time_period combination, but this obviously depends on
your research question) with one column for each metric. Then I load the data
into R or Stata or whatever else and build models. You can do a fair amount of
work in MySQL, but this is typically slower and more difficult than exporting
and using an actual statistical package.

The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run any
kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.

If you don't know MySQL at all, you need to learn it. There are a plethora of
books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or Perl.
For the rank beginner, I recommend going with Python and learning by working
through the chapters and exercises in How To Think Like a Computer Scientist
(free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If
you know how to program in general then diveintopython.org
http://diveintopython.org  is your best bet. I'm not a Perl guy, but I'm
sure someone can point you to resources.

For the actual network analysis, I'd first look into NodeXL since it's easy to
use, and if that doesn't meet your needs I'd go with something like igraph
(see http://igraph.sourceforge.net/ ), which works with both R and Python.

Best of luck.

-Tom

On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale timhale@uab.edu wrote:

Hi everyone,

I am working with others on a project that will examine health communication
among members of an online community who post to an online forum. We have
three primary goals: (1) to conduct a content analysis to understand the
types of health information that is communicated; (2) identify the context of
the discussions, including understanding the characteristics of the
individuals who initiate and disseminate health information; and (3) to
conduct a social network analysis to examine the larger structures of health
information sharing among community members.

Although this type of research could be conducted by manually collecting
posts from the online forum, coding for content, and the creation of a data
set for social network analysis -- we are interested in other approaches that
make better use of the forum database files. We have the cooperation of the
website owner and administrator to access the MySQL database.

I am seeking advice from anyone with experience working on similar research
questions involving online forums and especially, making use of the original
forum database files. All recommendations, suggestions, and pointers to
articles, books, and appropriate tools are welcome and greatly appreciated.

Thank you,
Tim Hale


Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu


CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

CMQCC:  Transforming Maternity Care 

Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative

Stanford University                p. 650-725-6108    f.  650-721-5751
Medical School Office Building            d. 650-721-2187    c. 650-995-4550
251 Campus Drive
Palo Alto, CA  94305-5415

cmorton@stanford.edu        www.cmqcc.org

Wow. Thank you for the question, Tim, and for sharing the outline of your research. It sounds very interesting and relevant. Would you be interested in sharing more details with me offlist? and thanks Tom for your thoughtful and thorough reply. I¹ve learned a lot. Regards, Christine On 2/25/10 6:54 PM, "Thomas M. Lento" <thomas.lento@gmail.com> wrote: > There are all kinds of supervised and semi-supervised content analysis > approaches that might be applicable - their effectiveness depends on the > nature of your data, the level of detail you require, and your ability to > design training data. For example, if you have a superset of health > information topics and keywords associated with them you could probably do a > simple keyword analysis to give you an idea of which posts are talking about > which topic. The advantage is that's pretty easy to do even if the data is > stored as text blobs in your MySQL database. However, if you need to do > serious contextual analysis or you don't already know the scope and range of > possible health topics and keywords that's a more difficult problem. You'll > probably need to try some different machine learning approaches and see what > works best for you needs. You can do literature searches for papers published > at ICWSM, KDD, or WWW for examples of some approaches. There are other > conferences that will be applicable, but those are good places to start. > Follow the citation trail from the relevant pieces and you should get some > idea of where to look for useful references. > > Network analysis of these things can either be pretty straightforward, > moderately complicated, or totally impossible. It all depends on how your data > is structured and what type of information is available. If you've got a > standard relational database structure designed to drive the content in the > online community then at best you're in for some work to reformat the data > tables into something useful for actual analysis. What you hope for are tables > with various combinations of userid, postid, time stamp, user_data, and > post_data. You'll probably need to join several tables to get the actual data > files you need, and although MySQL is not optimized for joins it should still > be manageable. > > Assuming you're new to this type of analysis, my advice is to spend a lot of > time familiarizing yourself with the underlying data. Do some basic > distributions and see how much noise you've got in the system, find out the > most efficient routes to generating the output you need, and discover what > information is available in which table and how those tables are keyed and > indexed. Make sure you're on the lookout for garbage data - a lot of the great > data you get from various online sources is basically bad, either because of > system errors (rare and usually easy to find) or "bad" users (common and not > so easy to find - there's a whole literature on spam detection algorithms out > there). You'll need to make decisions about your error tolerance and what > types of behaviors you wish to ignore, and you can only do that effectively if > you understand your data. > > I don't know how much experience you have with database queries, but assuming > you can only handle moderately complex queries my advice is to use the > database as a source for your final dataset and then conduct your analysis in > some other tool. My typical approach to this situation is to write database > queries that produce flat text files with one row per observation (typically > per user, or per user/time_period combination, but this obviously depends on > your research question) with one column for each metric. Then I load the data > into R or Stata or whatever else and build models. You can do a fair amount of > work in MySQL, but this is typically slower and more difficult than exporting > and using an actual statistical package. > > The drawback of exporting data is that you're limited to whatever your stats > package can hold in memory. If your data set is large (hundreds of thousands > or millions of observations) then you need a fair amount of memory to run any > kind of complex model. If you're dealing with 10s or 100s of millions of > observations in your model then things get really interesting - I suggest > sampling, but there are other more difficult options. > > If you don't know MySQL at all, you need to learn it. There are a plethora of > books on MySQL out there - I like O'Reilly for reference and SAMS for > instruction, so if you get something from one of those publishers you should > be ok. You will also want to learn how to do some scripting in Python or Perl. > For the rank beginner, I recommend going with Python and learning by working > through the chapters and exercises in How To Think Like a Computer Scientist > (free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If > you know how to program in general then diveintopython.org > <http://diveintopython.org> is your best bet. I'm not a Perl guy, but I'm > sure someone can point you to resources. > > For the actual network analysis, I'd first look into NodeXL since it's easy to > use, and if that doesn't meet your needs I'd go with something like igraph > (see http://igraph.sourceforge.net/ ), which works with both R and Python. > > Best of luck. > > -Tom > > On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale@uab.edu> wrote: >> Hi everyone, >> >> I am working with others on a project that will examine health communication >> among members of an online community who post to an online forum. We have >> three primary goals: (1) to conduct a content analysis to understand the >> types of health information that is communicated; (2) identify the context of >> the discussions, including understanding the characteristics of the >> individuals who initiate and disseminate health information; and (3) to >> conduct a social network analysis to examine the larger structures of health >> information sharing among community members. >> >> Although this type of research could be conducted by manually collecting >> posts from the online forum, coding for content, and the creation of a data >> set for social network analysis -- we are interested in other approaches that >> make better use of the forum database files. We have the cooperation of the >> website owner and administrator to access the MySQL database. >> >> I am seeking advice from anyone with experience working on similar research >> questions involving online forums and especially, making use of the original >> forum database files. All recommendations, suggestions, and pointers to >> articles, books, and appropriate tools are welcome and greatly appreciated. >> >> Thank you, >> Tim Hale >> >> ------------------------------------------------------------ >> Timothy M. Hale, MA >> University of Alabama at Birmingham >> Department of Sociology >> Heritage Hall 460E >> 1401 University Boulevard >> Birmingham, AL 35294-1152 >> 205.222.8108 (cell) >> timhale@uab.edu >> >> >> _______________________________________________ >> CITASA mailing list >> CITASA@list.citasa.org >> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org > > > > _______________________________________________ > CITASA mailing list > CITASA@list.citasa.org > http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org CMQCC: Transforming Maternity Care  Christine H. Morton, PhD Program Manager/Research Sociologist California Maternal Quality Care Collaborative Stanford University p. 650-725-6108 f.  650-721-5751 Medical School Office Building d. 650-721-2187 c. 650-995-4550 251 Campus Drive Palo Alto, CA  94305-5415 cmorton@stanford.edu www.cmqcc.org
CH
Caroline Haythornthwaite
Fri, Feb 26, 2010 11:30 AM

I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions.

/Caroline

---- Original message ----

Date: Thu, 25 Feb 2010 19:34:05 -0800
From: Christine Morton christine@christinemorton.com
Subject: Re: [CITASA] seeking suggestions about research methodology for online forums
To: "Thomas M. Lento" thomas.lento@gmail.com, Tim Hale timhale@uab.edu
Cc: CITASA@list.citasa.org

Wow.  Thank you for the question, Tim, and for
sharing the outline of your research.  It sounds
very interesting and relevant.  Would you be
interested in sharing more details with me offlist?
and thanks Tom for your thoughtful and thorough
reply.    I've learned a lot.
Regards,
Christine

On 2/25/10 6:54 PM, "Thomas M. Lento"
thomas.lento@gmail.com wrote:

 There are all kinds of supervised and
 semi-supervised content analysis approaches that
 might be applicable - their effectiveness depends
 on the nature of your data, the level of detail
 you require, and your ability to design training
 data. For example, if you have a superset of
 health information topics and keywords associated
 with them you could probably do a simple keyword
 analysis to give you an idea of which posts are
 talking about which topic. The advantage is that's
 pretty easy to do even if the data is stored as
 text blobs in your MySQL database. However, if you
 need to do serious contextual analysis or you
 don't already know the scope and range of possible
 health topics and keywords that's a more difficult
 problem. You'll probably need to try some
 different machine learning approaches and see what
 works best for you needs. You can do literature
 searches for papers published at ICWSM, KDD, or
 WWW for examples of some approaches. There are
 other conferences that will be applicable, but
 those are good places to start. Follow the
 citation trail from the relevant pieces and you
 should get some idea of where to look for useful
 references.

 Network analysis of these things can either be
 pretty straightforward, moderately complicated, or
 totally impossible. It all depends on how your
 data is structured and what type of information is
 available. If you've got a standard relational
 database structure designed to drive the content
 in the online community then at best you're in for
 some work to reformat the data tables into
 something useful for actual analysis. What you
 hope for are tables with various combinations of
 userid, postid, time stamp, user_data, and
 post_data. You'll probably need to join several
 tables to get the actual data files you need, and
 although MySQL is not optimized for joins it
 should still be manageable.

 Assuming you're new to this type of analysis, my
 advice is to spend a lot of time familiarizing
 yourself with the underlying data. Do some basic
 distributions and see how much noise you've got in
 the system, find out the most efficient routes to
 generating the output you need, and discover what
 information is available in which table and how
 those tables are keyed and indexed. Make sure
 you're on the lookout for garbage data - a lot of
 the great data you get from various online sources
 is basically bad, either because of system errors
 (rare and usually easy to find) or "bad" users
 (common and not so easy to find - there's a whole
 literature on spam detection algorithms out
 there). You'll need to make decisions about your
 error tolerance and what types of behaviors you
 wish to ignore, and you can only do that
 effectively if you understand your data.

 I don't know how much experience you have with
 database queries, but assuming you can only handle
 moderately complex queries my advice is to use the
 database as a source for your final dataset and
 then conduct your analysis in some other tool. My
 typical approach to this situation is to write
 database queries that produce flat text files with
 one row per observation (typically per user, or
 per user/time_period combination, but this
 obviously depends on your research question) with
 one column for each metric. Then I load the data
 into R or Stata or whatever else and build models.
 You can do a fair amount of work in MySQL, but
 this is typically slower and more difficult than
 exporting and using an actual statistical package.

 The drawback of exporting data is that you're
 limited to whatever your stats package can hold in
 memory. If your data set is large (hundreds of
 thousands or millions of observations) then you
 need a fair amount of memory to run any kind of
 complex model. If you're dealing with 10s or 100s
 of millions of observations in your model then
 things get really interesting - I suggest
 sampling, but there are other more difficult
 options.

 If you don't know MySQL at all, you need to learn
 it. There are a plethora of books on MySQL out
 there - I like O'Reilly for reference and SAMS for
 instruction, so if you get something from one of
 those publishers you should be ok. You will also
 want to learn how to do some scripting in Python
 or Perl. For the rank beginner, I recommend going
 with Python and learning by working through the
 chapters and exercises in How To Think Like a
 Computer Scientist (free online at
 http://www.greenteapress.com/thinkpython/thinkCSpy/html/
 ). If you know how to program in general then
 diveintopython.org <http://diveintopython.org>  is
 your best bet. I'm not a Perl guy, but I'm sure
 someone can point you to resources.

 For the actual network analysis, I'd first look
 into NodeXL since it's easy to use, and if that
 doesn't meet your needs I'd go with something like
 igraph (see http://igraph.sourceforge.net/ ),
 which works with both R and Python.

 Best of luck.

 -Tom

 On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
 <timhale@uab.edu> wrote:

   Hi everyone,

   I am working with others on a project that will
   examine health communication among members of an
   online community who post to an online forum. We
   have three primary goals: (1) to conduct a
   content analysis to understand the types of
   health information that is communicated; (2)
   identify the context of the discussions,
   including understanding the characteristics of
   the individuals who initiate and disseminate
   health information; and (3) to conduct a social
   network analysis to examine the larger
   structures of health information sharing among
   community members.

   Although this type of research could be
   conducted by manually collecting posts from the
   online forum, coding for content, and the
   creation of a data set for social network
   analysis -- we are interested in other
   approaches that make better use of the forum
   database files. We have the cooperation of the
   website owner and administrator to access the
   MySQL database.

   I am seeking advice from anyone with experience
   working on similar research questions involving
   online forums and especially, making use of the
   original forum database files. All
   recommendations, suggestions, and pointers to
   articles, books, and appropriate tools are
   welcome and greatly appreciated.

   Thank you,
   Tim Hale

   ------------------------------------------------------------
   Timothy M. Hale, MA
   University of Alabama at Birmingham
   Department of Sociology
   Heritage Hall 460E
   1401 University Boulevard
   Birmingham, AL 35294-1152
   205.222.8108 (cell)
   timhale@uab.edu

   _______________________________________________
   CITASA mailing list
   CITASA@list.citasa.org
   http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

-------------------------------------------------

 _______________________________________________
 CITASA mailing list
 CITASA@list.citasa.org
 http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org

CMQCC:  Transforming Maternity Care 

-------------------------------------------------

Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative

Stanford University                p. 650-725-6108
f.  650-721-5751
Medical School Office Building            d.
650-721-2187    c. 650-995-4550
251 Campus Drive
Palo Alto, CA  94305-5415

cmorton@stanford.edu        www.cmqcc.org



CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org


Caroline Haythornthwaite

Leverhulme Visiting Professor, Institute of Education, University of London (2009-10)

Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn@illinois.edu)

I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions. /Caroline ---- Original message ---- >Date: Thu, 25 Feb 2010 19:34:05 -0800 >From: Christine Morton <christine@christinemorton.com> >Subject: Re: [CITASA] seeking suggestions about research methodology for online forums >To: "Thomas M. Lento" <thomas.lento@gmail.com>, Tim Hale <timhale@uab.edu> >Cc: CITASA@list.citasa.org > > Wow. Thank you for the question, Tim, and for > sharing the outline of your research. It sounds > very interesting and relevant. Would you be > interested in sharing more details with me offlist? > and thanks Tom for your thoughtful and thorough > reply. I've learned a lot. > Regards, > Christine > > On 2/25/10 6:54 PM, "Thomas M. Lento" > <thomas.lento@gmail.com> wrote: > > There are all kinds of supervised and > semi-supervised content analysis approaches that > might be applicable - their effectiveness depends > on the nature of your data, the level of detail > you require, and your ability to design training > data. For example, if you have a superset of > health information topics and keywords associated > with them you could probably do a simple keyword > analysis to give you an idea of which posts are > talking about which topic. The advantage is that's > pretty easy to do even if the data is stored as > text blobs in your MySQL database. However, if you > need to do serious contextual analysis or you > don't already know the scope and range of possible > health topics and keywords that's a more difficult > problem. You'll probably need to try some > different machine learning approaches and see what > works best for you needs. You can do literature > searches for papers published at ICWSM, KDD, or > WWW for examples of some approaches. There are > other conferences that will be applicable, but > those are good places to start. Follow the > citation trail from the relevant pieces and you > should get some idea of where to look for useful > references. > > Network analysis of these things can either be > pretty straightforward, moderately complicated, or > totally impossible. It all depends on how your > data is structured and what type of information is > available. If you've got a standard relational > database structure designed to drive the content > in the online community then at best you're in for > some work to reformat the data tables into > something useful for actual analysis. What you > hope for are tables with various combinations of > userid, postid, time stamp, user_data, and > post_data. You'll probably need to join several > tables to get the actual data files you need, and > although MySQL is not optimized for joins it > should still be manageable. > > Assuming you're new to this type of analysis, my > advice is to spend a lot of time familiarizing > yourself with the underlying data. Do some basic > distributions and see how much noise you've got in > the system, find out the most efficient routes to > generating the output you need, and discover what > information is available in which table and how > those tables are keyed and indexed. Make sure > you're on the lookout for garbage data - a lot of > the great data you get from various online sources > is basically bad, either because of system errors > (rare and usually easy to find) or "bad" users > (common and not so easy to find - there's a whole > literature on spam detection algorithms out > there). You'll need to make decisions about your > error tolerance and what types of behaviors you > wish to ignore, and you can only do that > effectively if you understand your data. > > I don't know how much experience you have with > database queries, but assuming you can only handle > moderately complex queries my advice is to use the > database as a source for your final dataset and > then conduct your analysis in some other tool. My > typical approach to this situation is to write > database queries that produce flat text files with > one row per observation (typically per user, or > per user/time_period combination, but this > obviously depends on your research question) with > one column for each metric. Then I load the data > into R or Stata or whatever else and build models. > You can do a fair amount of work in MySQL, but > this is typically slower and more difficult than > exporting and using an actual statistical package. > > The drawback of exporting data is that you're > limited to whatever your stats package can hold in > memory. If your data set is large (hundreds of > thousands or millions of observations) then you > need a fair amount of memory to run any kind of > complex model. If you're dealing with 10s or 100s > of millions of observations in your model then > things get really interesting - I suggest > sampling, but there are other more difficult > options. > > If you don't know MySQL at all, you need to learn > it. There are a plethora of books on MySQL out > there - I like O'Reilly for reference and SAMS for > instruction, so if you get something from one of > those publishers you should be ok. You will also > want to learn how to do some scripting in Python > or Perl. For the rank beginner, I recommend going > with Python and learning by working through the > chapters and exercises in How To Think Like a > Computer Scientist (free online at > http://www.greenteapress.com/thinkpython/thinkCSpy/html/ > ). If you know how to program in general then > diveintopython.org <http://diveintopython.org> is > your best bet. I'm not a Perl guy, but I'm sure > someone can point you to resources. > > For the actual network analysis, I'd first look > into NodeXL since it's easy to use, and if that > doesn't meet your needs I'd go with something like > igraph (see http://igraph.sourceforge.net/ ), > which works with both R and Python. > > Best of luck. > > -Tom > > On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale > <timhale@uab.edu> wrote: > > Hi everyone, > > I am working with others on a project that will > examine health communication among members of an > online community who post to an online forum. We > have three primary goals: (1) to conduct a > content analysis to understand the types of > health information that is communicated; (2) > identify the context of the discussions, > including understanding the characteristics of > the individuals who initiate and disseminate > health information; and (3) to conduct a social > network analysis to examine the larger > structures of health information sharing among > community members. > > Although this type of research could be > conducted by manually collecting posts from the > online forum, coding for content, and the > creation of a data set for social network > analysis -- we are interested in other > approaches that make better use of the forum > database files. We have the cooperation of the > website owner and administrator to access the > MySQL database. > > I am seeking advice from anyone with experience > working on similar research questions involving > online forums and especially, making use of the > original forum database files. All > recommendations, suggestions, and pointers to > articles, books, and appropriate tools are > welcome and greatly appreciated. > > Thank you, > Tim Hale > > ------------------------------------------------------------ > Timothy M. Hale, MA > University of Alabama at Birmingham > Department of Sociology > Heritage Hall 460E > 1401 University Boulevard > Birmingham, AL 35294-1152 > 205.222.8108 (cell) > timhale@uab.edu > > _______________________________________________ > CITASA mailing list > CITASA@list.citasa.org > http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org > > ------------------------------------------------- > > _______________________________________________ > CITASA mailing list > CITASA@list.citasa.org > http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org > > CMQCC: Transforming Maternity Care  > > ------------------------------------------------- > > Christine H. Morton, PhD > Program Manager/Research Sociologist > California Maternal Quality Care Collaborative > > Stanford University p. 650-725-6108 > f.  650-721-5751 > Medical School Office Building d. > 650-721-2187 c. 650-995-4550 > 251 Campus Drive > Palo Alto, CA  94305-5415 > > cmorton@stanford.edu www.cmqcc.org >________________ >_______________________________________________ >CITASA mailing list >CITASA@list.citasa.org >http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org -------------------------------------- Caroline Haythornthwaite Leverhulme Visiting Professor, Institute of Education, University of London (2009-10) Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn@illinois.edu)
ED
el don
Tue, Mar 2, 2010 6:49 AM

a late entry but perhaps of interest.

as a linguist studying written interaction, i found sandra harrison's
work to be readable and relevant. our focus is not so much on
automatic data mining, but on what goes on in groups where
interaction occurs via language (and other modalities at times). one
of her papers outlines an approach that can be used to represent both
the posts in chronological order and who posted them ... it's a
simple framework (lots of manual labour i'm afraid), but it works
very well to show in summary what has been going on in long
multi-party discussions.

it's in a collection looking at communities of practice:
Harrison, S. (2003) "Computer-mediated interaction: Using discourse
maps to represent multi-party, multi-topic asynchronous discussions"
in Sarangi, S. & T. van Leeuwen: Applied Linguistics and Communities
of Practice.

the book can be viewed via google books, and most of the chapter can
be read online.

best,
alex

At 8:14 PM -0600 25/2/10, Tim Hale wrote:

Hi everyone,

I am working with others on a project that will examine health
communication among members of an online community who post to an
online forum. We have three primary goals: (1) to conduct a content
analysis to understand the types of health information that is
communicated; (2) identify the context of the discussions, including
understanding the characteristics of the individuals who initiate
and disseminate health information; and (3) to conduct a social
network analysis to examine the larger structures of health
information sharing among community members.

Although this type of research could be conducted by manually
collecting posts from the online forum, coding for content, and the
creation of a data set for social network analysis -- we are
interested in other approaches that make better use of the forum
database files. We have the cooperation of the website owner and
administrator to access the MySQL database.

I am seeking advice from anyone with experience working on similar
research questions involving online forums and especially, making
use of the original forum database files. All recommendations,
suggestions, and pointers to articles, books, and appropriate tools
are welcome and greatly appreciated.

Thank you,
Tim Hale


Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu

--

netdynam email list
[group dynamics on the internet:
now blogging at www.netdynam.org]

a late entry but perhaps of interest. as a linguist studying written interaction, i found sandra harrison's work to be readable and relevant. our focus is not so much on automatic data mining, but on what goes on in groups where interaction occurs via language (and other modalities at times). one of her papers outlines an approach that can be used to represent both the posts in chronological order and who posted them ... it's a simple framework (lots of manual labour i'm afraid), but it works very well to show in summary what has been going on in long multi-party discussions. it's in a collection looking at communities of practice: Harrison, S. (2003) "Computer-mediated interaction: Using discourse maps to represent multi-party, multi-topic asynchronous discussions" in Sarangi, S. & T. van Leeuwen: Applied Linguistics and Communities of Practice. the book can be viewed via google books, and most of the chapter can be read online. best, alex At 8:14 PM -0600 25/2/10, Tim Hale wrote: >Hi everyone, > >I am working with others on a project that will examine health >communication among members of an online community who post to an >online forum. We have three primary goals: (1) to conduct a content >analysis to understand the types of health information that is >communicated; (2) identify the context of the discussions, including >understanding the characteristics of the individuals who initiate >and disseminate health information; and (3) to conduct a social >network analysis to examine the larger structures of health >information sharing among community members. > >Although this type of research could be conducted by manually >collecting posts from the online forum, coding for content, and the >creation of a data set for social network analysis -- we are >interested in other approaches that make better use of the forum >database files. We have the cooperation of the website owner and >administrator to access the MySQL database. > >I am seeking advice from anyone with experience working on similar >research questions involving online forums and especially, making >use of the original forum database files. All recommendations, >suggestions, and pointers to articles, books, and appropriate tools >are welcome and greatly appreciated. > >Thank you, >Tim Hale > >------------------------------------------------------------ >Timothy M. Hale, MA >University of Alabama at Birmingham >Department of Sociology >Heritage Hall 460E >1401 University Boulevard >Birmingham, AL 35294-1152 >205.222.8108 (cell) >timhale@uab.edu -- ================================ netdynam email list [group dynamics on the internet: now blogging at www.netdynam.org] ================================