TH
Tim Hale
Fri, Feb 26, 2010 2:14 AM
Hi everyone,
I am working with others on a project that will examine health communication among members of an online community who post to an online forum. We have three primary goals: (1) to conduct a content analysis to understand the types of health information that is communicated; (2) identify the context of the discussions, including understanding the characteristics of the individuals who initiate and disseminate health information; and (3) to conduct a social network analysis to examine the larger structures of health information sharing among community members.
Although this type of research could be conducted by manually collecting posts from the online forum, coding for content, and the creation of a data set for social network analysis -- we are interested in other approaches that make better use of the forum database files. We have the cooperation of the website owner and administrator to access the MySQL database.
I am seeking advice from anyone with experience working on similar research questions involving online forums and especially, making use of the original forum database files. All recommendations, suggestions, and pointers to articles, books, and appropriate tools are welcome and greatly appreciated.
Thank you,
Tim Hale
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
Hi everyone,
I am working with others on a project that will examine health communication among members of an online community who post to an online forum. We have three primary goals: (1) to conduct a content analysis to understand the types of health information that is communicated; (2) identify the context of the discussions, including understanding the characteristics of the individuals who initiate and disseminate health information; and (3) to conduct a social network analysis to examine the larger structures of health information sharing among community members.
Although this type of research could be conducted by manually collecting posts from the online forum, coding for content, and the creation of a data set for social network analysis -- we are interested in other approaches that make better use of the forum database files. We have the cooperation of the website owner and administrator to access the MySQL database.
I am seeking advice from anyone with experience working on similar research questions involving online forums and especially, making use of the original forum database files. All recommendations, suggestions, and pointers to articles, books, and appropriate tools are welcome and greatly appreciated.
Thank you,
Tim Hale
------------------------------------------------------------
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
TM
Thomas M. Lento
Fri, Feb 26, 2010 2:54 AM
There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers
published at ICWSM, KDD, or WWW for examples of some approaches. There are
other conferences that will be applicable, but those are good places to
start. Follow the citation trail from the relevant pieces and you should get
some idea of where to look for useful references.
Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your
data is structured and what type of information is available. If you've got
a standard relational database structure designed to drive the content in
the online community then at best you're in for some work to reformat the
data tables into something useful for actual analysis. What you hope for are
tables with various combinations of userid, postid, time stamp, user_data,
and post_data. You'll probably need to join several tables to get the actual
data files you need, and although MySQL is not optimized for joins it should
still be manageable.
Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the
great data you get from various online sources is basically bad, either
because of system errors (rare and usually easy to find) or "bad" users
(common and not so easy to find - there's a whole literature on spam
detection algorithms out there). You'll need to make decisions about your
error tolerance and what types of behaviors you wish to ignore, and you can
only do that effectively if you understand your data.
I don't know how much experience you have with database queries, but
assuming you can only handle moderately complex queries my advice is to use
the database as a source for your final dataset and then conduct your
analysis in some other tool. My typical approach to this situation is to
write database queries that produce flat text files with one row per
observation (typically per user, or per user/time_period combination, but
this obviously depends on your research question) with one column for each
metric. Then I load the data into R or Stata or whatever else and build
models. You can do a fair amount of work in MySQL, but this is typically
slower and more difficult than exporting and using an actual statistical
package.
The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run
any kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.
If you don't know MySQL at all, you need to learn it. There are a plethora
of books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or
Perl. For the rank beginner, I recommend going with Python and learning by
working through the chapters and exercises in How To Think Like a Computer
Scientist (free online at
http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If you know how
to program in general then diveintopython.org is your best bet. I'm not a
Perl guy, but I'm sure someone can point you to resources.
For the actual network analysis, I'd first look into NodeXL since it's easy
to use, and if that doesn't meet your needs I'd go with something like
igraph (see http://igraph.sourceforge.net/ ), which works with both R and
Python.
Best of luck.
-Tom
On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale timhale@uab.edu wrote:
Hi everyone,
I am working with others on a project that will examine health
communication among members of an online community who post to an online
forum. We have three primary goals: (1) to conduct a content analysis to
understand the types of health information that is communicated; (2)
identify the context of the discussions, including understanding the
characteristics of the individuals who initiate and disseminate health
information; and (3) to conduct a social network analysis to examine the
larger structures of health information sharing among community members.
Although this type of research could be conducted by manually collecting
posts from the online forum, coding for content, and the creation of a data
set for social network analysis -- we are interested in other approaches
that make better use of the forum database files. We have the cooperation of
the website owner and administrator to access the MySQL database.
I am seeking advice from anyone with experience working on similar research
questions involving online forums and especially, making use of the original
forum database files. All recommendations, suggestions, and pointers to
articles, books, and appropriate tools are welcome and greatly appreciated.
Thank you,
Tim Hale
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers
published at ICWSM, KDD, or WWW for examples of some approaches. There are
other conferences that will be applicable, but those are good places to
start. Follow the citation trail from the relevant pieces and you should get
some idea of where to look for useful references.
Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your
data is structured and what type of information is available. If you've got
a standard relational database structure designed to drive the content in
the online community then at best you're in for some work to reformat the
data tables into something useful for actual analysis. What you hope for are
tables with various combinations of userid, postid, time stamp, user_data,
and post_data. You'll probably need to join several tables to get the actual
data files you need, and although MySQL is not optimized for joins it should
still be manageable.
Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the
great data you get from various online sources is basically bad, either
because of system errors (rare and usually easy to find) or "bad" users
(common and not so easy to find - there's a whole literature on spam
detection algorithms out there). You'll need to make decisions about your
error tolerance and what types of behaviors you wish to ignore, and you can
only do that effectively if you understand your data.
I don't know how much experience you have with database queries, but
assuming you can only handle moderately complex queries my advice is to use
the database as a source for your final dataset and then conduct your
analysis in some other tool. My typical approach to this situation is to
write database queries that produce flat text files with one row per
observation (typically per user, or per user/time_period combination, but
this obviously depends on your research question) with one column for each
metric. Then I load the data into R or Stata or whatever else and build
models. You can do a fair amount of work in MySQL, but this is typically
slower and more difficult than exporting and using an actual statistical
package.
The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run
any kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.
If you don't know MySQL at all, you need to learn it. There are a plethora
of books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or
Perl. For the rank beginner, I recommend going with Python and learning by
working through the chapters and exercises in How To Think Like a Computer
Scientist (free online at
http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If you know how
to program in general then diveintopython.org is your best bet. I'm not a
Perl guy, but I'm sure someone can point you to resources.
For the actual network analysis, I'd first look into NodeXL since it's easy
to use, and if that doesn't meet your needs I'd go with something like
igraph (see http://igraph.sourceforge.net/ ), which works with both R and
Python.
Best of luck.
-Tom
On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale@uab.edu> wrote:
> Hi everyone,
>
> I am working with others on a project that will examine health
> communication among members of an online community who post to an online
> forum. We have three primary goals: (1) to conduct a content analysis to
> understand the types of health information that is communicated; (2)
> identify the context of the discussions, including understanding the
> characteristics of the individuals who initiate and disseminate health
> information; and (3) to conduct a social network analysis to examine the
> larger structures of health information sharing among community members.
>
> Although this type of research could be conducted by manually collecting
> posts from the online forum, coding for content, and the creation of a data
> set for social network analysis -- we are interested in other approaches
> that make better use of the forum database files. We have the cooperation of
> the website owner and administrator to access the MySQL database.
>
> I am seeking advice from anyone with experience working on similar research
> questions involving online forums and especially, making use of the original
> forum database files. All recommendations, suggestions, and pointers to
> articles, books, and appropriate tools are welcome and greatly appreciated.
>
> Thank you,
> Tim Hale
>
> ------------------------------------------------------------
> Timothy M. Hale, MA
> University of Alabama at Birmingham
> Department of Sociology
> Heritage Hall 460E
> 1401 University Boulevard
> Birmingham, AL 35294-1152
> 205.222.8108 (cell)
> timhale@uab.edu
>
>
> _______________________________________________
> CITASA mailing list
> CITASA@list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
CM
Christine Morton
Fri, Feb 26, 2010 3:34 AM
Wow. Thank you for the question, Tim, and for sharing the outline of your
research. It sounds very interesting and relevant. Would you be interested
in sharing more details with me offlist? and thanks Tom for your thoughtful
and thorough reply. I¹ve learned a lot.
Regards,
Christine
On 2/25/10 6:54 PM, "Thomas M. Lento" thomas.lento@gmail.com wrote:
There are all kinds of supervised and semi-supervised content analysis
approaches that might be applicable - their effectiveness depends on the
nature of your data, the level of detail you require, and your ability to
design training data. For example, if you have a superset of health
information topics and keywords associated with them you could probably do a
simple keyword analysis to give you an idea of which posts are talking about
which topic. The advantage is that's pretty easy to do even if the data is
stored as text blobs in your MySQL database. However, if you need to do
serious contextual analysis or you don't already know the scope and range of
possible health topics and keywords that's a more difficult problem. You'll
probably need to try some different machine learning approaches and see what
works best for you needs. You can do literature searches for papers published
at ICWSM, KDD, or WWW for examples of some approaches. There are other
conferences that will be applicable, but those are good places to start.
Follow the citation trail from the relevant pieces and you should get some
idea of where to look for useful references.
Network analysis of these things can either be pretty straightforward,
moderately complicated, or totally impossible. It all depends on how your data
is structured and what type of information is available. If you've got a
standard relational database structure designed to drive the content in the
online community then at best you're in for some work to reformat the data
tables into something useful for actual analysis. What you hope for are tables
with various combinations of userid, postid, time stamp, user_data, and
post_data. You'll probably need to join several tables to get the actual data
files you need, and although MySQL is not optimized for joins it should still
be manageable.
Assuming you're new to this type of analysis, my advice is to spend a lot of
time familiarizing yourself with the underlying data. Do some basic
distributions and see how much noise you've got in the system, find out the
most efficient routes to generating the output you need, and discover what
information is available in which table and how those tables are keyed and
indexed. Make sure you're on the lookout for garbage data - a lot of the great
data you get from various online sources is basically bad, either because of
system errors (rare and usually easy to find) or "bad" users (common and not
so easy to find - there's a whole literature on spam detection algorithms out
there). You'll need to make decisions about your error tolerance and what
types of behaviors you wish to ignore, and you can only do that effectively if
you understand your data.
I don't know how much experience you have with database queries, but assuming
you can only handle moderately complex queries my advice is to use the
database as a source for your final dataset and then conduct your analysis in
some other tool. My typical approach to this situation is to write database
queries that produce flat text files with one row per observation (typically
per user, or per user/time_period combination, but this obviously depends on
your research question) with one column for each metric. Then I load the data
into R or Stata or whatever else and build models. You can do a fair amount of
work in MySQL, but this is typically slower and more difficult than exporting
and using an actual statistical package.
The drawback of exporting data is that you're limited to whatever your stats
package can hold in memory. If your data set is large (hundreds of thousands
or millions of observations) then you need a fair amount of memory to run any
kind of complex model. If you're dealing with 10s or 100s of millions of
observations in your model then things get really interesting - I suggest
sampling, but there are other more difficult options.
If you don't know MySQL at all, you need to learn it. There are a plethora of
books on MySQL out there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of those publishers you should
be ok. You will also want to learn how to do some scripting in Python or Perl.
For the rank beginner, I recommend going with Python and learning by working
through the chapters and exercises in How To Think Like a Computer Scientist
(free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If
you know how to program in general then diveintopython.org
http://diveintopython.org is your best bet. I'm not a Perl guy, but I'm
sure someone can point you to resources.
For the actual network analysis, I'd first look into NodeXL since it's easy to
use, and if that doesn't meet your needs I'd go with something like igraph
(see http://igraph.sourceforge.net/ ), which works with both R and Python.
Best of luck.
-Tom
On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale timhale@uab.edu wrote:
Hi everyone,
I am working with others on a project that will examine health communication
among members of an online community who post to an online forum. We have
three primary goals: (1) to conduct a content analysis to understand the
types of health information that is communicated; (2) identify the context of
the discussions, including understanding the characteristics of the
individuals who initiate and disseminate health information; and (3) to
conduct a social network analysis to examine the larger structures of health
information sharing among community members.
Although this type of research could be conducted by manually collecting
posts from the online forum, coding for content, and the creation of a data
set for social network analysis -- we are interested in other approaches that
make better use of the forum database files. We have the cooperation of the
website owner and administrator to access the MySQL database.
I am seeking advice from anyone with experience working on similar research
questions involving online forums and especially, making use of the original
forum database files. All recommendations, suggestions, and pointers to
articles, books, and appropriate tools are welcome and greatly appreciated.
Thank you,
Tim Hale
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
CMQCC: Transforming Maternity Care
Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative
Stanford University p. 650-725-6108 f. 650-721-5751
Medical School Office Building d. 650-721-2187 c. 650-995-4550
251 Campus Drive
Palo Alto, CA 94305-5415
cmorton@stanford.edu www.cmqcc.org
Wow. Thank you for the question, Tim, and for sharing the outline of your
research. It sounds very interesting and relevant. Would you be interested
in sharing more details with me offlist? and thanks Tom for your thoughtful
and thorough reply. I¹ve learned a lot.
Regards,
Christine
On 2/25/10 6:54 PM, "Thomas M. Lento" <thomas.lento@gmail.com> wrote:
> There are all kinds of supervised and semi-supervised content analysis
> approaches that might be applicable - their effectiveness depends on the
> nature of your data, the level of detail you require, and your ability to
> design training data. For example, if you have a superset of health
> information topics and keywords associated with them you could probably do a
> simple keyword analysis to give you an idea of which posts are talking about
> which topic. The advantage is that's pretty easy to do even if the data is
> stored as text blobs in your MySQL database. However, if you need to do
> serious contextual analysis or you don't already know the scope and range of
> possible health topics and keywords that's a more difficult problem. You'll
> probably need to try some different machine learning approaches and see what
> works best for you needs. You can do literature searches for papers published
> at ICWSM, KDD, or WWW for examples of some approaches. There are other
> conferences that will be applicable, but those are good places to start.
> Follow the citation trail from the relevant pieces and you should get some
> idea of where to look for useful references.
>
> Network analysis of these things can either be pretty straightforward,
> moderately complicated, or totally impossible. It all depends on how your data
> is structured and what type of information is available. If you've got a
> standard relational database structure designed to drive the content in the
> online community then at best you're in for some work to reformat the data
> tables into something useful for actual analysis. What you hope for are tables
> with various combinations of userid, postid, time stamp, user_data, and
> post_data. You'll probably need to join several tables to get the actual data
> files you need, and although MySQL is not optimized for joins it should still
> be manageable.
>
> Assuming you're new to this type of analysis, my advice is to spend a lot of
> time familiarizing yourself with the underlying data. Do some basic
> distributions and see how much noise you've got in the system, find out the
> most efficient routes to generating the output you need, and discover what
> information is available in which table and how those tables are keyed and
> indexed. Make sure you're on the lookout for garbage data - a lot of the great
> data you get from various online sources is basically bad, either because of
> system errors (rare and usually easy to find) or "bad" users (common and not
> so easy to find - there's a whole literature on spam detection algorithms out
> there). You'll need to make decisions about your error tolerance and what
> types of behaviors you wish to ignore, and you can only do that effectively if
> you understand your data.
>
> I don't know how much experience you have with database queries, but assuming
> you can only handle moderately complex queries my advice is to use the
> database as a source for your final dataset and then conduct your analysis in
> some other tool. My typical approach to this situation is to write database
> queries that produce flat text files with one row per observation (typically
> per user, or per user/time_period combination, but this obviously depends on
> your research question) with one column for each metric. Then I load the data
> into R or Stata or whatever else and build models. You can do a fair amount of
> work in MySQL, but this is typically slower and more difficult than exporting
> and using an actual statistical package.
>
> The drawback of exporting data is that you're limited to whatever your stats
> package can hold in memory. If your data set is large (hundreds of thousands
> or millions of observations) then you need a fair amount of memory to run any
> kind of complex model. If you're dealing with 10s or 100s of millions of
> observations in your model then things get really interesting - I suggest
> sampling, but there are other more difficult options.
>
> If you don't know MySQL at all, you need to learn it. There are a plethora of
> books on MySQL out there - I like O'Reilly for reference and SAMS for
> instruction, so if you get something from one of those publishers you should
> be ok. You will also want to learn how to do some scripting in Python or Perl.
> For the rank beginner, I recommend going with Python and learning by working
> through the chapters and exercises in How To Think Like a Computer Scientist
> (free online at http://www.greenteapress.com/thinkpython/thinkCSpy/html/ ). If
> you know how to program in general then diveintopython.org
> <http://diveintopython.org> is your best bet. I'm not a Perl guy, but I'm
> sure someone can point you to resources.
>
> For the actual network analysis, I'd first look into NodeXL since it's easy to
> use, and if that doesn't meet your needs I'd go with something like igraph
> (see http://igraph.sourceforge.net/ ), which works with both R and Python.
>
> Best of luck.
>
> -Tom
>
> On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale <timhale@uab.edu> wrote:
>> Hi everyone,
>>
>> I am working with others on a project that will examine health communication
>> among members of an online community who post to an online forum. We have
>> three primary goals: (1) to conduct a content analysis to understand the
>> types of health information that is communicated; (2) identify the context of
>> the discussions, including understanding the characteristics of the
>> individuals who initiate and disseminate health information; and (3) to
>> conduct a social network analysis to examine the larger structures of health
>> information sharing among community members.
>>
>> Although this type of research could be conducted by manually collecting
>> posts from the online forum, coding for content, and the creation of a data
>> set for social network analysis -- we are interested in other approaches that
>> make better use of the forum database files. We have the cooperation of the
>> website owner and administrator to access the MySQL database.
>>
>> I am seeking advice from anyone with experience working on similar research
>> questions involving online forums and especially, making use of the original
>> forum database files. All recommendations, suggestions, and pointers to
>> articles, books, and appropriate tools are welcome and greatly appreciated.
>>
>> Thank you,
>> Tim Hale
>>
>> ------------------------------------------------------------
>> Timothy M. Hale, MA
>> University of Alabama at Birmingham
>> Department of Sociology
>> Heritage Hall 460E
>> 1401 University Boulevard
>> Birmingham, AL 35294-1152
>> 205.222.8108 (cell)
>> timhale@uab.edu
>>
>>
>> _______________________________________________
>> CITASA mailing list
>> CITASA@list.citasa.org
>> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
>
>
> _______________________________________________
> CITASA mailing list
> CITASA@list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
CMQCC: Transforming Maternity Care
Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative
Stanford University p. 650-725-6108 f. 650-721-5751
Medical School Office Building d. 650-721-2187 c. 650-995-4550
251 Campus Drive
Palo Alto, CA 94305-5415
cmorton@stanford.edu www.cmqcc.org
CH
Caroline Haythornthwaite
Fri, Feb 26, 2010 11:30 AM
I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions.
/Caroline
---- Original message ----
Date: Thu, 25 Feb 2010 19:34:05 -0800
From: Christine Morton christine@christinemorton.com
Subject: Re: [CITASA] seeking suggestions about research methodology for online forums
To: "Thomas M. Lento" thomas.lento@gmail.com, Tim Hale timhale@uab.edu
Cc: CITASA@list.citasa.org
Wow. Thank you for the question, Tim, and for
sharing the outline of your research. It sounds
very interesting and relevant. Would you be
interested in sharing more details with me offlist?
and thanks Tom for your thoughtful and thorough
reply. I've learned a lot.
Regards,
Christine
On 2/25/10 6:54 PM, "Thomas M. Lento"
thomas.lento@gmail.com wrote:
There are all kinds of supervised and
semi-supervised content analysis approaches that
might be applicable - their effectiveness depends
on the nature of your data, the level of detail
you require, and your ability to design training
data. For example, if you have a superset of
health information topics and keywords associated
with them you could probably do a simple keyword
analysis to give you an idea of which posts are
talking about which topic. The advantage is that's
pretty easy to do even if the data is stored as
text blobs in your MySQL database. However, if you
need to do serious contextual analysis or you
don't already know the scope and range of possible
health topics and keywords that's a more difficult
problem. You'll probably need to try some
different machine learning approaches and see what
works best for you needs. You can do literature
searches for papers published at ICWSM, KDD, or
WWW for examples of some approaches. There are
other conferences that will be applicable, but
those are good places to start. Follow the
citation trail from the relevant pieces and you
should get some idea of where to look for useful
references.
Network analysis of these things can either be
pretty straightforward, moderately complicated, or
totally impossible. It all depends on how your
data is structured and what type of information is
available. If you've got a standard relational
database structure designed to drive the content
in the online community then at best you're in for
some work to reformat the data tables into
something useful for actual analysis. What you
hope for are tables with various combinations of
userid, postid, time stamp, user_data, and
post_data. You'll probably need to join several
tables to get the actual data files you need, and
although MySQL is not optimized for joins it
should still be manageable.
Assuming you're new to this type of analysis, my
advice is to spend a lot of time familiarizing
yourself with the underlying data. Do some basic
distributions and see how much noise you've got in
the system, find out the most efficient routes to
generating the output you need, and discover what
information is available in which table and how
those tables are keyed and indexed. Make sure
you're on the lookout for garbage data - a lot of
the great data you get from various online sources
is basically bad, either because of system errors
(rare and usually easy to find) or "bad" users
(common and not so easy to find - there's a whole
literature on spam detection algorithms out
there). You'll need to make decisions about your
error tolerance and what types of behaviors you
wish to ignore, and you can only do that
effectively if you understand your data.
I don't know how much experience you have with
database queries, but assuming you can only handle
moderately complex queries my advice is to use the
database as a source for your final dataset and
then conduct your analysis in some other tool. My
typical approach to this situation is to write
database queries that produce flat text files with
one row per observation (typically per user, or
per user/time_period combination, but this
obviously depends on your research question) with
one column for each metric. Then I load the data
into R or Stata or whatever else and build models.
You can do a fair amount of work in MySQL, but
this is typically slower and more difficult than
exporting and using an actual statistical package.
The drawback of exporting data is that you're
limited to whatever your stats package can hold in
memory. If your data set is large (hundreds of
thousands or millions of observations) then you
need a fair amount of memory to run any kind of
complex model. If you're dealing with 10s or 100s
of millions of observations in your model then
things get really interesting - I suggest
sampling, but there are other more difficult
options.
If you don't know MySQL at all, you need to learn
it. There are a plethora of books on MySQL out
there - I like O'Reilly for reference and SAMS for
instruction, so if you get something from one of
those publishers you should be ok. You will also
want to learn how to do some scripting in Python
or Perl. For the rank beginner, I recommend going
with Python and learning by working through the
chapters and exercises in How To Think Like a
Computer Scientist (free online at
http://www.greenteapress.com/thinkpython/thinkCSpy/html/
). If you know how to program in general then
diveintopython.org <http://diveintopython.org> is
your best bet. I'm not a Perl guy, but I'm sure
someone can point you to resources.
For the actual network analysis, I'd first look
into NodeXL since it's easy to use, and if that
doesn't meet your needs I'd go with something like
igraph (see http://igraph.sourceforge.net/ ),
which works with both R and Python.
Best of luck.
-Tom
On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
<timhale@uab.edu> wrote:
Hi everyone,
I am working with others on a project that will
examine health communication among members of an
online community who post to an online forum. We
have three primary goals: (1) to conduct a
content analysis to understand the types of
health information that is communicated; (2)
identify the context of the discussions,
including understanding the characteristics of
the individuals who initiate and disseminate
health information; and (3) to conduct a social
network analysis to examine the larger
structures of health information sharing among
community members.
Although this type of research could be
conducted by manually collecting posts from the
online forum, coding for content, and the
creation of a data set for social network
analysis -- we are interested in other
approaches that make better use of the forum
database files. We have the cooperation of the
website owner and administrator to access the
MySQL database.
I am seeking advice from anyone with experience
working on similar research questions involving
online forums and especially, making use of the
original forum database files. All
recommendations, suggestions, and pointers to
articles, books, and appropriate tools are
welcome and greatly appreciated.
Thank you,
Tim Hale
------------------------------------------------------------
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
_______________________________________________
CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
-------------------------------------------------
_______________________________________________
CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
CMQCC: Transforming Maternity Care
-------------------------------------------------
Christine H. Morton, PhD
Program Manager/Research Sociologist
California Maternal Quality Care Collaborative
Stanford University p. 650-725-6108
f. 650-721-5751
Medical School Office Building d.
650-721-2187 c. 650-995-4550
251 Campus Drive
Palo Alto, CA 94305-5415
cmorton@stanford.edu www.cmqcc.org
CITASA mailing list
CITASA@list.citasa.org
http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
Caroline Haythornthwaite
Leverhulme Visiting Professor, Institute of Education, University of London (2009-10)
Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn@illinois.edu)
I forwarded Tim Hale's question to Anatoliy Gruzd at Dalhousie, but others may also be interested in his "TextAnalytics" system (see http://anatoliygruzd.com/home/?page_id=27). This provides a home for text analysis and network analysis of threaded discussions.
/Caroline
---- Original message ----
>Date: Thu, 25 Feb 2010 19:34:05 -0800
>From: Christine Morton <christine@christinemorton.com>
>Subject: Re: [CITASA] seeking suggestions about research methodology for online forums
>To: "Thomas M. Lento" <thomas.lento@gmail.com>, Tim Hale <timhale@uab.edu>
>Cc: CITASA@list.citasa.org
>
> Wow. Thank you for the question, Tim, and for
> sharing the outline of your research. It sounds
> very interesting and relevant. Would you be
> interested in sharing more details with me offlist?
> and thanks Tom for your thoughtful and thorough
> reply. I've learned a lot.
> Regards,
> Christine
>
> On 2/25/10 6:54 PM, "Thomas M. Lento"
> <thomas.lento@gmail.com> wrote:
>
> There are all kinds of supervised and
> semi-supervised content analysis approaches that
> might be applicable - their effectiveness depends
> on the nature of your data, the level of detail
> you require, and your ability to design training
> data. For example, if you have a superset of
> health information topics and keywords associated
> with them you could probably do a simple keyword
> analysis to give you an idea of which posts are
> talking about which topic. The advantage is that's
> pretty easy to do even if the data is stored as
> text blobs in your MySQL database. However, if you
> need to do serious contextual analysis or you
> don't already know the scope and range of possible
> health topics and keywords that's a more difficult
> problem. You'll probably need to try some
> different machine learning approaches and see what
> works best for you needs. You can do literature
> searches for papers published at ICWSM, KDD, or
> WWW for examples of some approaches. There are
> other conferences that will be applicable, but
> those are good places to start. Follow the
> citation trail from the relevant pieces and you
> should get some idea of where to look for useful
> references.
>
> Network analysis of these things can either be
> pretty straightforward, moderately complicated, or
> totally impossible. It all depends on how your
> data is structured and what type of information is
> available. If you've got a standard relational
> database structure designed to drive the content
> in the online community then at best you're in for
> some work to reformat the data tables into
> something useful for actual analysis. What you
> hope for are tables with various combinations of
> userid, postid, time stamp, user_data, and
> post_data. You'll probably need to join several
> tables to get the actual data files you need, and
> although MySQL is not optimized for joins it
> should still be manageable.
>
> Assuming you're new to this type of analysis, my
> advice is to spend a lot of time familiarizing
> yourself with the underlying data. Do some basic
> distributions and see how much noise you've got in
> the system, find out the most efficient routes to
> generating the output you need, and discover what
> information is available in which table and how
> those tables are keyed and indexed. Make sure
> you're on the lookout for garbage data - a lot of
> the great data you get from various online sources
> is basically bad, either because of system errors
> (rare and usually easy to find) or "bad" users
> (common and not so easy to find - there's a whole
> literature on spam detection algorithms out
> there). You'll need to make decisions about your
> error tolerance and what types of behaviors you
> wish to ignore, and you can only do that
> effectively if you understand your data.
>
> I don't know how much experience you have with
> database queries, but assuming you can only handle
> moderately complex queries my advice is to use the
> database as a source for your final dataset and
> then conduct your analysis in some other tool. My
> typical approach to this situation is to write
> database queries that produce flat text files with
> one row per observation (typically per user, or
> per user/time_period combination, but this
> obviously depends on your research question) with
> one column for each metric. Then I load the data
> into R or Stata or whatever else and build models.
> You can do a fair amount of work in MySQL, but
> this is typically slower and more difficult than
> exporting and using an actual statistical package.
>
> The drawback of exporting data is that you're
> limited to whatever your stats package can hold in
> memory. If your data set is large (hundreds of
> thousands or millions of observations) then you
> need a fair amount of memory to run any kind of
> complex model. If you're dealing with 10s or 100s
> of millions of observations in your model then
> things get really interesting - I suggest
> sampling, but there are other more difficult
> options.
>
> If you don't know MySQL at all, you need to learn
> it. There are a plethora of books on MySQL out
> there - I like O'Reilly for reference and SAMS for
> instruction, so if you get something from one of
> those publishers you should be ok. You will also
> want to learn how to do some scripting in Python
> or Perl. For the rank beginner, I recommend going
> with Python and learning by working through the
> chapters and exercises in How To Think Like a
> Computer Scientist (free online at
> http://www.greenteapress.com/thinkpython/thinkCSpy/html/
> ). If you know how to program in general then
> diveintopython.org <http://diveintopython.org> is
> your best bet. I'm not a Perl guy, but I'm sure
> someone can point you to resources.
>
> For the actual network analysis, I'd first look
> into NodeXL since it's easy to use, and if that
> doesn't meet your needs I'd go with something like
> igraph (see http://igraph.sourceforge.net/ ),
> which works with both R and Python.
>
> Best of luck.
>
> -Tom
>
> On Thu, Feb 25, 2010 at 6:14 PM, Tim Hale
> <timhale@uab.edu> wrote:
>
> Hi everyone,
>
> I am working with others on a project that will
> examine health communication among members of an
> online community who post to an online forum. We
> have three primary goals: (1) to conduct a
> content analysis to understand the types of
> health information that is communicated; (2)
> identify the context of the discussions,
> including understanding the characteristics of
> the individuals who initiate and disseminate
> health information; and (3) to conduct a social
> network analysis to examine the larger
> structures of health information sharing among
> community members.
>
> Although this type of research could be
> conducted by manually collecting posts from the
> online forum, coding for content, and the
> creation of a data set for social network
> analysis -- we are interested in other
> approaches that make better use of the forum
> database files. We have the cooperation of the
> website owner and administrator to access the
> MySQL database.
>
> I am seeking advice from anyone with experience
> working on similar research questions involving
> online forums and especially, making use of the
> original forum database files. All
> recommendations, suggestions, and pointers to
> articles, books, and appropriate tools are
> welcome and greatly appreciated.
>
> Thank you,
> Tim Hale
>
> ------------------------------------------------------------
> Timothy M. Hale, MA
> University of Alabama at Birmingham
> Department of Sociology
> Heritage Hall 460E
> 1401 University Boulevard
> Birmingham, AL 35294-1152
> 205.222.8108 (cell)
> timhale@uab.edu
>
> _______________________________________________
> CITASA mailing list
> CITASA@list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
> -------------------------------------------------
>
> _______________________________________________
> CITASA mailing list
> CITASA@list.citasa.org
> http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
>
> CMQCC: Transforming Maternity Care
>
> -------------------------------------------------
>
> Christine H. Morton, PhD
> Program Manager/Research Sociologist
> California Maternal Quality Care Collaborative
>
> Stanford University p. 650-725-6108
> f. 650-721-5751
> Medical School Office Building d.
> 650-721-2187 c. 650-995-4550
> 251 Campus Drive
> Palo Alto, CA 94305-5415
>
> cmorton@stanford.edu www.cmqcc.org
>________________
>_______________________________________________
>CITASA mailing list
>CITASA@list.citasa.org
>http://list.citasa.org/mailman/listinfo/citasa_list.citasa.org
--------------------------------------
Caroline Haythornthwaite
Leverhulme Visiting Professor, Institute of Education, University of London (2009-10)
Professor, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 East Daniel St., Champaign IL 61820 (haythorn@illinois.edu)
ED
el don
Tue, Mar 2, 2010 6:49 AM
a late entry but perhaps of interest.
as a linguist studying written interaction, i found sandra harrison's
work to be readable and relevant. our focus is not so much on
automatic data mining, but on what goes on in groups where
interaction occurs via language (and other modalities at times). one
of her papers outlines an approach that can be used to represent both
the posts in chronological order and who posted them ... it's a
simple framework (lots of manual labour i'm afraid), but it works
very well to show in summary what has been going on in long
multi-party discussions.
it's in a collection looking at communities of practice:
Harrison, S. (2003) "Computer-mediated interaction: Using discourse
maps to represent multi-party, multi-topic asynchronous discussions"
in Sarangi, S. & T. van Leeuwen: Applied Linguistics and Communities
of Practice.
the book can be viewed via google books, and most of the chapter can
be read online.
best,
alex
At 8:14 PM -0600 25/2/10, Tim Hale wrote:
Hi everyone,
I am working with others on a project that will examine health
communication among members of an online community who post to an
online forum. We have three primary goals: (1) to conduct a content
analysis to understand the types of health information that is
communicated; (2) identify the context of the discussions, including
understanding the characteristics of the individuals who initiate
and disseminate health information; and (3) to conduct a social
network analysis to examine the larger structures of health
information sharing among community members.
Although this type of research could be conducted by manually
collecting posts from the online forum, coding for content, and the
creation of a data set for social network analysis -- we are
interested in other approaches that make better use of the forum
database files. We have the cooperation of the website owner and
administrator to access the MySQL database.
I am seeking advice from anyone with experience working on similar
research questions involving online forums and especially, making
use of the original forum database files. All recommendations,
suggestions, and pointers to articles, books, and appropriate tools
are welcome and greatly appreciated.
Thank you,
Tim Hale
Timothy M. Hale, MA
University of Alabama at Birmingham
Department of Sociology
Heritage Hall 460E
1401 University Boulevard
Birmingham, AL 35294-1152
205.222.8108 (cell)
timhale@uab.edu
--
netdynam email list
[group dynamics on the internet:
now blogging at www.netdynam.org]
a late entry but perhaps of interest.
as a linguist studying written interaction, i found sandra harrison's
work to be readable and relevant. our focus is not so much on
automatic data mining, but on what goes on in groups where
interaction occurs via language (and other modalities at times). one
of her papers outlines an approach that can be used to represent both
the posts in chronological order and who posted them ... it's a
simple framework (lots of manual labour i'm afraid), but it works
very well to show in summary what has been going on in long
multi-party discussions.
it's in a collection looking at communities of practice:
Harrison, S. (2003) "Computer-mediated interaction: Using discourse
maps to represent multi-party, multi-topic asynchronous discussions"
in Sarangi, S. & T. van Leeuwen: Applied Linguistics and Communities
of Practice.
the book can be viewed via google books, and most of the chapter can
be read online.
best,
alex
At 8:14 PM -0600 25/2/10, Tim Hale wrote:
>Hi everyone,
>
>I am working with others on a project that will examine health
>communication among members of an online community who post to an
>online forum. We have three primary goals: (1) to conduct a content
>analysis to understand the types of health information that is
>communicated; (2) identify the context of the discussions, including
>understanding the characteristics of the individuals who initiate
>and disseminate health information; and (3) to conduct a social
>network analysis to examine the larger structures of health
>information sharing among community members.
>
>Although this type of research could be conducted by manually
>collecting posts from the online forum, coding for content, and the
>creation of a data set for social network analysis -- we are
>interested in other approaches that make better use of the forum
>database files. We have the cooperation of the website owner and
>administrator to access the MySQL database.
>
>I am seeking advice from anyone with experience working on similar
>research questions involving online forums and especially, making
>use of the original forum database files. All recommendations,
>suggestions, and pointers to articles, books, and appropriate tools
>are welcome and greatly appreciated.
>
>Thank you,
>Tim Hale
>
>------------------------------------------------------------
>Timothy M. Hale, MA
>University of Alabama at Birmingham
>Department of Sociology
>Heritage Hall 460E
>1401 University Boulevard
>Birmingham, AL 35294-1152
>205.222.8108 (cell)
>timhale@uab.edu
--
================================
netdynam email list
[group dynamics on the internet:
now blogging at www.netdynam.org]
================================