CP
Christopher Pott
Fri, Jul 2, 2010 12:35 PM
Hi Susan,
So far I've only used Talend for generating CollectionObjects (no repeated fields, relations or any other schemas yet). The AdvancedXML output module does not seem to directly support multiple repeated fields but I think it's possible to find a way around this limitation (by formatting them before sending to the module). Talend can be configured to generate complete xml files (one for each object) ready for import to nuxeo (see attachment).
Post processing is limited to this: To satisfy the nuxeo import format, each generated xml file is placed in it's own unique directory and renamed (to document.xml). This directory path (relative to the nuxeo directory tree) is inserted into the xml document in the <document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above) becomes the CSID (so is in fact whatever name I've chosen to give the directory directly holding the xml file). At the moment, I'm not using particularly well thought out csid's (not UUIDs).
But I'm glad you asked this because it makes me think.... Is it a requirement that Services generates the CSID? In fact it would be interesting to hear from someone whether is it a requirement to use Services APIs to migrate data in general, or is nuxeo shell also an approved approach?
If it proves necessary to avoid nuxeo shell I think it would be optimal to access the service APIs directly from the ETL (requires building a new java component in Talend). Otherwise, I guess your java client could be adapted to handle the Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at Berkeley, and I
guess you've seen some of the documentation of our project that Chris
Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
create text files that I load using the CSpace services java client API.
I expect I am going to need to start creating XML when it becomes
possible to load "repeating" fields and more complex field groupings,
and I'm not sure if Talend will be better than Kettle for this.
Could you send me a sample of the "advanced XML" output for loading into
CSpace that you get out of the Talend ETL tool (before you do further
manipulation)? Does it output a series of XML records in a single file,
multiple files, or are you able to pipe (or something) output from
Talend to the Nuxeo shell that you are using?
Also, how are you creating Collection Space IDs (csids) in your load
process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
Hi Chris,
I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.
We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.
The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.
I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).
I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.
I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.
Best Regards,
Chris Pott
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
Sendt: 12. juni 2010 01:37
Til: talk@lists.collectionspace.org
Emne: [Talk] Deployment experience at SMK?
Hi Angela and others,
In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.
I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.
Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
Hi Susan,
So far I've only used Talend for generating CollectionObjects (no repeated fields, relations or any other schemas yet). The AdvancedXML output module does not seem to directly support multiple repeated fields but I think it's possible to find a way around this limitation (by formatting them before sending to the module). Talend can be configured to generate complete xml files (one for each object) ready for import to nuxeo (see attachment).
Post processing is limited to this: To satisfy the nuxeo import format, each generated xml file is placed in it's own unique directory and renamed (to document.xml). This directory path (relative to the nuxeo directory tree) is inserted into the xml document in the <document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above) becomes the CSID (so is in fact whatever name I've chosen to give the directory directly holding the xml file). At the moment, I'm not using particularly well thought out csid's (not UUIDs).
But I'm glad you asked this because it makes me think.... Is it a requirement that Services generates the CSID? In fact it would be interesting to hear from someone whether is it a requirement to use Services APIs to migrate data in general, or is nuxeo shell also an approved approach?
If it proves necessary to avoid nuxeo shell I think it would be optimal to access the service APIs directly from the ETL (requires building a new java component in Talend). Otherwise, I guess your java client could be adapted to handle the Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at Berkeley, and I
guess you've seen some of the documentation of our project that Chris
Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
create text files that I load using the CSpace services java client API.
I expect I am going to need to start creating XML when it becomes
possible to load "repeating" fields and more complex field groupings,
and I'm not sure if Talend will be better than Kettle for this.
Could you send me a sample of the "advanced XML" output for loading into
CSpace that you get out of the Talend ETL tool (before you do further
manipulation)? Does it output a series of XML records in a single file,
multiple files, or are you able to pipe (or something) output from
Talend to the Nuxeo shell that you are using?
Also, how are you creating Collection Space IDs (csids) in your load
process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
> Hi Chris,
>
> I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.
>
> We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.
>
> The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.
>
> I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).
>
> I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.
>
> I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.
>
> Best Regards,
> Chris Pott
> Developer, Corpus Project
> Statens Museum for Kunst (National Gallery of Denmark)
>
>
> -----Oprindelig meddelelse-----
> Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
> Sendt: 12. juni 2010 01:37
> Til: talk@lists.collectionspace.org
> Emne: [Talk] Deployment experience at SMK?
>
> Hi Angela and others,
>
> In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
> https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
> and for the Phoebe A. Hearst Museum of Anthropology at
> http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
>
> We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.
>
> I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.
>
> Many thanks,
> Chris Hoffman
> Manager, Informatics Services
> UC Berkeley
> _______________________________________________
> Talk mailing list
> Talk@lists.collectionspace.org
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
>
> _______________________________________________
> Talk mailing list
> Talk@lists.collectionspace.org
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
>
PS
Patrick Schmitz
Fri, Jul 2, 2010 4:29 PM
This is great info, Chris. Thanks for sharing it.
On a separate thread of work, I am exploring how we can do bulk import to
Nuxeo, creating both our CSIDs as well as their internal UUID values.
Initially, I will be doing this for bulk import of vocabularies and
authorities, but what we learn should of course apply to import in general.
I'll keep the list posted on progress.
Thanks - Patrick
-----Original Message-----
From: talk-bounces@lists.collectionspace.org
[mailto:talk-bounces@lists.collectionspace.org] On Behalf Of
Christopher Pott
Sent: Friday, July 02, 2010 5:35 AM
To: Susan Stone
Cc: Glen Jackson; talk@lists.collectionspace.org
Subject: Re: [Talk] Deployment experience at SMK?
Hi Susan,
So far I've only used Talend for generating CollectionObjects
(no repeated fields, relations or any other schemas yet). The
AdvancedXML output module does not seem to directly support
multiple repeated fields but I think it's possible to find a
way around this limitation (by formatting them before sending
to the module). Talend can be configured to generate complete
xml files (one for each object) ready for import to nuxeo
(see attachment).
Post processing is limited to this: To satisfy the nuxeo
import format, each generated xml file is placed in it's own
unique directory and renamed (to document.xml). This
directory path (relative to the nuxeo directory tree) is
inserted into the xml document in the
<document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above) becomes the
CSID (so is in fact whatever name I've chosen to give the
directory directly holding the xml file). At the moment, I'm
not using particularly well thought out csid's (not UUIDs).
But I'm glad you asked this because it makes me think.... Is
it a requirement that Services generates the CSID? In fact it
would be interesting to hear from someone whether is it a
requirement to use Services APIs to migrate data in general,
or is nuxeo shell also an approved approach?
If it proves necessary to avoid nuxeo shell I think it would
be optimal to access the service APIs directly from the ETL
(requires building a new java component in Talend).
Otherwise, I guess your java client could be adapted to
handle the Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at
Berkeley, and I guess you've seen some of the documentation
of our project that Chris Hoffman has put up. So far I am
using the Pentaho (Kettle) tool to create text files that I
load using the CSpace services java client API.
I expect I am going to need to start creating XML when it
becomes possible to load "repeating" fields and more complex
field groupings, and I'm not sure if Talend will be better
than Kettle for this.
Could you send me a sample of the "advanced XML" output for
loading into CSpace that you get out of the Talend ETL tool
(before you do further manipulation)? Does it output a series
of XML records in a single file, multiple files, or are you
able to pipe (or something) output from Talend to the Nuxeo
shell that you are using?
Also, how are you creating Collection Space IDs (csids) in
your load process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
Hi Chris,
I intend soon to follow your example and move some SMK
documentation to the CS wiki. But in the mean time, here's a
brief description of our experiences.
We've currently got two Collectionspace deployments
running, version 0.6 with ~33,000 collection objects and
version 0.7 with just a few thousand.
The main part of our data is currently residing within two
systems, an art database which is (indirectly) served by an
MS Sql Server and an exhibition management system based upon
a Visual FoxPro database.
I'm using the Talend ETL tool for data mapping, and so far
it has provided the necessary functionality. For input, it
can connect directly to our MS Sql server, but connecting to
FoxPro has not been so straightforward. After much
experimentation, I migrated the FoxPro database to MySql and
then used a Talend MySql input module. For output, Talend
provides an output format named "Advanced XML" which can be
used to generate a series of xml records. I then run a small
(bash/awk) script on these files to format them for importing
to Nuxeo (This combination is not perfect. Ideally, it would
be nice to use an ETL tool with a dedicated Nuxeo output module).
I've been using the 'nuxeo shell' command line tool in
interactive mode to load data to CollectionSpace. Our
CollectionSpace deployments are currently on two Vmware
virtual servers running Debian Linux. I've been loading data
via nuxeo shell remote connections to these servers and this
process is slow (the time for transferring 33,000 records
approached a couple of hours - but I've only once imported
this amount of data and not yet experimented with other ways
to do this). From a single user perspective CollectionSpace
performance with this amount of data is generally fine
(except for an pagination issue on the Find and Edit main
page). I've not yet run any stress tests or performance/load
analysis on the servers and would welcome recommendations on
the best tools/approach to accomplish these.
I'm keen to discuss other tools, experiences and ideas
related to all aspects of deploying CollectionSpace so please
feel free to get in touch.
working on data mapping and that they might have already
loaded a significant volume of records (45,000 records). I'd
love to hear confirmation and more details! We're documenting
our experience for the prototype deployment for the
University and Jepson Herbaria (UC Berkeley) at
ployment and for the Phoebe A. Hearst Museum of Anthropology at
ectionSpace+Deployment
We're just starting up Jira projects for the next round of
work we're doing tied to the 0.7 release.
I'd especially like to hear about data loading -- how you
are doing this, how long it took, how the system performs,
where you are running the system (dedicated hardware or VMs),
and so on.
This is great info, Chris. Thanks for sharing it.
On a separate thread of work, I am exploring how we can do bulk import to
Nuxeo, creating both our CSIDs as well as their internal UUID values.
Initially, I will be doing this for bulk import of vocabularies and
authorities, but what we learn should of course apply to import in general.
I'll keep the list posted on progress.
Thanks - Patrick
> -----Original Message-----
> From: talk-bounces@lists.collectionspace.org
> [mailto:talk-bounces@lists.collectionspace.org] On Behalf Of
> Christopher Pott
> Sent: Friday, July 02, 2010 5:35 AM
> To: Susan Stone
> Cc: Glen Jackson; talk@lists.collectionspace.org
> Subject: Re: [Talk] Deployment experience at SMK?
>
> Hi Susan,
>
> So far I've only used Talend for generating CollectionObjects
> (no repeated fields, relations or any other schemas yet). The
> AdvancedXML output module does not seem to directly support
> multiple repeated fields but I think it's possible to find a
> way around this limitation (by formatting them before sending
> to the module). Talend can be configured to generate complete
> xml files (one for each object) ready for import to nuxeo
> (see attachment).
>
> Post processing is limited to this: To satisfy the nuxeo
> import format, each generated xml file is placed in it's own
> unique directory and renamed (to document.xml). This
> directory path (relative to the nuxeo directory tree) is
> inserted into the xml document in the
> <document><system><path> part of the xml.
>
> CSIDs: The final node of the <path> (see above) becomes the
> CSID (so is in fact whatever name I've chosen to give the
> directory directly holding the xml file). At the moment, I'm
> not using particularly well thought out csid's (not UUIDs).
>
> But I'm glad you asked this because it makes me think.... Is
> it a requirement that Services generates the CSID? In fact it
> would be interesting to hear from someone whether is it a
> requirement to use Services APIs to migrate data in general,
> or is nuxeo shell also an approved approach?
>
> If it proves necessary to avoid nuxeo shell I think it would
> be optimal to access the service APIs directly from the ETL
> (requires building a new java component in Talend).
> Otherwise, I guess your java client could be adapted to
> handle the Talend (xml) output instead.
>
> Regards,
> Chris
>
> Developer, Corpus Project
> Statens Museum for Kunst (National Gallery of Denmark)
>
> -----Oprindelig meddelelse-----
> Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
> Sendt: 1. juli 2010 20:08
> Til: Christopher Pott
> Cc: Richard Millet; Glen Jackson; Chris Hoffman
> Emne: Re: [Talk] Deployment experience at SMK?
>
> Chris,
>
> I'm working on data migration for the Hearst Museum at
> Berkeley, and I guess you've seen some of the documentation
> of our project that Chris Hoffman has put up. So far I am
> using the Pentaho (Kettle) tool to create text files that I
> load using the CSpace services java client API.
> I expect I am going to need to start creating XML when it
> becomes possible to load "repeating" fields and more complex
> field groupings, and I'm not sure if Talend will be better
> than Kettle for this.
>
> Could you send me a sample of the "advanced XML" output for
> loading into CSpace that you get out of the Talend ETL tool
> (before you do further manipulation)? Does it output a series
> of XML records in a single file, multiple files, or are you
> able to pipe (or something) output from Talend to the Nuxeo
> shell that you are using?
>
> Also, how are you creating Collection Space IDs (csids) in
> your load process?
>
> Thanks,
> Susan Stone
> Informatics Group, Data Services, UC Berkeley
>
>
> Christopher Pott wrote:
> > Hi Chris,
> >
> > I intend soon to follow your example and move some SMK
> documentation to the CS wiki. But in the mean time, here's a
> brief description of our experiences.
> >
> > We've currently got two Collectionspace deployments
> running, version 0.6 with ~33,000 collection objects and
> version 0.7 with just a few thousand.
> >
> > The main part of our data is currently residing within two
> systems, an art database which is (indirectly) served by an
> MS Sql Server and an exhibition management system based upon
> a Visual FoxPro database.
> >
> > I'm using the Talend ETL tool for data mapping, and so far
> it has provided the necessary functionality. For input, it
> can connect directly to our MS Sql server, but connecting to
> FoxPro has not been so straightforward. After much
> experimentation, I migrated the FoxPro database to MySql and
> then used a Talend MySql input module. For output, Talend
> provides an output format named "Advanced XML" which can be
> used to generate a series of xml records. I then run a small
> (bash/awk) script on these files to format them for importing
> to Nuxeo (This combination is not perfect. Ideally, it would
> be nice to use an ETL tool with a dedicated Nuxeo output module).
> >
> > I've been using the 'nuxeo shell' command line tool in
> interactive mode to load data to CollectionSpace. Our
> CollectionSpace deployments are currently on two Vmware
> virtual servers running Debian Linux. I've been loading data
> via nuxeo shell remote connections to these servers and this
> process is slow (the time for transferring 33,000 records
> approached a couple of hours - but I've only once imported
> this amount of data and not yet experimented with other ways
> to do this). From a single user perspective CollectionSpace
> performance with this amount of data is generally fine
> (except for an pagination issue on the Find and Edit main
> page). I've not yet run any stress tests or performance/load
> analysis on the servers and would welcome recommendations on
> the best tools/approach to accomplish these.
> >
> > I'm keen to discuss other tools, experiences and ideas
> related to all aspects of deploying CollectionSpace so please
> feel free to get in touch.
> >
> > Best Regards,
> > Chris Pott
> > Developer, Corpus Project
> > Statens Museum for Kunst (National Gallery of Denmark)
> >
> >
> > -----Oprindelig meddelelse-----
> > Fra: talk-bounces@lists.collectionspace.org
> > [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris
> > Hoffman
> > Sendt: 12. juni 2010 01:37
> > Til: talk@lists.collectionspace.org
> > Emne: [Talk] Deployment experience at SMK?
> >
> > Hi Angela and others,
> >
> > In a CollectionSpace meeting this week, I heard that SMK is
> working on data mapping and that they might have already
> loaded a significant volume of records (45,000 records). I'd
> love to hear confirmation and more details! We're documenting
> our experience for the prototype deployment for the
> University and Jepson Herbaria (UC Berkeley) at
> >
> >
> https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+De
> > ployment and for the Phoebe A. Hearst Museum of Anthropology at
> >
> >
> http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+Coll
> > ectionSpace+Deployment
> >
> > We're just starting up Jira projects for the next round of
> work we're doing tied to the 0.7 release.
> >
> > I'd especially like to hear about data loading -- how you
> are doing this, how long it took, how the system performs,
> where you are running the system (dedicated hardware or VMs),
> and so on.
> >
> > Many thanks,
> > Chris Hoffman
> > Manager, Informatics Services
> > UC Berkeley
> > _______________________________________________
> > Talk mailing list
> > Talk@lists.collectionspace.org
> >
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectio
> > nspace.org
> >
> > _______________________________________________
> > Talk mailing list
> > Talk@lists.collectionspace.org
> >
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectio
> > nspace.org
> >
>
>
SS
Susan Stone
Fri, Jul 2, 2010 5:46 PM
Thanks so much, Chris. A very different approach from what I'm doing so
far. Let's continue to exchange notes.
I hope the services group will be able to answer your questions on what
data loading methods are likely to be OK.
I may be trying out various things using the CSpace java and REST client
APIs with XML payloads (with guidance from the services group). Kettle
will let me run shell script jobs using the output of a transformation,
and I'll try using that before I think about a CSpace plug-in. I'll also
take a look at Talend and see what it offers.
Since CSpace is generating csids for me, it is important that I am able
to get them back so that I can use them later in relations and in some
cases updates (I'm doing some of my loading as updates). So part of my
approach is associating generated csids with IDs in the original data.
Thanks again for describing your approach.
Susan
Christopher Pott wrote:
Hi Susan,
So far I've only used Talend for generating CollectionObjects (no repeated fields, relations or any other schemas yet). The AdvancedXML output module does not seem to directly support multiple repeated fields but I think it's possible to find a way around this limitation (by formatting them before sending to the module). Talend can be configured to generate complete xml files (one for each object) ready for import to nuxeo (see attachment).
Post processing is limited to this: To satisfy the nuxeo import format, each generated xml file is placed in it's own unique directory and renamed (to document.xml). This directory path (relative to the nuxeo directory tree) is inserted into the xml document in the <document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above) becomes the CSID (so is in fact whatever name I've chosen to give the directory directly holding the xml file). At the moment, I'm not using particularly well thought out csid's (not UUIDs).
But I'm glad you asked this because it makes me think.... Is it a requirement that Services generates the CSID? In fact it would be interesting to hear from someone whether is it a requirement to use Services APIs to migrate data in general, or is nuxeo shell also an approved approach?
If it proves necessary to avoid nuxeo shell I think it would be optimal to access the service APIs directly from the ETL (requires building a new java component in Talend). Otherwise, I guess your java client could be adapted to handle the Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at Berkeley, and I
guess you've seen some of the documentation of our project that Chris
Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
create text files that I load using the CSpace services java client API.
I expect I am going to need to start creating XML when it becomes
possible to load "repeating" fields and more complex field groupings,
and I'm not sure if Talend will be better than Kettle for this.
Could you send me a sample of the "advanced XML" output for loading into
CSpace that you get out of the Talend ETL tool (before you do further
manipulation)? Does it output a series of XML records in a single file,
multiple files, or are you able to pipe (or something) output from
Talend to the Nuxeo shell that you are using?
Also, how are you creating Collection Space IDs (csids) in your load
process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
Hi Chris,
I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.
We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.
The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.
I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).
I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.
I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.
Best Regards,
Chris Pott
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
Sendt: 12. juni 2010 01:37
Til: talk@lists.collectionspace.org
Emne: [Talk] Deployment experience at SMK?
Hi Angela and others,
In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.
I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.
Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
Thanks so much, Chris. A very different approach from what I'm doing so
far. Let's continue to exchange notes.
I hope the services group will be able to answer your questions on what
data loading methods are likely to be OK.
I may be trying out various things using the CSpace java and REST client
APIs with XML payloads (with guidance from the services group). Kettle
will let me run shell script jobs using the output of a transformation,
and I'll try using that before I think about a CSpace plug-in. I'll also
take a look at Talend and see what it offers.
Since CSpace is generating csids for me, it is important that I am able
to get them back so that I can use them later in relations and in some
cases updates (I'm doing some of my loading as updates). So part of my
approach is associating generated csids with IDs in the original data.
Thanks again for describing your approach.
Susan
Christopher Pott wrote:
> Hi Susan,
>
> So far I've only used Talend for generating CollectionObjects (no repeated fields, relations or any other schemas yet). The AdvancedXML output module does not seem to directly support multiple repeated fields but I think it's possible to find a way around this limitation (by formatting them before sending to the module). Talend can be configured to generate complete xml files (one for each object) ready for import to nuxeo (see attachment).
>
> Post processing is limited to this: To satisfy the nuxeo import format, each generated xml file is placed in it's own unique directory and renamed (to document.xml). This directory path (relative to the nuxeo directory tree) is inserted into the xml document in the <document><system><path> part of the xml.
>
> CSIDs: The final node of the <path> (see above) becomes the CSID (so is in fact whatever name I've chosen to give the directory directly holding the xml file). At the moment, I'm not using particularly well thought out csid's (not UUIDs).
>
> But I'm glad you asked this because it makes me think.... Is it a requirement that Services generates the CSID? In fact it would be interesting to hear from someone whether is it a requirement to use Services APIs to migrate data in general, or is nuxeo shell also an approved approach?
>
> If it proves necessary to avoid nuxeo shell I think it would be optimal to access the service APIs directly from the ETL (requires building a new java component in Talend). Otherwise, I guess your java client could be adapted to handle the Talend (xml) output instead.
>
> Regards,
> Chris
>
> Developer, Corpus Project
> Statens Museum for Kunst (National Gallery of Denmark)
>
> -----Oprindelig meddelelse-----
> Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
> Sendt: 1. juli 2010 20:08
> Til: Christopher Pott
> Cc: Richard Millet; Glen Jackson; Chris Hoffman
> Emne: Re: [Talk] Deployment experience at SMK?
>
> Chris,
>
> I'm working on data migration for the Hearst Museum at Berkeley, and I
> guess you've seen some of the documentation of our project that Chris
> Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
> create text files that I load using the CSpace services java client API.
> I expect I am going to need to start creating XML when it becomes
> possible to load "repeating" fields and more complex field groupings,
> and I'm not sure if Talend will be better than Kettle for this.
>
> Could you send me a sample of the "advanced XML" output for loading into
> CSpace that you get out of the Talend ETL tool (before you do further
> manipulation)? Does it output a series of XML records in a single file,
> multiple files, or are you able to pipe (or something) output from
> Talend to the Nuxeo shell that you are using?
>
> Also, how are you creating Collection Space IDs (csids) in your load
> process?
>
> Thanks,
> Susan Stone
> Informatics Group, Data Services, UC Berkeley
>
>
> Christopher Pott wrote:
>> Hi Chris,
>>
>> I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.
>>
>> We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.
>>
>> The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.
>>
>> I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).
>>
>> I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.
>>
>> I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.
>>
>> Best Regards,
>> Chris Pott
>> Developer, Corpus Project
>> Statens Museum for Kunst (National Gallery of Denmark)
>>
>>
>> -----Oprindelig meddelelse-----
>> Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
>> Sendt: 12. juni 2010 01:37
>> Til: talk@lists.collectionspace.org
>> Emne: [Talk] Deployment experience at SMK?
>>
>> Hi Angela and others,
>>
>> In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
>> https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
>> and for the Phoebe A. Hearst Museum of Anthropology at
>> http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
>>
>> We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.
>>
>> I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.
>>
>> Many thanks,
>> Chris Hoffman
>> Manager, Informatics Services
>> UC Berkeley
>> _______________________________________________
>> Talk mailing list
>> Talk@lists.collectionspace.org
>> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
>>
>> _______________________________________________
>> Talk mailing list
>> Talk@lists.collectionspace.org
>> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
>>
>
AR
Aron Roberts
Fri, Jul 2, 2010 7:10 PM
Hi Chris,
At 14:35 +0200 2010-07-02, Christopher Pott wrote:
CSIDs: The final node of the <path> (see above)
becomes the CSID (so is in fact whatever name
I've chosen to give the directory directly
holding the xml file). At the moment, I'm not
using particularly well thought out csid's (not
UUIDs).
My sense - and this is personal opinion only -
is that externally-assigned CSIDs for records in
CollectionSpace should ideally be UUIDs. There
is additional merit in using Type 4 pseudorandom
UUIDs, as generated by Java's UUID.randomUUID()
method or comparable code in other programming or
scripting languages, which match the identifiers
assigned to records by the CollectionSpace
services.
Doing so:
- Helps ensure uniqueness - at least to an extraordinarily high
probability.
http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Random_UUID_probability_of_duplicates
-
Invokes code in core Java, with a (presumably) well-designed
and tested algorithm - not to mention a robust code maintenance
infrastructure - to generate identifiers.
-
Standardizes on a single format that can be relied upon when
and if it may be necessary to validate CSIDs (e.g. see regex below).
Using other types of identifiers may not offer the above benefits.
Aron
P.S. These UUIDs are simple to generate, as in this Java utility class:
import java.util.UUID;
public class GenerateCSID {
public static void main( String args[] ) {
UUID id = UUID.randomUUID();
System.out.println( id.toString() );
}
}
The generated CSIDs can also be matched by this regex:
([a-z0-9-]{8}-[a-z0-9-]{4}-4[a-z0-9-]{3}-[89ab][a-z0-9-]{3}-[a-z0-9-]{12})
--
Date: Fri, 2 Jul 2010 14:35:15 +0200
From: "Christopher Pott" Christopher.Pott@smk.dk
To: "Susan Stone" sstone@socrates.berkeley.edu
Cc: Glen Jackson glen@berkeley.edu, talk@lists.collectionspace.org
Subject: Re: [Talk] Deployment experience at SMK?
Hi Susan,
So far I've only used Talend for generating
CollectionObjects (no repeated fields, relations
or any other schemas yet). The AdvancedXML output
module does not seem to directly support multiple
repeated fields but I think it's possible to find
a way around this limitation (by formatting them
before sending to the module). Talend can be
configured to generate complete xml files (one
for each object) ready for import to nuxeo (see
attachment).
Post processing is limited to this: To satisfy
the nuxeo import format, each generated xml file
is placed in it's own unique directory and
renamed (to document.xml). This directory path
(relative to the nuxeo directory tree) is
inserted into the xml document in the
<document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above)
becomes the CSID (so is in fact whatever name
I've chosen to give the directory directly
holding the xml file). At the moment, I'm not
using particularly well thought out csid's (not
UUIDs).
But I'm glad you asked this because it makes me
think.... Is it a requirement that Services
generates the CSID? In fact it would be
interesting to hear from someone whether is it a
requirement to use Services APIs to migrate data
in general, or is nuxeo shell also an approved
approach?
If it proves necessary to avoid nuxeo shell I
think it would be optimal to access the service
APIs directly from the ETL (requires building a
new java component in Talend). Otherwise, I guess
your java client could be adapted to handle the
Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at Berkeley, and I
guess you've seen some of the documentation of our project that Chris
Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
create text files that I load using the CSpace services java client API.
I expect I am going to need to start creating XML when it becomes
possible to load "repeating" fields and more complex field groupings,
and I'm not sure if Talend will be better than Kettle for this.
Could you send me a sample of the "advanced XML" output for loading into
CSpace that you get out of the Talend ETL tool (before you do further
manipulation)? Does it output a series of XML records in a single file,
multiple files, or are you able to pipe (or something) output from
Talend to the Nuxeo shell that you are using?
Also, how are you creating Collection Space IDs (csids) in your load
process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
Hi Chris,
I intend soon to follow your example and move
some SMK documentation to the CS wiki. But in
the mean time, here's a brief description of our
experiences.
We've currently got two Collectionspace
deployments running, version 0.6 with ~33,000
collection objects and version 0.7 with just a
few thousand.
The main part of our data is currently residing
within two systems, an art database which is
(indirectly) served by an MS Sql Server and an
exhibition management system based upon a Visual
FoxPro database.
I'm using the Talend ETL tool for data mapping,
and so far it has provided the necessary
functionality. For input, it can connect
directly to our MS Sql server, but connecting to
FoxPro has not been so straightforward. After
much experimentation, I migrated the FoxPro
database to MySql and then used a Talend MySql
input module. For output, Talend provides an
output format named "Advanced XML" which can be
used to generate a series of xml records. I then
run a small (bash/awk) script on these files to
format them for importing to Nuxeo (This
combination is not perfect. Ideally, it would be
nice to use an ETL tool with a dedicated Nuxeo
output module).
I've been using the 'nuxeo shell' command line
tool in interactive mode to load data to
CollectionSpace. Our CollectionSpace deployments
are currently on two Vmware virtual servers
running Debian Linux. I've been loading data via
nuxeo shell remote connections to these servers
and this process is slow (the time for
transferring 33,000 records approached a couple
of hours - but I've only once imported this
amount of data and not yet experimented with
other ways to do this). From a single user
perspective CollectionSpace performance with
this amount of data is generally fine (except
for an pagination issue on the Find and Edit
main page). I've not yet run any stress tests or
performance/load analysis on the servers and
would welcome recommendations on the best
tools/approach to accomplish these.
I'm keen to discuss other tools, experiences
and ideas related to all aspects of deploying
CollectionSpace so please feel free to get in
touch.
Best Regards,
Chris Pott
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: talk-bounces@lists.collectionspace.org
[mailto:talk-bounces@lists.collectionspace.org]
På vegne af Chris Hoffman
Sendt: 12. juni 2010 01:37
Til: talk@lists.collectionspace.org
Emne: [Talk] Deployment experience at SMK?
Hi Angela and others,
In a CollectionSpace meeting this week, I heard
that SMK is working on data mapping and that
they might have already loaded a significant
volume of records (45,000 records). I'd love to
hear confirmation and more details! We're
documenting our experience for the prototype
deployment for the University and Jepson
Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
We're just starting up Jira projects for the
next round of work we're doing tied to the 0.7
release.
I'd especially like to hear about data loading
-- how you are doing this, how long it took, how
the system performs, where you are running the
system (dedicated hardware or VMs), and so on.
Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley
Hi Chris,
At 14:35 +0200 2010-07-02, Christopher Pott wrote:
>CSIDs: The final node of the <path> (see above)
>becomes the CSID (so is in fact whatever name
>I've chosen to give the directory directly
>holding the xml file). At the moment, I'm not
>using particularly well thought out csid's (not
>UUIDs).
My sense - and this is personal opinion only -
is that externally-assigned CSIDs for records in
CollectionSpace should ideally be UUIDs. There
is additional merit in using Type 4 pseudorandom
UUIDs, as generated by Java's UUID.randomUUID()
method or comparable code in other programming or
scripting languages, which match the identifiers
assigned to records by the CollectionSpace
services.
Doing so:
- Helps ensure uniqueness - at least to an extraordinarily high
probability.
http://en.wikipedia.org/wiki/Universally_Unique_Identifier#Random_UUID_probability_of_duplicates
- Invokes code in core Java, with a (presumably) well-designed
and tested algorithm - not to mention a robust code maintenance
infrastructure - to generate identifiers.
- Standardizes on a single format that can be relied upon when
and if it may be necessary to validate CSIDs (e.g. see regex below).
Using other types of identifiers *may* not offer the above benefits.
Aron
P.S. These UUIDs are simple to generate, as in this Java utility class:
import java.util.UUID;
public class GenerateCSID {
public static void main( String args[] ) {
UUID id = UUID.randomUUID();
System.out.println( id.toString() );
}
}
The generated CSIDs can also be matched by this regex:
([a-z0-9\-]{8}-[a-z0-9\-]{4}-4[a-z0-9\-]{3}-[89ab][a-z0-9\-]{3}-[a-z0-9\-]{12})
--
Date: Fri, 2 Jul 2010 14:35:15 +0200
From: "Christopher Pott" <Christopher.Pott@smk.dk>
To: "Susan Stone" <sstone@socrates.berkeley.edu>
Cc: Glen Jackson <glen@berkeley.edu>, talk@lists.collectionspace.org
Subject: Re: [Talk] Deployment experience at SMK?
Hi Susan,
So far I've only used Talend for generating
CollectionObjects (no repeated fields, relations
or any other schemas yet). The AdvancedXML output
module does not seem to directly support multiple
repeated fields but I think it's possible to find
a way around this limitation (by formatting them
before sending to the module). Talend can be
configured to generate complete xml files (one
for each object) ready for import to nuxeo (see
attachment).
Post processing is limited to this: To satisfy
the nuxeo import format, each generated xml file
is placed in it's own unique directory and
renamed (to document.xml). This directory path
(relative to the nuxeo directory tree) is
inserted into the xml document in the
<document><system><path> part of the xml.
CSIDs: The final node of the <path> (see above)
becomes the CSID (so is in fact whatever name
I've chosen to give the directory directly
holding the xml file). At the moment, I'm not
using particularly well thought out csid's (not
UUIDs).
But I'm glad you asked this because it makes me
think.... Is it a requirement that Services
generates the CSID? In fact it would be
interesting to hear from someone whether is it a
requirement to use Services APIs to migrate data
in general, or is nuxeo shell also an approved
approach?
If it proves necessary to avoid nuxeo shell I
think it would be optimal to access the service
APIs directly from the ETL (requires building a
new java component in Talend). Otherwise, I guess
your java client could be adapted to handle the
Talend (xml) output instead.
Regards,
Chris
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)
-----Oprindelig meddelelse-----
Fra: Susan Stone [mailto:sstone@socrates.berkeley.edu]
Sendt: 1. juli 2010 20:08
Til: Christopher Pott
Cc: Richard Millet; Glen Jackson; Chris Hoffman
Emne: Re: [Talk] Deployment experience at SMK?
Chris,
I'm working on data migration for the Hearst Museum at Berkeley, and I
guess you've seen some of the documentation of our project that Chris
Hoffman has put up. So far I am using the Pentaho (Kettle) tool to
create text files that I load using the CSpace services java client API.
I expect I am going to need to start creating XML when it becomes
possible to load "repeating" fields and more complex field groupings,
and I'm not sure if Talend will be better than Kettle for this.
Could you send me a sample of the "advanced XML" output for loading into
CSpace that you get out of the Talend ETL tool (before you do further
manipulation)? Does it output a series of XML records in a single file,
multiple files, or are you able to pipe (or something) output from
Talend to the Nuxeo shell that you are using?
Also, how are you creating Collection Space IDs (csids) in your load
process?
Thanks,
Susan Stone
Informatics Group, Data Services, UC Berkeley
Christopher Pott wrote:
> Hi Chris,
>
> I intend soon to follow your example and move
>some SMK documentation to the CS wiki. But in
>the mean time, here's a brief description of our
>experiences.
>
> We've currently got two Collectionspace
>deployments running, version 0.6 with ~33,000
>collection objects and version 0.7 with just a
>few thousand.
>
> The main part of our data is currently residing
>within two systems, an art database which is
>(indirectly) served by an MS Sql Server and an
>exhibition management system based upon a Visual
>FoxPro database.
>
> I'm using the Talend ETL tool for data mapping,
>and so far it has provided the necessary
>functionality. For input, it can connect
>directly to our MS Sql server, but connecting to
>FoxPro has not been so straightforward. After
>much experimentation, I migrated the FoxPro
>database to MySql and then used a Talend MySql
>input module. For output, Talend provides an
>output format named "Advanced XML" which can be
>used to generate a series of xml records. I then
>run a small (bash/awk) script on these files to
>format them for importing to Nuxeo (This
>combination is not perfect. Ideally, it would be
>nice to use an ETL tool with a dedicated Nuxeo
>output module).
>
> I've been using the 'nuxeo shell' command line
>tool in interactive mode to load data to
>CollectionSpace. Our CollectionSpace deployments
>are currently on two Vmware virtual servers
>running Debian Linux. I've been loading data via
>nuxeo shell remote connections to these servers
>and this process is slow (the time for
>transferring 33,000 records approached a couple
>of hours - but I've only once imported this
>amount of data and not yet experimented with
>other ways to do this). From a single user
>perspective CollectionSpace performance with
>this amount of data is generally fine (except
>for an pagination issue on the Find and Edit
>main page). I've not yet run any stress tests or
>performance/load analysis on the servers and
>would welcome recommendations on the best
>tools/approach to accomplish these.
>
> I'm keen to discuss other tools, experiences
>and ideas related to all aspects of deploying
>CollectionSpace so please feel free to get in
>touch.
>
> Best Regards,
> Chris Pott
> Developer, Corpus Project
> Statens Museum for Kunst (National Gallery of Denmark)
>
>
> -----Oprindelig meddelelse-----
> Fra: talk-bounces@lists.collectionspace.org
>[mailto:talk-bounces@lists.collectionspace.org]
>På vegne af Chris Hoffman
> Sendt: 12. juni 2010 01:37
> Til: talk@lists.collectionspace.org
> Emne: [Talk] Deployment experience at SMK?
>
> Hi Angela and others,
>
> In a CollectionSpace meeting this week, I heard
>that SMK is working on data mapping and that
>they might have already loaded a significant
>volume of records (45,000 records). I'd love to
>hear confirmation and more details! We're
>documenting our experience for the prototype
>deployment for the University and Jepson
>Herbaria (UC Berkeley) at
>
>https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
> and for the Phoebe A. Hearst Museum of Anthropology at
>
>http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment
>
> We're just starting up Jira projects for the
>next round of work we're doing tied to the 0.7
>release.
>
> I'd especially like to hear about data loading
>-- how you are doing this, how long it took, how
>the system performs, where you are running the
>system (dedicated hardware or VMs), and so on.
>
> Many thanks,
> Chris Hoffman
> Manager, Informatics Services
> UC Berkeley
AR
Aron Roberts
Fri, Jul 2, 2010 7:15 PM
At 12:10 -0700 2010-07-02, Aron Roberts wrote:
My sense ... is that externally-assigned CSIDs for records in
CollectionSpace should ideally be UUIDs. There is additional merit
in using [CSIDs] ... which match the identifiers assigned to records
by the CollectionSpace services.
Doing so:
-
Helps ensure uniqueness - at least to an extraordinarily high
probability. ...
-
Invokes code in core Java ...
-
Standardizes on a single format ...
And a fourth benefit: helps ensure consistency of identifiers with
those assigned in any future bulk import mechanism in
CollectionSpace, such as that which Patrick is currently exploring:
At 09:29 -0700 2010-07-02, Patrick Schmitz wrote:
I am exploring how we can do bulk import to Nuxeo, creating both our
CSIDs as well as their internal UUID values.
At 12:10 -0700 2010-07-02, Aron Roberts wrote:
> My sense ... is that externally-assigned CSIDs for records in
>CollectionSpace should ideally be UUIDs. There is additional merit
>in using [CSIDs] ... which match the identifiers assigned to records
>by the CollectionSpace services.
>
> Doing so:
>
> - Helps ensure uniqueness - at least to an extraordinarily high
> probability. ...
>
> - Invokes code in core Java ...
>
> - Standardizes on a single format ...
And a fourth benefit: helps ensure consistency of identifiers with
those assigned in any future bulk import mechanism in
CollectionSpace, such as that which Patrick is currently exploring:
At 09:29 -0700 2010-07-02, Patrick Schmitz wrote:
>I am exploring how we can do bulk import to Nuxeo, creating both our
>CSIDs as well as their internal UUID values.
Aron
CP
Christopher Pott
Mon, Jul 12, 2010 2:32 PM
Hi,
I'm experiencing a problem running v0.8 locally. Everything seems to go
smoothly with the installation and logon but when it comes to creating a
new collection object, the template for object entry is not displayed by
the UI. The same problem exists with Loan Out, but the rest of the UI is
working as usual - I can succesfully create a new Person, for example.
I've tried the v0.8 archives available on the ftp site and also checking
out from source and building, but I get the same result. Anyone have any
ideas what I'm missing?
Thanks,
Chris
Hi,
I'm experiencing a problem running v0.8 locally. Everything seems to go
smoothly with the installation and logon but when it comes to creating a
new collection object, the template for object entry is not displayed by
the UI. The same problem exists with Loan Out, but the rest of the UI is
working as usual - I can succesfully create a new Person, for example.
I've tried the v0.8 archives available on the ftp site and also checking
out from source and building, but I get the same result. Anyone have any
ideas what I'm missing?
Thanks,
Chris
CM
Chris Martin
Mon, Jul 12, 2010 2:51 PM
I have an idea that you might need to initialise the controlled lists
if you ran the tests on mvn for the app layer then it should have done
it for you. But if you skipped the tests you can run the initialization
from a browser
first go to
/cspace-ui/html and login
then go to
/chain/authorities/vocab/initialize
unfortunately I haven't set it up to give any useful messages on the
screen but it should only take a moment
then log back in and see if the pages work.
if you are missing the default person and org authorities (and if you
need them)
you can get them by going to
/chain/reset
(once again you must have logged in to the ui or else the authorization
will fail)
Tell me how it goes
Chris M
On 12/07/2010 15:32, Christopher Pott wrote:
Hi,
I'm experiencing a problem running v0.8 locally. Everything seems to go
smoothly with the installation and logon but when it comes to creating a
new collection object, the template for object entry is not displayed by
the UI. The same problem exists with Loan Out, but the rest of the UI is
working as usual - I can succesfully create a new Person, for example.
I've tried the v0.8 archives available on the ftp site and also checking
out from source and building, but I get the same result. Anyone have any
ideas what I'm missing?
Thanks,
Chris
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
I have an idea that you might need to initialise the controlled lists
if you ran the tests on mvn for the app layer then it should have done
it for you. But if you skipped the tests you can run the initialization
from a browser
first go to
/cspace-ui/html and login
then go to
/chain/authorities/vocab/initialize
unfortunately I haven't set it up to give any useful messages on the
screen but it should only take a moment
then log back in and see if the pages work.
if you are missing the default person and org authorities (and if you
need them)
you can get them by going to
/chain/reset
(once again you must have logged in to the ui or else the authorization
will fail)
Tell me how it goes
Chris M
On 12/07/2010 15:32, Christopher Pott wrote:
> Hi,
>
> I'm experiencing a problem running v0.8 locally. Everything seems to go
> smoothly with the installation and logon but when it comes to creating a
> new collection object, the template for object entry is not displayed by
> the UI. The same problem exists with Loan Out, but the rest of the UI is
> working as usual - I can succesfully create a new Person, for example.
>
> I've tried the v0.8 archives available on the ftp site and also checking
> out from source and building, but I get the same result. Anyone have any
> ideas what I'm missing?
>
> Thanks,
> Chris
>
>
>
>
>
> _______________________________________________
> Talk mailing list
> Talk@lists.collectionspace.org
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
>
CP
Christopher Pott
Tue, Jul 13, 2010 12:09 PM
if you skipped the tests you can run the initialization
from a browser
In my initial attempt I downloaded the chain.war from the
collectionspace ftp server, so I had no interaction with mvn in regards
to the application layer. Maybe there is a problem with this war?
The good news is that '/chain/authorities/vocab/initialize' did the
trick for making the missing templates visible (although it returned a
400 error). However, there is now quite a noticible delay (10-20
seconds) when the UI attemps to GET /chain/objects/uispec.
I also tried chain/reset which worked fine, but I think was not needed
in my case as I will in case be populating these objects with our own
data.
Thanks for the help,
Chris
-----Oprindelig meddelelse-----
Fra: Chris Martin [mailto:csm22@caret.cam.ac.uk]
Sendt: 12. juli 2010 16:52
Til: Christopher Pott
Cc: talk@lists.collectionspace.org
Emne: Re: [Talk] problem running v0.8
I have an idea that you might need to initialise the controlled lists
if you ran the tests on mvn for the app layer then it should have done
it for you. But if you skipped the tests you can run the initialization
from a browser
first go to
/cspace-ui/html and login
then go to
/chain/authorities/vocab/initialize
unfortunately I haven't set it up to give any useful messages on the
screen but it should only take a moment
then log back in and see if the pages work.
if you are missing the default person and org authorities (and if you
need them)
you can get them by going to
/chain/reset
(once again you must have logged in to the ui or else the authorization
will fail)
Tell me how it goes
Chris M
On 12/07/2010 15:32, Christopher Pott wrote:
Hi,
I'm experiencing a problem running v0.8 locally. Everything seems to
smoothly with the installation and logon but when it comes to creating
new collection object, the template for object entry is not displayed
the UI. The same problem exists with Loan Out, but the rest of the UI
working as usual - I can succesfully create a new Person, for example.
I've tried the v0.8 archives available on the ftp site and also
out from source and building, but I get the same result. Anyone have
Hi Chris,
> if you skipped the tests you can run the initialization
> from a browser
In my initial attempt I downloaded the chain.war from the
collectionspace ftp server, so I had no interaction with mvn in regards
to the application layer. Maybe there is a problem with this war?
The good news is that '/chain/authorities/vocab/initialize' did the
trick for making the missing templates visible (although it returned a
400 error). However, there is now quite a noticible delay (10-20
seconds) when the UI attemps to GET /chain/objects/uispec.
I also tried chain/reset which worked fine, but I think was not needed
in my case as I will in case be populating these objects with our own
data.
Thanks for the help,
Chris
-----Oprindelig meddelelse-----
Fra: Chris Martin [mailto:csm22@caret.cam.ac.uk]
Sendt: 12. juli 2010 16:52
Til: Christopher Pott
Cc: talk@lists.collectionspace.org
Emne: Re: [Talk] problem running v0.8
I have an idea that you might need to initialise the controlled lists
if you ran the tests on mvn for the app layer then it should have done
it for you. But if you skipped the tests you can run the initialization
from a browser
first go to
/cspace-ui/html and login
then go to
/chain/authorities/vocab/initialize
unfortunately I haven't set it up to give any useful messages on the
screen but it should only take a moment
then log back in and see if the pages work.
if you are missing the default person and org authorities (and if you
need them)
you can get them by going to
/chain/reset
(once again you must have logged in to the ui or else the authorization
will fail)
Tell me how it goes
Chris M
On 12/07/2010 15:32, Christopher Pott wrote:
> Hi,
>
> I'm experiencing a problem running v0.8 locally. Everything seems to
go
> smoothly with the installation and logon but when it comes to creating
a
> new collection object, the template for object entry is not displayed
by
> the UI. The same problem exists with Loan Out, but the rest of the UI
is
> working as usual - I can succesfully create a new Person, for example.
>
> I've tried the v0.8 archives available on the ftp site and also
checking
> out from source and building, but I get the same result. Anyone have
any
> ideas what I'm missing?
>
> Thanks,
> Chris
>
>
>
>
>
> _______________________________________________
> Talk mailing list
> Talk@lists.collectionspace.org
>
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collections
pace.org
>
CM
Chris Martin
Tue, Jul 13, 2010 12:13 PM
On 13/07/2010 13:09, Christopher Pott wrote:
if you skipped the tests you can run the initialization
from a browser
In my initial attempt I downloaded the chain.war from the
collectionspace ftp server, so I had no interaction with mvn in regards
to the application layer. Maybe there is a problem with this war?
no the war is fine.
However, if you want to auto populate data like the controlled
vocabulariezed lists then script need to be run when the war is in place
and can talk to the service layer.
We need to update the documentation to add this stage - sorry for the
confusion
The good news is that '/chain/authorities/vocab/initialize' did the
trick for making the missing templates visible (although it returned a
400 error). However, there is now quite a noticible delay (10-20
seconds) when the UI attemps to GET /chain/objects/uispec.
Yeap the time take for objects/uispec is a known issue and should be
resolved in v1.0a as controlled lists will be handled differently so
should need to worry the uispec as much
I also tried chain/reset which worked fine, but I think was not needed
in my case as I will in case be populating these objects with our own
data.
Thanks for the help,
Chris
-----Oprindelig meddelelse-----
Fra: Chris Martin [mailto:csm22@caret.cam.ac.uk]
Sendt: 12. juli 2010 16:52
Til: Christopher Pott
Cc: talk@lists.collectionspace.org
Emne: Re: [Talk] problem running v0.8
I have an idea that you might need to initialise the controlled lists
if you ran the tests on mvn for the app layer then it should have done
it for you. But if you skipped the tests you can run the initialization
from a browser
first go to
/cspace-ui/html and login
then go to
/chain/authorities/vocab/initialize
unfortunately I haven't set it up to give any useful messages on the
screen but it should only take a moment
then log back in and see if the pages work.
if you are missing the default person and org authorities (and if you
need them)
you can get them by going to
/chain/reset
(once again you must have logged in to the ui or else the authorization
will fail)
Tell me how it goes
Chris M
On 12/07/2010 15:32, Christopher Pott wrote:
Hi,
I'm experiencing a problem running v0.8 locally. Everything seems to
smoothly with the installation and logon but when it comes to creating
new collection object, the template for object entry is not displayed
the UI. The same problem exists with Loan Out, but the rest of the UI
working as usual - I can succesfully create a new Person, for example.
I've tried the v0.8 archives available on the ftp site and also
out from source and building, but I get the same result. Anyone have
On 13/07/2010 13:09, Christopher Pott wrote:
> Hi Chris,
>
>
>> if you skipped the tests you can run the initialization
>> from a browser
>>
> In my initial attempt I downloaded the chain.war from the
> collectionspace ftp server, so I had no interaction with mvn in regards
> to the application layer. Maybe there is a problem with this war?
>
>
no the war is fine.
However, if you want to auto populate data like the controlled
vocabulariezed lists then script need to be run when the war is in place
and can talk to the service layer.
We need to update the documentation to add this stage - sorry for the
confusion
> The good news is that '/chain/authorities/vocab/initialize' did the
> trick for making the missing templates visible (although it returned a
> 400 error). However, there is now quite a noticible delay (10-20
> seconds) when the UI attemps to GET /chain/objects/uispec.
>
>
Yeap the time take for objects/uispec is a known issue and should be
resolved in v1.0a as controlled lists will be handled differently so
should need to worry the uispec as much
> I also tried chain/reset which worked fine, but I think was not needed
> in my case as I will in case be populating these objects with our own
> data.
>
> Thanks for the help,
> Chris
>
> -----Oprindelig meddelelse-----
> Fra: Chris Martin [mailto:csm22@caret.cam.ac.uk]
> Sendt: 12. juli 2010 16:52
> Til: Christopher Pott
> Cc: talk@lists.collectionspace.org
> Emne: Re: [Talk] problem running v0.8
>
> I have an idea that you might need to initialise the controlled lists
>
> if you ran the tests on mvn for the app layer then it should have done
> it for you. But if you skipped the tests you can run the initialization
> from a browser
>
> first go to
> /cspace-ui/html and login
> then go to
> /chain/authorities/vocab/initialize
>
> unfortunately I haven't set it up to give any useful messages on the
> screen but it should only take a moment
>
> then log back in and see if the pages work.
>
> if you are missing the default person and org authorities (and if you
> need them)
> you can get them by going to
> /chain/reset
> (once again you must have logged in to the ui or else the authorization
> will fail)
>
> Tell me how it goes
>
>
> Chris M
>
>
> On 12/07/2010 15:32, Christopher Pott wrote:
>
>> Hi,
>>
>> I'm experiencing a problem running v0.8 locally. Everything seems to
>>
> go
>
>> smoothly with the installation and logon but when it comes to creating
>>
> a
>
>> new collection object, the template for object entry is not displayed
>>
> by
>
>> the UI. The same problem exists with Loan Out, but the rest of the UI
>>
> is
>
>> working as usual - I can succesfully create a new Person, for example.
>>
>> I've tried the v0.8 archives available on the ftp site and also
>>
> checking
>
>> out from source and building, but I get the same result. Anyone have
>>
> any
>
>> ideas what I'm missing?
>>
>> Thanks,
>> Chris
>>
>>
>>
>>
>>
>> _______________________________________________
>> Talk mailing list
>> Talk@lists.collectionspace.org
>>
>>
> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collections
> pace.org
>
>>
>>