Empathy List Archives

CH

Chris Hoffman

Fri, Jun 11, 2010 11:37 PM

Hi Angela and others,

In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment

We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.

I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.

Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley

Hi Angela and others, In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment and for the Phoebe A. Hearst Museum of Anthropology at http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release. I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on. Many thanks, Chris Hoffman Manager, Informatics Services UC Berkeley

CP

Christopher Pott

Mon, Jun 14, 2010 10:19 AM

Hi Chris,

I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.

We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.

The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.

I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).

I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.

I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.

Best Regards,
Chris Pott
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)

-----Oprindelig meddelelse-----
Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
Sendt: 12. juni 2010 01:37
Til: talk@lists.collectionspace.org
Emne: [Talk] Deployment experience at SMK?

Hi Angela and others,

In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment

We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.

I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.

Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley

Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Hi Chris, I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences. We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand. The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database. I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module). I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these. I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch. Best Regards, Chris Pott Developer, Corpus Project Statens Museum for Kunst (National Gallery of Denmark) -----Oprindelig meddelelse----- Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman Sendt: 12. juni 2010 01:37 Til: talk@lists.collectionspace.org Emne: [Talk] Deployment experience at SMK? Hi Angela and others, In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment and for the Phoebe A. Hearst Museum of Anthropology at http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release. I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on. Many thanks, Chris Hoffman Manager, Informatics Services UC Berkeley _______________________________________________ Talk mailing list Talk@lists.collectionspace.org http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

CH

Chris Hoffman

Mon, Jun 14, 2010 8:00 PM

Hi Chris,

Thanks for posting this information -- very interesting. I think we might experiment with your approach to data loading (creating XML records via an ETL tool, loading via Nuxeo). We've been using Kettle, though I have worked a little bit with Talend (mainly their data quality tool). We have been loading data by running a small Java program that calls the services layer, row by row. However, the small Java program needs maintenance and updates as the schema changes, and I think it takes a bit longer for us to load data. I'll check with Glen Jackson and Susan Stone on our team to get some better numbers posted. We have talked with the UC Berkeley team about developing a better way to load large amounts of data.

It will be good to talk about performance. We are getting 0.7 installed on a couple SliceHost VMs and will be doing some stress testing there. Some of our collections have hundreds of thousands of records in their current system.
http://wiki.collectionspace.org/display/collectionspace/Data+volumes+for+potential+adopters
We are also going to get a VM running in our local UCB data center so we can test local performance.

Regards,
Chris

On Jun 14, 2010, at 3:19 AM, Christopher Pott wrote:

Hi Chris,

I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences.

We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand.

The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database.

I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module).

I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these.

I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch.

Best Regards,
Chris Pott
Developer, Corpus Project
Statens Museum for Kunst (National Gallery of Denmark)

-----Oprindelig meddelelse-----
Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman
Sendt: 12. juni 2010 01:37
Til: talk@lists.collectionspace.org
Emne: [Talk] Deployment experience at SMK?

Hi Angela and others,

In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at
https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment
and for the Phoebe A. Hearst Museum of Anthropology at
http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment

We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release.

I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on.

Many thanks,
Chris Hoffman
Manager, Informatics Services
UC Berkeley

Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Hi Chris, Thanks for posting this information -- very interesting. I think we might experiment with your approach to data loading (creating XML records via an ETL tool, loading via Nuxeo). We've been using Kettle, though I have worked a little bit with Talend (mainly their data quality tool). We have been loading data by running a small Java program that calls the services layer, row by row. However, the small Java program needs maintenance and updates as the schema changes, and I think it takes a bit longer for us to load data. I'll check with Glen Jackson and Susan Stone on our team to get some better numbers posted. We have talked with the UC Berkeley team about developing a better way to load large amounts of data. It will be good to talk about performance. We are getting 0.7 installed on a couple SliceHost VMs and will be doing some stress testing there. Some of our collections have hundreds of thousands of records in their current system. http://wiki.collectionspace.org/display/collectionspace/Data+volumes+for+potential+adopters We are also going to get a VM running in our local UCB data center so we can test local performance. Regards, Chris On Jun 14, 2010, at 3:19 AM, Christopher Pott wrote: > Hi Chris, > > I intend soon to follow your example and move some SMK documentation to the CS wiki. But in the mean time, here's a brief description of our experiences. > > We've currently got two Collectionspace deployments running, version 0.6 with ~33,000 collection objects and version 0.7 with just a few thousand. > > The main part of our data is currently residing within two systems, an art database which is (indirectly) served by an MS Sql Server and an exhibition management system based upon a Visual FoxPro database. > > I'm using the Talend ETL tool for data mapping, and so far it has provided the necessary functionality. For input, it can connect directly to our MS Sql server, but connecting to FoxPro has not been so straightforward. After much experimentation, I migrated the FoxPro database to MySql and then used a Talend MySql input module. For output, Talend provides an output format named "Advanced XML" which can be used to generate a series of xml records. I then run a small (bash/awk) script on these files to format them for importing to Nuxeo (This combination is not perfect. Ideally, it would be nice to use an ETL tool with a dedicated Nuxeo output module). > > I've been using the 'nuxeo shell' command line tool in interactive mode to load data to CollectionSpace. Our CollectionSpace deployments are currently on two Vmware virtual servers running Debian Linux. I've been loading data via nuxeo shell remote connections to these servers and this process is slow (the time for transferring 33,000 records approached a couple of hours - but I've only once imported this amount of data and not yet experimented with other ways to do this). From a single user perspective CollectionSpace performance with this amount of data is generally fine (except for an pagination issue on the Find and Edit main page). I've not yet run any stress tests or performance/load analysis on the servers and would welcome recommendations on the best tools/approach to accomplish these. > > I'm keen to discuss other tools, experiences and ideas related to all aspects of deploying CollectionSpace so please feel free to get in touch. > > Best Regards, > Chris Pott > Developer, Corpus Project > Statens Museum for Kunst (National Gallery of Denmark) > > > -----Oprindelig meddelelse----- > Fra: talk-bounces@lists.collectionspace.org [mailto:talk-bounces@lists.collectionspace.org] På vegne af Chris Hoffman > Sendt: 12. juni 2010 01:37 > Til: talk@lists.collectionspace.org > Emne: [Talk] Deployment experience at SMK? > > Hi Angela and others, > > In a CollectionSpace meeting this week, I heard that SMK is working on data mapping and that they might have already loaded a significant volume of records (45,000 records). I'd love to hear confirmation and more details! We're documenting our experience for the prototype deployment for the University and Jepson Herbaria (UC Berkeley) at > https://wikihub.berkeley.edu/display/istds/Herbaria+CollectionSpace+Deployment > and for the Phoebe A. Hearst Museum of Anthropology at > http://wiki.collectionspace.org/display/collectionspace/The+PAHMA+CollectionSpace+Deployment > > We're just starting up Jira projects for the next round of work we're doing tied to the 0.7 release. > > I'd especially like to hear about data loading -- how you are doing this, how long it took, how the system performs, where you are running the system (dedicated hardware or VMs), and so on. > > Many thanks, > Chris Hoffman > Manager, Informatics Services > UC Berkeley > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org > > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org