talk@lists.collectionspace.org

WE HAVE SUNSET THIS LISTSERV - Join us at collectionspace@lyrasislists.org

View all threads

UTF-8 error on import

NS
Nate Solas
Wed, Feb 15, 2012 9:11 PM

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but I'm
willing to be wrong on that... I just can't seem to prove that it's NOT
UTF-8. Anyway, I'm attaching the file to try to preserve the encoding and
someone with fresh eyes can tell me if that's it. Surely someone has
successfully imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of
1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the XML,
so that's the culprit for sure. A hex editor shows me the character is a
valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into
CS, but I can't get past the encoding...
Thanks,
Nate

Hello! I'm working on importing into the Persons service, and it's going pretty well. It's choking on what appears to me to be valid UTF-8, but I'm willing to be wrong on that... I just can't seem to prove that it's NOT UTF-8. Anyway, I'm attaching the file to try to preserve the encoding and someone with fresh eyes can tell me if that's it. Surely someone has successfully imported UTF-8 characters using curl? curl -X POST http://localhost:8180/cspace-services/imports -i -u "admin@walkerart.org:Administrator" -F "file=@./creator_import.xml;type=text/xml;" [ the first Person imports fine: ] HTTP/1.1 100 Continue HTTP/1.1 200 OK Server: Apache-Coyote/1.1 Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; Path=/cspace-services Content-Type: application/xml Content-Length: 265 Date: Wed, 15 Feb 2012 21:03:12 GMT <?xml ?><import><msg>SUCCESS</msg><report></report>READ: /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 [ ... and it ends there. Nothing useful in collectionspace-services.log, but catalina.out shows this: ] FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) at org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) at org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) at org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) at org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) .... Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can successfully import the doc by removing the a-acute character in the XML, so that's the culprit for sure. A hex editor shows me the character is a valid multibyte sequence in UTF-8: C3-A1 http://en.wikipedia.org/wiki/UTF-8#Codepage_layout http://www.utf8-chartable.de/ Help! I'm so close to slamming (a version of) our entire artist list into CS, but I can't get past the encoding... Thanks, Nate
CH
Chris Hoffman
Wed, Feb 15, 2012 10:12 PM

Hey Nate,
I'm living the data import dream right now as well!  What platform are you inputing into?  There's some extra config for example if your CSpace stack is running on a Mac.  We are importing UTF8 successfully.
Chris

On Feb 15, 2012, at 1:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going pretty well. It's choking on what appears to me to be valid UTF-8, but I'm willing to be wrong on that... I just can't seem to prove that it's NOT UTF-8. Any
way, I'm attaching the file to try to preserve the encoding and someone with fresh eyes can tell me if that's it. Surely someone has successfully imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u "admin@walkerart.org:Administrator" -F "file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ: /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml

/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log, but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can successfully import the doc by removing the a-acute character in the XML, so that's the culprit for sure. A hex editor shows me the character is a valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into CS, but I can't get past the encoding...
Thanks,
Nate

<creator_import.xml>_______________________________________________
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Hey Nate, I'm living the data import dream right now as well! What platform are you inputing into? There's some extra config for example if your CSpace stack is running on a Mac. We are importing UTF8 successfully. Chris On Feb 15, 2012, at 1:11 PM, Nate Solas wrote: > Hello! I'm working on importing into the Persons service, and it's going pretty well. It's choking on what appears to me to be valid UTF-8, but I'm willing to be wrong on that... I just can't seem to prove that it's NOT UTF-8. Any > way, I'm attaching the file to try to preserve the encoding and someone with fresh eyes can tell me if that's it. Surely someone has successfully imported UTF-8 characters using curl? > > curl -X POST http://localhost:8180/cspace-services/imports -i -u "admin@walkerart.org:Administrator" -F "file=@./creator_import.xml;type=text/xml;" > > [ the first Person imports fine: ] > > HTTP/1.1 100 Continue > > HTTP/1.1 200 OK > Server: Apache-Coyote/1.1 > Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; Path=/cspace-services > Content-Type: application/xml > Content-Length: 265 > Date: Wed, 15 Feb 2012 21:03:12 GMT > > <?xml ?><import><msg>SUCCESS</msg><report></report>READ: /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml > /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 > > [ ... and it ends there. Nothing useful in collectionspace-services.log, but catalina.out shows this: ] > > FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence. > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) > at org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) > at org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) > at org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) > at org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) > > .... > > Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can successfully import the doc by removing the a-acute character in the XML, so that's the culprit for sure. A hex editor shows me the character is a valid multibyte sequence in UTF-8: C3-A1 > http://en.wikipedia.org/wiki/UTF-8#Codepage_layout > http://www.utf8-chartable.de/ > > Help! I'm so close to slamming (a version of) our entire artist list into CS, but I can't get past the encoding... > Thanks, > Nate > > <creator_import.xml>_______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
AR
Aron Roberts
Wed, Feb 15, 2012 10:24 PM

On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman
chris.hoffman@berkeley.edu wrote:

I'm living the data import dream right now as well!

;-)

What platform are you inputing into?  There's some extra config for example
if your CSpace stack is running on a Mac.  We are importing UTF8 successfully.

What Chris is referring to here may be what's discussed in
http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126,
and in the subsequent comments in that issue.

Aron

--

Chris

On Feb 15, 2012, at 1:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but I'm
willing to be wrong on that... I just can't seem to prove that it's NOT
UTF-8. Any

way, I'm attaching the file to try to preserve the encoding and someone with
fresh eyes can tell me if that's it. Surely someone has successfully
imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log, but
catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of
1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the XML, so
that's the culprit for sure. A hex editor shows me the character is a valid
multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into
CS, but I can't get past the encoding...
Thanks,
Nate

<creator_import.xml>_______________________________________________
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org


Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman <chris.hoffman@berkeley.edu> wrote: > I'm living the data import dream right now as well! ;-) > What platform are you inputing into?  There's some extra config for example > if your CSpace stack is running on a Mac.  We are importing UTF8 successfully. What Chris is referring to here may be what's discussed in <http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126>, and in the subsequent comments in that issue. Aron -- > Chris > > On Feb 15, 2012, at 1:11 PM, Nate Solas wrote: > > Hello! I'm working on importing into the Persons service, and it's going > pretty well. It's choking on what appears to me to be valid UTF-8, but I'm > willing to be wrong on that... I just can't seem to prove that it's NOT > UTF-8. Any > > way, I'm attaching the file to try to preserve the encoding and someone with > fresh eyes can tell me if that's it. Surely someone has successfully > imported UTF-8 characters using curl? > > curl -X POST http://localhost:8180/cspace-services/imports -i -u > "admin@walkerart.org:Administrator" -F > "file=@./creator_import.xml;type=text/xml;" > > [ the first Person imports fine: ] > > HTTP/1.1 100 Continue > > HTTP/1.1 200 OK > Server: Apache-Coyote/1.1 > Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; > Path=/cspace-services > Content-Type: application/xml > Content-Length: 265 > Date: Wed, 15 Feb 2012 21:03:12 GMT > > <?xml ?><import><msg>SUCCESS</msg><report></report>READ: > /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml > /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 > > [ ... and it ends there. Nothing useful in collectionspace-services.log, but > catalina.out shows this: ] > > FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: > Invalid byte 1 of 1-byte UTF-8 sequence. > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 of > 1-byte UTF-8 sequence. > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at > org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) > at > org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) > at > org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) > at > org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) > at > org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) > > .... > > Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can > successfully import the doc by removing the a-acute character in the XML, so > that's the culprit for sure. A hex editor shows me the character is a valid > multibyte sequence in UTF-8: C3-A1 > http://en.wikipedia.org/wiki/UTF-8#Codepage_layout > http://www.utf8-chartable.de/ > > Help! I'm so close to slamming (a version of) our entire artist list into > CS, but I can't get past the encoding... > Thanks, > Nate > > <creator_import.xml>_______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org > > > > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org >
NS
Nate Solas
Wed, Feb 15, 2012 10:40 PM

Linux. I'll take a look tonight with fresh eyes, good to hear
someone's getting UTF-8 in without problems. Chris Potts, any Talend
gotchas? I've got it set to UTF-8 at every step, as far as I cvan
tell.

Thanks,
Nate

On 2/15/12, Chris Hoffman chris.hoffman@berkeley.edu wrote:

Hey Nate,
I'm living the data import dream right now as well!  What platform are you
inputing into?  There's some extra config for example if your CSpace stack
is running on a Mac.  We are importing UTF8 successfully.
Chris

On Feb 15, 2012, at 1:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but I'm
willing to be wrong on that... I just can't seem to prove that it's NOT
UTF-8. Any
way, I'm attaching the file to try to preserve the encoding and someone
with fresh eyes can tell me if that's it. Surely someone has successfully
imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1
of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the XML,
so that's the culprit for sure. A hex editor shows me the character is a
valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into
CS, but I can't get past the encoding...
Thanks,
Nate

<creator_import.xml>_______________________________________________
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

--
Sent from my mobile device

Linux. I'll take a look tonight with fresh eyes, good to hear someone's getting UTF-8 in without problems. Chris Potts, any Talend gotchas? I've got it set to UTF-8 at every step, as far as I cvan tell. Thanks, Nate On 2/15/12, Chris Hoffman <chris.hoffman@berkeley.edu> wrote: > Hey Nate, > I'm living the data import dream right now as well! What platform are you > inputing into? There's some extra config for example if your CSpace stack > is running on a Mac. We are importing UTF8 successfully. > Chris > > On Feb 15, 2012, at 1:11 PM, Nate Solas wrote: > >> Hello! I'm working on importing into the Persons service, and it's going >> pretty well. It's choking on what appears to me to be valid UTF-8, but I'm >> willing to be wrong on that... I just can't seem to prove that it's NOT >> UTF-8. Any >> way, I'm attaching the file to try to preserve the encoding and someone >> with fresh eyes can tell me if that's it. Surely someone has successfully >> imported UTF-8 characters using curl? >> >> curl -X POST http://localhost:8180/cspace-services/imports -i -u >> "admin@walkerart.org:Administrator" -F >> "file=@./creator_import.xml;type=text/xml;" >> >> [ the first Person imports fine: ] >> >> HTTP/1.1 100 Continue >> >> HTTP/1.1 200 OK >> Server: Apache-Coyote/1.1 >> Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; >> Path=/cspace-services >> Content-Type: application/xml >> Content-Length: 265 >> Date: Wed, 15 Feb 2012 21:03:12 GMT >> >> <?xml ?><import><msg>SUCCESS</msg><report></report>READ: >> /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml >> /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 >> >> [ ... and it ends there. Nothing useful in collectionspace-services.log, >> but catalina.out shows this: ] >> >> FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: >> Invalid byte 1 of 1-byte UTF-8 sequence. >> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 >> of 1-byte UTF-8 sequence. >> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) >> at >> org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) >> at >> org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) >> at >> org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) >> at >> org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) >> at >> org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) >> >> .... >> >> Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can >> successfully import the doc by removing the a-acute character in the XML, >> so that's the culprit for sure. A hex editor shows me the character is a >> valid multibyte sequence in UTF-8: C3-A1 >> http://en.wikipedia.org/wiki/UTF-8#Codepage_layout >> http://www.utf8-chartable.de/ >> >> Help! I'm so close to slamming (a version of) our entire artist list into >> CS, but I can't get past the encoding... >> Thanks, >> Nate >> >> <creator_import.xml>_______________________________________________ >> Talk mailing list >> Talk@lists.collectionspace.org >> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org > > -- Sent from my mobile device
NS
Nate Solas
Wed, Feb 15, 2012 10:41 PM

Oops. Reading my email out of order... :)

On 2/15/12, Aron Roberts aron@socrates.berkeley.edu wrote:

On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman
chris.hoffman@berkeley.edu wrote:

I'm living the data import dream right now as well!

;-)

What platform are you inputing into?  There's some extra config for
example
if your CSpace stack is running on a Mac.  We are importing UTF8
successfully.

What Chris is referring to here may be what's discussed in
http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126,
and in the subsequent comments in that issue.

Aron

--

Chris

On Feb 15, 2012, at 1:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but I'm
willing to be wrong on that... I just can't seem to prove that it's NOT
UTF-8. Any

way, I'm attaching the file to try to preserve the encoding and someone
with
fresh eyes can tell me if that's it. Surely someone has successfully
imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but
catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1
of
1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the XML,
so
that's the culprit for sure. A hex editor shows me the character is a
valid
multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into
CS, but I can't get past the encoding...
Thanks,
Nate

<creator_import.xml>_______________________________________________
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org


Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

--
Sent from my mobile device

Oops. Reading my email out of order... :) On 2/15/12, Aron Roberts <aron@socrates.berkeley.edu> wrote: > On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman > <chris.hoffman@berkeley.edu> wrote: >> I'm living the data import dream right now as well! > > ;-) > >> What platform are you inputing into?  There's some extra config for >> example >> if your CSpace stack is running on a Mac.  We are importing UTF8 >> successfully. > > What Chris is referring to here may be what's discussed in > <http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126>, > and in the subsequent comments in that issue. > > Aron > > -- > >> Chris >> >> On Feb 15, 2012, at 1:11 PM, Nate Solas wrote: >> >> Hello! I'm working on importing into the Persons service, and it's going >> pretty well. It's choking on what appears to me to be valid UTF-8, but I'm >> willing to be wrong on that... I just can't seem to prove that it's NOT >> UTF-8. Any >> >> way, I'm attaching the file to try to preserve the encoding and someone >> with >> fresh eyes can tell me if that's it. Surely someone has successfully >> imported UTF-8 characters using curl? >> >> curl -X POST http://localhost:8180/cspace-services/imports -i -u >> "admin@walkerart.org:Administrator" -F >> "file=@./creator_import.xml;type=text/xml;" >> >> [ the first Person imports fine: ] >> >> HTTP/1.1 100 Continue >> >> HTTP/1.1 200 OK >> Server: Apache-Coyote/1.1 >> Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; >> Path=/cspace-services >> Content-Type: application/xml >> Content-Length: 265 >> Date: Wed, 15 Feb 2012 21:03:12 GMT >> >> <?xml ?><import><msg>SUCCESS</msg><report></report>READ: >> /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml >> /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 >> >> [ ... and it ends there. Nothing useful in collectionspace-services.log, >> but >> catalina.out shows this: ] >> >> FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: >> Invalid byte 1 of 1-byte UTF-8 sequence. >> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 >> of >> 1-byte UTF-8 sequence. >> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) >> at >> org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) >> at >> org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) >> at >> org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) >> at >> org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) >> at >> org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) >> >> .... >> >> Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can >> successfully import the doc by removing the a-acute character in the XML, >> so >> that's the culprit for sure. A hex editor shows me the character is a >> valid >> multibyte sequence in UTF-8: C3-A1 >> http://en.wikipedia.org/wiki/UTF-8#Codepage_layout >> http://www.utf8-chartable.de/ >> >> Help! I'm so close to slamming (a version of) our entire artist list into >> CS, but I can't get past the encoding... >> Thanks, >> Nate >> >> <creator_import.xml>_______________________________________________ >> Talk mailing list >> Talk@lists.collectionspace.org >> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org >> >> >> >> _______________________________________________ >> Talk mailing list >> Talk@lists.collectionspace.org >> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org >> > -- Sent from my mobile device
JM
Jesse Martinez
Wed, Feb 15, 2012 10:51 PM

Hi Nate,

It might be a very minor thing and completely inconsequential, but
from the imports example from the
wiki:http://wiki.collectionspace.org/x/joE9B it shows the curl URL
value with an appended "type" parameter: ?type=xml

  • Jesse

I'm not all that familiar with the imports service and the new &
improved way in which imports are invoked via curl as a form POST, so
take this with a grain of salt.

On Wed, Feb 15, 2012 at 5:41 PM, Nate Solas nate.solas@walkerart.org wrote:

Oops. Reading my email out of order... :)

On 2/15/12, Aron Roberts aron@socrates.berkeley.edu wrote:

On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman
chris.hoffman@berkeley.edu wrote:

I'm living the data import dream right now as well!

  ;-)

What platform are you inputing into?  There's some extra config for
example
if your CSpace stack is running on a Mac.  We are importing UTF8
successfully.

  What Chris is referring to here may be what's discussed in
http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126,
and in the subsequent comments in that issue.

Aron

--

Chris

On Feb 15, 2012, at 1:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but I'm
willing to be wrong on that... I just can't seem to prove that it's NOT
UTF-8. Any

way, I'm attaching the file to try to preserve the encoding and someone
with
fresh eyes can tell me if that's it. Surely someone has successfully
imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but
catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1
of
1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the XML,
so
that's the culprit for sure. A hex editor shows me the character is a
valid
multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list into
CS, but I can't get past the encoding...
Thanks,
Nate

<creator_import.xml>_______________________________________________
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org


Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Hi Nate, It might be a very minor thing and completely inconsequential, but from the imports example from the wiki:http://wiki.collectionspace.org/x/joE9B it shows the curl URL value with an appended "type" parameter: ?type=xml - Jesse I'm not all that familiar with the imports service and the new & improved way in which imports are invoked via curl as a form POST, so take this with a grain of salt. On Wed, Feb 15, 2012 at 5:41 PM, Nate Solas <nate.solas@walkerart.org> wrote: > Oops. Reading my email out of order... :) > > > On 2/15/12, Aron Roberts <aron@socrates.berkeley.edu> wrote: >> On Wed, Feb 15, 2012 at 2:12 PM, Chris Hoffman >> <chris.hoffman@berkeley.edu> wrote: >>> I'm living the data import dream right now as well! >> >>   ;-) >> >>> What platform are you inputing into?  There's some extra config for >>> example >>> if your CSpace stack is running on a Mac.  We are importing UTF8 >>> successfully. >> >>   What Chris is referring to here may be what's discussed in >> <http://issues.collectionspace.org/browse/CSPACE-4447?focusedCommentId=25126&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-25126>, >> and in the subsequent comments in that issue. >> >> Aron >> >> -- >> >>> Chris >>> >>> On Feb 15, 2012, at 1:11 PM, Nate Solas wrote: >>> >>> Hello! I'm working on importing into the Persons service, and it's going >>> pretty well. It's choking on what appears to me to be valid UTF-8, but I'm >>> willing to be wrong on that... I just can't seem to prove that it's NOT >>> UTF-8. Any >>> >>> way, I'm attaching the file to try to preserve the encoding and someone >>> with >>> fresh eyes can tell me if that's it. Surely someone has successfully >>> imported UTF-8 characters using curl? >>> >>> curl -X POST http://localhost:8180/cspace-services/imports -i -u >>> "admin@walkerart.org:Administrator" -F >>> "file=@./creator_import.xml;type=text/xml;" >>> >>> [ the first Person imports fine: ] >>> >>> HTTP/1.1 100 Continue >>> >>> HTTP/1.1 200 OK >>> Server: Apache-Coyote/1.1 >>> Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; >>> Path=/cspace-services >>> Content-Type: application/xml >>> Content-Length: 265 >>> Date: Wed, 15 Feb 2012 21:03:12 GMT >>> >>> <?xml ?><import><msg>SUCCESS</msg><report></report>READ: >>> /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml >>> /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 >>> >>> [ ... and it ends there. Nothing useful in collectionspace-services.log, >>> but >>> catalina.out shows this: ] >>> >>> FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: >>> Invalid byte 1 of 1-byte UTF-8 sequence. >>> org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 >>> of >>> 1-byte UTF-8 sequence. >>> at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) >>> at >>> org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) >>> at >>> org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) >>> at >>> org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) >>> at >>> org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) >>> at >>> org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) >>> >>> .... >>> >>> Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can >>> successfully import the doc by removing the a-acute character in the XML, >>> so >>> that's the culprit for sure. A hex editor shows me the character is a >>> valid >>> multibyte sequence in UTF-8: C3-A1 >>> http://en.wikipedia.org/wiki/UTF-8#Codepage_layout >>> http://www.utf8-chartable.de/ >>> >>> Help! I'm so close to slamming (a version of) our entire artist list into >>> CS, but I can't get past the encoding... >>> Thanks, >>> Nate >>> >>> <creator_import.xml>_______________________________________________ >>> Talk mailing list >>> Talk@lists.collectionspace.org >>> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org >>> >>> >>> >>> _______________________________________________ >>> Talk mailing list >>> Talk@lists.collectionspace.org >>> http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org >>> >> > > -- > Sent from my mobile device > > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
SS
Susan Stone
Thu, Feb 16, 2012 12:20 AM

Nate.

The utf8 looks good and the post command looks right. I'm on linux and
I've had no trouble with utf8. So I think the problem is probably on the
server or database side.

That said, I don't usually include the line

<?xml version="1.0" encoding="UTF-8"?>

in my import files. They just start with <imports> and end with
</imports> and the import service takes care of the rest and I guess
assumes utf8 as well. Not sure how I came to do it this way though.

Susan

On 02/15/2012 01:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but
I'm willing to be wrong on that... I just can't seem to prove that it's
NOT UTF-8. Anyway, I'm attaching the file to try to preserve the
encoding and someone with fresh eyes can tell me if that's it. Surely
someone has successfully imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1
of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the
XML, so that's the culprit for sure. A hex editor shows me the character
is a valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list
into CS, but I can't get past the encoding...
Thanks,
Nate


Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Nate. The utf8 looks good and the post command looks right. I'm on linux and I've had no trouble with utf8. So I think the problem is probably on the server or database side. That said, I don't usually include the line <?xml version="1.0" encoding="UTF-8"?> in my import files. They just start with <imports> and end with </imports> and the import service takes care of the rest and I guess assumes utf8 as well. Not sure how I came to do it this way though. Susan On 02/15/2012 01:11 PM, Nate Solas wrote: > Hello! I'm working on importing into the Persons service, and it's going > pretty well. It's choking on what appears to me to be valid UTF-8, but > I'm willing to be wrong on that... I just can't seem to prove that it's > NOT UTF-8. Anyway, I'm attaching the file to try to preserve the > encoding and someone with fresh eyes can tell me if that's it. Surely > someone has successfully imported UTF-8 characters using curl? > > curl -X POST http://localhost:8180/cspace-services/imports -i -u > "admin@walkerart.org:Administrator" -F > "file=@./creator_import.xml;type=text/xml;" > > [ the first Person imports fine: ] > > HTTP/1.1 100 Continue > > HTTP/1.1 200 OK > Server: Apache-Coyote/1.1 > Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; > Path=/cspace-services > Content-Type: application/xml > Content-Length: 265 > Date: Wed, 15 Feb 2012 21:03:12 GMT > > <?xml ?><import><msg>SUCCESS</msg><report></report>READ: > /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml > /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 > > [ ... and it ends there. Nothing useful in collectionspace-services.log, > but catalina.out shows this: ] > > FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: > Invalid byte 1 of 1-byte UTF-8 sequence. > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 > of 1-byte UTF-8 sequence. > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at > org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) > at > org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) > at > org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) > at > org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) > at > org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) > > .... > > Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can > successfully import the doc by removing the a-acute character in the > XML, so that's the culprit for sure. A hex editor shows me the character > is a valid multibyte sequence in UTF-8: C3-A1 > http://en.wikipedia.org/wiki/UTF-8#Codepage_layout > http://www.utf8-chartable.de/ > > Help! I'm so close to slamming (a version of) our entire artist list > into CS, but I can't get past the encoding... > Thanks, > Nate > > > > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
SS
Susan Stone
Thu, Feb 16, 2012 12:26 AM

Oh, I just noticed the difference Jesse mentions. I also use
imports?type=xml when I use the -F form of curl (but I still usually use
...imports -i -u "whatever:whatever" -H "Content-Type: application/xml"
-T xxx) and it still works for me.

Susan

On 02/15/2012 01:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but
I'm willing to be wrong on that... I just can't seem to prove that it's
NOT UTF-8. Anyway, I'm attaching the file to try to preserve the
encoding and someone with fresh eyes can tell me if that's it. Surely
someone has successfully imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/imports -i -u
"admin@walkerart.org:Administrator" -F
"file=@./creator_import.xml;type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml
/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1
of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the
XML, so that's the culprit for sure. A hex editor shows me the character
is a valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list
into CS, but I can't get past the encoding...
Thanks,
Nate


Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Oh, I just noticed the difference Jesse mentions. I also use imports?type=xml when I use the -F form of curl (but I still usually use ...imports -i -u "whatever:whatever" -H "Content-Type: application/xml" -T xxx) and it still works for me. Susan On 02/15/2012 01:11 PM, Nate Solas wrote: > Hello! I'm working on importing into the Persons service, and it's going > pretty well. It's choking on what appears to me to be valid UTF-8, but > I'm willing to be wrong on that... I just can't seem to prove that it's > NOT UTF-8. Anyway, I'm attaching the file to try to preserve the > encoding and someone with fresh eyes can tell me if that's it. Surely > someone has successfully imported UTF-8 characters using curl? > > curl -X POST http://localhost:8180/cspace-services/imports -i -u > "admin@walkerart.org:Administrator" -F > "file=@./creator_import.xml;type=text/xml;" > > [ the first Person imports fine: ] > > HTTP/1.1 100 Continue > > HTTP/1.1 200 OK > Server: Apache-Coyote/1.1 > Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355; > Path=/cspace-services > Content-Type: application/xml > Content-Length: 265 > Date: Wed, 15 Feb 2012 21:03:12 GMT > > <?xml ?><import><msg>SUCCESS</msg><report></report>READ: > /home/usr/local/share/apache-tomcat-6.0.33/temp/imports-882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-cc78-49a8-9830-595b9ae373e5/document.xml > /Persons/7d24e866-cc78-49a8-9830-595b9ae373e5 > > [ ... and it ends there. Nothing useful in collectionspace-services.log, > but catalina.out shows this: ] > > FATAL_ERROR:org.apache.xerces.impl.io.MalformedByteSequenceException: > Invalid byte 1 of 1-byte UTF-8 sequence. > org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 1 > of 1-byte UTF-8 sequence. > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at > org.collectionspace.services.common.XmlSaxFragmenter.parse(XmlSaxFragmenter.java:294) > at > org.collectionspace.services.imports.TemplateExpander.expandInputSource(TemplateExpander.java:126) > at > org.collectionspace.services.imports.ImportsResource.expandXmlPayloadToDir(ImportsResource.java:288) > at > org.collectionspace.services.imports.ImportsResource.createFromInputSource(ImportsResource.java:182) > at > org.collectionspace.services.imports.ImportsResource.acceptUpload(ImportsResource.java:313) > > .... > > Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can > successfully import the doc by removing the a-acute character in the > XML, so that's the culprit for sure. A hex editor shows me the character > is a valid multibyte sequence in UTF-8: C3-A1 > http://en.wikipedia.org/wiki/UTF-8#Codepage_layout > http://www.utf8-chartable.de/ > > Help! I'm so close to slamming (a version of) our entire artist list > into CS, but I can't get past the encoding... > Thanks, > Nate > > > > _______________________________________________ > Talk mailing list > Talk@lists.collectionspace.org > http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org
NS
Nate Solas
Thu, Feb 16, 2012 3:23 AM

Thanks everyone for your input. I don't quite have the WHY figured out, but
I have figured out WHEN it happens. Specifically, the change Ray recommends
here, where we use curl to send a multipart/form-data upload:
http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-Sendingtherequest%3A

If I revert to the old, now deprecated, method of using -T to upload the
file, it works. This reflects the situation using the web interface to
manually upload the file, where it also fails.

I can't figure out all the differences in the code between those methods,
but I notice it's a pretty different sequence for the -T: it calls
"payloadToFilename" and writes it out, THEN opens that and runs the import.
The failing method uses "createFromInputSource", and from the debugger it
seems like maybe the inputSource has a byteStream with an InputStreamReader
which finally has a StreamDecoder with two variables that seem suspicious
in this context: cs, and decoder, both of which refer to US_ASCII. Should
be UTF-8? Who knows.

I'm tired and have a working alternative, so... I'll file a JIRA in the
morning? Seems pretty clear the new upload-to-import functionality is
breaking UTF-8.

Thanks again for your efforts,
Nate

On Wed, Feb 15, 2012 at 6:26 PM, Susan Stone
sstone@socrates.berkeley.eduwrote:

Oh, I just noticed the difference Jesse mentions. I also use
imports?type=xml when I use the -F form of curl (but I still usually use
...imports -i -u "whatever:whatever" -H "Content-Type: application/xml"  -T
xxx) and it still works for me.

Susan

On 02/15/2012 01:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's going
pretty well. It's choking on what appears to me to be valid UTF-8, but
I'm willing to be wrong on that... I just can't seem to prove that it's
NOT UTF-8. Anyway, I'm attaching the file to try to preserve the
encoding and someone with fresh eyes can tell me if that's it. Surely
someone has successfully imported UTF-8 characters using curl?

curl -X POST http://localhost:8180/cspace-services/importshttp://localhost:8180/cspace-services/imports-i -u
"admin@walkerart.org:Administr
ator" -F
"file=@./creator_import.xml;**type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><**report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-
882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-
cc78-49a8-9830-595b9ae373e5/**document.xml
/Persons/7d24e866-cc78-49a8-**9830-595b9ae373e5

[ ... and it ends there. Nothing useful in collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.**impl.io.*MalformedByteSequenceException
*:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid
byte 1
of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.**AbstractSAXParser.parse(**Unknown Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(
XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.
expandInputSource(**TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.
expandXmlPayloadToDir(**ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.
createFromInputSource(**ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.
acceptUpload(ImportsResource.**java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can
successfully import the doc by removing the a-acute character in the
XML, so that's the culprit for sure. A hex editor shows me the character
is a valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/**UTF-8#Codepage_layouthttp://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list
into CS, but I can't get past the encoding...
Thanks,
Nate

____________**
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.**org/mailman/listinfo/talk
**
lists.collectionspace.orghttp://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Thanks everyone for your input. I don't quite have the WHY figured out, but I have figured out WHEN it happens. Specifically, the change Ray recommends here, where we use curl to send a multipart/form-data upload: http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-Sendingtherequest%3A If I revert to the old, now deprecated, method of using -T to upload the file, it works. This reflects the situation using the web interface to manually upload the file, where it also fails. I can't figure out all the differences in the code between those methods, but I notice it's a pretty different sequence for the -T: it calls "payloadToFilename" and writes it out, THEN opens that and runs the import. The failing method uses "createFromInputSource", and from the debugger it seems like maybe the inputSource has a byteStream with an InputStreamReader which finally has a StreamDecoder with two variables that seem suspicious in this context: cs, and decoder, both of which refer to US_ASCII. Should be UTF-8? Who knows. I'm tired and have a working alternative, so... I'll file a JIRA in the morning? Seems pretty clear the new upload-to-import functionality is breaking UTF-8. Thanks again for your efforts, Nate On Wed, Feb 15, 2012 at 6:26 PM, Susan Stone <sstone@socrates.berkeley.edu>wrote: > Oh, I just noticed the difference Jesse mentions. I also use > imports?type=xml when I use the -F form of curl (but I still usually use > ...imports -i -u "whatever:whatever" -H "Content-Type: application/xml" -T > xxx) and it still works for me. > > > Susan > > > On 02/15/2012 01:11 PM, Nate Solas wrote: > >> Hello! I'm working on importing into the Persons service, and it's going >> pretty well. It's choking on what appears to me to be valid UTF-8, but >> I'm willing to be wrong on that... I just can't seem to prove that it's >> NOT UTF-8. Anyway, I'm attaching the file to try to preserve the >> encoding and someone with fresh eyes can tell me if that's it. Surely >> someone has successfully imported UTF-8 characters using curl? >> >> curl -X POST http://localhost:8180/cspace-**services/imports<http://localhost:8180/cspace-services/imports>-i -u >> "admin@walkerart.org:Administr**ator" -F >> "file=@./creator_import.xml;**type=text/xml;" >> >> [ the first Person imports fine: ] >> >> HTTP/1.1 100 Continue >> >> HTTP/1.1 200 OK >> Server: Apache-Coyote/1.1 >> Set-Cookie: JSESSIONID=**D683F5751AFD9B9C44D923D0010903**55; >> Path=/cspace-services >> Content-Type: application/xml >> Content-Length: 265 >> Date: Wed, 15 Feb 2012 21:03:12 GMT >> >> <?xml ?><import><msg>SUCCESS</msg><**report></report>READ: >> /home/usr/local/share/apache-**tomcat-6.0.33/temp/imports-** >> 882d44c8-753c-4adb-831d-**378f0fa46899/Persons/7d24e866-** >> cc78-49a8-9830-595b9ae373e5/**document.xml >> /Persons/7d24e866-cc78-49a8-**9830-595b9ae373e5 >> >> [ ... and it ends there. Nothing useful in collectionspace-services.log, >> but catalina.out shows this: ] >> >> FATAL_ERROR:org.apache.xerces.**impl.io.**MalformedByteSequenceException* >> *: >> Invalid byte 1 of 1-byte UTF-8 sequence. >> org.apache.xerces.impl.io.**MalformedByteSequenceException**: Invalid >> byte 1 >> of 1-byte UTF-8 sequence. >> at org.apache.xerces.parsers.**AbstractSAXParser.parse(**Unknown Source) >> at >> org.collectionspace.services.**common.XmlSaxFragmenter.parse(** >> XmlSaxFragmenter.java:294) >> at >> org.collectionspace.services.**imports.TemplateExpander.** >> expandInputSource(**TemplateExpander.java:126) >> at >> org.collectionspace.services.**imports.ImportsResource.** >> expandXmlPayloadToDir(**ImportsResource.java:288) >> at >> org.collectionspace.services.**imports.ImportsResource.** >> createFromInputSource(**ImportsResource.java:182) >> at >> org.collectionspace.services.**imports.ImportsResource.** >> acceptUpload(ImportsResource.**java:313) >> >> .... >> >> Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I can >> successfully import the doc by removing the a-acute character in the >> XML, so that's the culprit for sure. A hex editor shows me the character >> is a valid multibyte sequence in UTF-8: C3-A1 >> http://en.wikipedia.org/wiki/**UTF-8#Codepage_layout<http://en.wikipedia.org/wiki/UTF-8#Codepage_layout> >> http://www.utf8-chartable.de/ >> >> Help! I'm so close to slamming (a version of) our entire artist list >> into CS, but I can't get past the encoding... >> Thanks, >> Nate >> >> >> >> ______________________________**_________________ >> Talk mailing list >> Talk@lists.collectionspace.org >> http://lists.collectionspace.**org/mailman/listinfo/talk_** >> lists.collectionspace.org<http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org> >> > >
S
sstone@socrates.berkeley.edu
Thu, Feb 16, 2012 5:09 AM

Nate,

Thanks for discovering this problem. If the -F/input form actually doesn't
handle UTF-8 correctly, I'm just lucky I've been using the -T way all
along.

So far I haven't encountered a case where -F works and -T fails to import.
However, they may fail differently so that the -T obviously fails and the
-F seems to work but fails. Restarting tomcat usually helps.

Susan

Thanks everyone for your input. I don't quite have the WHY figured out,
but
I have figured out WHEN it happens. Specifically, the change Ray
recommends
here, where we use curl to send a multipart/form-data upload:
http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-Sendingtherequest%3A

If I revert to the old, now deprecated, method of using -T to upload the
file, it works. This reflects the situation using the web interface to
manually upload the file, where it also fails.

I can't figure out all the differences in the code between those methods,
but I notice it's a pretty different sequence for the -T: it calls
"payloadToFilename" and writes it out, THEN opens that and runs the
import.
The failing method uses "createFromInputSource", and from the debugger it
seems like maybe the inputSource has a byteStream with an
InputStreamReader
which finally has a StreamDecoder with two variables that seem suspicious
in this context: cs, and decoder, both of which refer to US_ASCII. Should
be UTF-8? Who knows.

I'm tired and have a working alternative, so... I'll file a JIRA in the
morning? Seems pretty clear the new upload-to-import functionality is
breaking UTF-8.

Thanks again for your efforts,
Nate

On Wed, Feb 15, 2012 at 6:26 PM, Susan Stone
sstone@socrates.berkeley.eduwrote:

Oh, I just noticed the difference Jesse mentions. I also use
imports?type=xml when I use the -F form of curl (but I still usually use
...imports -i -u "whatever:whatever" -H "Content-Type: application/xml"
-T
xxx) and it still works for me.

Susan

On 02/15/2012 01:11 PM, Nate Solas wrote:

Hello! I'm working on importing into the Persons service, and it's
going
pretty well. It's choking on what appears to me to be valid UTF-8, but
I'm willing to be wrong on that... I just can't seem to prove that it's
NOT UTF-8. Anyway, I'm attaching the file to try to preserve the
encoding and someone with fresh eyes can tell me if that's it. Surely
someone has successfully imported UTF-8 characters using curl?

curl -X POST
http://localhost:8180/cspace-services/importshttp://localhost:8180/cspace-services/imports-i
-u
"admin@walkerart.org:Administr
ator" -F
"file=@./creator_import.xml;**type=text/xml;"

[ the first Person imports fine: ]

HTTP/1.1 100 Continue

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Set-Cookie: JSESSIONID=D683F5751AFD9B9C44D923D001090355;
Path=/cspace-services
Content-Type: application/xml
Content-Length: 265
Date: Wed, 15 Feb 2012 21:03:12 GMT

<?xml ?><import><msg>SUCCESS</msg><**report></report>READ:

/home/usr/local/share/apache-tomcat-6.0.33/temp/imports-
882d44c8-753c-4adb-831d-378f0fa46899/Persons/7d24e866-
cc78-49a8-9830-595b9ae373e5/**document.xml
/Persons/7d24e866-cc78-49a8-**9830-595b9ae373e5

[ ... and it ends there. Nothing useful in
collectionspace-services.log,
but catalina.out shows this: ]

FATAL_ERROR:org.apache.xerces.**impl.io.*MalformedByteSequenceException
*:
Invalid byte 1 of 1-byte UTF-8 sequence.
org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid
byte 1
of 1-byte UTF-8 sequence.
at org.apache.xerces.parsers.**AbstractSAXParser.parse(**Unknown
Source)
at
org.collectionspace.services.common.XmlSaxFragmenter.parse(
XmlSaxFragmenter.java:294)
at
org.collectionspace.services.imports.TemplateExpander.
expandInputSource(**TemplateExpander.java:126)
at
org.collectionspace.services.imports.ImportsResource.
expandXmlPayloadToDir(**ImportsResource.java:288)
at
org.collectionspace.services.imports.ImportsResource.
createFromInputSource(**ImportsResource.java:182)
at
org.collectionspace.services.imports.ImportsResource.
acceptUpload(ImportsResource.**java:313)

....

Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I
can
successfully import the doc by removing the a-acute character in the
XML, so that's the culprit for sure. A hex editor shows me the
character
is a valid multibyte sequence in UTF-8: C3-A1
http://en.wikipedia.org/wiki/**UTF-8#Codepage_layouthttp://en.wikipedia.org/wiki/UTF-8#Codepage_layout
http://www.utf8-chartable.de/

Help! I'm so close to slamming (a version of) our entire artist list
into CS, but I can't get past the encoding...
Thanks,
Nate

____________**
Talk mailing list
Talk@lists.collectionspace.org
http://lists.collectionspace.**org/mailman/listinfo/talk
**
lists.collectionspace.orghttp://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org

Nate, Thanks for discovering this problem. If the -F/input form actually doesn't handle UTF-8 correctly, I'm just lucky I've been using the -T way all along. So far I haven't encountered a case where -F works and -T fails to import. However, they may fail differently so that the -T obviously fails and the -F seems to work but fails. Restarting tomcat usually helps. Susan > Thanks everyone for your input. I don't quite have the WHY figured out, > but > I have figured out WHEN it happens. Specifically, the change Ray > recommends > here, where we use curl to send a multipart/form-data upload: > http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-Sendingtherequest%3A > > If I revert to the old, now deprecated, method of using -T to upload the > file, it works. This reflects the situation using the web interface to > manually upload the file, where it also fails. > > I can't figure out all the differences in the code between those methods, > but I notice it's a pretty different sequence for the -T: it calls > "payloadToFilename" and writes it out, THEN opens that and runs the > import. > The failing method uses "createFromInputSource", and from the debugger it > seems like maybe the inputSource has a byteStream with an > InputStreamReader > which finally has a StreamDecoder with two variables that seem suspicious > in this context: cs, and decoder, both of which refer to US_ASCII. Should > be UTF-8? Who knows. > > I'm tired and have a working alternative, so... I'll file a JIRA in the > morning? Seems pretty clear the new upload-to-import functionality is > breaking UTF-8. > > Thanks again for your efforts, > Nate > > > On Wed, Feb 15, 2012 at 6:26 PM, Susan Stone > <sstone@socrates.berkeley.edu>wrote: > >> Oh, I just noticed the difference Jesse mentions. I also use >> imports?type=xml when I use the -F form of curl (but I still usually use >> ...imports -i -u "whatever:whatever" -H "Content-Type: application/xml" >> -T >> xxx) and it still works for me. >> >> >> Susan >> >> >> On 02/15/2012 01:11 PM, Nate Solas wrote: >> >>> Hello! I'm working on importing into the Persons service, and it's >>> going >>> pretty well. It's choking on what appears to me to be valid UTF-8, but >>> I'm willing to be wrong on that... I just can't seem to prove that it's >>> NOT UTF-8. Anyway, I'm attaching the file to try to preserve the >>> encoding and someone with fresh eyes can tell me if that's it. Surely >>> someone has successfully imported UTF-8 characters using curl? >>> >>> curl -X POST >>> http://localhost:8180/cspace-**services/imports<http://localhost:8180/cspace-services/imports>-i >>> -u >>> "admin@walkerart.org:Administr**ator" -F >>> "file=@./creator_import.xml;**type=text/xml;" >>> >>> [ the first Person imports fine: ] >>> >>> HTTP/1.1 100 Continue >>> >>> HTTP/1.1 200 OK >>> Server: Apache-Coyote/1.1 >>> Set-Cookie: JSESSIONID=**D683F5751AFD9B9C44D923D0010903**55; >>> Path=/cspace-services >>> Content-Type: application/xml >>> Content-Length: 265 >>> Date: Wed, 15 Feb 2012 21:03:12 GMT >>> >>> <?xml ?><import><msg>SUCCESS</msg><**report></report>READ: >>> /home/usr/local/share/apache-**tomcat-6.0.33/temp/imports-** >>> 882d44c8-753c-4adb-831d-**378f0fa46899/Persons/7d24e866-** >>> cc78-49a8-9830-595b9ae373e5/**document.xml >>> /Persons/7d24e866-cc78-49a8-**9830-595b9ae373e5 >>> >>> [ ... and it ends there. Nothing useful in >>> collectionspace-services.log, >>> but catalina.out shows this: ] >>> >>> FATAL_ERROR:org.apache.xerces.**impl.io.**MalformedByteSequenceException* >>> *: >>> Invalid byte 1 of 1-byte UTF-8 sequence. >>> org.apache.xerces.impl.io.**MalformedByteSequenceException**: Invalid >>> byte 1 >>> of 1-byte UTF-8 sequence. >>> at org.apache.xerces.parsers.**AbstractSAXParser.parse(**Unknown >>> Source) >>> at >>> org.collectionspace.services.**common.XmlSaxFragmenter.parse(** >>> XmlSaxFragmenter.java:294) >>> at >>> org.collectionspace.services.**imports.TemplateExpander.** >>> expandInputSource(**TemplateExpander.java:126) >>> at >>> org.collectionspace.services.**imports.ImportsResource.** >>> expandXmlPayloadToDir(**ImportsResource.java:288) >>> at >>> org.collectionspace.services.**imports.ImportsResource.** >>> createFromInputSource(**ImportsResource.java:182) >>> at >>> org.collectionspace.services.**imports.ImportsResource.** >>> acceptUpload(ImportsResource.**java:313) >>> >>> .... >>> >>> Seems like it says I'm claiming UTF-8 but it's not actually UTF-8? I >>> can >>> successfully import the doc by removing the a-acute character in the >>> XML, so that's the culprit for sure. A hex editor shows me the >>> character >>> is a valid multibyte sequence in UTF-8: C3-A1 >>> http://en.wikipedia.org/wiki/**UTF-8#Codepage_layout<http://en.wikipedia.org/wiki/UTF-8#Codepage_layout> >>> http://www.utf8-chartable.de/ >>> >>> Help! I'm so close to slamming (a version of) our entire artist list >>> into CS, but I can't get past the encoding... >>> Thanks, >>> Nate >>> >>> >>> >>> ______________________________**_________________ >>> Talk mailing list >>> Talk@lists.collectionspace.org >>> http://lists.collectionspace.**org/mailman/listinfo/talk_** >>> lists.collectionspace.org<http://lists.collectionspace.org/mailman/listinfo/talk_lists.collectionspace.org> >>> >> >> >